On 12/28/2024 11:24 AM, Tim Rentsch wrote:
BGB <cr88192@gmail.com> writes:
On 12/23/2024 3:18 PM, Tim Rentsch wrote:
>
Michael S <already5chosen@yahoo.com> writes:
>
On Mon, 23 Dec 2024 09:46:46 +0100
David Brown <david.brown@hesbynett.no> wrote:
>
And Tim did not rule out using the standard library,
>
Are you sure?
>
I explicitly called out setjmp and longjmp as being excluded.
Based on that, it's reasonable to infer the rest of the
standard library is allowed.
>
Furthermore I don't think it matters. Except for a very small
set of functions -- eg, fopen, fgetc, fputc, malloc, free --
everything else in the standard library either isn't important
for Turing Completeness or can be synthesized from the base
set. The functionality of fprintf(), for example, can be
implemented on top of fputc and non-library language features.
>
If I were to choose a set of primitive functions, probably:
malloc/free and/or realloc
could define, say:
malloc(sz) => realloc(NULL, sz)
free(ptr) => realloc(ptr, 0)
Maybe _msize and _mtag/..., but this is non-standard.
With _msize, can implement realloc on top of malloc/free.
>
For basic IO:
fopen, fclose, fseek, fread, fwrite
>
printf could be implemented on top of vsnprintf and fputs
fputs can be implemented on top of fwrite (via strlen).
With a temporary buffer buffer being used for the printed string.
Most of these aren't needed. I think everything can be
done using only fopen, fclose, fgetc, fputc, and feof.
If you only have fgetc and fputc, IO speeds are going to be unacceptably slow for non-trivial file sizes.
If you try to fake fseek by closing, re-opening, and an fgetc loop, well, also going to be very slow.
Then again, fgetc/fputc as the primary operations could make sense for text files if the implementation is doing some form of format conversion (such as converting between LF only and CR+LF), though admittedly IMO one is better off treating text files as equivalent to binary files (and letting the application deal with any conversions here).
OTOH:
fgetc and fputc can be implemented via fread and fwrite;
feof (for normal files) can be implemented via fseek (*1);
Similar, ftell could be treated as a special case of fseek.
*1: Say, if the internal fseek call were made to return the current file position (similar to lseek).
...
Well, in another also recently left facing off with the wonk of UTF-8 normalization for the VFS layer in my project (for paths/filenames). Options:
Do Nothing, assume valid UTF-8 and that it is sensibly normalized;
May risk malformed encodings at deeper levels of the VFS though.
Encoding only normalization:
Normalize to an M-UTF-8 variant and call it done.
Do a subset of normalizing combining characters.
The full set of Unicode rules would likely be too bulky;
Filesystem should have no concept of locale;
The rules should be ideally be "semi frozen" once defined.
At present, this is applied at the level of VFS syscalls (like "open()" or "opendir()").
Current thinking is that it will normalize to a variant of M-UTF-8 NFC (characters are stored in composed forms), but:
Will only apply the rules covering the Latin-1 and Latin Extended A spaces, and a subset of Latin Extended B.
Though, a case could be made for limiting the scope solely to the Latin-1/1252 range (and passing everything beyond this along as-is).
Less sure, had also added cases for the Roman numeral characters, mostly for decomposing them into ASCII; various ligatures would also be decomposed to ASCII (excluding those which appear as their own glyph, so AE and OE are left as-is, but IJ/DZ/... would be decomposed). A case could also be made for leaving these alone (passing them along unmodified). Depends mostly on the open question of whether or not these convey relevant semantic information (or are merely historical/aesthetic).
At present, the rules are stored as a table, with roughly 8 bytes needed per combiner rule (increases to 12 once initialized, mostly because it allocates a pair of 16-bit hash chains).
Namely: SrcCodepoint1, SrcCodepoint2, DstCodepoint, Flags
Flags specify when and how the rule is applied.
SrcCodepoint2 is currently 0x0000 for simple conversion rules.
DstCodepoint is used for lookup for decompose.
...
Limiting the scope also makes things likely more repeatable (where inconsistent normalization could result in file lookup issues in cases where rules differ, if stepping on the offending code-points). Goal is mostly to find an acceptable set of rules that can be "mostly frozen". Though, in most cases this is likely N/A as the majority of filenames tend to be plain ASCII.
The responsibility for any more advanced normalization (or locale-dependent stuff) would be left up at the application level.
Can't seem to find much information about "best practices" in these areas.
It is not certain normalizing for combining characters is actually a good idea, vs only normalizing for codepoint encoding. Mostly to deal with cases where malformed data is submitted to the VFS, or possibly 1252 (if the VFS calls and similar are given something that is invalid UTF-8, then it may be assumed to be 1252). Theoretically, the locale code in the C library is expected to normalize for 1252 vs UTF-8 though (but, ideally, the integrity of the VFS should be kept protected from this sort of thing).
This also applies to console printing, which is also expected to be handed UTF-8, but may also normalize the strings. Though, there is some wonk with the console here in my case.
Seemingly (from what I can gather):
Linux:
It is per FS driver;
Some are "do nothing", others normalize.
MacOS:
Also depends on filesystem:
HFS/HFS+, normalizing (as NFD for some reason);
APFS, does nothing (apparently leads to a lot of hassles).
Windows:
FAT32: Depends solely on OS locale;
NTFS: Locale rules are baked-in when the drive is formatted.
The relevant tables are held in filesystem metadata.
...