Sujet : Re: Rationale for aligning data on even bytes in a Unix shell file?
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.lang.cDate : 07. May 2025, 11:08:03
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vvfbnj$ulpc$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
User-Agent : Mozilla Thunderbird
On 5/6/2025 9:35 AM, Bonita Montero wrote:
Am 29.04.2025 um 08:25 schrieb Richard Heathfield:
A dog has a tail, but that doesn't mean a tail is a dog. Whatever UTF-8 may or may not have, it's an encoding, not a locale.
As no one considers UTF-8 for anything different than encoding
Unicode-characters you can say that UTF-8 is encodes a charset.
...
FWIW, UTF-8:
0xxxxxxx //U+0000..U+007F
110xxxxx 10xxxxxx //U+0080..U+07FF
1110xxxx 10xxxxxx 10xxxxxx //U+0800..U+FFFF
And, going further:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx //U+010000..U+10FFFF
Technically it is possible to go larger still, but conventionally Unicode stops after U+10FFFF.
If you know one side is UTF-8 and the other is UTF-16, then conversion does not need to know or care which locale is in effect.
UTF-8 should not be confused with 8859-1 or CP1252.
8859-1: Directly maps bytes to 0000..00FF
80..9F being a block of pretty much never-used control characters.
CP1252: Same as 8859-1, but remaps 80..9F to other characters.
There is a variant of UTF-8 known as M-UTF-8, where:
Anything outside 0000..FFFF is first encoded as in UTF-16 and then encoded as UTF-8 (as opposed to using UTF-8 to directly express values outside of the "Basic Multilingual Plane");
Some special values, like U+0000 (NUL) are encoded as 2 bytes (C0 80), allowing NUL to be expressed within a string.
As can be noted with UTF-8, when reading a character as bytes:
00..7F: ASCII
C0..DF (80..BF follows): 0000..07FF
E0..EF (two 80..BF bytes follow): 0000..FFFF
If a pattern is detected which breaks these patterns, we can infer that it is not UTF-8 (and, random CP1252 text is statistically unlikely to also be valid as per UTF-8 encoding rules).
As for locales and filesystems:
Ideally, the filesystem proper should not need to know nor care.
UI may care, but UI is its own things.
As for case (in)sensitivity:
Ideally, filesystems should be case sensitive by default;
If someone wants case insensitivity, this can be better handled at the application or file-browser level.
Admittedly, in my project, I had taken the non-standard option of treating FAT32/VFAT as case sensitive (though, it will disallow creating a file if after case normalization, the file already exists but differs solely in case).
So, say, the OS will not treat "Makefile" and "makefile" as the same file in FAT, rather if one exists, the other may not be created.
Though, if someone really must make something case-insensitive, a case could be made for only supporting it for maybe Latin, Greek, and Cyrillic. Ideally, this would be better handled in a file-browser or similar, and not in the VFS or FS driver itself.