Sujet : Re: Rationale for aligning data on even bytes in a Unix shell file?
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.lang.cDate : 09. May 2025, 19:31:27
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vvlhvh$2uri7$2@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
User-Agent : Mozilla Thunderbird
On 5/9/2025 12:52 PM, Bonita Montero wrote:
Am 07.05.2025 um 12:08 schrieb BGB:
If you know one side is UTF-8 and the other is UTF-16, then conversion does not need to know or care which locale is in effect.
Unicode hasn't locales, i.e. alternative meanings for the same code-
point. Even the characters from 128 to 255 are fixed to Latin-1.
A locale is not an encoding; nor is it a codepage.
A locale is a set of formatting and language-specific rules to apply.
Which, in some past contexts, may have been associated with the usage of specific code pages, but codepages are N/A with Unicode. Even as such, various language specific rules may still exist.
For things like case-folding, you may still need to care about which language (AKA, locale) is in effect, as some conversions may apply to some languages but not others.
Some letters case-map differently depending on the language, ligatures may be in effect (which may compose/decompose or map to other ligatures), etc.
Or, one just throws a lot of this out and uses a simplified set of "mostly language neutral" rules.
Say, case conversion maps:
Upper: 0061..007A -> 0041..005A
Lower: 0041..005A -> 0061..007A
Upper: 00E0..00FE -> 00C0..00DE
Lower: 00C0..00DE -> 00E0..00FE
... (Add a few more, for Greek / Cyrillic / etc)
And, maybe a few special cases, say (*):
009A <-> 008A
009C <-> 008C
009E <-> 008E
00FF <-> 009F
*: Assuming the "1252 mappings in Unicode Space replacing C1 controls" wonk.
Probably ignore most everything else, it passes through as-is.