Sujet : Re: Rationale for aligning data on even bytes in a Unix shell file?
De : david.brown (at) *nospam* hesbynett.no (David Brown)
Groupes : comp.lang.cDate : 29. Apr 2025, 09:58:46
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vuq4c6$1ca4v$2@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0
On 29/04/2025 01:13, Janis Papanagnou wrote:
On 28.04.2025 20:38, Bonita Montero wrote:
Am 28.04.2025 um 20:05 schrieb Janis Papanagnou:
>
(I thought Windows would use "UCS2". Anyway; would 16 bit suffice to
support full Unicode; I thought it wouldn't, or only old restricted
versions of Unicode.)
>
Windows is UTF-16 since Windows 2000, UCS2 before.
No, Windows has had /some/ UTF-16 support since W2K, with gradual improvements over time to APIs, filesystems, and applications. Later on, it started getting /some/ UTF-8 support, which is a much better choice for most uses.
Oh, so it's a multi-word encoding (like UTF-8 is a multi-byte encoding)
and a character not necessarily encoded with only one 16 bit word... -
...but then I wonder even more where you see an advantage.
When Unicode started, they thought 16 bits would be enough. UCS2 made sense then, because it was a fixed size encoding - though it had the huge disadvantages of being endian-dependent and totally incompatible with every existing character set. Early Unicode adopters included Windows NT and NTFS, Java, QT and Python, using UCS2.
Once Unicode was extended beyond 16 bits, three new encodings emerged - UCS4 (32-bit fixed size), UTF-8 and UTF-16. UCS4 has the advantage of being fixed size (that turns out to be a minor issue in practice, but was long thought to be important), but like UCS2 it suffers from endianness, and is inefficient in size (but is easily compressed, so that also does not matter as much as many people think). UCS4 covers all code points in Unicode, but combining characters mean it still does not cover all characters in one code unit.
UTF-16 is variable length, and can encode any Unicode code point. It has the advantage that existing UCS2 is a subset, making it a natural extension for UCS2 systems - but keeps the same disadvantages of inefficiency for common ASCII characters, incompatibility with ASCII, endian-dependent, and it requires dedicated functions for almost everything.
The biggest problem with UTF-16 IMHO is that it delayed adoption of UTF-8 on early Unicode software. Changing something like QT or Windows from UCS2 to UTF-8 is not easy, but it would have been much better in the long run if that had been done without changing to UTF-16 first.