Sujet : Re: Byte Addressability And Beyond
De : monnier (at) *nospam* iro.umontreal.ca (Stefan Monnier)
Groupes : comp.archDate : 04. Jun 2024, 21:28:00
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <jwv7cf4mpug.fsf-monnier+comp.arch@gnu.org>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
User-Agent : Gnus/5.13 (Gnus v5.13)
If I want to validate combiner codes or normalize characters I need
UTF-32 because I have to work with the whole character as a unit.
You can read the code points directly from the UTF-8 sequence almost
as easily as you can from a UTF-32 sequence.
Most of the cost will be in the memory accesses and then in looking up the
various tables to decide how to normalize or whether it's valid, so the
difference between reading the info from UTF-32 or UTF-8 should be lost in
the noise.
UTF-32 might be marginally faster at this specific operation in some
cases (definitely not if your text is mostly ASCII), but I'd be very
surprised if the difference is ever large enough to pay for a conversion
from UTF-8 to UTF-32.
I was just trying to get people thinking of ways that malformed
characters might be used to bypass other validation checks in
their software.
Another issue with Unicode is the so-called "confusables": things that
may look identical (or close enough) on screen yet are different (and
not just because of normalization). E.g. Β vs B, А vs A, or ∕ vs / vs ⁄.
Unicode comes with a 700kB `confusables.txt` listing such issues.
Stefan