Re: Byte Addressability And Beyond

Liste des GroupesRevenir à c arch 
Sujet : Re: Byte Addressability And Beyond
De : monnier (at) *nospam* iro.umontreal.ca (Stefan Monnier)
Groupes : comp.arch
Date : 04. Jun 2024, 21:28:00
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <jwv7cf4mpug.fsf-monnier+comp.arch@gnu.org>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
User-Agent : Gnus/5.13 (Gnus v5.13)
If I want to validate combiner codes or normalize characters I need
UTF-32 because I have to work with the whole character as a unit.

You can read the code points directly from the UTF-8 sequence almost
as easily as you can from a UTF-32 sequence.
Most of the cost will be in the memory accesses and then in looking up the
various tables to decide how to normalize or whether it's valid, so the
difference between reading the info from UTF-32 or UTF-8 should be lost in
the noise.
UTF-32 might be marginally faster at this specific operation in some
cases (definitely not if your text is mostly ASCII), but I'd be very
surprised if the difference is ever large enough to pay for a conversion
from UTF-8 to UTF-32.

I was just trying to get people thinking of ways that malformed
characters might be used to bypass other validation checks in
their software.

Another issue with Unicode is the so-called "confusables": things that
may look identical (or close enough) on screen yet are different (and
not just because of normalization).  E.g. Β vs B, А vs A, or ∕ vs / vs ⁄.
Unicode comes with a 700kB `confusables.txt` listing such issues.


        Stefan

Date Sujet#  Auteur
4 Jun 24 * Re: Byte Addressability And Beyond4Stefan Monnier
7 Jun 24 +- Re: Byte Addressability And Beyond1Terje Mathisen
7 Jun 24 `* Re: Character non-equivalence, was Byte Addressability And Beyond2John Levine
9 Jun 24  `- Re: Character non-equivalence, was Byte Addressability And Beyond1Lawrence D'Oliveiro

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal