Sujet : Re: Byte Addressability And Beyond
De : terje.mathisen (at) *nospam* tmsw.no (Terje Mathisen)
Groupes : comp.archDate : 07. Jun 2024, 16:05:42
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v3v7k7$24548$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
User-Agent : Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0 SeaMonkey/2.53.18.2
EricP wrote:
Stefan Monnier wrote:
>
Another issue with Unicode is the so-called "confusables": things that
may look identical (or close enough) on screen yet are different (and
not just because of normalization). E.g. Î’ vs B, Ð vs A, or ∕ vs / vs â„.
Unicode comes with a 700kB `confusables.txt` listing such issues.
Eeewww... I didn't even think of that.
What does one do about them? You can't treat them as equivalent in a
string compare... the user might want the first B and not second B.
I suppose one would want two compare equal functions,
an exactly equal, and a visually approximately equal.
Like using a soundex for words to catch misspellings.
But then programmers need to decide when to use each compare.
These character and code attribute lookup tables are looking awkward.
With up to 2M codes, and some base character codes having multiple
possible combiners, but very sparse. And links between entries
for upper and lower case, and now links between confusables.
And we don't want to roll over the L1 cache just to do a string compare.
Years ago I considered case-insensitive Boyer-Moore text search with a wide alphabet and found that the only approach that made sense was to maintain two copies of the string to be searched for, one lower and one upper case, where each "character" was a length-encoded string. This was required to handle things like the German double s which can uppercase into a single letter.
The lookup table for skip lengths was still far shorter than the alphabet size, effectively a very short and fast hash of the current character/codepoint/combined letter.
Terje
-- - <Terje.Mathisen at tmsw.no>"almost all programming can be viewed as an exercise in caching"