Newsportal USENET - Re: Character non-equivalence, was Byte Addressability And Beyond

Sujet : Re: Character non-equivalence, was Byte Addressability And Beyond
De : johnl (at) *nospam* taugh.com (John Levine)
Groupes : comp.arch
Date : 07. Jun 2024, 22:26:03

Autres entêtes

Organisation : Taughannock Networks
Message-ID : <v3vttb$5tk$1@gal.iecc.com>
References : 1 2 3 4
User-Agent : trn 4.0-test77 (Sep 1, 2010)

It appears that EricP <ThatWouldBeTelling@thevillage.com> said:

Eeewww... I didn't even think of that.
What does one do about them? You can't treat them as equivalent in a
string compare... the user might want the first B and not second B.

People keep rediscovering that when you're using Unicode, nothing is
simple. One of its canonical forms is NFKC which uses composed
versions of accented characters, and uses a canonical equivalence rule
to turn some kinds of characters that look similar into a single form.

That solves some of the problems but not even close to all of them.
The rules about whether two strings are upper/lower caase equivalent
depend on the language and sometimes even the local version of the
language, e.g. French French and Quebec French have different
conventions about accented capital letters.

The only thing I can say with confidence is that any rule that starts
with "You can just ..." is wrong.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Date	Sujet	#	Auteur
4 Jun 24	Re: Byte Addressability And Beyond	4	Stefan Monnier
7 Jun 24	Re: Byte Addressability And Beyond	1	Terje Mathisen
7 Jun 24	Re: Character non-equivalence, was Byte Addressability And Beyond	2	John Levine
9 Jun 24	Re: Character non-equivalence, was Byte Addressability And Beyond	1	Lawrence D'Oliveiro