Re: Unicode in strings

Liste des GroupesRevenir à c arch 
Sujet : Re: Unicode in strings
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.arch
Date : 18. May 2024, 07:29:20
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2024May18.072920@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7 8 9 10 11 12
User-Agent : xrn 10.11
Stefan Monnier <monnier@iro.umontreal.ca> writes:
[> Anton Ertl:]
[>> Thomas Koenig:]
Assume you're implementing a language which has a function of setting
an individual character in a string.
That's a design mistake in the language, and I know no language that
has this misfeature.
>
I suspect "individual character" meant "code point" above.

I meant character, not code point, as should have become clear from
the following.  I think that Thomas Koenig meant "character", too, but
he may have been unaware of the difference between "character" and
"Unicode code point".

Does Unicode even has the notion of "character", really?

AFAIK it does not.  But applications like palindrome checkers care
about characters, not code points.

OTOH, most code can be implemented fine as working on strings, without
knowing how many characters there are in the string (and it then does
not need to know about code points, either).  In other words, it can
be implemented just as well when the strings are represented as
strings of code units (whether UTF-8 (bytes), UTF-16 (16-bit code
units) or UTF-32 (32-bit code units)), and then it does not help to
convert UTF-8 to something else on input and something else to UTF-8
on output.

For the code that cares about characters, if it wants to work
correctly for characters that cannot be precomposed into a single code
point, it has to deal with characters that consist of multiple code
points, i.e., that even in UTF-32 are variable-width.  So given that
you have to bite the variable-width bullet anyway, you can just as
well use UTF-8.

Instead, what we see is one language (Python3) that has an even worse
misfeature: You can set an individual code point in a string; see
above for the things you get when you overwrite code points.
>
I think it's fairly common for languages that started with strings
as "arrays of 8bit chars".

Apart from Python3 not in those languages that I have looked at more
closely wrt this feature.

In particular, C was created by adding a byte type to B, and that type
was called "char".  It was allowed to be wider to cater for
word-addressed machines, but on byte-addressed machines "char" is
invariably a byte.  To cater to Unicode, they used a two-pronged
approach: they added wchar_t and multi-byte functions (IIRC both
already in C89); wchar_t was obviously introduced to cater for the
upcoming Unicode 1.0 (which satisfied code unit=code point=character),
while the multibyte stuff was probably introduced originally for
dealing with the ASCII-compatible East-Asian encodings.

When UTF-8 arrived, the multi-byte functions proved to fit that well;
but of course there is not much usage of those functions, because most
code works fine without knowing about individual code points or
characters.  And UTF-8 turned out to be the answer to dealing with
Unicode that the Unix programmers who had a lot of code working with
strings of chars (i.e., bytes) were looking for.

Then Unicode 2.0 arrived and the Win32 API (which had embraced wchar_t
and defined it as being 16-bit) stuck with 16-bit wchar_t, which
breaks "code unit=code point"; this may not be in line with the
intentions of the inventors of wchar_t (e.g., there are no
multi-wchar_t functions in the C standard last time I looked), but
that has been the existing practice in wchar_t use in C for more than
a quarter-century.

Unix, where wchar_t was (and still is) little used, switched to 32-bit
wchar_t, but

1) given that Unicode at some point (probably already in 2.0) broke
"code point=character", that does not really help software like
palindrome checkers.

2) wchar_t is little-used in Unix-specific code.

3) Code that wants to be portable between Unix and Windows and uses
wchar_t cannot rely on "code unit=code point" anyway.

So, in practice, C code does not make use of the ability to set an
individual code point by overwriting a fixed-size code unit.

Forth has chars that are 8 bits wide in traditional Forth systems on
byte-addressed machines.  In the 1994 standard (in the middle of the
reign of Unicode 1.0, and with lots of Californians on the
standardization committe) provided the option to implement Forth
systems with chars that take a fixed number >1 of bytes, and one
system (JaxForth by Jack Woehr for Windows NT) implemented 16-bit
chars.

However, JaxForth was not very popular, and most code assumed that 1
char = 1 (i.e., 8 bits on a byte-addressed machine), and given that
there was no widely available system that deviated from that, even
code that wanted to avoid this assumption could not be tested.  And
given that most code has this assumption and would not work on systems
with 1 chars > 1, all the other systems stuck with 1 char = 1.  A
Chicken-and-Egg problem?  Not really:

When we looked at the problem in 2004, we found that most code works
fine with UTF-8; that's because most code does not care about
characters.  Even code that uses words like C@ (load a char from
memory) typically does it in a way that works with UTF-8.  We proposed
a number of words for dealing with variable-width xchars (what C calls
multi-byte characters), and you can theoretically use them with the
pre-Unicode East-Asian encodings as well as with UTF-8.  These words
were standardized in Forth-2012, but they are actually little-used
(including by me), because most code actually works fine with opaque
strings.

In Gforth, an xchar is a code point, not a character, so these words
are currently less useful for writing Palindrome checkers than one
might hope.  Maybe at some point we will look at the problem again,
and provide words for dealing with characters, Unicode normalization,
collating order and such things, but for now the pain is not big
enough to tackle that problem.

Finally, I proposed to standardize the common practice 1 chars = 1;
this proposal was accepted for standardization in 2016.

Emacs Lisp has this misfeature as well (and so does Common Lisp).  🙁
It's really hard to get rid of it, even though it's used *very* rarely.
In ELisp, strings are represented internally as utf-8 (tho it pretends
to be an array opf code points), so an assignment that replaces a single
char can require reallocating the array!

One way forward might be to also provide a string-oriented API with
byte (code unit) indices, and recommend that people use that instead
of the inefficient code-point-indexed API.  For a high-level language
like Elisp or Python, the internal representation can depend on which
function was last used on the string.  So if code uses only the
string-oriented API, you may be able to avoid the costs of the
code-point API completely.

But why would one want to set individual code points?
>
Because you know your string only contains "characters" made of a single
code point?

This incorrect "knowledge" may be the reason why Emacs 27.1 displays

K̖̈nig

as if the first three-code-point character actually was three characters.

E.g. your string contains the representation of the border of a table
(to be displayed in a tty), and you want to "move" the `+` of a column
separator (or a prettier version that takes advantage of the wider
choice offered by Unicode).

These kinds of things involve additional complications.  Not only do
you have to know the difference between code points and characters,
you also have to know the visual width of a character which is 0-2 for
fixed-width fonts to be used in xterm or the like.  Actually, if you
treat a combining mark as having width 0, you may be able to work with
code points and do not need characters.

Why do you want to move the column separator and what do you want to
overwrite with it?  This is likely the result of another operation,
and maybe that involves another string replacement; and displaying the
result involves so much overhead that using a string replacement
instead of a fixed-width store is probably not the dominant cost.  And
if the replacement string happens to have as many bytes as the
replaced string (which would happen for, e.g., replacing " " with
"+"), the operation is not so expensive anyway.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Date Sujet#  Auteur
1 May 24 * Byte Addressability And Beyond590Lawrence D'Oliveiro
1 May 24 +* Re: Byte Addressability And Beyond431John Levine
1 May 24 i+* Re: Byte Addressability And Beyond409Lawrence D'Oliveiro
1 May 24 ii+* Re: Byte Addressability And Beyond3John Levine
1 May 24 iii+- Re: Byte Addressability And Beyond1John Levine
1 May 24 iii`- Re: Byte Addressability And Beyond1Lawrence D'Oliveiro
1 May 24 ii+- Re: Byte Addressability And Beyond1Michael S
1 May 24 ii`* Re: Byte Addressability And Beyond404John Levine
2 May 24 ii +* Re: Byte Addressability And Beyond382Lawrence D'Oliveiro
2 May 24 ii i+* Re: Byte Addressability And Beyond4John Levine
2 May 24 ii ii`* Re: Byte Addressability And Beyond3Lawrence D'Oliveiro
2 May 24 ii ii `* Re: Byte Addressability And Beyond2John Levine
5 May 24 ii ii  `- Re: Byte Addressability And Beyond1Lawrence D'Oliveiro
2 May 24 ii i+* Re: Byte Addressability And Beyond367John Savard
2 May 24 ii ii+* Re: Byte Addressability And Beyond2MitchAlsup1
11 May 24 ii iii`- Re: Byte Addressability And Beyond1John Savard
4 May 24 ii ii`* Re: Byte Addressability And Beyond364Lawrence D'Oliveiro
8 May 24 ii ii `* Re: Byte Addressability And Beyond363John Savard
8 May 24 ii ii  +* Re: Byte Addressability And Beyond2Lawrence D'Oliveiro
10 May 24 ii ii  i`- Re: Byte Addressability And Beyond1David Brown
8 May 24 ii ii  `* Re: Byte Addressability And Beyond360MitchAlsup1
8 May 24 ii ii   `* Re: Byte Addressability And Beyond359John Levine
8 May 24 ii ii    +* Re: Byte Addressability And Beyond357Lawrence D'Oliveiro
9 May 24 ii ii    i`* Re: Byte Addressability And Beyond356John Levine
10 May 24 ii ii    i +* Re: Byte Addressability And Beyond354David Brown
10 May 24 ii ii    i i`* Re: Byte Addressability And Beyond353Anton Ertl
11 May 24 ii ii    i i `* Re: Byte Addressability And Beyond352David Brown
11 May 24 ii ii    i i  `* Re: Byte Addressability And Beyond351Anton Ertl
11 May 24 ii ii    i i   +* Re: Byte Addressability And Beyond158David Brown
11 May 24 ii ii    i i   i+- Re: Byte Addressability And Beyond1Anton Ertl
27 May 24 ii ii    i i   i`* Re: Byte Addressability And Beyond156Lawrence D'Oliveiro
27 May 24 ii ii    i i   i `* Re: Byte Addressability And Beyond155John Levine
27 May 24 ii ii    i i   i  `* Re: Byte Addressability And Beyond154Lawrence D'Oliveiro
27 May 24 ii ii    i i   i   `* Re: Byte Addressability And Beyond153John Levine
27 May 24 ii ii    i i   i    +* Re: Byte Addressability And Beyond149John Levine
27 May 24 ii ii    i i   i    i+- Re: Byte Addressability And Beyond1MitchAlsup1
28 May 24 ii ii    i i   i    i`* Re: Byte Addressability And Beyond147Lawrence D'Oliveiro
28 May 24 ii ii    i i   i    i +- Re: encoding conversion, Byte Addressability And Beyond1John Levine
28 May 24 ii ii    i i   i    i `* Re: Byte Addressability And Beyond145Thomas Koenig
29 May 24 ii ii    i i   i    i  +* Re: Byte Addressability And Beyond137Lawrence D'Oliveiro
29 May 24 ii ii    i i   i    i  i`* Re: Byte Addressability And Beyond136Anton Ertl
29 May 24 ii ii    i i   i    i  i +* Re: Byte Addressability And Beyond12Stefan Monnier
29 May 24 ii ii    i i   i    i  i i+* Re: Byte Addressability And Beyond10Stefan Monnier
29 May 24 ii ii    i i   i    i  i ii+* Re: Byte Addressability And Beyond3John Levine
30 May 24 ii ii    i i   i    i  i iii`* Re: Byte Addressability And Beyond2George Neuner
4 Jun 24 ii ii    i i   i    i  i iii `- Re: Byte Addressability And Beyond1George Neuner
30 May 24 ii ii    i i   i    i  i ii`* Re: Byte Addressability And Beyond6Anton Ertl
4 Jun 24 ii ii    i i   i    i  i ii +- Re: Byte Addressability And Beyond1Lawrence D'Oliveiro
4 Jun 24 ii ii    i i   i    i  i ii `* Re: Byte Addressability And Beyond4Stefan Monnier
7 Jun 24 ii ii    i i   i    i  i ii  +- Re: Byte Addressability And Beyond1Terje Mathisen
7 Jun 24 ii ii    i i   i    i  i ii  `* Re: Character non-equivalence, was Byte Addressability And Beyond2John Levine
9 Jun 24 ii ii    i i   i    i  i ii   `- Re: Character non-equivalence, was Byte Addressability And Beyond1Lawrence D'Oliveiro
30 May 24 ii ii    i i   i    i  i i`- Re: Byte Addressability And Beyond1Lawrence D'Oliveiro
30 May 24 ii ii    i i   i    i  i +* Re: Byte Addressability And Beyond117Lawrence D'Oliveiro
30 May 24 ii ii    i i   i    i  i i+* Re: architectural goals, Byte Addressability And Beyond66John Levine
30 May 24 ii ii    i i   i    i  i ii+- Re: architectural goals, Byte Addressability And Beyond1Stephen Fuld
30 May 24 ii ii    i i   i    i  i ii+* Re: architectural goals, Byte Addressability And Beyond22Anton Ertl
30 May 24 ii ii    i i   i    i  i iii`* Re: architectural goals, Byte Addressability And Beyond21Thomas Koenig
30 May 24 ii ii    i i   i    i  i iii +* Re: architectural goals, Byte Addressability And Beyond8Michael S
30 May 24 ii ii    i i   i    i  i iii i+- Re: architectural goals, Byte Addressability And Beyond1Thomas Koenig
30 May 24 ii ii    i i   i    i  i iii i+* Re: IBM architectural goals, Byte Addressability And Beyond5John Levine
30 May 24 ii ii    i i   i    i  i iii ii+* Re: IBM architectural goals, Byte Addressability And Beyond2Michael S
30 May 24 ii ii    i i   i    i  i iii iii`- Re: IBM architectural goals, Byte Addressability And Beyond1John Levine
30 May 24 ii ii    i i   i    i  i iii ii`* Re: IBM architectural goals, Byte Addressability And Beyond2Thomas Koenig
30 May 24 ii ii    i i   i    i  i iii ii `- Re: IBM architectural goals, Byte Addressability And Beyond1John Levine
30 May 24 ii ii    i i   i    i  i iii i`- Re: architectural goals, Byte Addressability And Beyond1Anton Ertl
30 May 24 ii ii    i i   i    i  i iii +* Re: architectural goals, Byte Addressability And Beyond3Anton Ertl
30 May 24 ii ii    i i   i    i  i iii i+- Re: architectural goals, Byte Addressability And Beyond1John Levine
30 May 24 ii ii    i i   i    i  i iii i`- Re: architectural goals, Byte Addressability And Beyond1Thomas Koenig
31 May 24 ii ii    i i   i    i  i iii +* Re: architectural goals, Byte Addressability And Beyond5Terje Mathisen
1 Jun 24 ii ii    i i   i    i  i iii i`* Re: architectural goals, Byte Addressability And Beyond4Thomas Koenig
1 Jun 24 ii ii    i i   i    i  i iii i `* Re: architectural goals, Byte Addressability And Beyond3Anton Ertl
2 Jun 24 ii ii    i i   i    i  i iii i  `* Re: architectural goals, Byte Addressability And Beyond2John Levine
4 Jun 24 ii ii    i i   i    i  i iii i   `- Re: architectural goals, Byte Addressability And Beyond1Stefan Monnier
4 Jun 24 ii ii    i i   i    i  i iii `* Re: architectural goals, Byte Addressability And Beyond4Lawrence D'Oliveiro
4 Jun 24 ii ii    i i   i    i  i iii  +- Re: architectural goals, Byte Addressability And Beyond1MitchAlsup1
4 Jun 24 ii ii    i i   i    i  i iii  +- Re: architectural goals, Byte Addressability And Beyond1Lynn Wheeler
4 Jun 24 ii ii    i i   i    i  i iii  `- Re: architectural goals, Byte Addressability And Beyond1Stefan Monnier
31 May 24 ii ii    i i   i    i  i ii`* Re: architectural goals, Byte Addressability And Beyond42John Savard
31 May 24 ii ii    i i   i    i  i ii `* Re: architectural goals, Byte Addressability And Beyond41John Levine
1 Jun 24 ii ii    i i   i    i  i ii  +* Re: architectural goals, Byte Addressability And Beyond31John Savard
1 Jun 24 ii ii    i i   i    i  i ii  i+* Re: architectural goals, Byte Addressability And Beyond20Thomas Koenig
2 Jun 24 ii ii    i i   i    i  i ii  ii+* Re: architectural goals, Byte Addressability And Beyond6John Savard
2 Jun 24 ii ii    i i   i    i  i ii  iii`* Re: architectural goals, Byte Addressability And Beyond5Thomas Koenig
2 Jun 24 ii ii    i i   i    i  i ii  iii +* Re: architectural goals, Byte Addressability And Beyond3John Levine
3 Jun 24 ii ii    i i   i    i  i ii  iii i`* Re: architectural goals, Byte Addressability And Beyond2OrangeFish
3 Jun 24 ii ii    i i   i    i  i ii  iii i `- Re: architectural goals, Byte Addressability And Beyond1John Levine
4 Jun 24 ii ii    i i   i    i  i ii  iii `- Re: architectural goals, Byte Addressability And Beyond1Lawrence D'Oliveiro
4 Jun 24 ii ii    i i   i    i  i ii  ii`* Re: architectural goals, Byte Addressability And Beyond13Lawrence D'Oliveiro
5 Jun 24 ii ii    i i   i    i  i ii  ii `* Re: architectural goals, Byte Addressability And Beyond12Lawrence D'Oliveiro
5 Jun 24 ii ii    i i   i    i  i ii  ii  +- Re: architectural goals, Byte Addressability And Beyond1Lawrence D'Oliveiro
6 Jun 24 ii ii    i i   i    i  i ii  ii  `* Re: architectural goals, Byte Addressability And Beyond10George Neuner
6 Jun 24 ii ii    i i   i    i  i ii  ii   +* Re: architectural goals, Byte Addressability And Beyond6John Levine
7 Jun 24 ii ii    i i   i    i  i ii  ii   i+* Re: architectural goals, Byte Addressability And Beyond4Lawrence D'Oliveiro
7 Jun 24 ii ii    i i   i    i  i ii  ii   ii`* Re: architectural goals, Byte Addressability And Beyond3Stephen Fuld
7 Jun 24 ii ii    i i   i    i  i ii  ii   ii `* Re: architectural goals, Byte Addressability And Beyond2Lawrence D'Oliveiro
7 Jun 24 ii ii    i i   i    i  i ii  ii   ii  `- Re: architectural goals, Byte Addressability And Beyond1Stephen Fuld
7 Jun 24 ii ii    i i   i    i  i ii  ii   i`- Re: architectural goals, Byte Addressability And Beyond1Terje Mathisen
6 Jun 24 ii ii    i i   i    i  i ii  ii   +- Re: architectural goals, Byte Addressability And Beyond1Lynn Wheeler
6 Jun 24 ii ii    i i   i    i  i ii  ii   +- Re: architectural goals, Byte Addressability And Beyond1OrangeFish
7 Jun 24 ii ii    i i   i    i  i ii  ii   `- Re: architectural goals, Byte Addressability And Beyond1Lawrence D'Oliveiro
2 Jun 24 ii ii    i i   i    i  i ii  i`* Re: architectural goals, Byte Addressability And Beyond10John Dallman
2 Jun 24 ii ii    i i   i    i  i ii  +- Re: architectural goals, Byte Addressability And Beyond1Michael S
2 Jun 24 ii ii    i i   i    i  i ii  +- Re: architectural goals, Byte Addressability And Beyond1John Dallman
4 Jun 24 ii ii    i i   i    i  i ii  `* Re: architectural goals, Byte Addressability And Beyond7Lawrence D'Oliveiro
30 May 24 ii ii    i i   i    i  i i+* Re: Byte Addressability And Beyond49Stephen Fuld
30 May 24 ii ii    i i   i    i  i i`- Re: Byte Addressability And Beyond1Anton Ertl
30 May 24 ii ii    i i   i    i  i +* Re: Byte Addressability And Beyond2Lawrence D'Oliveiro
30 May 24 ii ii    i i   i    i  i `* Re: Byte Addressability And Beyond4Terje Mathisen
30 May 24 ii ii    i i   i    i  `* Re: Byte Addressability And Beyond7Terje Mathisen
28 May 24 ii ii    i i   i    `* Re: Byte Addressability And Beyond3Lawrence D'Oliveiro
12 May 24 ii ii    i i   +* Re: python text, Byte Addressability And Beyond14John Levine
12 May 24 ii ii    i i   `* Re: Byte Addressability And Beyond178Thomas Koenig
27 May 24 ii ii    i `- Re: Byte Addressability And Beyond1Lawrence D'Oliveiro
8 May 24 ii ii    `- Re: Byte Addressability And Beyond1Michael S
2 May 24 ii i`* Re: Byte Addressability And Beyond10MitchAlsup1
2 May 24 ii +* Re: Byte Addressability And Beyond3Michael S
2 May 24 ii `* Re: Byte Addressability And Beyond18Anton Ertl
1 May 24 i+* Byte Order (was: Byte Addressability And Beyond)4Anton Ertl
1 May 24 i`* Re: Byte Addressability And Beyond17Stefan Monnier
1 May 24 +* Re: Byte Addressability And Beyond40MitchAlsup1
1 May 24 +* Re: Byte Addressability And Beyond15Thomas Koenig
1 May 24 +* Re: Byte Addressability And Beyond3Michael S
2 May 24 +* Re: Byte Addressability And Beyond4Lawrence D'Oliveiro
3 May 24 +* Re: Byte Addressability And Beyond75Anton Ertl
5 May 24 +* Re: Byte Addressability And Beyond20John Savard
5 May 24 `- Re: Byte Addressability And Beyond1John Savard

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal