Sujet : Re: Unicode in strings
De : bohannonindustriesllc (at) *nospam* gmail.com (BGB-Alt)
Groupes : comp.archDate : 22. May 2024, 23:15:53
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v2lqqq$1ckto$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14
User-Agent : Mozilla Thunderbird
On 5/22/2024 2:38 PM, Stefan Monnier wrote:
Assume you're implementing a language which has a function of setting
an individual character in a string.
That's a design mistake in the language, and I know no language that
has this misfeature.
I suspect "individual character" meant "code point" above.
I meant character, not code point, as should have become clear from
the following. I think that Thomas Koenig meant "character", too, but
he may have been unaware of the difference between "character" and
"Unicode code point".
I don't know of any language (or even library) that supports the notion
of "character" for Unicode strings. 🙁
Mostly just codepoints.
One can take their pick between UTF-16 and UTF-32, but on-average UTF-16 uses less memory than UTF-32.
Then there is a schism between the worlds of 16 or 32 bit wchar_t, ...
OTOH, most code can be implemented fine as working on strings, without
knowing how many characters there are in the string (and it then does
not need to know about code points, either).
Indeed, most operations on strings are conversion of things to strings,
concatenation of strings, search (typically for a substring or a regexp),
extraction of substring where the boundaries result from an earlier
search, and parsing (which at the bottom relies often on some sort of
regexp or equivalent system).
All of those work just fine on a UTF-8 sequence of bytes.
Sometimes it depends on context which is best.
For general use in C (or in OS APIs), UTF-8 is a win.
Likewise it is a sensible choice within a filesystem, or for file storage, ...
Sometimes, one has languages that view everything as if it were UTF-16 codepoints. But, UTF-16 wastes memory in many cases. In these cases, the winning option here may end up being to use 8859-1 or 1252 (for string literals), or M-UTF-8 for external storage (UTF-8 encoded UTF-16 strings, with NUL escape-coded as C0-80).
To some extent, my TestKern sub-project is using a hacked version of Unicode:
Text is typically stored (and transmitted to/from OS APIs) using M-UTF-8;
Generally, 0080..009F are interpreted as in 1252 (printable characters rather than extended control codes);
0400..04FF are interpreted as "dense hexadecimal" or "inline raw data" rather than Arabic in certain contexts (*1), ...
*1: Though, this shouldn't break much, as the contexts where dense hexadecimal or raw data would exist are likely mutually exclusive from those that would need Arabic (and failing this could probably use some extra encoding hackery).
Well, and the extra hack that is "Double-encoded UTF-8" (used internally by BGBCC for u8 string literals), where parts of the space are repurposed (mostly to reduce codepoints needing 4-6 bytes, ...).
Emacs Lisp has this misfeature as well (and so does Common Lisp). 🙁
It's really hard to get rid of it, even though it's used *very* rarely.
In ELisp, strings are represented internally as utf-8 (tho it pretends
to be an array opf code points), so an assignment that replaces a single
char can require reallocating the array!
One way forward might be to also provide a string-oriented API with
byte (code unit) indices, and recommend that people use that instead
of the inefficient code-point-indexed API.
I think the long term solution for ELisp will be to declare strings as
basically immutable.
In general, it makes sense to regard strings as immutable. My own language designs had generally assumed immutable strings (if one wants a mutable string, they use an array of a character type).
Because you know your string only contains "characters" made of a single
code point?
>
This incorrect "knowledge" may be the reason why Emacs 27.1 displays
>
K̖̈nig
>
as if the first three-code-point character actually was three characters.
No, the above seems like a problem in the redisplay code, and that code
is quite aware of combining characters and stuff. You're probably
seeing simply a missing rule to allow composition/shaping of your word.
(the composition/shaping library operates on whole strings at a time,
but Emacs tends to be quite conservative about the string-chunks it
sends to that library).
I recommend you `M-x report-emacs-bug`. The fix should be fairly simple.
E.g. your string contains the representation of the border of a table
(to be displayed in a tty), and you want to "move" the `+` of a column
separator (or a prettier version that takes advantage of the wider
choice offered by Unicode).
These kinds of things involve additional complications.
Very much so, indeed. It usually breaks down in many different ways
because of the common-but-not-guaranteed assumptions.
Stefan
Date | Sujet | # | | Auteur |
1 May 24 | Byte Addressability And Beyond | 590 | | Lawrence D'Oliveiro |
1 May 24 | Re: Byte Addressability And Beyond | 431 | | John Levine |
1 May 24 | Re: Byte Addressability And Beyond | 409 | | Lawrence D'Oliveiro |
1 May 24 | Re: Byte Addressability And Beyond | 3 | | John Levine |
1 May 24 | Re: Byte Addressability And Beyond | 1 | | John Levine |
1 May 24 | Re: Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
1 May 24 | Re: Byte Addressability And Beyond | 1 | | Michael S |
1 May 24 | Re: Byte Addressability And Beyond | 404 | | John Levine |
2 May 24 | Re: Byte Addressability And Beyond | 382 | | Lawrence D'Oliveiro |
2 May 24 | Re: Byte Addressability And Beyond | 4 | | John Levine |
2 May 24 | Re: Byte Addressability And Beyond | 3 | | Lawrence D'Oliveiro |
2 May 24 | Re: Byte Addressability And Beyond | 2 | | John Levine |
5 May 24 | Re: Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
2 May 24 | Re: Byte Addressability And Beyond | 367 | | John Savard |
2 May 24 | Re: Byte Addressability And Beyond | 2 | | MitchAlsup1 |
11 May 24 | Re: Byte Addressability And Beyond | 1 | | John Savard |
4 May 24 | Re: Byte Addressability And Beyond | 364 | | Lawrence D'Oliveiro |
8 May 24 | Re: Byte Addressability And Beyond | 363 | | John Savard |
8 May 24 | Re: Byte Addressability And Beyond | 2 | | Lawrence D'Oliveiro |
10 May 24 | Re: Byte Addressability And Beyond | 1 | | David Brown |
8 May 24 | Re: Byte Addressability And Beyond | 360 | | MitchAlsup1 |
8 May 24 | Re: Byte Addressability And Beyond | 359 | | John Levine |
8 May 24 | Re: Byte Addressability And Beyond | 357 | | Lawrence D'Oliveiro |
9 May 24 | Re: Byte Addressability And Beyond | 356 | | John Levine |
10 May 24 | Re: Byte Addressability And Beyond | 354 | | David Brown |
10 May 24 | Re: Byte Addressability And Beyond | 353 | | Anton Ertl |
11 May 24 | Re: Byte Addressability And Beyond | 352 | | David Brown |
11 May 24 | Re: Byte Addressability And Beyond | 351 | | Anton Ertl |
11 May 24 | Re: Byte Addressability And Beyond | 158 | | David Brown |
11 May 24 | Re: Byte Addressability And Beyond | 1 | | Anton Ertl |
27 May 24 | Re: Byte Addressability And Beyond | 156 | | Lawrence D'Oliveiro |
27 May 24 | Re: Byte Addressability And Beyond | 155 | | John Levine |
27 May 24 | Re: Byte Addressability And Beyond | 154 | | Lawrence D'Oliveiro |
27 May 24 | Re: Byte Addressability And Beyond | 153 | | John Levine |
27 May 24 | Re: Byte Addressability And Beyond | 149 | | John Levine |
27 May 24 | Re: Byte Addressability And Beyond | 1 | | MitchAlsup1 |
28 May 24 | Re: Byte Addressability And Beyond | 147 | | Lawrence D'Oliveiro |
28 May 24 | Re: encoding conversion, Byte Addressability And Beyond | 1 | | John Levine |
28 May 24 | Re: Byte Addressability And Beyond | 145 | | Thomas Koenig |
29 May 24 | Re: Byte Addressability And Beyond | 137 | | Lawrence D'Oliveiro |
29 May 24 | Re: Byte Addressability And Beyond | 136 | | Anton Ertl |
29 May 24 | Re: Byte Addressability And Beyond | 12 | | Stefan Monnier |
29 May 24 | Re: Byte Addressability And Beyond | 10 | | Stefan Monnier |
29 May 24 | Re: Byte Addressability And Beyond | 3 | | John Levine |
30 May 24 | Re: Byte Addressability And Beyond | 2 | | George Neuner |
4 Jun 24 | Re: Byte Addressability And Beyond | 1 | | George Neuner |
30 May 24 | Re: Byte Addressability And Beyond | 6 | | Anton Ertl |
4 Jun 24 | Re: Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
4 Jun 24 | Re: Byte Addressability And Beyond | 4 | | Stefan Monnier |
7 Jun 24 | Re: Byte Addressability And Beyond | 1 | | Terje Mathisen |
7 Jun 24 | Re: Character non-equivalence, was Byte Addressability And Beyond | 2 | | John Levine |
9 Jun 24 | Re: Character non-equivalence, was Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
30 May 24 | Re: Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
30 May 24 | Re: Byte Addressability And Beyond | 117 | | Lawrence D'Oliveiro |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 66 | | John Levine |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Stephen Fuld |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 22 | | Anton Ertl |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 21 | | Thomas Koenig |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 8 | | Michael S |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Thomas Koenig |
30 May 24 | Re: IBM architectural goals, Byte Addressability And Beyond | 5 | | John Levine |
30 May 24 | Re: IBM architectural goals, Byte Addressability And Beyond | 2 | | Michael S |
30 May 24 | Re: IBM architectural goals, Byte Addressability And Beyond | 1 | | John Levine |
30 May 24 | Re: IBM architectural goals, Byte Addressability And Beyond | 2 | | Thomas Koenig |
30 May 24 | Re: IBM architectural goals, Byte Addressability And Beyond | 1 | | John Levine |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Anton Ertl |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 3 | | Anton Ertl |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | John Levine |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Thomas Koenig |
31 May 24 | Re: architectural goals, Byte Addressability And Beyond | 5 | | Terje Mathisen |
1 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 4 | | Thomas Koenig |
1 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 3 | | Anton Ertl |
2 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 2 | | John Levine |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Stefan Monnier |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 4 | | Lawrence D'Oliveiro |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | MitchAlsup1 |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Lynn Wheeler |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Stefan Monnier |
31 May 24 | Re: architectural goals, Byte Addressability And Beyond | 42 | | John Savard |
31 May 24 | Re: architectural goals, Byte Addressability And Beyond | 41 | | John Levine |
1 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 31 | | John Savard |
1 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 20 | | Thomas Koenig |
2 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 6 | | John Savard |
2 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 5 | | Thomas Koenig |
2 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 3 | | John Levine |
3 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 2 | | OrangeFish |
3 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | John Levine |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 13 | | Lawrence D'Oliveiro |
5 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 12 | | Lawrence D'Oliveiro |
5 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
6 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 10 | | George Neuner |
6 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 6 | | John Levine |
7 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 4 | | Lawrence D'Oliveiro |
7 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 3 | | Stephen Fuld |
7 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 2 | | Lawrence D'Oliveiro |
7 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Stephen Fuld |
7 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Terje Mathisen |
6 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Lynn Wheeler |
6 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | OrangeFish |
7 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
2 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 10 | | John Dallman |
2 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Michael S |
2 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | John Dallman |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 7 | | Lawrence D'Oliveiro |
30 May 24 | Re: Byte Addressability And Beyond | 49 | | Stephen Fuld |
30 May 24 | Re: Byte Addressability And Beyond | 1 | | Anton Ertl |
30 May 24 | Re: Byte Addressability And Beyond | 2 | | Lawrence D'Oliveiro |
30 May 24 | Re: Byte Addressability And Beyond | 4 | | Terje Mathisen |
30 May 24 | Re: Byte Addressability And Beyond | 7 | | Terje Mathisen |
28 May 24 | Re: Byte Addressability And Beyond | 3 | | Lawrence D'Oliveiro |
12 May 24 | Re: python text, Byte Addressability And Beyond | 14 | | John Levine |
12 May 24 | Re: Byte Addressability And Beyond | 178 | | Thomas Koenig |
27 May 24 | Re: Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
8 May 24 | Re: Byte Addressability And Beyond | 1 | | Michael S |
2 May 24 | Re: Byte Addressability And Beyond | 10 | | MitchAlsup1 |
2 May 24 | Re: Byte Addressability And Beyond | 3 | | Michael S |
2 May 24 | Re: Byte Addressability And Beyond | 18 | | Anton Ertl |
1 May 24 | Byte Order (was: Byte Addressability And Beyond) | 4 | | Anton Ertl |
1 May 24 | Re: Byte Addressability And Beyond | 17 | | Stefan Monnier |
1 May 24 | Re: Byte Addressability And Beyond | 40 | | MitchAlsup1 |
1 May 24 | Re: Byte Addressability And Beyond | 15 | | Thomas Koenig |
1 May 24 | Re: Byte Addressability And Beyond | 3 | | Michael S |
2 May 24 | Re: Byte Addressability And Beyond | 4 | | Lawrence D'Oliveiro |
3 May 24 | Re: Byte Addressability And Beyond | 75 | | Anton Ertl |
5 May 24 | Re: Byte Addressability And Beyond | 20 | | John Savard |
5 May 24 | Re: Byte Addressability And Beyond | 1 | | John Savard |