Sujet : Re: python text, Byte Addressability And Beyond
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.archDate : 12. May 2024, 06:40:45
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2024May12.074045@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5
User-Agent : xrn 10.11
John Levine <
johnl@taugh.com> writes:
Python3 has a complex internal string format that stores each string
as 1, 2, or 4 byte values, depending on what the contents of the
string are, so ASCII is one byte, UCS-2 is two bytes, and strings that
contain code points beyond UCS-2 are four bytes. It's not clear how
hard they try to shrink stuff down when taking substrings.
>
https://peps.python.org/pep-0393/
This is a nice demonstration of the unnecessary complexity that the
codepoint mistake leads to. In the general case they can have three
representations of the same string: wstr, utf8, and data; only one of
them needs to be non-NULL, and data is canonical if it is non-NULL
(not sure what is canonical if wstr and utf8 are present but data is
not). If data is in latin1 format, but not ASCII, outputting both
UTF-8 and UTF-16 needs conversion (it's just 8bit->16bit expansion in
the UTF-16 case, but that means that a fast block copy is
insufficient). On top of that, they specify both zero termination and
length indicators: length, utf8_length and wstr_length.
Of course Python3 has baked this mistake into their API, and once
software has been written for that API, the complexity becomes
necessary.
But if they had decided to just store the data as UTF-8 and use byte
indexes and lengths in their API, and adjusted the rest of their API
accordingly, they could have avoided this complexity and inefficiency,
and only palindrome and anagram programs that limit themselves to
character=codepoint would have become harder to write.
Python lets you subscript strings either individual items or
substrings, and I have written a fair amount of code that does that. I
realize that if I were doing semantic processing on Greek or Arabic, I
would not be subscripting and expecting it to return straightforwardly
useful results.
I don't doubt that the API works, it just leads to unnecessary
complexity in the implementation.
The string structure has a field for the length of the string in
UTF-8, but they don't seem to use it for anything, at least not yet,
My understanding from the PEP is that they use it for specifying the
length of the utf8 representation; of course, they also use zero
termination, so if the utf8 field is only passed to functions that use
zero-termination, the utf8_length field is not used. Given that, as
soon as data has been initialized, the contents of the utf8 and wstr
fields are no longer used (they are not canonical), I expect that the
only function that is called for the utf8 field is that for converting
from utf8 to the data form.
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
Date | Sujet | # | | Auteur |
1 May 24 | Byte Addressability And Beyond | 590 | | Lawrence D'Oliveiro |
1 May 24 | Re: Byte Addressability And Beyond | 431 | | John Levine |
1 May 24 | Re: Byte Addressability And Beyond | 409 | | Lawrence D'Oliveiro |
1 May 24 | Re: Byte Addressability And Beyond | 3 | | John Levine |
1 May 24 | Re: Byte Addressability And Beyond | 1 | | John Levine |
1 May 24 | Re: Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
1 May 24 | Re: Byte Addressability And Beyond | 1 | | Michael S |
1 May 24 | Re: Byte Addressability And Beyond | 404 | | John Levine |
2 May 24 | Re: Byte Addressability And Beyond | 382 | | Lawrence D'Oliveiro |
2 May 24 | Re: Byte Addressability And Beyond | 4 | | John Levine |
2 May 24 | Re: Byte Addressability And Beyond | 3 | | Lawrence D'Oliveiro |
2 May 24 | Re: Byte Addressability And Beyond | 2 | | John Levine |
5 May 24 | Re: Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
2 May 24 | Re: Byte Addressability And Beyond | 367 | | John Savard |
2 May 24 | Re: Byte Addressability And Beyond | 2 | | MitchAlsup1 |
11 May 24 | Re: Byte Addressability And Beyond | 1 | | John Savard |
4 May 24 | Re: Byte Addressability And Beyond | 364 | | Lawrence D'Oliveiro |
8 May 24 | Re: Byte Addressability And Beyond | 363 | | John Savard |
8 May 24 | Re: Byte Addressability And Beyond | 2 | | Lawrence D'Oliveiro |
10 May 24 | Re: Byte Addressability And Beyond | 1 | | David Brown |
8 May 24 | Re: Byte Addressability And Beyond | 360 | | MitchAlsup1 |
8 May 24 | Re: Byte Addressability And Beyond | 359 | | John Levine |
8 May 24 | Re: Byte Addressability And Beyond | 357 | | Lawrence D'Oliveiro |
9 May 24 | Re: Byte Addressability And Beyond | 356 | | John Levine |
10 May 24 | Re: Byte Addressability And Beyond | 354 | | David Brown |
10 May 24 | Re: Byte Addressability And Beyond | 353 | | Anton Ertl |
11 May 24 | Re: Byte Addressability And Beyond | 352 | | David Brown |
11 May 24 | Re: Byte Addressability And Beyond | 351 | | Anton Ertl |
11 May 24 | Re: Byte Addressability And Beyond | 158 | | David Brown |
11 May 24 | Re: Byte Addressability And Beyond | 1 | | Anton Ertl |
27 May 24 | Re: Byte Addressability And Beyond | 156 | | Lawrence D'Oliveiro |
27 May 24 | Re: Byte Addressability And Beyond | 155 | | John Levine |
27 May 24 | Re: Byte Addressability And Beyond | 154 | | Lawrence D'Oliveiro |
27 May 24 | Re: Byte Addressability And Beyond | 153 | | John Levine |
27 May 24 | Re: Byte Addressability And Beyond | 149 | | John Levine |
27 May 24 | Re: Byte Addressability And Beyond | 1 | | MitchAlsup1 |
28 May 24 | Re: Byte Addressability And Beyond | 147 | | Lawrence D'Oliveiro |
28 May 24 | Re: encoding conversion, Byte Addressability And Beyond | 1 | | John Levine |
28 May 24 | Re: Byte Addressability And Beyond | 145 | | Thomas Koenig |
29 May 24 | Re: Byte Addressability And Beyond | 137 | | Lawrence D'Oliveiro |
29 May 24 | Re: Byte Addressability And Beyond | 136 | | Anton Ertl |
29 May 24 | Re: Byte Addressability And Beyond | 12 | | Stefan Monnier |
29 May 24 | Re: Byte Addressability And Beyond | 10 | | Stefan Monnier |
29 May 24 | Re: Byte Addressability And Beyond | 3 | | John Levine |
30 May 24 | Re: Byte Addressability And Beyond | 2 | | George Neuner |
4 Jun 24 | Re: Byte Addressability And Beyond | 1 | | George Neuner |
30 May 24 | Re: Byte Addressability And Beyond | 6 | | Anton Ertl |
4 Jun 24 | Re: Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
4 Jun 24 | Re: Byte Addressability And Beyond | 4 | | Stefan Monnier |
7 Jun 24 | Re: Byte Addressability And Beyond | 1 | | Terje Mathisen |
7 Jun 24 | Re: Character non-equivalence, was Byte Addressability And Beyond | 2 | | John Levine |
9 Jun 24 | Re: Character non-equivalence, was Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
30 May 24 | Re: Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
30 May 24 | Re: Byte Addressability And Beyond | 117 | | Lawrence D'Oliveiro |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 66 | | John Levine |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Stephen Fuld |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 22 | | Anton Ertl |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 21 | | Thomas Koenig |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 8 | | Michael S |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Thomas Koenig |
30 May 24 | Re: IBM architectural goals, Byte Addressability And Beyond | 5 | | John Levine |
30 May 24 | Re: IBM architectural goals, Byte Addressability And Beyond | 2 | | Michael S |
30 May 24 | Re: IBM architectural goals, Byte Addressability And Beyond | 1 | | John Levine |
30 May 24 | Re: IBM architectural goals, Byte Addressability And Beyond | 2 | | Thomas Koenig |
30 May 24 | Re: IBM architectural goals, Byte Addressability And Beyond | 1 | | John Levine |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Anton Ertl |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 3 | | Anton Ertl |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | John Levine |
30 May 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Thomas Koenig |
31 May 24 | Re: architectural goals, Byte Addressability And Beyond | 5 | | Terje Mathisen |
1 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 4 | | Thomas Koenig |
1 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 3 | | Anton Ertl |
2 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 2 | | John Levine |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Stefan Monnier |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 4 | | Lawrence D'Oliveiro |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | MitchAlsup1 |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Lynn Wheeler |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Stefan Monnier |
31 May 24 | Re: architectural goals, Byte Addressability And Beyond | 42 | | John Savard |
31 May 24 | Re: architectural goals, Byte Addressability And Beyond | 41 | | John Levine |
1 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 31 | | John Savard |
1 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 20 | | Thomas Koenig |
2 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 6 | | John Savard |
2 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 5 | | Thomas Koenig |
2 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 3 | | John Levine |
3 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 2 | | OrangeFish |
3 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | John Levine |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 13 | | Lawrence D'Oliveiro |
5 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 12 | | Lawrence D'Oliveiro |
5 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
6 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 10 | | George Neuner |
6 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 6 | | John Levine |
7 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 4 | | Lawrence D'Oliveiro |
7 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 3 | | Stephen Fuld |
7 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 2 | | Lawrence D'Oliveiro |
7 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Stephen Fuld |
7 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Terje Mathisen |
6 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Lynn Wheeler |
6 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | OrangeFish |
7 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
2 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 10 | | John Dallman |
2 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | Michael S |
2 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 1 | | John Dallman |
4 Jun 24 | Re: architectural goals, Byte Addressability And Beyond | 7 | | Lawrence D'Oliveiro |
30 May 24 | Re: Byte Addressability And Beyond | 49 | | Stephen Fuld |
30 May 24 | Re: Byte Addressability And Beyond | 1 | | Anton Ertl |
30 May 24 | Re: Byte Addressability And Beyond | 2 | | Lawrence D'Oliveiro |
30 May 24 | Re: Byte Addressability And Beyond | 4 | | Terje Mathisen |
30 May 24 | Re: Byte Addressability And Beyond | 7 | | Terje Mathisen |
28 May 24 | Re: Byte Addressability And Beyond | 3 | | Lawrence D'Oliveiro |
12 May 24 | Re: python text, Byte Addressability And Beyond | 14 | | John Levine |
12 May 24 | Re: Byte Addressability And Beyond | 178 | | Thomas Koenig |
27 May 24 | Re: Byte Addressability And Beyond | 1 | | Lawrence D'Oliveiro |
8 May 24 | Re: Byte Addressability And Beyond | 1 | | Michael S |
2 May 24 | Re: Byte Addressability And Beyond | 10 | | MitchAlsup1 |
2 May 24 | Re: Byte Addressability And Beyond | 3 | | Michael S |
2 May 24 | Re: Byte Addressability And Beyond | 18 | | Anton Ertl |
1 May 24 | Byte Order (was: Byte Addressability And Beyond) | 4 | | Anton Ertl |
1 May 24 | Re: Byte Addressability And Beyond | 17 | | Stefan Monnier |
1 May 24 | Re: Byte Addressability And Beyond | 40 | | MitchAlsup1 |
1 May 24 | Re: Byte Addressability And Beyond | 15 | | Thomas Koenig |
1 May 24 | Re: Byte Addressability And Beyond | 3 | | Michael S |
2 May 24 | Re: Byte Addressability And Beyond | 4 | | Lawrence D'Oliveiro |
3 May 24 | Re: Byte Addressability And Beyond | 75 | | Anton Ertl |
5 May 24 | Re: Byte Addressability And Beyond | 20 | | John Savard |
5 May 24 | Re: Byte Addressability And Beyond | 1 | | John Savard |