Stefan Monnier <
monnier@iro.umontreal.ca> writes:
[> Anton Ertl:]
[>> Thomas Koenig:]
Assume you're implementing a language which has a function of setting
an individual character in a string.
That's a design mistake in the language, and I know no language that
has this misfeature.
>
I suspect "individual character" meant "code point" above.
I meant character, not code point, as should have become clear from
the following. I think that Thomas Koenig meant "character", too, but
he may have been unaware of the difference between "character" and
"Unicode code point".
Does Unicode even has the notion of "character", really?
AFAIK it does not. But applications like palindrome checkers care
about characters, not code points.
OTOH, most code can be implemented fine as working on strings, without
knowing how many characters there are in the string (and it then does
not need to know about code points, either). In other words, it can
be implemented just as well when the strings are represented as
strings of code units (whether UTF-8 (bytes), UTF-16 (16-bit code
units) or UTF-32 (32-bit code units)), and then it does not help to
convert UTF-8 to something else on input and something else to UTF-8
on output.
For the code that cares about characters, if it wants to work
correctly for characters that cannot be precomposed into a single code
point, it has to deal with characters that consist of multiple code
points, i.e., that even in UTF-32 are variable-width. So given that
you have to bite the variable-width bullet anyway, you can just as
well use UTF-8.
Instead, what we see is one language (Python3) that has an even worse
misfeature: You can set an individual code point in a string; see
above for the things you get when you overwrite code points.
>
I think it's fairly common for languages that started with strings
as "arrays of 8bit chars".
Apart from Python3 not in those languages that I have looked at more
closely wrt this feature.
In particular, C was created by adding a byte type to B, and that type
was called "char". It was allowed to be wider to cater for
word-addressed machines, but on byte-addressed machines "char" is
invariably a byte. To cater to Unicode, they used a two-pronged
approach: they added wchar_t and multi-byte functions (IIRC both
already in C89); wchar_t was obviously introduced to cater for the
upcoming Unicode 1.0 (which satisfied code unit=code point=character),
while the multibyte stuff was probably introduced originally for
dealing with the ASCII-compatible East-Asian encodings.
When UTF-8 arrived, the multi-byte functions proved to fit that well;
but of course there is not much usage of those functions, because most
code works fine without knowing about individual code points or
characters. And UTF-8 turned out to be the answer to dealing with
Unicode that the Unix programmers who had a lot of code working with
strings of chars (i.e., bytes) were looking for.
Then Unicode 2.0 arrived and the Win32 API (which had embraced wchar_t
and defined it as being 16-bit) stuck with 16-bit wchar_t, which
breaks "code unit=code point"; this may not be in line with the
intentions of the inventors of wchar_t (e.g., there are no
multi-wchar_t functions in the C standard last time I looked), but
that has been the existing practice in wchar_t use in C for more than
a quarter-century.
Unix, where wchar_t was (and still is) little used, switched to 32-bit
wchar_t, but
1) given that Unicode at some point (probably already in 2.0) broke
"code point=character", that does not really help software like
palindrome checkers.
2) wchar_t is little-used in Unix-specific code.
3) Code that wants to be portable between Unix and Windows and uses
wchar_t cannot rely on "code unit=code point" anyway.
So, in practice, C code does not make use of the ability to set an
individual code point by overwriting a fixed-size code unit.
Forth has chars that are 8 bits wide in traditional Forth systems on
byte-addressed machines. In the 1994 standard (in the middle of the
reign of Unicode 1.0, and with lots of Californians on the
standardization committe) provided the option to implement Forth
systems with chars that take a fixed number >1 of bytes, and one
system (JaxForth by Jack Woehr for Windows NT) implemented 16-bit
chars.
However, JaxForth was not very popular, and most code assumed that 1
char = 1 (i.e., 8 bits on a byte-addressed machine), and given that
there was no widely available system that deviated from that, even
code that wanted to avoid this assumption could not be tested. And
given that most code has this assumption and would not work on systems
with 1 chars > 1, all the other systems stuck with 1 char = 1. A
Chicken-and-Egg problem? Not really:
When we looked at the problem in 2004, we found that most code works
fine with UTF-8; that's because most code does not care about
characters. Even code that uses words like C@ (load a char from
memory) typically does it in a way that works with UTF-8. We proposed
a number of words for dealing with variable-width xchars (what C calls
multi-byte characters), and you can theoretically use them with the
pre-Unicode East-Asian encodings as well as with UTF-8. These words
were standardized in Forth-2012, but they are actually little-used
(including by me), because most code actually works fine with opaque
strings.
In Gforth, an xchar is a code point, not a character, so these words
are currently less useful for writing Palindrome checkers than one
might hope. Maybe at some point we will look at the problem again,
and provide words for dealing with characters, Unicode normalization,
collating order and such things, but for now the pain is not big
enough to tackle that problem.
Finally, I proposed to standardize the common practice 1 chars = 1;
this proposal was accepted for standardization in 2016.
Emacs Lisp has this misfeature as well (and so does Common Lisp). 🙁
It's really hard to get rid of it, even though it's used *very* rarely.
In ELisp, strings are represented internally as utf-8 (tho it pretends
to be an array opf code points), so an assignment that replaces a single
char can require reallocating the array!
One way forward might be to also provide a string-oriented API with
byte (code unit) indices, and recommend that people use that instead
of the inefficient code-point-indexed API. For a high-level language
like Elisp or Python, the internal representation can depend on which
function was last used on the string. So if code uses only the
string-oriented API, you may be able to avoid the costs of the
code-point API completely.
But why would one want to set individual code points?
>
Because you know your string only contains "characters" made of a single
code point?
This incorrect "knowledge" may be the reason why Emacs 27.1 displays
K̖̈nig
as if the first three-code-point character actually was three characters.
E.g. your string contains the representation of the border of a table
(to be displayed in a tty), and you want to "move" the `+` of a column
separator (or a prettier version that takes advantage of the wider
choice offered by Unicode).
These kinds of things involve additional complications. Not only do
you have to know the difference between code points and characters,
you also have to know the visual width of a character which is 0-2 for
fixed-width fonts to be used in xterm or the like. Actually, if you
treat a combining mark as having width 0, you may be able to work with
code points and do not need characters.
Why do you want to move the column separator and what do you want to
overwrite with it? This is likely the result of another operation,
and maybe that involves another string replacement; and displaying the
result involves so much overhead that using a string replacement
instead of a fixed-width store is probably not the dominant cost. And
if the replacement string happens to have as many bytes as the
replaced string (which would happen for, e.g., replacing " " with
"+"), the operation is not so expensive anyway.
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>