Sujet : Re: multi bytes character - how to make it defined behavior?
De : bc (at) *nospam* freeuk.com (Bart)
Groupes : comp.lang.cDate : 14. Aug 2024, 00:52:13
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v9grjd$4cjd$1@dont-email.me>
References : 1
User-Agent : Mozilla Thunderbird
On 13/08/2024 15:45, Thiago Adams wrote:
static_assert('×' == 50071);
GCC - warning multi byte
CLANG - error character too large
I think instead of "multi bytes" we need "multi characters" - not bytes.
We decode utf8 then we have the character to decide if it is multi char or not.
decoding '×' would consume bytes 195 and 151 the result is the decoded Unicode value of 215.
It is not multi byte : 256*195 + 151 = 50071
O the other hand 'ab' is "multi character" resulting
256 * 'a' + 'b' = 256*97+98= 24930
One consequence is that
'ab' == '𤤰'
But I don't think this is a problem. At least everything is defined.
What exactly do you mean by multi-byte characters? Is it a literal such as 'ABCD'?
I've no idea what C makes of that, so you will first have to specify what it might represent:
* Is it a single character represented by multiple bytes?
* If so, do those multiple bytes specify a Unicode number (2-3 bytes), or a UTF8 sequence (up to 4 bytes, maybe more)?
* If those multiple sequence are allowed, could you have more than one mixed ASCII/Unicode/UTF8 characters?
One problem with UTF8 in C character literals is that I believe those are limited to an 'int' type, so 32 bits. You can't fit much in there. And once you have such a value, how do you print it?
Some of this you can take care of in your 'cake' product, and superimpose a particular spec on top of C (maybe they can be extended to 64 bits) but you probably can't do much about 'printf'.
(In my language, I overhauled this part of it earlier this year. There it works like this:
* Character literals can be 64 bits
* They can represent up to 8 ASCII characters: 'ABCDEFGH'
* They can include escape codes for both Unicode and UTF8, and multiple
such characters can be specified:
'A\u20ACB' # All represent A€B; this is Unicode
'A\h EC 82 AC\B' # This is UTF8
'A\xEC\x82\xACB' # C-style escape
Internally they are stored as UTF8, so the 20AC is converted to UTF8
* The ordering of the characters matches that of the equivalent
"A\e20ACB" string when stored in memory; but this applies only to
little-endian
* Print routines have options to print the first character (which can be
a Unicode one), or the whole sequence)
Another aspect is when typing Unicode text directly via your text editor instead of using escape codes; will the C source be UTF8, or some other encoding? This will affect how the text is represented, and how much you can fit into one 32/64-bit literal.