Sujet : Re: multi bytes character - how to make it defined behavior?
De : richard (at) *nospam* damon-family.org (Richard Damon)
Groupes : comp.lang.cDate : 14. Aug 2024, 04:44:24
Autres entêtes
Organisation : i2pn2 (i2pn.org)
Message-ID : <1ffb2244967a28423c968f4b4a9fec5a2553f356@i2pn2.org>
References : 1
User-Agent : Mozilla Thunderbird
On 8/13/24 10:45 AM, Thiago Adams wrote:
static_assert('×' == 50071);
GCC - warning multi byte
CLANG - error character too large
I think instead of "multi bytes" we need "multi characters" - not bytes.
We decode utf8 then we have the character to decide if it is multi char or not.
decoding '×' would consume bytes 195 and 151 the result is the decoded Unicode value of 215.
It is not multi byte : 256*195 + 151 = 50071
O the other hand 'ab' is "multi character" resulting
256 * 'a' + 'b' = 256*97+98= 24930
One consequence is that
'ab' == '𤤰'
But I don't think this is a problem. At least everything is defined.
When you use the single quotes by themselves ('), you are specifying characters in the narrow character set, typically ASCII, but might be some other 8-bit character encoding. It can not specify extended character beyond those.
You can (if the implementation allows it) place multiple characters in the constant to get an integer value with those characters packed.
When you use the double quotes by themselves ("), you are specifying a string of these narrow characters, although this form might allow for multi-byte encodings of some characters, like is done with UTF-8.
You can specifiy wide character constants by the syntax of L'x', u'x', or U'x'.
L'x' will give you what ever the inplementation calls its "wide character set". This MIGHT be UCS-2/UTF-16 or UCS-4/UTF-32 encoded, but doesn't need to be.
The u'x' form will always be USC-2/UTF-16, and U'x' will always be UCS-4/UTF-32
Like the plain 'x' form, the results from a single character, can not be a multi-unit value, so u'x' can't generate a two surrogate pairs for a single source character.
Change the ' to a " and you get wide strings, just like the characters, but now u"xx" and L"xx" can generate charaters that use surrogate pairs (or other multi-part encodings for L"xxx")