Re: multi bytes character - how to make it defined behavior?

Liste des GroupesRevenir à l c 
Sujet : Re: multi bytes character - how to make it defined behavior?
De : bc (at) *nospam* freeuk.com (Bart)
Groupes : comp.lang.c
Date : 14. Aug 2024, 01:52:13
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v9grjd$4cjd$1@dont-email.me>
References : 1
User-Agent : Mozilla Thunderbird
On 13/08/2024 15:45, Thiago Adams wrote:
static_assert('×' == 50071);
 GCC -  warning multi byte
CLANG - error character too large
 I think instead of "multi bytes" we need "multi characters" - not bytes.
 We decode utf8 then we have the character to decide if it is multi char or not.
 decoding '×' would consume bytes 195 and 151 the result is the decoded Unicode value of 215.
 It is not multi byte : 256*195 + 151 = 50071
 O the other hand 'ab' is "multi character" resulting
 256 * 'a' + 'b' = 256*97+98= 24930
 One consequence is that
 'ab' == '𤤰'
 But I don't think this is a problem. At least everything is defined.
What exactly do you mean by multi-byte characters? Is it a literal such as 'ABCD'?
I've no idea what C makes of that, so you will first have to specify what it might represent:
* Is it a single character represented by multiple bytes?
* If so, do those multiple bytes specify a Unicode number (2-3 bytes), or a UTF8 sequence (up to 4 bytes, maybe more)?
* If those multiple sequence are allowed, could you have more than one mixed ASCII/Unicode/UTF8 characters?
One problem with UTF8 in C character literals is that I believe those are limited to an 'int' type, so 32 bits. You can't fit much in there. And once you have such a value, how do you print it?
Some of this you can take care of in your 'cake' product, and superimpose a particular spec on top of C (maybe they can be extended to 64 bits) but you probably can't do much about 'printf'.
(In my language, I overhauled this part of it earlier this year. There it works like this:
* Character literals can be 64 bits
* They can represent up to 8 ASCII characters: 'ABCDEFGH'
* They can include escape codes for both Unicode and UTF8, and multiple
   such characters can be specified:
    'A\u20ACB'            # All represent A€B; this is Unicode
    'A\h EC 82 AC\B'      # This is UTF8
    'A\xEC\x82\xACB'      # C-style escape
   Internally they are stored as UTF8, so the 20AC is converted to UTF8
* The ordering of the characters matches that of the equivalent
   "A\e20ACB" string when stored in memory; but this applies only to
   little-endian
* Print routines have options to print the first character (which can be
   a Unicode one), or the whole sequence)
Another aspect is when typing Unicode text directly via your text editor instead of using escape codes; will the C source be UTF8, or some other encoding? This will affect how the text is represented, and how much you can fit into one 32/64-bit literal.

Date Sujet#  Auteur
13 Aug 24 * multi bytes character - how to make it defined behavior?19Thiago Adams
14 Aug 24 +* Re: multi bytes character - how to make it defined behavior?16Bart
14 Aug 24 i`* Re: multi bytes character - how to make it defined behavior?15Keith Thompson
14 Aug 24 i `* Re: multi bytes character - how to make it defined behavior?14Thiago Adams
14 Aug 24 i  `* Re: multi bytes character - how to make it defined behavior?13Bart
14 Aug 24 i   +* Re: multi bytes character - how to make it defined behavior?11Thiago Adams
14 Aug 24 i   i+* Re: multi bytes character - how to make it defined behavior?9Bart
14 Aug 24 i   ii`* Re: multi bytes character - how to make it defined behavior?8Thiago Adams
14 Aug 24 i   ii +- Re: multi bytes character - how to make it defined behavior?1Thiago Adams
14 Aug 24 i   ii +* Re: multi bytes character - how to make it defined behavior?5Bart
14 Aug 24 i   ii i`* Re: multi bytes character - how to make it defined behavior?4Thiago Adams
14 Aug 24 i   ii i `* Re: multi bytes character - how to make it defined behavior?3Bart
14 Aug 24 i   ii i  `* Re: multi bytes character - how to make it defined behavior?2Thiago Adams
14 Aug 24 i   ii i   `- Re: multi bytes character - how to make it defined behavior?1Bart
15 Aug 24 i   ii `- Re: multi bytes character - how to make it defined behavior?1Lawrence D'Oliveiro
15 Aug 24 i   i`- Re: multi bytes character - how to make it defined behavior?1Lawrence D'Oliveiro
15 Aug 24 i   `- Re: multi bytes character - how to make it defined behavior?1Lawrence D'Oliveiro
14 Aug 24 +- Re: multi bytes character - how to make it defined behavior?1Ben Bacarisse
14 Aug 24 `- Re: multi bytes character - how to make it defined behavior?1Richard Damon

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal