Newsportal USENET - Re: multi bytes character - how to make it defined behavior?

On 14/08/2024 19:28, Thiago Adams wrote:

On 14/08/2024 15:12, Bart wrote:
On 14/08/2024 18:40, Thiago Adams wrote:
On 14/08/2024 14:07, Bart wrote:
>
That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to clash with some other Unicode character.
>
>
>
My suggestion again. I am using string but imagine this working with bytes from file.
>
>
#include <stdio.h>
#include <assert.h>
>
...
int get_value(const char* s0)
{
    const char * s = s0;
    int value = 0;
    int uc;
    s = utf8_decode(s, &uc);
    while (s)
    {
      if (uc < 0x007F)
      {
         //multichar formula
         value = value*256+uc;
      }
      else
      {
         //single char
         value = uc;
         break; //check if there is more then error..
      }
      s = utf8_decode(s, &uc);
    }
    return value;
}
>
int main(){
   printf("%d\n", get_value(u8"×"));
   printf("%d\n", get_value(u8"ab"));
}
>
I see your problem. You're mixing things up.
The objective is :
  - make single characters have the Unicode value without having to use U''
  - allow more than one chars like 'ab' in some cases where each character is less than 0x007F. This can break code for instance '¼¼'.
but I am suspecting people are not using in this way (I hope)

Obviously that can't work, for example because two printable ASCII characters with codes 32 to 96, will have values from 1024 to 9216 when combined in a character literal. Those are going to clash with Unicode characters with those values.
It won't work either at compile-time or runtime.
You need to choose between Unicode representation and UTF8. Either that or use some prefix to disambiguate in source code, but you still need decide whether '€' in source code is represented as the Unicode bytes 20 AC (or maybe 00 20 AC) or the UTF8 sequence EC 82 AC, and further decide which end of those sequences will be the least signfificant byte.

In any case..my suggestion looks dangerous. But meanwhile this is not well specified in the standard.

It wasn't well-specified even when dealing with 100% ASCII. For example, 'AB' might have the hex value 0x4142 on one compiler, 0x4241 on another, maybe just 0x41 or 0x42 on a third, or even 0x41410000.

Date	Sujet	#	Auteur
13 Aug 24	multi bytes character - how to make it defined behavior?	19	Thiago Adams
14 Aug 24	Re: multi bytes character - how to make it defined behavior?	16	Bart
14 Aug 24	Re: multi bytes character - how to make it defined behavior?	15	Keith Thompson
14 Aug 24	Re: multi bytes character - how to make it defined behavior?	14	Thiago Adams
14 Aug 24	Re: multi bytes character - how to make it defined behavior?	13	Bart
14 Aug 24	Re: multi bytes character - how to make it defined behavior?	11	Thiago Adams
14 Aug 24	Re: multi bytes character - how to make it defined behavior?	9	Bart
14 Aug 24	Re: multi bytes character - how to make it defined behavior?	8	Thiago Adams
14 Aug 24	Re: multi bytes character - how to make it defined behavior?	1	Thiago Adams
14 Aug 24	Re: multi bytes character - how to make it defined behavior?	5	Bart
14 Aug 24	Re: multi bytes character - how to make it defined behavior?	4	Thiago Adams
14 Aug 24	Re: multi bytes character - how to make it defined behavior?	3	Bart
14 Aug 24	Re: multi bytes character - how to make it defined behavior?	2	Thiago Adams
14 Aug 24	Re: multi bytes character - how to make it defined behavior?	1	Bart
15 Aug 24	Re: multi bytes character - how to make it defined behavior?	1	Lawrence D'Oliveiro
15 Aug 24	Re: multi bytes character - how to make it defined behavior?	1	Lawrence D'Oliveiro
15 Aug 24	Re: multi bytes character - how to make it defined behavior?	1	Lawrence D'Oliveiro
14 Aug 24	Re: multi bytes character - how to make it defined behavior?	1	Ben Bacarisse
14 Aug 24	Re: multi bytes character - how to make it defined behavior?	1	Richard Damon