Re: multi bytes character - how to make it defined behavior?

Liste des GroupesRevenir à cl c  
Sujet : Re: multi bytes character - how to make it defined behavior?
De : bc (at) *nospam* freeuk.com (Bart)
Groupes : comp.lang.c
Date : 14. Aug 2024, 20:32:31
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v9j0oe$in82$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12
User-Agent : Mozilla Thunderbird
On 14/08/2024 19:28, Thiago Adams wrote:
On 14/08/2024 15:12, Bart wrote:
On 14/08/2024 18:40, Thiago Adams wrote:
On 14/08/2024 14:07, Bart wrote:
>
That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to clash with some other Unicode character.
>
>
>
My suggestion again. I am using string but imagine this working with bytes from file.
>
>
#include <stdio.h>
#include <assert.h>
>
...
int get_value(const char* s0)
{
    const char * s = s0;
    int value = 0;
    int  uc;
    s = utf8_decode(s, &uc);
    while (s)
    {
      if (uc < 0x007F)
      {
         //multichar formula
         value = value*256+uc;
      }
      else
      {
         //single char
         value = uc;
         break; //check if there is more then error..
      }
      s = utf8_decode(s, &uc);
    }
    return value;
}
>
int main(){
   printf("%d\n", get_value(u8"×"));
   printf("%d\n", get_value(u8"ab"));
}
>
I see your problem. You're mixing things up.
  The objective is :
  - make single characters have the Unicode value without  having to use U''
  - allow more than one chars like 'ab' in some cases where each character is less than 0x007F. This can break code for instance '¼¼'.
but I am suspecting people are not using in this way (I hope)
Obviously that can't work, for example because two printable ASCII characters with codes 32 to 96, will have values from 1024 to 9216 when combined in a character literal. Those are going to clash with Unicode characters with those values.
It won't work either at compile-time or runtime.
You need to choose between Unicode representation and UTF8. Either that or use some prefix to disambiguate in source code, but you still need decide whether '€' in source code is represented as the Unicode bytes 20 AC (or maybe 00 20 AC) or the UTF8 sequence EC 82 AC, and further decide which end of those sequences will be the least signfificant byte.

In any case..my suggestion looks dangerous. But meanwhile this is not well specified in the standard.
It wasn't well-specified even when dealing with 100% ASCII. For example, 'AB' might have the hex value 0x4142 on one compiler, 0x4241 on another, maybe just 0x41 or 0x42 on a third, or even 0x41410000.

Date Sujet#  Auteur
13 Aug 24 * multi bytes character - how to make it defined behavior?19Thiago Adams
14 Aug 24 +* Re: multi bytes character - how to make it defined behavior?16Bart
14 Aug 24 i`* Re: multi bytes character - how to make it defined behavior?15Keith Thompson
14 Aug 24 i `* Re: multi bytes character - how to make it defined behavior?14Thiago Adams
14 Aug 24 i  `* Re: multi bytes character - how to make it defined behavior?13Bart
14 Aug 24 i   +* Re: multi bytes character - how to make it defined behavior?11Thiago Adams
14 Aug 24 i   i+* Re: multi bytes character - how to make it defined behavior?9Bart
14 Aug 24 i   ii`* Re: multi bytes character - how to make it defined behavior?8Thiago Adams
14 Aug 24 i   ii +- Re: multi bytes character - how to make it defined behavior?1Thiago Adams
14 Aug 24 i   ii +* Re: multi bytes character - how to make it defined behavior?5Bart
14 Aug 24 i   ii i`* Re: multi bytes character - how to make it defined behavior?4Thiago Adams
14 Aug 24 i   ii i `* Re: multi bytes character - how to make it defined behavior?3Bart
14 Aug 24 i   ii i  `* Re: multi bytes character - how to make it defined behavior?2Thiago Adams
14 Aug 24 i   ii i   `- Re: multi bytes character - how to make it defined behavior?1Bart
15 Aug 24 i   ii `- Re: multi bytes character - how to make it defined behavior?1Lawrence D'Oliveiro
15 Aug 24 i   i`- Re: multi bytes character - how to make it defined behavior?1Lawrence D'Oliveiro
15 Aug 24 i   `- Re: multi bytes character - how to make it defined behavior?1Lawrence D'Oliveiro
14 Aug 24 +- Re: multi bytes character - how to make it defined behavior?1Ben Bacarisse
14 Aug 24 `- Re: multi bytes character - how to make it defined behavior?1Richard Damon

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal