Newsportal USENET - Re: Simple string conversion from UCS2 to ISO8859-1

Re: Simple string conversion from UCS2 to ISO8859-1

Sujet : Re: Simple string conversion from UCS2 to ISO8859-1
De : pozzugno (at) *nospam* gmail.com (pozz)
Groupes : comp.lang.c
Date : 21. Feb 2025, 15:53:02

Autres entêtes

Organisation : A noiseless patient Spider
Message-ID : <vpa40d$3a0k4$6@dont-email.me>
References : 1 2
User-Agent : Mozilla Thunderbird

Il 21/02/2025 15:23, David Brown ha scritto:

On 21/02/2025 12:40, pozz wrote:
I want to write a simple function that converts UCS2 string into ISO8859-1:
>
void ucs2_to_iso8859p1(char *ucs2, size_t size);
>
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing size because ucs2 isn't null terminated.
>
I know I can use iconv() feature, but I'm on an embedded platform without an OS and without iconv() function.
>
It is trivial to convert "0000"-"007F" chars: it's a simple cast from unsigned int to char.
>
It isn't so simple to convert higher codes. For example, the small e with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial again. But I saw the code "2019" (apostrophe) that can be rendered as 0x27 in ISO8859-1.
>
Is there a simplified mapping table that can be written with if/switch?
>
if (code < 0x80) {
   *dst++ = (char)code;
} else {
   switch (code) {
     case 0x2019: *dst++ = 0x27; break; // Apostrophe
     case 0x...: *dst++ = ...; break;
     default: *ds++ = ' ';
   }
}
>
I'm not searching a very detailed and correct mapping, but just a "sufficient" implementation.
>
>
<https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane>
As has been mentioned by others, 0 - 0xff should be a direct translation (with the possible exception of Latin-9 differences).
<https://en.wikipedia.org/wiki/ISO/IEC_8859-15>
When you look that BMP blocks above the first two blocks (0 - 0x7f, 0x80 - 0xff), you will quickly see that virtually none of them make any sense to support in the way you are thinking. Just because a couple of the characters in the Thaana block look a bit like quotation marks, does not mean it makes any sense to try to transliterate them. Realistically, you can at most make use of a few punctuation symbols (like 0x2019 above), and maybe approximate forms for some extended Latin alphabet characters that you will never see in practice. Oh, and you might be able to support those spam emails that use Greek and other letters that look like Latin letters such as "ՏΡ𐊠Ꮇ" to fool filters. And that's assuming you have output support for the full Latin-1 or Latin-9 range.
Unicode is rarely much use unless you want and can provide good support for non-Latin alphabets. Otherwise your translations are going to be so limited and simple that they are barely worth the effort and won't cover anything useful.
So here I would say that whoever provides the text, provides it in Latin-9 encoding. There's no point in allowing external translators to use whatever characters they feel is best in their language, and then your code makes some kind of odd approximation giving results that look different. If someone really wants to use the letter "ā" that is found in the Latin Extended A block, how do /you/ know whether the best Latin-9 match is "a", "ã", "ä", or something different like "aa" or an alternative spelling of the word? Maybe the rules are different for Latvian and Anglicised Mandarin.
When we have worked with multiple languages on small embedded systems (too small for big fonts and UTF-8), we have used one of three techniques :
1. Insist that the external translators provide strings in Latin-9 only (or even just ASCII when the system was more restricted).
2. Use primarily ASCII, with a few user-defined characters per language (that's useful for old-style character displays with space for perhaps 8 user-defined characters).
3. Use a PC program to figure out the characters actually used in the strings, and put them into a single table indexing a generated list of bitmap glyphs, also generated by the program (from freely available fonts). The source is, naturally, UTF-8 - the strings stored in the embedded system are not in any standard encoding representing characters, but now hold glyph table indices.
Your idea here sounds to me like a lot of work for virtually no benefit.

Yes, you're right. My question comes from an SMS text received by a 4G network modem. The reply to AT+CMGR command for a specific SMS reported the text in UCS2. The SMS was one sent by the mobile operator with balance of the prepaid SIM card.
The text included the apostrophe coded as U+2019 instead of U+0027. I suspect the developer that wrote the text in the mobile operator systems was using UTF-8 (or UTF-16) and inserted exactly U+2019 (maybe it did wrong).
Anyway I think I can live without that.

Les messages affichés proviennent d'usenet.

Date	Sujet	#	Auteur
21 Feb 25	Simple string conversion from UCS2 to ISO8859-1	65	pozz
21 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	29	Richard Damon
21 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	28	pozz
21 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	16	Janis Papanagnou
21 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	Janis Papanagnou
21 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	14	Keith Thompson
21 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	13	Janis Papanagnou
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	12	David Brown
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	5	Janis Papanagnou
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	David Brown
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	3	Lawrence D'Oliveiro
24 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	2	Janis Papanagnou
24 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	Lawrence D'Oliveiro
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	6	Richard Damon
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	David Brown
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	2	Janis Papanagnou
23 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	Richard Damon
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	Lawrence D'Oliveiro
23 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	Waldek Hebisch
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	Richard Damon
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	10	Lawrence D'Oliveiro
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	9	Janis Papanagnou
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	3	Lawrence D'Oliveiro
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	2	Janis Papanagnou
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	Lawrence D'Oliveiro
23 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	James Kuyper
23 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	Lawrence D'Oliveiro
23 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	3	Kaz Kylheku
24 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	2	Janis Papanagnou
24 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	Lawrence D'Oliveiro
21 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	2	David Brown
21 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	pozz
21 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	30	Keith Thompson
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	29	David Brown
24 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	28	pozz
24 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	27	Lawrence D'Oliveiro
25 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	2	pozz
25 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	Lawrence D'Oliveiro
25 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	24	pozz
25 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	23	Richard Damon
25 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	22	pozz
25 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	15	David Brown
26 Feb 25	[OT] Standards (was Re: Simple string conversion from UCS2 to ISO8859-1)	14	Janis Papanagnou
26 Feb 25	Re: [OT] Standards (was Re: Simple string conversion from UCS2 to ISO8859-1)	2	David Brown
26 Feb 25	Re: [OT] Standards (was Re: Simple string conversion from UCS2 to ISO8859-1)	1	Janis Papanagnou
26 Feb 25	Re: Standards (was Re: Simple string conversion from UCS2 to ISO8859-1)	11	Lawrence D'Oliveiro
27 Feb 25	Re: Standards (was Re: Simple string conversion from UCS2 to ISO8859-1)	10	Janis Papanagnou
27 Feb 25	Re: Standards (was Re: Simple string conversion from UCS2 to ISO8859-1)	9	David Brown
27 Feb 25	Re: Standards (was Re: Simple string conversion from UCS2 to ISO8859-1)	1	Richard Heathfield
27 Feb 25	Re: Standards (was Re: Simple string conversion from UCS2 to ISO8859-1)	5	bart
28 Feb 25	Re: Standards (was Re: Simple string conversion from UCS2 to ISO8859-1)	2	Lawrence D'Oliveiro
28 Feb 25	Re: Standards (was Re: Simple string conversion from UCS2 to ISO8859-1)	1	Janis Papanagnou
28 Feb 25	Re: Standards (was Re: Simple string conversion from UCS2 to ISO8859-1)	1	James Kuyper
28 Feb 25	Re: Standards (was Re: Simple string conversion from UCS2 to ISO8859-1)	1	David Brown
28 Feb 25	Re: Standards (was Re: Simple string conversion from UCS2 to ISO8859-1)	2	Janis Papanagnou
28 Feb 25	Re: Standards (was Re: Simple string conversion from UCS2 to ISO8859-1)	1	David Brown
25 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	3	Lawrence D'Oliveiro
25 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	pozz
26 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	Richard Damon
26 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	3	Lawrence D'Oliveiro
26 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	2	Keith Thompson
26 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	David Brown
22 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	Kaz Kylheku
25 Feb 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	Richard Harnden
1 Mar 25	Re: Simple string conversion from UCS2 to ISO8859-1	1	Geoff