Liste des Groupes | Revenir à cl c |
On 21/02/2025 12:40, pozz wrote:Yes, you're right. My question comes from an SMS text received by a 4G network modem. The reply to AT+CMGR command for a specific SMS reported the text in UCS2. The SMS was one sent by the mobile operator with balance of the prepaid SIM card.I want to write a simple function that converts UCS2 string into ISO8859-1:<https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane>
>
void ucs2_to_iso8859p1(char *ucs2, size_t size);
>
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing size because ucs2 isn't null terminated.
>
I know I can use iconv() feature, but I'm on an embedded platform without an OS and without iconv() function.
>
It is trivial to convert "0000"-"007F" chars: it's a simple cast from unsigned int to char.
>
It isn't so simple to convert higher codes. For example, the small e with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial again. But I saw the code "2019" (apostrophe) that can be rendered as 0x27 in ISO8859-1.
>
Is there a simplified mapping table that can be written with if/switch?
>
if (code < 0x80) {
*dst++ = (char)code;
} else {
switch (code) {
case 0x2019: *dst++ = 0x27; break; // Apostrophe
case 0x...: *dst++ = ...; break;
default: *ds++ = ' ';
}
}
>
I'm not searching a very detailed and correct mapping, but just a "sufficient" implementation.
>
>
As has been mentioned by others, 0 - 0xff should be a direct translation (with the possible exception of Latin-9 differences).
<https://en.wikipedia.org/wiki/ISO/IEC_8859-15>
When you look that BMP blocks above the first two blocks (0 - 0x7f, 0x80 - 0xff), you will quickly see that virtually none of them make any sense to support in the way you are thinking. Just because a couple of the characters in the Thaana block look a bit like quotation marks, does not mean it makes any sense to try to transliterate them. Realistically, you can at most make use of a few punctuation symbols (like 0x2019 above), and maybe approximate forms for some extended Latin alphabet characters that you will never see in practice. Oh, and you might be able to support those spam emails that use Greek and other letters that look like Latin letters such as "ՏΡ𐊠Ꮇ" to fool filters. And that's assuming you have output support for the full Latin-1 or Latin-9 range.
Unicode is rarely much use unless you want and can provide good support for non-Latin alphabets. Otherwise your translations are going to be so limited and simple that they are barely worth the effort and won't cover anything useful.
So here I would say that whoever provides the text, provides it in Latin-9 encoding. There's no point in allowing external translators to use whatever characters they feel is best in their language, and then your code makes some kind of odd approximation giving results that look different. If someone really wants to use the letter "ā" that is found in the Latin Extended A block, how do /you/ know whether the best Latin-9 match is "a", "ã", "ä", or something different like "aa" or an alternative spelling of the word? Maybe the rules are different for Latvian and Anglicised Mandarin.
When we have worked with multiple languages on small embedded systems (too small for big fonts and UTF-8), we have used one of three techniques :
1. Insist that the external translators provide strings in Latin-9 only (or even just ASCII when the system was more restricted).
2. Use primarily ASCII, with a few user-defined characters per language (that's useful for old-style character displays with space for perhaps 8 user-defined characters).
3. Use a PC program to figure out the characters actually used in the strings, and put them into a single table indexing a generated list of bitmap glyphs, also generated by the program (from freely available fonts). The source is, naturally, UTF-8 - the strings stored in the embedded system are not in any standard encoding representing characters, but now hold glyph table indices.
Your idea here sounds to me like a lot of work for virtually no benefit.
Les messages affichés proviennent d'usenet.