Sujet : Re: Simple string conversion from UCS2 to ISO8859-1
De : david.brown (at) *nospam* hesbynett.no (David Brown)
Groupes : comp.lang.cDate : 25. Feb 2025, 17:16:23
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vpkqcn$22c6h$1@dont-email.me>
References : 1 2 3 4 5 6 7 8
User-Agent : Mozilla Thunderbird
On 25/02/2025 15:53, pozz wrote:
Il 25/02/2025 13:18, Richard Damon ha scritto:
On 2/25/25 2:35 AM, pozz wrote:
Il 24/02/2025 21:13, Lawrence D'Oliveiro ha scritto:
On Mon, 24 Feb 2025 16:57:24 +0100, pozz wrote:
>
Il 22/02/2025 14:18, David Brown ha scritto:
>
My understanding here is that the OP is getting the UCS-2 encoded
string in from a modem, almost certainly on a serial line. The UCS-2
encoded data is itself a binary sequence of 16-bit code units, and the
modem firmware is sending those as four hex digits.
>
Exactly. This is the reply to AT+CMGR command that is standardized in
3GPP TS 27.005.
>
Anything that is specifying the use of UCS-2 encoding automatically dates
itself to about the early-to-mid 1990s.
>
Sincereley I don't know why and when, but the LTE modem I'm using (Simcom A7672E) replies to AT+CMGR in two different format:
>
- what is described as GSM 7-bit alphabet (but it's really UTF-8 when non ASCII chas are present)
>
- UCS2
>
Of course, in the header, it specifies the <dcs> (data coding scheme) so the receiver on the UART can interpret correctly all the data.
>
>
Are you sure it is UCS2 and not UTF-16?
>
Can it not handle characters not in the BMP?
>
The difference between UCS2 and UTF-16 is that UCS2 is the character set that predates the surrogate-pairs added to extend it. It is very much the equivalent relationship of ASCII to UTF-8.
Sincerely I don't know, the standard says UCS2
The standard used by modems here is UCS2, not UTF-16. As you point out, this was all standardised in the early 1990's (before UTF-16) - as a standardisation of things that had already been used before that. And once a telecom standard is made, it is set in stone and never changed. Unlike for some things that adopted Unicode early using UCS2 (like Windows NT, Java, Qt, Python) the UCS2 use in established modem standard commands (like AT+CMGR) could not, and were not, extended to UTF-16. There might be other AT commands supported by some modems that /do/ support UTF-8 or UTF-16, but existing standardised commands don't change.
For all Unicode code points supported by UCS2, the coding is the same as for UTF-16 (as Richard says, it's like the ASCII subset of UTF-8). So you can always treat UCS2 as UTF-16. Unicode characters outside this set simply have no representation in UCS2.