Liste des Groupes | Revenir à cl c |
On 21.02.2025 20:40, Keith Thompson wrote:As the OP explained in a reply to one of my posts, he is getting data in in UCS-2 format from SMS's from a modem. Somewhere along the line, either the firmware in the modem or in the code sending the SMS's, characters beyond the BMP are being used needlessly. So it looks like his first idea of manually handling a few cases (like code 0x2019) seems like the right approach.Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:Yes, that had already been pointed out upthread.
[...]BTW; you may want to consider using ISO 8859-15 (Latin 9) instead>
of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
contains a few other characters like the € (Euro Sign). If that is
possible for your context you have to map a handful of characters.
Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does
not, which would make the translation more difficult.
The (open) question is whether it makes sense to convert to "Latin 1"
only because it has a one-to-one mapping concerning the first UCS-2
characters, or if the underlying application of the OP wants support
of contemporary information by e.g. providing the € (Euro) sign with
"Latin 9".
>Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
<https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing
the 8 characters that differ betwween Latin-1 and Latin-9.
>
If at all possible, it would be better to convert to UTF-8. The
conversion is exact and reversible, and UTF-8 has largely superseded the
various Latin-* character encodings.
while the ISO 8859-X family represents single octet representations.
I'm curious why the OP needs ISO8859-1 and can't use UTF-8.I think this, or why he can't use "Latin 9", are essential questions.
It seems to have got clear after a subsequent post of the OP; some
message/data source seems to provide characters from the upper planes
of Unicode and the OP has to (or wants to) somehow map them to some
constant octet character set. - Yet there's no information provided
what Unicode characters - characters that don't have a representation
in Latin 1 or Latin 9 - the OP will encounter or not from that source.
As it sounds it all seems to make little sense.
Janis
Les messages affichés proviennent d'usenet.