Sujet : Re: Sorting problem with Unix sort(1) with UTF-8 punctuation characters - locale issue
De : Lem (at) *nospam* none.invalid (Lem Novantotto)
Groupes : comp.unix.shellDate : 20. Feb 2025, 12:14:42
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vp72r2$2pift$1@dont-email.me>
References : 1
User-Agent : Pan/0.160 (Toresk; )
Il Wed, 19 Feb 2025 12:27:18 +0100, Janis Papanagnou ha scritto:
I've been sorting punctuation characters on one Unix system and it did
not produce the expected result. Switching to another system did it as
expected.
The second system (not working "properly") is treating all dots as equal,
so it sorts just the letters.
Also my system doesn't sort properly. In my system:
$ locale
LANG=it_IT.UTF-8
LANGUAGE=it_IT
LC_CTYPE="it_IT.UTF-8"
LC_NUMERIC="it_IT.UTF-8"
LC_TIME="it_IT.UTF-8"
LC_COLLATE="it_IT.UTF-8"
LC_MONETARY="it_IT.UTF-8"
LC_MESSAGES="it_IT.UTF-8"
LC_PAPER="it_IT.UTF-8"
LC_NAME="it_IT.UTF-8"
LC_ADDRESS="it_IT.UTF-8"
LC_TELEPHONE="it_IT.UTF-8"
LC_MEASUREMENT="it_IT.UTF-8"
LC_IDENTIFICATION="it_IT.UTF-8"
LC_ALL=
Let's see. In my /usr/share/i18n/locales/it_IT, I have yhis section:
LC_COLLATE
copy "iso14651_t1"
END LC_COLLATE
In your second system, you have LC_COLLATE=en_US or de_DE. It's the same:
in the relative files there is always the same section:
LC_COLLATE
copy "iso14651_t1"
END LC_COLLATE
But in /usr/share/i18n/locales/C there is:
LC_COLLATE
% The keyword 'codepoint_collation' in any part of any LC_COLLATE
% immediately discards all collation information and causes the
% locale to use strcmp/wcscmp for collation comparison. This is
% exactly what is needed for C (ASCII) or C.UTF-8.
codepoint_collation
END LC_COLLATE
And here it is:
$ LC_COLLATE=C sort yada yada
gives the correct sorting.
-- Bye, Lem Talis erit dies qualem egeris