Newsportal USENET - Re: Sorting problem with Unix sort(1) with UTF-8 punctuation characters

Sujet : Re: Sorting problem with Unix sort(1) with UTF-8 punctuation characters - locale issue
De : Lem (at) *nospam* none.invalid (Lem Novantotto)
Groupes : comp.unix.shell
Date : 20. Feb 2025, 12:14:42

Autres entêtes

Organisation : A noiseless patient Spider
Message-ID : <vp72r2$2pift$1@dont-email.me>
References : 1
User-Agent : Pan/0.160 (Toresk; )

Il Wed, 19 Feb 2025 12:27:18 +0100, Janis Papanagnou ha scritto:

I've been sorting punctuation characters on one Unix system and it did
not produce the expected result. Switching to another system did it as
expected.

The second system (not working "properly") is treating all dots as equal,
so it sorts just the letters.

Also my system doesn't sort properly. In my system:

$ locale
LANG=it_IT.UTF-8
LANGUAGE=it_IT
LC_CTYPE="it_IT.UTF-8"
LC_NUMERIC="it_IT.UTF-8"
LC_TIME="it_IT.UTF-8"
LC_COLLATE="it_IT.UTF-8"
LC_MONETARY="it_IT.UTF-8"
LC_MESSAGES="it_IT.UTF-8"
LC_PAPER="it_IT.UTF-8"
LC_NAME="it_IT.UTF-8"
LC_ADDRESS="it_IT.UTF-8"
LC_TELEPHONE="it_IT.UTF-8"
LC_MEASUREMENT="it_IT.UTF-8"
LC_IDENTIFICATION="it_IT.UTF-8"
LC_ALL=

Let's see. In my /usr/share/i18n/locales/it_IT, I have yhis section:

LC_COLLATE
copy "iso14651_t1"
END LC_COLLATE

In your second system, you have LC_COLLATE=en_US or de_DE. It's the same:
in the relative files there is always the same section:
LC_COLLATE
copy "iso14651_t1"
END LC_COLLATE

But in /usr/share/i18n/locales/C there is:

LC_COLLATE
% The keyword 'codepoint_collation' in any part of any LC_COLLATE
% immediately discards all collation information and causes the
% locale to use strcmp/wcscmp for collation comparison. This is
% exactly what is needed for C (ASCII) or C.UTF-8.
codepoint_collation
END LC_COLLATE

And here it is:

$ LC_COLLATE=C sort yada yada

gives the correct sorting.
--
Bye, Lem
Talis erit dies qualem egeris

Date	Sujet	#	Auteur
19 Feb 25	Sorting problem with Unix sort(1) with UTF-8 punctuation characters - locale issue	9	Janis Papanagnou
19 Feb 25	Re: Sorting problem with Unix sort(1) with UTF-8 punctuation characters - locale issue	4	Christian Weisgerber
20 Feb 25	Re: Sorting problem with Unix sort(1) with UTF-8 punctuation characters - locale issue	2	Janis Papanagnou
20 Feb 25	Re: Sorting problem with Unix sort(1) with UTF-8 punctuation characters - locale issue	1	Lawrence D'Oliveiro
20 Feb 25	Re: Sorting problem with Unix sort(1) with UTF-8 punctuation characters - locale issue	1	Janis Papanagnou
19 Feb 25	Re: Sorting problem with Unix sort(1) with UTF-8 punctuation characters - locale issue	3	Dan Cross
20 Feb 25	Re: Sorting problem with Unix sort(1) with UTF-8 punctuation characters - locale issue	1	Janis Papanagnou
20 Feb 25	Re: Sorting problem with Unix sort(1) with UTF-8 punctuation characters - locale issue	1	Christian Weisgerber
20 Feb 25	Re: Sorting problem with Unix sort(1) with UTF-8 punctuation characters - locale issue	1	Lem Novantotto