Sujet : Sorting problem with Unix sort(1) with UTF-8 punctuation characters - locale issue
De : janis_papanagnou+ng (at) *nospam* hotmail.com (Janis Papanagnou)
Groupes : comp.unix.shellDate : 19. Feb 2025, 12:27:18
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vp4f6o$288ui$1@dont-email.me>
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
I've been sorting punctuation characters on one Unix system and it
did not produce the expected result. Switching to another system did
it as expected.
The test program (it contains non-ASCII middle-dot characters) was
sort -t $'\t' <<EOT
····**·······**················< abc1
···········**······**··········< efg2
·**·························**·< hij3
············**·················< klm4
···**····················**····< nop5
···**···················**·**··< qrs6
··**··········**·········**····< tuv7
**·····························< wxy8
EOT
Run on an older system - with sort (GNU coreutils) 8.13 - produced
**·····························< wxy8
·**·························**·< hij3
··**··········**·········**····< tuv7
···**···················**·**··< qrs6
···**····················**····< nop5
····**·······**················< abc1
···········**······**··········< efg2
············**·················< klm4
On a newer system - with sort (GNU coreutils) 8.28 - it produced no
sorting at all (of these lines[*]).
····**·······**················< abc1
···········**······**··········< efg2
·**·························**·< hij3
············**·················< klm4
···**····················**····< nop5
···**···················**·**··< qrs6
··**··········**·········**····< tuv7
**·····························< wxy8
One hypothesis was that it's some locale issue. So I've copied the
LC_* settings to the newer system and disabled them one by one.
Strangely, the one that was responsible for the effect was LC_TIME!
On the correct sorting system it was defined as
LC_TIME=
de_DE.UTF-8@isodateand the one that worked improperly had
LC_TIME=de_DE.UTF-8
Now I'm puzzled in many ways...
If anything, I'd expected LC_COLLATE to have an effect on sorting.
Then there's no locale with @isodate on that sort-defunct system.
And clearing that LC_TIME locale or removing the "@isodate" part
did not change anything; it needs that setting to a non-existing
locale file to work correctly on the otherwise not correctly
sorting system.
Does anyone have an idea what's going on here?
I'm reluctant to globally set LC_TIME=
de_DE.UTF-8@isodate(since there is no file with that name in the locale directories).
Thanks.
Janis
[*] Lines with additional other contents than the depicted payload
were sorted correctly.