Sujet : Re: Sorting problem with Unix sort(1) with UTF-8 punctuation characters - locale issue
De : janis_papanagnou+ng (at) *nospam* hotmail.com (Janis Papanagnou)
Groupes : comp.unix.shellDate : 20. Feb 2025, 01:54:15
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vp5ufo$2h4ql$1@dont-email.me>
References : 1 2
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
On 19.02.2025 21:22, Christian Weisgerber wrote:
On 2025-02-19, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
If anything, I'd expected LC_COLLATE to have an effect on sorting.
Then there's no locale with @isodate on that sort-defunct system.
And clearing that LC_TIME locale or removing the "@isodate" part
did not change anything; it needs that setting to a non-existing
locale file to work correctly on the otherwise not correctly
sorting system.
My working hypothesis would be that setting LC_TIME to a nonexistent
locale causes an error that invalidates the _whole_ locale setting
and causes a fallback to a default setting, likely the "C" locale.
You can check that sorting with LC_ALL=C or an invalid value like
LC_ALL=foobar will produce your "correct" result.
That was actually also my own first locale-based hypothesis, and
setting LC_ALL=C was the first thing I tried (before identifying
the strange LC_TIME "solution"). But that setting did not change
that strange behavior. (But see below.)
A corollary from this would be that your "sort-defunct" system uses
a different collation order than your "correctly" sorting system
for the de_DE.UTF-8 locale.
Right. The point is that the two systems I'm using are handled by
me in different ways. The old system is one where I changed on a
system level all deficiencies I encountered; the @isodate locale
is such a beast. (It works on that system.) The newer system is
one that got standard updates and less (or hardy any) "fixes" by
me, so that I'd expect to work better "as designed". (But the
opposite is the case.)
On the old system I've explicitly defined
LC_TIME=
de_DE.UTF-8@isodate LC_COLLATE=C.UTF-8
and on the new system the collation is
LC_TIME=de_DE.UTF-8
LC_COLLATE=en_US.UTF-8
I'm sure there was a reason why the setting is now "en_US" instead
of "de_DE" (like almost all others LC-settings), so I'm reluctant
to change that. (But setting LC_COLLATE to "C.UTF-8" works as well.)
I think I'll have to use a local (not system wide) LC-change to fix
the issue to behave as I'd expect without touching the rest.
On the FreeBSD 14-STABLE system I'm typing this on, sorting your
example data with my typical C.UTF-8 locale produces your expected
result, sorting with de_DE.UTF-8 (or en_US.UTF-8) produces a different
order.
····**·······**················< abc1
···········**······**··········< efg2
·**·························**·< hij3
Also, I have no idea what could be considered the "correct" sorting
order for this.
Unless all used punctuation characters are disregarded or treated as
having all the same sorting order it should IMO be obvious that the
original unsorted form is not correct.
Thanks for your reply. It helped to find another setting that produces
the desired result.
Janis