Deutsch   English   Français   Italiano  
<vp5ufo$2h4ql$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Janis Papanagnou <janis_papanagnou+ng@hotmail.com>
Newsgroups: comp.unix.shell
Subject: Re: Sorting problem with Unix sort(1) with UTF-8 punctuation
 characters - locale issue
Date: Thu, 20 Feb 2025 01:54:15 +0100
Organization: A noiseless patient Spider
Lines: 70
Message-ID: <vp5ufo$2h4ql$1@dont-email.me>
References: <vp4f6o$288ui$1@dont-email.me>
 <slrnvrcfcl.3e0.naddy@lorvorc.mips.inka.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 20 Feb 2025 01:54:16 +0100 (CET)
Injection-Info: dont-email.me; posting-host="21d22e278c11729412f6eed56de0f37b";
	logging-data="2659157"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+QUwiI6AEK6TTIJdZqSv6i"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.8.0
Cancel-Lock: sha1:JyKNRFraGpjDqSQN2ByxQLmZwnA=
In-Reply-To: <slrnvrcfcl.3e0.naddy@lorvorc.mips.inka.de>
X-Enigmail-Draft-Status: N1110
Bytes: 4192

On 19.02.2025 21:22, Christian Weisgerber wrote:
> On 2025-02-19, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
> 
>> If anything, I'd expected LC_COLLATE to have an effect on sorting.
>> Then there's no locale with @isodate on that sort-defunct system.
>> And clearing that LC_TIME locale or removing the "@isodate" part
>> did not change anything; it needs that setting to a non-existing
>> locale file to work correctly on the otherwise not correctly
>> sorting system.
> 
> My working hypothesis would be that setting LC_TIME to a nonexistent
> locale causes an error that invalidates the _whole_ locale setting
> and causes a fallback to a default setting, likely the "C" locale.
> You can check that sorting with LC_ALL=C or an invalid value like
> LC_ALL=foobar will produce your "correct" result.

That was actually also my own first locale-based hypothesis, and
setting LC_ALL=C was the first thing I tried (before identifying
the strange LC_TIME "solution"). But that setting did not change
that strange behavior. (But see below.)

> 
> A corollary from this would be that your "sort-defunct" system uses
> a different collation order than your "correctly" sorting system
> for the de_DE.UTF-8 locale.

Right. The point is that the two systems I'm using are handled by
me in different ways. The old system is one where I changed on a
system level all deficiencies I encountered; the @isodate locale
is such a beast. (It works on that system.) The newer system is
one that got standard updates and less (or hardy any) "fixes" by
me, so that I'd expect to work better "as designed". (But the
opposite is the case.)

On the old system I've explicitly defined
  LC_TIME=de_DE.UTF-8@isodate
  LC_COLLATE=C.UTF-8
and on the new system the collation is
  LC_TIME=de_DE.UTF-8
  LC_COLLATE=en_US.UTF-8

I'm sure there was a reason why the setting is now "en_US" instead
of "de_DE" (like almost all others LC-settings), so I'm reluctant
to change that. (But setting LC_COLLATE to "C.UTF-8" works as well.)

I think I'll have to use a local (not system wide) LC-change to fix
the issue to behave as I'd expect without touching the rest.

> 
> On the FreeBSD 14-STABLE system I'm typing this on, sorting your
> example data with my typical C.UTF-8 locale produces your expected
> result, sorting with de_DE.UTF-8 (or en_US.UTF-8) produces a different
> order.
> 
>>> ····**·······**················<	abc1
>>> ···········**······**··········<	efg2
>>> ·**·························**·<	hij3
> 
> Also, I have no idea what could be considered the "correct" sorting
> order for this.

Unless all used punctuation characters are disregarded or treated as
having all the same sorting order it should IMO be obvious that the
original unsorted form is not correct.

Thanks for your reply. It helped to find another setting that produces
the desired result.

Janis