Deutsch   English   Français   Italiano  
<vp72r2$2pift$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Lem Novantotto <Lem@none.invalid>
Newsgroups: comp.unix.shell
Subject: Re: Sorting problem with Unix sort(1) with UTF-8 punctuation
 characters - locale issue
Date: Thu, 20 Feb 2025 11:14:42 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 58
Message-ID: <vp72r2$2pift$1@dont-email.me>
References: <vp4f6o$288ui$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 20 Feb 2025 12:14:42 +0100 (CET)
Injection-Info: dont-email.me; posting-host="0baac5ddd7aa045ad2b1697eaedd31f4";
	logging-data="2935293"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18HBdhjrdJvvN72UA8pC731UyE/TIm39ns="
User-Agent: Pan/0.160 (Toresk; )
Cancel-Lock: sha1:eRPE1c5Tl8A0AmIIyDVT30/251M=
Bytes: 2530

Il Wed, 19 Feb 2025 12:27:18 +0100, Janis Papanagnou ha scritto:

> I've been sorting punctuation characters on one Unix system and it did
> not produce the expected result. Switching to another system did it as
> expected.

The second system (not working "properly") is treating all dots as equal, 
so it sorts just the letters.

Also my system doesn't sort properly. In my system:

$ locale
LANG=it_IT.UTF-8
LANGUAGE=it_IT
LC_CTYPE="it_IT.UTF-8"
LC_NUMERIC="it_IT.UTF-8"
LC_TIME="it_IT.UTF-8"
LC_COLLATE="it_IT.UTF-8"
LC_MONETARY="it_IT.UTF-8"
LC_MESSAGES="it_IT.UTF-8"
LC_PAPER="it_IT.UTF-8"
LC_NAME="it_IT.UTF-8"
LC_ADDRESS="it_IT.UTF-8"
LC_TELEPHONE="it_IT.UTF-8"
LC_MEASUREMENT="it_IT.UTF-8"
LC_IDENTIFICATION="it_IT.UTF-8"
LC_ALL=

Let's see. In my /usr/share/i18n/locales/it_IT, I have yhis section:

LC_COLLATE
copy "iso14651_t1"
END LC_COLLATE

In your second system, you have LC_COLLATE=en_US or de_DE. It's the same: 
in the relative files there is always the same section:
LC_COLLATE
copy "iso14651_t1"
END LC_COLLATE

But in /usr/share/i18n/locales/C there is:

LC_COLLATE
% The keyword 'codepoint_collation' in any part of any LC_COLLATE
% immediately discards all collation information and causes the
% locale to use strcmp/wcscmp for collation comparison.  This is
% exactly what is needed for C (ASCII) or C.UTF-8.
codepoint_collation
END LC_COLLATE

And here it is:

$ LC_COLLATE=C sort yada yada

gives the correct sorting.
-- 
Bye, Lem
		Talis erit dies qualem egeris