Deutsch English Français Italiano |
<vp72r2$2pift$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: Lem Novantotto <Lem@none.invalid> Newsgroups: comp.unix.shell Subject: Re: Sorting problem with Unix sort(1) with UTF-8 punctuation characters - locale issue Date: Thu, 20 Feb 2025 11:14:42 -0000 (UTC) Organization: A noiseless patient Spider Lines: 58 Message-ID: <vp72r2$2pift$1@dont-email.me> References: <vp4f6o$288ui$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Injection-Date: Thu, 20 Feb 2025 12:14:42 +0100 (CET) Injection-Info: dont-email.me; posting-host="0baac5ddd7aa045ad2b1697eaedd31f4"; logging-data="2935293"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18HBdhjrdJvvN72UA8pC731UyE/TIm39ns=" User-Agent: Pan/0.160 (Toresk; ) Cancel-Lock: sha1:eRPE1c5Tl8A0AmIIyDVT30/251M= Bytes: 2530 Il Wed, 19 Feb 2025 12:27:18 +0100, Janis Papanagnou ha scritto: > I've been sorting punctuation characters on one Unix system and it did > not produce the expected result. Switching to another system did it as > expected. The second system (not working "properly") is treating all dots as equal, so it sorts just the letters. Also my system doesn't sort properly. In my system: $ locale LANG=it_IT.UTF-8 LANGUAGE=it_IT LC_CTYPE="it_IT.UTF-8" LC_NUMERIC="it_IT.UTF-8" LC_TIME="it_IT.UTF-8" LC_COLLATE="it_IT.UTF-8" LC_MONETARY="it_IT.UTF-8" LC_MESSAGES="it_IT.UTF-8" LC_PAPER="it_IT.UTF-8" LC_NAME="it_IT.UTF-8" LC_ADDRESS="it_IT.UTF-8" LC_TELEPHONE="it_IT.UTF-8" LC_MEASUREMENT="it_IT.UTF-8" LC_IDENTIFICATION="it_IT.UTF-8" LC_ALL= Let's see. In my /usr/share/i18n/locales/it_IT, I have yhis section: LC_COLLATE copy "iso14651_t1" END LC_COLLATE In your second system, you have LC_COLLATE=en_US or de_DE. It's the same: in the relative files there is always the same section: LC_COLLATE copy "iso14651_t1" END LC_COLLATE But in /usr/share/i18n/locales/C there is: LC_COLLATE % The keyword 'codepoint_collation' in any part of any LC_COLLATE % immediately discards all collation information and causes the % locale to use strcmp/wcscmp for collation comparison. This is % exactly what is needed for C (ASCII) or C.UTF-8. codepoint_collation END LC_COLLATE And here it is: $ LC_COLLATE=C sort yada yada gives the correct sorting. -- Bye, Lem Talis erit dies qualem egeris