Deutsch   English   Français   Italiano  
<vvghrg$18321$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: David Brown <david.brown@hesbynett.no>
Newsgroups: comp.lang.c
Subject: Re: Rationale for aligning data on even bytes in a Unix shell file?
Date: Wed, 7 May 2025 23:03:44 +0200
Organization: A noiseless patient Spider
Lines: 64
Message-ID: <vvghrg$18321$1@dont-email.me>
References: <vuih43$2agfa$1@dont-email.me> <vuml73$1riea$1@dont-email.me>
 <vun04h$2fjrn$2@raubtier-asyl.eternal-september.org>
 <vun1nh$22hc5$3@dont-email.me>
 <vunak2$2p980$1@raubtier-asyl.eternal-september.org>
 <vunbgo$2q5u8$1@dont-email.me>
 <vunbjg$2q72n$1@raubtier-asyl.eternal-september.org>
 <vund1f$2rh3j$1@dont-email.me>
 <vungko$2uoa2$1@raubtier-asyl.eternal-september.org>
 <X9MPP.1383458$f81.819466@fx48.iad>
 <vuobri$3o38b$1@raubtier-asyl.eternal-september.org>
 <XtOPP.2986761$t84d.2537581@fx11.iad>
 <vuohq9$3tlhf$1@raubtier-asyl.eternal-september.org>
 <vuoig5$3ub4j$1@dont-email.me>
 <vuorpf$6tnn$1@raubtier-asyl.eternal-september.org>
 <vup2nt$bi1k$2@dont-email.me>
 <vupofl$13pg2$2@raubtier-asyl.eternal-september.org>
 <vuprce$15sqo$2@dont-email.me>
 <vvd6n5$353gs$1@raubtier-asyl.eternal-september.org>
 <vvfbnj$ulpc$1@dont-email.me> <vvflec$11b72$1@dont-email.me>
 <vvg8uq$1647n$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 07 May 2025 23:03:45 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="32b705cac158ab76c251e0573503a3c2";
	logging-data="1313857"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19AnngUsEJ1YwhAfmNQP7yRfDA2q8HTUsM="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:FPeoZ2c9xgnT6DgjrNowUjphxSs=
Content-Language: en-GB
In-Reply-To: <vvg8uq$1647n$2@dont-email.me>
Bytes: 4672

On 07/05/2025 20:26, BGB wrote:
> On 5/7/2025 7:58 AM, Janis Papanagnou wrote:
>> On 07.05.2025 12:08, BGB wrote:
>>> [...]
>>>
>>> Though, if someone really must make something case-insensitive, a case
>>> could be made for only supporting it for maybe Latin, Greek, and
>>> Cyrillic.
>>
>> I don't understand what you want to say here; it just sounds strange
>> to me. - Mind to elaborate?
>>
> 
> Latin, Greek, and Cyrillic, are the main alphabets which actually have a 
> useful and reasonably well defined concept of "case", and thus "case 
> folding" actually makes sense for these.
> 
> For most other places, it does not, and one can likely ignore rules for 
> things outside of these alphabets. Can eliminate a bunch of rules for 
> alphabets that don't actually have "case" as we would understand it.
> 
> 
> By limiting rules in these ways, a simpler and more manageable set of 
> rules is possible. Vs, say, actual Unicode rules, which tend to have 
> stuff going on all over the place.
> 
> 
> Ligatures pose an issue though, but presumably option is one of:
>    Case fold between ligatures, when both variants exist;
>    Treat the ligature as its own character;
>    Decompose and compare.
> 
> 
> Though, FWIW, in my normalization code, I mostly ignored ligatures, as 
> while they could be decomposed in many cases, they could only be 
> recomposed for locales that actually use said ligature (like, in 
> English, if AE and IJ started spontaneously merging into new characters, 
> this would be weird and out of place; and having a filesystem layer that 
> merely decomposed any ligatures it encountered would not be ideal).
> 
> 
>>> Ideally, this would be better handled in a file-browser or
>>> similar, and not in the VFS or FS driver itself.
>>
>> Janis
>>
> 

No matter how you choose to do it, you will get it wrong sometimes. 
Case-insensitive comparison has language-specific details in addition to 
the character in the Unicode tables.  Should the lower-case version of 
"SS" be "ss" or "ß" ?  That depends on the language and the position of 
the letters.  Should the capital of "ß" be "SS" or "ẞ"?  Should the 
capital of "i" be "I" or "İ" ?  Some languages have a letter "dz" - some 
of those capitalise it as "DZ", others as "Dz".

About the only case-normalisation you can reasonably do without risk of 
getting things wrong (except for the Turkish i/ı) is for the plain 26 
letters in ASCII.  For everything else you would provide little of help 
to anyone, and mistakes for some languages.  Case normalisation, like 
ordering, is language-dependent and does not belong in a filesystem or 
other low-level parts of a system.