Article <vkplb6$g94b$1@dont-email.me>

Deutsch English Français Italiano
<vkplb6$g94b$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: transpiling to low level C
Date: Sat, 28 Dec 2024 13:59:24 -0600
Organization: A noiseless patient Spider
Lines: 156
Message-ID: <vkplb6$g94b$1@dont-email.me>
References: <vjlh19$8j4k$1@dont-email.me>
 <vjn9g5$n0vl$1@raubtier-asyl.eternal-september.org>
 <vjnhsq$oh1f$1@dont-email.me> <vjnq5s$pubt$1@dont-email.me>
 <vjpn29$17jub$1@dont-email.me> <86ikrdg6yq.fsf@linuxsc.com>
 <vk78it$77aa$1@dont-email.me> <vk8a0e$l8sq$1@paganini.bofh.team>
 <vk9q1p$oucu$1@dont-email.me> <vkb81n$14frj$1@dont-email.me>
 <20241223134008.000058cf@yahoo.com> <86frmedrof.fsf@linuxsc.com>
 <vkgk0u$2bh1n$1@dont-email.me> <865xn3d8lb.fsf@linuxsc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 28 Dec 2024 20:59:35 +0100 (CET)
Injection-Info: dont-email.me; posting-host="d66f7f33a087be869dd0efce3472a412";
	logging-data="533643"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX198eu2kIFLUuVWMhRcAmNCAk1ghZA3Otow="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:8U04SrUIC6Ju6RIeu2c5saPDa2E=
Content-Language: en-US
In-Reply-To: <865xn3d8lb.fsf@linuxsc.com>
Bytes: 7815

On 12/28/2024 11:24 AM, Tim Rentsch wrote:
> BGB <cr88192@gmail.com> writes:
> 
>> On 12/23/2024 3:18 PM, Tim Rentsch wrote:
>>
>>> Michael S <already5chosen@yahoo.com> writes:
>>>
>>>> On Mon, 23 Dec 2024 09:46:46 +0100
>>>> David Brown <david.brown@hesbynett.no> wrote:
>>>>
>>>>> And Tim did not rule out using the standard library,
>>>>
>>>> Are you sure?
>>>
>>> I explicitly called out setjmp and longjmp as being excluded.
>>> Based on that, it's reasonable to infer the rest of the
>>> standard library is allowed.
>>>
>>> Furthermore I don't think it matters.  Except for a very small
>>> set of functions -- eg, fopen, fgetc, fputc, malloc, free --
>>> everything else in the standard library either isn't important
>>> for Turing Completeness or can be synthesized from the base
>>> set.  The functionality of fprintf(), for example, can be
>>> implemented on top of fputc and non-library language features.
>>
>> If I were to choose a set of primitive functions, probably:
>>    malloc/free and/or realloc
>>      could define, say:
>>        malloc(sz) => realloc(NULL, sz)
>>        free(ptr) => realloc(ptr, 0)
>>      Maybe _msize and _mtag/..., but this is non-standard.
>>        With _msize, can implement realloc on top of malloc/free.
>>
>> For basic IO:
>>    fopen, fclose, fseek, fread, fwrite
>>
>> printf could be implemented on top of vsnprintf and fputs
>>    fputs can be implemented on top of fwrite (via strlen).
>>    With a temporary buffer buffer being used for the printed string.
> 
> Most of these aren't needed.  I think everything can be
> done using only fopen, fclose, fgetc, fputc, and feof.


If you only have fgetc and fputc, IO speeds are going to be unacceptably 
slow for non-trivial file sizes.

If you try to fake fseek by closing, re-opening, and an fgetc loop, 
well, also going to be very slow.


Then again, fgetc/fputc as the primary operations could make sense for 
text files if the implementation is doing some form of format conversion 
(such as converting between LF only and CR+LF), though admittedly IMO 
one is better off treating text files as equivalent to binary files (and 
letting the application deal with any conversions here).


OTOH:
   fgetc and fputc can be implemented via fread and fwrite;
   feof (for normal files) can be implemented via fseek (*1);
     Similar, ftell could be treated as a special case of fseek.

*1: Say, if the internal fseek call were made to return the current file 
position (similar to lseek).

....





Well, in another also recently left facing off with the wonk of UTF-8 
normalization for the VFS layer in my project (for paths/filenames). 
Options:
   Do Nothing, assume valid UTF-8 and that it is sensibly normalized;
     May risk malformed encodings at deeper levels of the VFS though.
   Encoding only normalization:
     Normalize to an M-UTF-8 variant and call it done.
   Do a subset of normalizing combining characters.
     The full set of Unicode rules would likely be too bulky;
     Filesystem should have no concept of locale;
     The rules should be ideally be "semi frozen" once defined.

At present, this is applied at the level of VFS syscalls (like "open()" 
or "opendir()").


Current thinking is that it will normalize to a variant of M-UTF-8 NFC 
(characters are stored in composed forms), but:
Will only apply the rules covering the Latin-1 and Latin Extended A 
spaces, and a subset of Latin Extended B.

Though, a case could be made for limiting the scope solely to the 
Latin-1/1252 range (and passing everything beyond this along as-is).

Less sure, had also added cases for the Roman numeral characters, mostly 
for decomposing them into ASCII; various ligatures would also be 
decomposed to ASCII (excluding those which appear as their own glyph, so 
AE and OE are left as-is, but IJ/DZ/... would be decomposed). A case 
could also be made for leaving these alone (passing them along 
unmodified). Depends mostly on the open question of whether or not these 
convey relevant semantic information (or are merely historical/aesthetic).

At present, the rules are stored as a table, with roughly 8 bytes needed 
per combiner rule (increases to 12 once initialized, mostly because it 
allocates a pair of 16-bit hash chains).
   Namely: SrcCodepoint1, SrcCodepoint2, DstCodepoint, Flags
     Flags specify when and how the rule is applied.
     SrcCodepoint2 is currently 0x0000 for simple conversion rules.
     DstCodepoint is used for lookup for decompose.
     ...

Limiting the scope also makes things likely more repeatable (where 
inconsistent normalization could result in file lookup issues in cases 
where rules differ, if stepping on the offending code-points). Goal is 
mostly to find an acceptable set of rules that can be "mostly frozen". 
Though, in most cases this is likely N/A as the majority of filenames 
tend to be plain ASCII.

The responsibility for any more advanced normalization (or 
locale-dependent stuff) would be left up at the application level.


Can't seem to find much information about "best practices" in these areas.

It is not certain normalizing for combining characters is actually a 
good idea, vs only normalizing for codepoint encoding. Mostly to deal 
with cases where malformed data is submitted to the VFS, or possibly 
1252 (if the VFS calls and similar are given something that is invalid 
UTF-8, then it may be assumed to be 1252). Theoretically, the locale 
code in the C library is expected to normalize for 1252 vs UTF-8 though 
(but, ideally, the integrity of the VFS should be kept protected from 
this sort of thing).

This also applies to console printing, which is also expected to be 
handed UTF-8, but may also normalize the strings. Though, there is some 
wonk with the console here in my case.


Seemingly (from what I can gather):
   Linux:
     It is per FS driver;
     Some are "do nothing", others normalize.
   MacOS:
     Also depends on filesystem:
       HFS/HFS+, normalizing (as NFD for some reason);
       APFS, does nothing (apparently leads to a lot of hassles).
   Windows:
     FAT32: Depends solely on OS locale;
     NTFS: Locale rules are baked-in when the drive is formatted.
       The relevant tables are held in filesystem metadata.

....