Article <fff7432902b9f6c8daa6b1cc8369632e064187d7@i2pn2.org>

Deutsch English Français Italiano
<fff7432902b9f6c8daa6b1cc8369632e064187d7@i2pn2.org>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.quux.org!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail
From: Richard Damon <richard@damon-family.org>
Newsgroups: comp.lang.c
Subject: Re: Simple string conversion from UCS2 to ISO8859-1
Date: Fri, 21 Feb 2025 20:05:22 -0500
Organization: i2pn2 (i2pn.org)
Message-ID: <fff7432902b9f6c8daa6b1cc8369632e064187d7@i2pn2.org>
References: <vp9oml$3a0k5$1@dont-email.me>
 <7bf2c66d1f1ef9e92c00f44320bb998f3cea2183@i2pn2.org>
 <vp9sb4$3a0k4$5@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 22 Feb 2025 01:05:23 -0000 (UTC)
Injection-Info: i2pn2.org;
	logging-data="1177159"; mail-complaints-to="usenet@i2pn2.org";
	posting-account="diqKR1lalukngNWEqoq9/uFtbkm5U+w3w6FQ0yesrXg";
User-Agent: Mozilla Thunderbird
Content-Language: en-US
X-Spam-Checker-Version: SpamAssassin 4.0.0
In-Reply-To: <vp9sb4$3a0k4$5@dont-email.me>
Bytes: 3611
Lines: 69

On 2/21/25 7:42 AM, pozz wrote:
> Il 21/02/2025 13:05, Richard Damon ha scritto:
>> On 2/21/25 6:40 AM, pozz wrote:
>>> I want to write a simple function that converts UCS2 string into 
>>> ISO8859-1:
>>>
>>> void ucs2_to_iso8859p1(char *ucs2, size_t size);
>>>
>>> ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm 
>>> passing size because ucs2 isn't null terminated.
>>
>> Typically UCS2 strings ARE null terminated, it just a null is two 
>> bytes long.
> 
> Sure, but this isn't an issue here.
> 
> 
>>> I know I can use iconv() feature, but I'm on an embedded platform 
>>> without an OS and without iconv() function.
>>>
>>> It is trivial to convert "0000"-"007F" chars: it's a simple cast from 
>>> unsigned int to char.
>>
>> Note, I think you will find that it is that 0000-00FF that match. (as 
>> I remember ISO8859-1 was the base for starting Unicode).
>>
>>>
>>> It isn't so simple to convert higher codes. For example, the small e 
>>> with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's 
>>> trivial again. But I saw the code "2019" (apostrophe) that can be 
>>> rendered as 0x27 in ISO8859-1.
>>
>> To be correct, u2019 isn't 0x27, its just character that looks a lot 
>> like it.
> 
> Yes, but as a first approximation, 0x27 is much better than '?' for u2019.

And, as such is a subjective decision that you need to make.

> 
> 
>>> Is there a simplified mapping table that can be written with if/switch?
>>>
>>> if (code < 0x80) {
>>>    *dst++ = (char)code;
>>> } else {
>>>    switch (code) {
>>>      case 0x2019: *dst++ = 0x27; break;  // Apostrophe
>>>      case 0x...: *dst++ = ...; break;
>>>      default: *ds++ = ' ';
>>>    }
>>> }
>>>
>>> I'm not searching a very detailed and correct mapping, but just a 
>>> "sufficient" implementation.
>>
>> Then you have to decide which are sufficient mappings. No character 
>> above FF *IS* the character below, but some have a close 
>> approximation, so you will need to decide what to map.
> 
> Yes, I have to decide, but it is a very big problem (there are thousands 
> of Unicode symbols that can be approximated to another ISO8859-1 code). 
> I'm wondering if such an approximation is just implemented somewhere.
> 
> For example, what iconv() does in this case?

Just look at its code, there will be open source versions of it.

The two real options is just reject anything above 0xFF, or have a big 
table/switch to handle some determined list of things "close enough"