Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Janis Papanagnou <janis_papanagnou+ng@hotmail.com>
Newsgroups: comp.lang.c
Subject: Re: Simple string conversion from UCS2 to ISO8859-1
Date: Fri, 21 Feb 2025 14:06:03 +0100
Organization: A noiseless patient Spider
Lines: 75
Message-ID: <vp9tnr$3dca2$1@dont-email.me>
References: <vp9oml$3a0k5$1@dont-email.me>
 <7bf2c66d1f1ef9e92c00f44320bb998f3cea2183@i2pn2.org>
 <vp9sb4$3a0k4$5@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 21 Feb 2025 14:06:03 +0100 (CET)
Injection-Info: dont-email.me; posting-host="61413e013f1df834c4b6d261a6846824";
	logging-data="3584322"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18xRuLxz0uJFAwbaKyMD5Tr"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.8.0
Cancel-Lock: sha1:5vl23/KUEBrg4LOALRzLWSxA8Cg=
In-Reply-To: <vp9sb4$3a0k4$5@dont-email.me>
X-Enigmail-Draft-Status: N1110
Bytes: 3936

On 21.02.2025 13:42, pozz wrote:
> Il 21/02/2025 13:05, Richard Damon ha scritto:
>> On 2/21/25 6:40 AM, pozz wrote:
>>> I want to write a simple function that converts UCS2 string into
>>> ISO8859-1:
>>>
>>> void ucs2_to_iso8859p1(char *ucs2, size_t size);
>>>
>>> ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
>>> passing size because ucs2 isn't null terminated.
>>
[...]
>>>
>>> It is trivial to convert "0000"-"007F" chars: it's a simple cast from
>>> unsigned int to char.
>>
>> Note, I think you will find that it is that 0000-00FF that match. (as
>> I remember ISO8859-1 was the base for starting Unicode).

I second that.

>>>
>>> It isn't so simple to convert higher codes. For example, the small e
>>> with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's
>>> trivial again. But I saw the code "2019" (apostrophe) that can be
>>> rendered as 0x27 in ISO8859-1.
>>
>> To be correct, u2019 isn't 0x27, its just character that looks a lot
>> like it.
> 
> Yes, but as a first approximation, 0x27 is much better than '?' for u2019.

Note that there are _standard names_ assigned with the characters.
These are normative what the characters represent. - I strongly
suggest to not twist these standards by assigning different
characters; you will do no one a favor but inflict only confusion
and harm.

> 
>>> Is there a simplified mapping table that can be written with if/switch?
>>>
>>> if (code < 0x80) {
>>>    *dst++ = (char)code;
>>> } else {
>>>    switch (code) {
>>>      case 0x2019: *dst++ = 0x27; break;  // Apostrophe
>>>      case 0x...: *dst++ = ...; break;
>>>      default: *ds++ = ' ';
>>>    }
>>> }
>>>
>>> I'm not searching a very detailed and correct mapping, but just a
>>> "sufficient" implementation.
>>
>> Then you have to decide which are sufficient mappings. No character
>> above FF *IS* the character below, but some have a close
>> approximation, so you will need to decide what to map.
> 
> Yes, I have to decide, but it is a very big problem (there are thousands
> of Unicode symbols that can be approximated to another ISO8859-1 code).
> I'm wondering if such an approximation is just implemented somewhere.

I've just made a run across the names of UCS-2 and ISO 8859-1, based
on their normative names and, as mentioned above already; they match
one-to-one in the ranges 0000-00FF and 00-FF respectively.

BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
contains a few other characters like the € (Euro Sign). If that is
possible for your context you have to map a handful of characters.

Janis

> For example, what iconv() does in this case?