Article <vpccfn$3to51$1@dont-email.me>

Deutsch English Français Italiano
<vpccfn$3to51$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: David Brown <david.brown@hesbynett.no>
Newsgroups: comp.lang.c
Subject: Re: Simple string conversion from UCS2 to ISO8859-1
Date: Sat, 22 Feb 2025 12:29:59 +0100
Organization: A noiseless patient Spider
Lines: 75
Message-ID: <vpccfn$3to51$1@dont-email.me>
References: <vp9oml$3a0k5$1@dont-email.me>
 <7bf2c66d1f1ef9e92c00f44320bb998f3cea2183@i2pn2.org>
 <vp9sb4$3a0k4$5@dont-email.me> <vp9tnr$3dca2$1@dont-email.me>
 <87frk7m6h5.fsf@nosuchdomain.example.com> <vpav4f$3jdl6$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 22 Feb 2025 12:30:00 +0100 (CET)
Injection-Info: dont-email.me; posting-host="c7fa0a28977b5f488f5523ebf65c845d";
	logging-data="4120737"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+EQi5zXBwRT7KzxYxCkwpvcAhMtKaqnDs="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:/PMJNzCegkwlnATH+fH7m0y2j0k=
Content-Language: en-GB
In-Reply-To: <vpav4f$3jdl6$1@dont-email.me>
Bytes: 4770

On 21/02/2025 23:35, Janis Papanagnou wrote:
> On 21.02.2025 20:40, Keith Thompson wrote:
>> Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
>> [...]
>>> BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
>>> of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
>>> contains a few other characters like the € (Euro Sign). If that is
>>> possible for your context you have to map a handful of characters.
>>
>> Latin-1 maps exactly to Unicode for the first 256 values.  Latin-9 does
>> not, which would make the translation more difficult.
> 
> Yes, that had already been pointed out upthread.
> 
> The (open) question is whether it makes sense to convert to "Latin 1"
> only because it has a one-to-one mapping concerning the first UCS-2
> characters, or if the underlying application of the OP wants support
> of contemporary information by e.g. providing the € (Euro) sign with
> "Latin 9".
> 
>>
>> <https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing
>> the 8 characters that differ betwween Latin-1 and Latin-9.
>>
>> If at all possible, it would be better to convert to UTF-8.  The
>> conversion is exact and reversible, and UTF-8 has largely superseded the
>> various Latin-* character encodings.
> 
> Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
> while the ISO 8859-X family represents single octet representations.
> 
>> I'm curious why the OP needs ISO8859-1 and can't use UTF-8.
> 
> I think this, or why he can't use "Latin 9", are essential questions.
> 
> It seems to have got clear after a subsequent post of the OP; some
> message/data source seems to provide characters from the upper planes
> of Unicode and the OP has to (or wants to) somehow map them to some
> constant octet character set. - Yet there's no information provided
> what Unicode characters - characters that don't have a representation
> in Latin 1 or Latin 9 - the OP will encounter or not from that source.
> 
> As it sounds it all seems to make little sense.
> 
> Janis
> 

As the OP explained in a reply to one of my posts, he is getting data in 
in UCS-2 format from SMS's from a modem.  Somewhere along the line, 
either the firmware in the modem or in the code sending the SMS's, 
characters beyond the BMP are being used needlessly.  So it looks like 
his first idea of manually handling a few cases (like code 0x2019) seems 
like the right approach.

Whether Latin-1 or Latin-9 is better will depend on his application. 
The additional characters in Latin-9, with the exception of the Euro 
symbol, are pretty obscure - it's unlikely that you'd need them and not 
need a good deal more other characters (i.e., supporting much more of 
Unicode).

As for why not use UTF-8, the answer is clearly simplicity.  The OP is 
working with a resource-constrained embedded system.  I don't know what 
he is doing with the characters after converting them from UCS-2, but it 
is massively simpler to use an 8-bit character set if they are going to 
be used for display on a small system.  It also keeps memory management 
simpler, and that is essential on such systems - one UCS-2 character 
maps to one code unit with Latin-9 here.  The space needed for UTF-8 is 
much harder to predict, and the OP will want to avoid any kind of 
malloc() or dynamic allocation where possible.

If the incoming SMS's are just being logged, or passed out in some other 
way, then UTF-8 may be a convenient alternative.