Deutsch   English   Français   Italiano  
<vpa40d$3a0k4$6@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: pozz <pozzugno@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: Simple string conversion from UCS2 to ISO8859-1
Date: Fri, 21 Feb 2025 15:53:02 +0100
Organization: A noiseless patient Spider
Lines: 107
Message-ID: <vpa40d$3a0k4$6@dont-email.me>
References: <vp9oml$3a0k5$1@dont-email.me> <vpa29o$3e5jo$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 21 Feb 2025 15:53:02 +0100 (CET)
Injection-Info: dont-email.me; posting-host="136b3ca6734aaf1ce4acc4b9b573137d";
	logging-data="3474052"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/VRwD2GrBI2hWEonYDzFSHKnxV+dwo2Q0="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:jfUd7H8mCFJrYJ0Kdr4RK4LUb8Y=
In-Reply-To: <vpa29o$3e5jo$1@dont-email.me>
Content-Language: it
Bytes: 5850

Il 21/02/2025 15:23, David Brown ha scritto:
> On 21/02/2025 12:40, pozz wrote:
>> I want to write a simple function that converts UCS2 string into 
>> ISO8859-1:
>>
>> void ucs2_to_iso8859p1(char *ucs2, size_t size);
>>
>> ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm 
>> passing size because ucs2 isn't null terminated.
>>
>> I know I can use iconv() feature, but I'm on an embedded platform 
>> without an OS and without iconv() function.
>>
>> It is trivial to convert "0000"-"007F" chars: it's a simple cast from 
>> unsigned int to char.
>>
>> It isn't so simple to convert higher codes. For example, the small e 
>> with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's 
>> trivial again. But I saw the code "2019" (apostrophe) that can be 
>> rendered as 0x27 in ISO8859-1.
>>
>> Is there a simplified mapping table that can be written with if/switch?
>>
>> if (code < 0x80) {
>>    *dst++ = (char)code;
>> } else {
>>    switch (code) {
>>      case 0x2019: *dst++ = 0x27; break;  // Apostrophe
>>      case 0x...: *dst++ = ...; break;
>>      default: *ds++ = ' ';
>>    }
>> }
>>
>> I'm not searching a very detailed and correct mapping, but just a 
>> "sufficient" implementation.
>>
>>
> 
> <https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane>
> 
> As has been mentioned by others, 0 - 0xff should be a direct translation 
> (with the possible exception of Latin-9 differences).
> 
> <https://en.wikipedia.org/wiki/ISO/IEC_8859-15>
> 
> 
> When you look that BMP blocks above the first two blocks (0 - 0x7f, 0x80 
> - 0xff), you will quickly see that virtually none of them make any sense 
> to support in the way you are thinking.  Just because a couple of the 
> characters in the Thaana block look a bit like quotation marks, does not 
> mean it makes any sense to try to transliterate them.  Realistically, 
> you can at most make use of a few punctuation symbols (like 0x2019 
> above), and maybe approximate forms for some extended Latin alphabet 
> characters that you will never see in practice.  Oh, and you might be 
> able to support those spam emails that use Greek and other letters that 
> look like Latin letters such as "ՏΡ𐊠Ꮇ" to fool filters.  And that's 
> assuming you have output support for the full Latin-1 or Latin-9 range.
> 
> 
> Unicode is rarely much use unless you want and can provide good support 
> for non-Latin alphabets.  Otherwise your translations are going to be so 
> limited and simple that they are barely worth the effort and won't cover 
> anything useful.
> 
> 
> So here I would say that whoever provides the text, provides it in 
> Latin-9 encoding.  There's no point in allowing external translators to 
> use whatever characters they feel is best in their language, and then 
> your code makes some kind of odd approximation giving results that look 
> different.  If someone really wants to use the letter "ā" that is found 
> in the Latin Extended A block, how do /you/ know whether the best 
> Latin-9 match is "a", "ã", "ä", or something different like "aa" or an 
> alternative spelling of the word?  Maybe the rules are different for 
> Latvian and Anglicised Mandarin.
> 
> 
> When we have worked with multiple languages on small embedded systems 
> (too small for big fonts and UTF-8), we have used one of three techniques :
> 
> 1. Insist that the external translators provide strings in Latin-9 only 
> (or even just ASCII when the system was more restricted).
> 
> 2. Use primarily ASCII, with a few user-defined characters per language 
> (that's useful for old-style character displays with space for perhaps 8 
> user-defined characters).
> 
> 3. Use a PC program to figure out the characters actually used in the 
> strings, and put them into a single table indexing a generated list of 
> bitmap glyphs, also generated by the program (from freely available 
> fonts).  The source is, naturally, UTF-8 - the strings stored in the 
> embedded system are not in any standard encoding representing 
> characters, but now hold glyph table indices.
> 
> 
> Your idea here sounds to me like a lot of work for virtually no benefit.

Yes, you're right. My question comes from an SMS text received by a 4G 
network modem. The reply to AT+CMGR command for a specific SMS reported 
the text in UCS2. The SMS was one sent by the mobile operator with 
balance of the prepaid SIM card.

The text included the apostrophe coded as U+2019 instead of U+0027. I 
suspect the developer that wrote the text in the mobile operator systems 
was using UTF-8 (or UTF-16) and inserted exactly U+2019 (maybe it did 
wrong).

Anyway I think I can live without that.