Deutsch English Français Italiano |
<vpa40d$3a0k4$6@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: pozz <pozzugno@gmail.com> Newsgroups: comp.lang.c Subject: Re: Simple string conversion from UCS2 to ISO8859-1 Date: Fri, 21 Feb 2025 15:53:02 +0100 Organization: A noiseless patient Spider Lines: 107 Message-ID: <vpa40d$3a0k4$6@dont-email.me> References: <vp9oml$3a0k5$1@dont-email.me> <vpa29o$3e5jo$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Fri, 21 Feb 2025 15:53:02 +0100 (CET) Injection-Info: dont-email.me; posting-host="136b3ca6734aaf1ce4acc4b9b573137d"; logging-data="3474052"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/VRwD2GrBI2hWEonYDzFSHKnxV+dwo2Q0=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:jfUd7H8mCFJrYJ0Kdr4RK4LUb8Y= In-Reply-To: <vpa29o$3e5jo$1@dont-email.me> Content-Language: it Bytes: 5850 Il 21/02/2025 15:23, David Brown ha scritto: > On 21/02/2025 12:40, pozz wrote: >> I want to write a simple function that converts UCS2 string into >> ISO8859-1: >> >> void ucs2_to_iso8859p1(char *ucs2, size_t size); >> >> ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm >> passing size because ucs2 isn't null terminated. >> >> I know I can use iconv() feature, but I'm on an embedded platform >> without an OS and without iconv() function. >> >> It is trivial to convert "0000"-"007F" chars: it's a simple cast from >> unsigned int to char. >> >> It isn't so simple to convert higher codes. For example, the small e >> with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's >> trivial again. But I saw the code "2019" (apostrophe) that can be >> rendered as 0x27 in ISO8859-1. >> >> Is there a simplified mapping table that can be written with if/switch? >> >> if (code < 0x80) { >> *dst++ = (char)code; >> } else { >> switch (code) { >> case 0x2019: *dst++ = 0x27; break; // Apostrophe >> case 0x...: *dst++ = ...; break; >> default: *ds++ = ' '; >> } >> } >> >> I'm not searching a very detailed and correct mapping, but just a >> "sufficient" implementation. >> >> > > <https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane> > > As has been mentioned by others, 0 - 0xff should be a direct translation > (with the possible exception of Latin-9 differences). > > <https://en.wikipedia.org/wiki/ISO/IEC_8859-15> > > > When you look that BMP blocks above the first two blocks (0 - 0x7f, 0x80 > - 0xff), you will quickly see that virtually none of them make any sense > to support in the way you are thinking. Just because a couple of the > characters in the Thaana block look a bit like quotation marks, does not > mean it makes any sense to try to transliterate them. Realistically, > you can at most make use of a few punctuation symbols (like 0x2019 > above), and maybe approximate forms for some extended Latin alphabet > characters that you will never see in practice. Oh, and you might be > able to support those spam emails that use Greek and other letters that > look like Latin letters such as "ՏΡ𐊠Ꮇ" to fool filters. And that's > assuming you have output support for the full Latin-1 or Latin-9 range. > > > Unicode is rarely much use unless you want and can provide good support > for non-Latin alphabets. Otherwise your translations are going to be so > limited and simple that they are barely worth the effort and won't cover > anything useful. > > > So here I would say that whoever provides the text, provides it in > Latin-9 encoding. There's no point in allowing external translators to > use whatever characters they feel is best in their language, and then > your code makes some kind of odd approximation giving results that look > different. If someone really wants to use the letter "ā" that is found > in the Latin Extended A block, how do /you/ know whether the best > Latin-9 match is "a", "ã", "ä", or something different like "aa" or an > alternative spelling of the word? Maybe the rules are different for > Latvian and Anglicised Mandarin. > > > When we have worked with multiple languages on small embedded systems > (too small for big fonts and UTF-8), we have used one of three techniques : > > 1. Insist that the external translators provide strings in Latin-9 only > (or even just ASCII when the system was more restricted). > > 2. Use primarily ASCII, with a few user-defined characters per language > (that's useful for old-style character displays with space for perhaps 8 > user-defined characters). > > 3. Use a PC program to figure out the characters actually used in the > strings, and put them into a single table indexing a generated list of > bitmap glyphs, also generated by the program (from freely available > fonts). The source is, naturally, UTF-8 - the strings stored in the > embedded system are not in any standard encoding representing > characters, but now hold glyph table indices. > > > Your idea here sounds to me like a lot of work for virtually no benefit. Yes, you're right. My question comes from an SMS text received by a 4G network modem. The reply to AT+CMGR command for a specific SMS reported the text in UCS2. The SMS was one sent by the mobile operator with balance of the prepaid SIM card. The text included the apostrophe coded as U+2019 instead of U+0027. I suspect the developer that wrote the text in the mobile operator systems was using UTF-8 (or UTF-16) and inserted exactly U+2019 (maybe it did wrong). Anyway I think I can live without that.