Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: Janis Papanagnou Newsgroups: comp.lang.c Subject: Re: Simple string conversion from UCS2 to ISO8859-1 Date: Fri, 21 Feb 2025 14:06:03 +0100 Organization: A noiseless patient Spider Lines: 75 Message-ID: References: <7bf2c66d1f1ef9e92c00f44320bb998f3cea2183@i2pn2.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Date: Fri, 21 Feb 2025 14:06:03 +0100 (CET) Injection-Info: dont-email.me; posting-host="61413e013f1df834c4b6d261a6846824"; logging-data="3584322"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18xRuLxz0uJFAwbaKyMD5Tr" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 Cancel-Lock: sha1:5vl23/KUEBrg4LOALRzLWSxA8Cg= In-Reply-To: X-Enigmail-Draft-Status: N1110 Bytes: 3936 On 21.02.2025 13:42, pozz wrote: > Il 21/02/2025 13:05, Richard Damon ha scritto: >> On 2/21/25 6:40 AM, pozz wrote: >>> I want to write a simple function that converts UCS2 string into >>> ISO8859-1: >>> >>> void ucs2_to_iso8859p1(char *ucs2, size_t size); >>> >>> ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm >>> passing size because ucs2 isn't null terminated. >> [...] >>> >>> It is trivial to convert "0000"-"007F" chars: it's a simple cast from >>> unsigned int to char. >> >> Note, I think you will find that it is that 0000-00FF that match. (as >> I remember ISO8859-1 was the base for starting Unicode). I second that. >>> >>> It isn't so simple to convert higher codes. For example, the small e >>> with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's >>> trivial again. But I saw the code "2019" (apostrophe) that can be >>> rendered as 0x27 in ISO8859-1. >> >> To be correct, u2019 isn't 0x27, its just character that looks a lot >> like it. > > Yes, but as a first approximation, 0x27 is much better than '?' for u2019. Note that there are _standard names_ assigned with the characters. These are normative what the characters represent. - I strongly suggest to not twist these standards by assigning different characters; you will do no one a favor but inflict only confusion and harm. > >>> Is there a simplified mapping table that can be written with if/switch? >>> >>> if (code < 0x80) { >>> *dst++ = (char)code; >>> } else { >>> switch (code) { >>> case 0x2019: *dst++ = 0x27; break; // Apostrophe >>> case 0x...: *dst++ = ...; break; >>> default: *ds++ = ' '; >>> } >>> } >>> >>> I'm not searching a very detailed and correct mapping, but just a >>> "sufficient" implementation. >> >> Then you have to decide which are sufficient mappings. No character >> above FF *IS* the character below, but some have a close >> approximation, so you will need to decide what to map. > > Yes, I have to decide, but it is a very big problem (there are thousands > of Unicode symbols that can be approximated to another ISO8859-1 code). > I'm wondering if such an approximation is just implemented somewhere. I've just made a run across the names of UCS-2 and ISO 8859-1, based on their normative names and, as mentioned above already; they match one-to-one in the ranges 0000-00FF and 00-FF respectively. BTW; you may want to consider using ISO 8859-15 (Latin 9) instead of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9 contains a few other characters like the € (Euro Sign). If that is possible for your context you have to map a handful of characters. Janis > For example, what iconv() does in this case?