Deutsch English Français Italiano |
<v0tsrm$39c2q$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder9.news.weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: jak <nospam@please.ty> Newsgroups: comp.lang.python Subject: Re: UTF_16 question Date: Wed, 1 May 2024 19:07:02 +0200 Organization: A noiseless patient Spider Lines: 37 Message-ID: <v0tsrm$39c2q$1@dont-email.me> References: <v0jh4g$h14g$1@dont-email.me> <08F2BE28-1252-4BD6-AED0-2323E112E0A1@damon-family.org> <mailman.1.1714409701.3326.python-list@python.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Wed, 01 May 2024 19:07:02 +0200 (CEST) Injection-Info: dont-email.me; posting-host="c9a3d396de3bf2208b6558793b5886b6"; logging-data="3453018"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19dUWCNy0fptHbXExfTOitU" User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0 SeaMonkey/2.53.18.2 Cancel-Lock: sha1:3OtjVHh82dmGpT1Lo2GNq93HOr4= In-Reply-To: <mailman.1.1714409701.3326.python-list@python.org> Bytes: 3685 Richard Damon ha scritto: >> On Apr 29, 2024, at 12:23 PM, jak via Python-list <python-list@python.org> wrote: >> >> Hi everyone, >> one thing that I do not understand is happening to me: I have some text >> files with different characteristics, among these there are that they >> have an UTF_32_le coding, utf_32be, utf_16_le, utf_16_be all of them >> without BOM. With those utf_32_xx I have no problem but with the >> UTF_16_xx I have. If I have an utf_16_le coded file and I read it with >> encoding='utf_16_le' I have no problem I read it, with >> encoding='utf_16_be' I can read it without any error even if the data I >> receive have the inverted bytes. The same thing happens with the >> utf_16_be codified file, I read it, both with encoding='utf_16_be' and >> with 'utf_16_le' without errors but in the last case the bytes are >> inverted. What did I not understand? What am I doing wrong? >> >> thanks in advance >> >> -- >> https://mail.python.org/mailman/listinfo/python-list > > That is why the BOM was created. A lot of files can be “correctly” read as either UTF-16-LE or UTF-1-BE encoded, as most of the 16 bit codes are valid, so unless the wrong encoding happens to hit something that is invalid (most likely something looking like a Surrogage Pair without a match), there isn’t an error in reading the file. The BOM character was specifically designed to be an invalid code if read by the wrong encoding (if you ignore the possibility of the file having a NUL right after the BOM) > > If you know the files likely contains a lot of “ASCII” characters, then you might be able to detect that you got it wrong, due to seeing a lot of 0xXX00 characters and few 0x00XX characters, but that doesn’t create an “error” normally. > Thanks to you too for the reply. I was actually looking for a way to distinguish "utf16le" texts from "utf16be" ones. Unfortunately, whoever created this log file archive thought that the BOM was not important and so omitted it. Now they want to switch to "utf8 " and also save the previous. Fortunately I can be sure that the text of the log files is in some European language, so after converting the file to "utf8" I make sure that most of the bytes are less than the value 0x7F and if not I reconvert them by replacing "utf16 " "le" with "be" or vice versa. The strategy seems to be working. In the future, by writing files in "utf8" they will no longer have problems like this.