Deutsch English Français Italiano |
<mailman.88.1717538560.2909.python-list@python.org> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!fu-berlin.de!uni-berlin.de!not-for-mail From: Chris Angelico <rosuav@gmail.com> Newsgroups: comp.lang.python Subject: Re: From JoyceUlysses.txt -- words occurring exactly once Date: Wed, 5 Jun 2024 08:02:26 +1000 Lines: 32 Message-ID: <mailman.88.1717538560.2909.python-list@python.org> References: <v3am2l$1qf6m$3@dont-email.me> <26202.4083.590062.42312@ixdm.fritz.box> <32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us> <mailman.81.1717270463.2909.python-list@python.org> <20240603104742.1664b37c@fedora> <4VtNKZ70YdznVGW@mail.python.org> <mailman.83.1717441107.2909.python-list@python.org> <20240604122134.2696c36d@fedora> <CAPTjJmomgE02LpfiMi5ZdORkeMrA5NbTp4VdPn3_9v68F2BfMQ@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" X-Trace: news.uni-berlin.de FfLl3Dj5lDTF+g6Bpc03lwMpdzcueQrdXSrKkHkbHOJg== Cancel-Lock: sha1:ziZWVMAUEmWoeMDAYrGlqcIK2to= sha256:ylNe8yCLHQRLEXXG0H54Mdg6W8gLdm1lST+2VKhoNUY= Return-Path: <rosuav@gmail.com> X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org Authentication-Results: mail.python.org; dkim=pass reason="2048-bit key; unprotected key" header.d=gmail.com header.i=@gmail.com header.b=nUdRVbH4; dkim-adsp=pass; dkim-atps=neutral X-Spam-Status: OK 0.044 X-Spam-Evidence: '*H*': 0.91; '*S*': 0.00; 'utf-8': 0.07; 'edward': 0.09; '2024': 0.16; 'chrisa': 0.16; 'conversion': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'is).': 0.16; 'subject: -- ': 0.16; 'subject:words': 0.16; 'unicode': 0.16; 'wrote:': 0.16; 'grant': 0.17; "aren't": 0.19; 'to:addr:python-list': 0.20; 'problem,': 0.22; 'teach': 0.22; 'lines': 0.23; '(and': 0.25; 'python,': 0.25; 'jun': 0.26; 'python-list': 0.32; 'message-id:@mail.gmail.com': 0.32; "i'm": 0.33; 'there': 0.33; 'skip:" 20': 0.34; 'header:In-Reply-To:1': 0.34; 'received:google.com': 0.34; 'running': 0.34; 'from:addr:gmail.com': 0.35; 'mon,': 0.36; 'couple': 0.37; 'using': 0.37; 'file': 0.38; 'means': 0.38; 'read': 0.38; 'list': 0.39; 'master': 0.39; 'wed,': 0.39; 'something': 0.40; 'english': 0.60; 'gave': 0.61; 'back': 0.67; 'right': 0.68; 'order': 0.69; 'skip:/ 10': 0.69; 'converted': 0.84; 'subject:From': 0.91; 'subject:once': 0.91; 'hundred': 0.93 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1717538558; x=1718143358; darn=python.org; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=/VdVKpnO/6InaFI0hEsj6jcTD58PpY/3/Jt1pvsO0ac=; b=nUdRVbH4SKd0W1p/J7ulGYu1ZQHgcSjj0iMs7WPIV4t1SIwkq1fpK0EZPYoiBaWJ3l mCjnPvSNlVGpsP8Uw1ekKRNdB3MgxVBUYGfZh2gfOIvFabGICxuyCTxgayNh4tzgfsAl P0wdl3AlRgeeGO0ok+h0kmtMWfi72OCPP1B8BjAH4Ss1GC32RMFxYQqaZN/V+y1ovsWN xMVWEZxznfFDMWrSZnfbLaQmdSfmf2XtZ4zP7SuVHYzF1hhg9QmJ0Tav1Nb+bExlfFY7 X9lqtEfkxwY+mDfZTM7H6Meao5PNSi0XjZJA4OAlFFQTwWkHyp7+akmEMplwVaY3duCm RdBA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717538558; x=1718143358; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=/VdVKpnO/6InaFI0hEsj6jcTD58PpY/3/Jt1pvsO0ac=; b=ZyUkpx70zcUPrUfnTi1+4ypsmmTLX2HwUwlfhtMdmWMjniKCqYK3C2ARW/b7EALexV VC35ysTZox2hDcSCN4ylTxlKTpnsg7BylSfQNVscrDjZ+62RdBCp05lYoty5KT1yOAF1 6GbkWYzzb1siHraAi9Wzl4029FjQWpBObndbeKc7NGa6m8CnULTM77N1tvXHAbQKGdA1 rNDlOtW13tvZzJ5okQP4VllJldVE3d3OGkuAxz9zdavNP2PRHpzDEM6SAw+PZaTZjhLW ftNkz6KqoNnAQVA/ZheLnJ8p8NsXTFpa0xaQY+nLGDceTrDaqYZncTyAnGpFOROQjkl4 k/HQ== X-Gm-Message-State: AOJu0YwYkaieOxETS4bjvXzVN8T2LRXEUDn3VIV6pi0fabg+IzIDLeqg UKmTuBYCmZvNs+pDh+Cbk+o/7K7yDrho1qGgAutaCWV7xTyuCfxUs8na9+kqMDI9xhmFPsiqSlc 6y9Q3x9tKbnNHlS5eAZ3mirpSE3uyEQ== X-Google-Smtp-Source: AGHT+IGjngwSn+II6H6PdL+WparTW2L4nbzzeMwDYE/9sXn8hiwaeyGhjaIhBTnNz4vihWtfgbTxelD43yx/KbY+b00= X-Received: by 2002:a2e:9683:0:b0:2e9:855b:acb5 with SMTP id 38308e7fff4ca-2eac79ec597mr2613611fa.20.1717538558122; Tue, 04 Jun 2024 15:02:38 -0700 (PDT) In-Reply-To: <20240604122134.2696c36d@fedora> X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.39 Precedence: list List-Id: General discussion list for the Python programming language <python-list.python.org> List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> List-Archive: <https://mail.python.org/pipermail/python-list/> List-Post: <mailto:python-list@python.org> List-Help: <mailto:python-list-request@python.org?subject=help> List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> X-Mailman-Original-Message-ID: <CAPTjJmomgE02LpfiMi5ZdORkeMrA5NbTp4VdPn3_9v68F2BfMQ@mail.gmail.com> X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me> <26202.4083.590062.42312@ixdm.fritz.box> <32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us> <mailman.81.1717270463.2909.python-list@python.org> <20240603104742.1664b37c@fedora> <4VtNKZ70YdznVGW@mail.python.org> <mailman.83.1717441107.2909.python-list@python.org> <20240604122134.2696c36d@fedora> Bytes: 6496 On Wed, 5 Jun 2024 at 02:49, Edward Teach via Python-list <python-list@python.org> wrote: > > On Mon, 03 Jun 2024 14:58:26 -0400 (EDT) > Grant Edwards <grant.b.edwards@gmail.com> wrote: > > > On 2024-06-03, Edward Teach via Python-list <python-list@python.org> > > wrote: > > > > > The Gutenburg Project publishes "plain text". That's another > > > problem, because "plain text" means UTF-8....and that means > > > unicode...and that means running some sort of unicode-to-ascii > > > conversion in order to get something like "words". A couple of > > > hours....a couple of hundred lines of C....problem solved! > > > > I'm curious. Why does it need to be converted frum Unicode to ASCII? > > > > When you read it into Python, it gets converted right back to > > Unicode... > > > > Well.....when using the file linux.words as a useful master list of > "words".....linux.words is strict ASCII........ > Whatever gave you that idea? I have a large number of dictionaries in /usr/share/dict, all of them encoded UTF-8 except one (and I don't know why that is). Even the English ones aren't entirely ASCII. There is no need to "convert from Unicode to ASCII", which makes no sense. ChrisA