Deutsch   English   Français   Italiano  
<mailman.88.1717538560.2909.python-list@python.org>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!fu-berlin.de!uni-berlin.de!not-for-mail
From: Chris Angelico <rosuav@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Wed, 5 Jun 2024 08:02:26 +1000
Lines: 32
Message-ID: <mailman.88.1717538560.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
 <26202.4083.590062.42312@ixdm.fritz.box>
 <32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
 <mailman.81.1717270463.2909.python-list@python.org>
 <20240603104742.1664b37c@fedora> <4VtNKZ70YdznVGW@mail.python.org>
 <mailman.83.1717441107.2909.python-list@python.org>
 <20240604122134.2696c36d@fedora>
 <CAPTjJmomgE02LpfiMi5ZdORkeMrA5NbTp4VdPn3_9v68F2BfMQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Trace: news.uni-berlin.de FfLl3Dj5lDTF+g6Bpc03lwMpdzcueQrdXSrKkHkbHOJg==
Cancel-Lock: sha1:ziZWVMAUEmWoeMDAYrGlqcIK2to= sha256:ylNe8yCLHQRLEXXG0H54Mdg6W8gLdm1lST+2VKhoNUY=
Return-Path: <rosuav@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
 reason="2048-bit key; unprotected key"
 header.d=gmail.com header.i=@gmail.com header.b=nUdRVbH4;
 dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.044
X-Spam-Evidence: '*H*': 0.91; '*S*': 0.00; 'utf-8': 0.07; 'edward':
 0.09; '2024': 0.16; 'chrisa': 0.16; 'conversion': 0.16;
 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16;
 'is).': 0.16; 'subject: -- ': 0.16; 'subject:words': 0.16;
 'unicode': 0.16; 'wrote:': 0.16; 'grant': 0.17; "aren't": 0.19;
 'to:addr:python-list': 0.20; 'problem,': 0.22; 'teach': 0.22;
 'lines': 0.23; '(and': 0.25; 'python,': 0.25; 'jun': 0.26;
 'python-list': 0.32; 'message-id:@mail.gmail.com': 0.32; "i'm":
 0.33; 'there': 0.33; 'skip:" 20': 0.34; 'header:In-Reply-To:1':
 0.34; 'received:google.com': 0.34; 'running': 0.34;
 'from:addr:gmail.com': 0.35; 'mon,': 0.36; 'couple': 0.37;
 'using': 0.37; 'file': 0.38; 'means': 0.38; 'read': 0.38; 'list':
 0.39; 'master': 0.39; 'wed,': 0.39; 'something': 0.40; 'english':
 0.60; 'gave': 0.61; 'back': 0.67; 'right': 0.68; 'order': 0.69;
 'skip:/ 10': 0.69; 'converted': 0.84; 'subject:From': 0.91;
 'subject:once': 0.91; 'hundred': 0.93
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1717538558; x=1718143358; darn=python.org;
 h=to:subject:message-id:date:from:in-reply-to:references:mime-version
 :from:to:cc:subject:date:message-id:reply-to;
 bh=/VdVKpnO/6InaFI0hEsj6jcTD58PpY/3/Jt1pvsO0ac=;
 b=nUdRVbH4SKd0W1p/J7ulGYu1ZQHgcSjj0iMs7WPIV4t1SIwkq1fpK0EZPYoiBaWJ3l
 mCjnPvSNlVGpsP8Uw1ekKRNdB3MgxVBUYGfZh2gfOIvFabGICxuyCTxgayNh4tzgfsAl
 P0wdl3AlRgeeGO0ok+h0kmtMWfi72OCPP1B8BjAH4Ss1GC32RMFxYQqaZN/V+y1ovsWN
 xMVWEZxznfFDMWrSZnfbLaQmdSfmf2XtZ4zP7SuVHYzF1hhg9QmJ0Tav1Nb+bExlfFY7
 X9lqtEfkxwY+mDfZTM7H6Meao5PNSi0XjZJA4OAlFFQTwWkHyp7+akmEMplwVaY3duCm
 RdBA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1717538558; x=1718143358;
 h=to:subject:message-id:date:from:in-reply-to:references:mime-version
 :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
 bh=/VdVKpnO/6InaFI0hEsj6jcTD58PpY/3/Jt1pvsO0ac=;
 b=ZyUkpx70zcUPrUfnTi1+4ypsmmTLX2HwUwlfhtMdmWMjniKCqYK3C2ARW/b7EALexV
 VC35ysTZox2hDcSCN4ylTxlKTpnsg7BylSfQNVscrDjZ+62RdBCp05lYoty5KT1yOAF1
 6GbkWYzzb1siHraAi9Wzl4029FjQWpBObndbeKc7NGa6m8CnULTM77N1tvXHAbQKGdA1
 rNDlOtW13tvZzJ5okQP4VllJldVE3d3OGkuAxz9zdavNP2PRHpzDEM6SAw+PZaTZjhLW
 ftNkz6KqoNnAQVA/ZheLnJ8p8NsXTFpa0xaQY+nLGDceTrDaqYZncTyAnGpFOROQjkl4
 k/HQ==
X-Gm-Message-State: AOJu0YwYkaieOxETS4bjvXzVN8T2LRXEUDn3VIV6pi0fabg+IzIDLeqg
 UKmTuBYCmZvNs+pDh+Cbk+o/7K7yDrho1qGgAutaCWV7xTyuCfxUs8na9+kqMDI9xhmFPsiqSlc
 6y9Q3x9tKbnNHlS5eAZ3mirpSE3uyEQ==
X-Google-Smtp-Source: AGHT+IGjngwSn+II6H6PdL+WparTW2L4nbzzeMwDYE/9sXn8hiwaeyGhjaIhBTnNz4vihWtfgbTxelD43yx/KbY+b00=
X-Received: by 2002:a2e:9683:0:b0:2e9:855b:acb5 with SMTP id
 38308e7fff4ca-2eac79ec597mr2613611fa.20.1717538558122; Tue, 04 Jun 2024
 15:02:38 -0700 (PDT)
In-Reply-To: <20240604122134.2696c36d@fedora>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
 <python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
 <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
 <mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAPTjJmomgE02LpfiMi5ZdORkeMrA5NbTp4VdPn3_9v68F2BfMQ@mail.gmail.com>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
 <26202.4083.590062.42312@ixdm.fritz.box>
 <32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
 <mailman.81.1717270463.2909.python-list@python.org>
 <20240603104742.1664b37c@fedora> <4VtNKZ70YdznVGW@mail.python.org>
 <mailman.83.1717441107.2909.python-list@python.org>
 <20240604122134.2696c36d@fedora>
Bytes: 6496

On Wed, 5 Jun 2024 at 02:49, Edward Teach via Python-list
<python-list@python.org> wrote:
>
> On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)
> Grant Edwards <grant.b.edwards@gmail.com> wrote:
>
> > On 2024-06-03, Edward Teach via Python-list <python-list@python.org>
> > wrote:
> >
> > > The Gutenburg Project publishes "plain text".  That's another
> > > problem, because "plain text" means UTF-8....and that means
> > > unicode...and that means running some sort of unicode-to-ascii
> > > conversion in order to get something like "words".  A couple of
> > > hours....a couple of hundred lines of C....problem solved!
> >
> > I'm curious.  Why does it need to be converted frum Unicode to ASCII?
> >
> > When you read it into Python, it gets converted right back to
> > Unicode...
> >
>
> Well.....when using the file linux.words as a useful master list of
> "words".....linux.words is strict ASCII........
>

Whatever gave you that idea? I have a large number of dictionaries in
/usr/share/dict, all of them encoded UTF-8 except one (and I don't
know why that is). Even the English ones aren't entirely ASCII.

There is no need to "convert from Unicode to ASCII", which makes no sense.

ChrisA