Deutsch   English   Français   Italiano  
<mailman.81.1717270463.2909.python-list@python.org>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: Mats Wichmann <mats@wichmann.us>
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Sat, 1 Jun 2024 13:34:11 -0600
Lines: 51
Message-ID: <mailman.81.1717270463.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
 <26202.4083.590062.42312@ixdm.fritz.box>
 <32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: news.uni-berlin.de DJmMM0oasuPZ7IFUDrdNnw+2A99v5KhTG+ucGhUbljRQ==
Cancel-Lock: sha1:Ad9o6rLizb0FKFNGpBoiEzJ8zX4= sha256:2JhIKkPIblCrSc4zZeEIOD8/7/ug8d6TJrweQguzH/s=
Return-Path: <mats@wichmann.us>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
 reason="1024-bit key; unprotected key"
 header.d=pobox.com header.i=@pobox.com header.b=TUVszAXP;
 dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.055
X-Spam-Evidence: '*H*': 0.89; '*S*': 0.00; 'usage': 0.05; ':-)': 0.09;
 "hasn't": 0.09; 'language,': 0.09; 'parse': 0.09; 'regex': 0.09;
 'import': 0.15; '"simple"': 0.16; 'assumptions': 0.16;
 'characters.': 0.16; 'dieter': 0.16; 'hyphenation': 0.16; 'nltk':
 0.16; 'received:64.147': 0.16; 'subject: -- ': 0.16;
 'subject:words': 0.16; 'tries': 0.16; 'wrote:': 0.16; 'problem':
 0.16; 'python': 0.16; 'to:addr:python-list': 0.20; 'way.': 0.22;
 "what's": 0.22; 'lines': 0.23; '(and': 0.25; 'depends': 0.25;
 'object': 0.26; 'task': 0.26; 'bit': 0.27; 'example,': 0.28;
 'asked': 0.29; 'header:User-Agent:1': 0.30; 'program': 0.31;
 'python-list': 0.32; 'split': 0.32; 'trademarks': 0.32; 'but':
 0.32; 'there': 0.33; 'someone': 0.34; 'same': 0.34; 'header:In-
 Reply-To:1': 0.34; '"the': 0.35; 'words': 0.35; 'count': 0.36;
 'people': 0.36; 'source': 0.36; 'really': 0.37; "it's": 0.37;
 'hard': 0.37; 'received:192.168': 0.37; 'file': 0.38; 'could':
 0.38; 'least': 0.39; 'text': 0.39; 'list': 0.39; 'use': 0.39;
 'wrote': 0.39; 'forms': 0.40; 'gone': 0.40; 'both': 0.40;
 'something': 0.40; 'want': 0.40; 'counts': 0.60; "there's": 0.61;
 'skip:o 20': 0.63; 'document.': 0.64; 'your': 0.64; 'received:64':
 0.67; 'exactly': 0.68; 'counter': 0.69; 'piece': 0.69; 'sequence':
 0.69; 'longer': 0.71; 'experts': 0.76; 'choice': 0.76; 'quick':
 0.77; 'happens': 0.84; 'novel': 0.84; 'occurring': 0.84;
 'remained': 0.84; 'punctuation': 0.91; 'subject:From': 0.91;
 'subject:once': 0.91
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed; d=pobox.com; h=message-id
 :date:mime-version:subject:to:references:from:in-reply-to
 :content-type:content-transfer-encoding; s=sasl; bh=NSXpv3d8DSRm
 G94iwxJAk6PFNJrtD3R7Cgm3iLnekJo=; b=TUVszAXPGd0k5VB5fFWsswbUVGno
 90bbQh5zlrCFjJQtu3YdmOM5PLG1P2VKFmdpQ5t+hCJLgiiOYz3zoa27zW64FOFe
 cDYbVvEQNzqpJYg9VswE+Hf1Rr4QGja6fzEH7JFtvbdB/0fmyiUUpDZLJ8ZcWGGh
 atSyes/xajtRz84=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed; d=wichmann.us;
 h=message-id:date:mime-version:subject:to:references:from:in-reply-to:content-type:content-transfer-encoding;
 s=2018-07.pbsmtp; bh=NSXpv3d8DSRmG94iwxJAk6PFNJrtD3R7Cgm3iLnekJo=;
 b=PFtbCuiZ3btC1/34KxyEtqgD4u6E8erEPYcDMNvnIM61JBy2h8ZaRfcaKKJddjXmXCjbR6I0WE3iF9d3VBrl2P0OjwuXZwwuzLQfnpa4OnQzkyhrhTOu8cZwP+rtU4+YoTyGBgtt0pXJz2p72xqGqyKgr9nG/8vTgEzLY6lRpmo=
User-Agent: Mozilla Thunderbird
Content-Language: en-US
Autocrypt: addr=mats@wichmann.us; keydata=
 xsDiBD9xp6oRBAC1vd3YI8Gcr1CxpV1gldNQu0uQsNaICDk+Ai3+R163s/P83JOYG+SBEA3P
 v7iZx70qpQ3RzP7KrjF1Nm6j0em9ccUX2fPQUCAxXw5Hiq7CSMiwQQZRI6shcnyMh9XTKViT
 WK5MrKDyvjDEn7epjKzKwPS5SG039l6XaOKU0A4uGwCgsNqUQqC0gMMcbKlJV8ql58iKmbMD
 /ii8FPQrXmyS/FnsPs7UddV5qMHKm7NUH5oiKuMVyakInRyq9iIxuu3D4Ec6mWRKcGsjmIkW
 HXCSz0aefs6dsqNqpU54cYioJ3wP5LzHK7oclgJPryVt5Qezbdutf8SQf8gVkaNIlkxwGUzi
 bKTZ6CHzwlz9nNgeel0XPUcZzFxGA/4paeCg2rMSVuAhUQbsLYHu4XzTs9P16zaXkrtxc4m5
 b+BF5xsLgTpyO5l859XudS2Gp+7/Y37dAU4QlyGGOboWmF1y9U5DnzBwG8ghsnym+ga58MJh
 LdRdQQ6xQolCpEXOuzm40f2r5uMxF3KOJ7WpIPuGAkeCPru9BmlATH+zOs0gTWF0cyBXaWNo
 bWFubiA8bWF0c0B3aWNobWFubi51cz7CYQQTEQIAIQIbAwYLCQgHAwIDFQIDAxYCAQIeAQIX
 gAUCT0VyZwIZAQAKCRDAMaCQc9hUxiZBAJ9cWziGp7hVfsu5T+cQptc3rLNndQCgrZh8u5LW
 BfJ5e/Y+3PwZ8UEm+ELOwE0EP5is8BAEAMtwzcA8TYf5UTjDMgwcSNoErTc9ag+IX05QFgL8
 aF8sfJRv5atcitqQy0gSIsOzI+L/AFdPN/+QQI3dL1tCq14t32KPDtigDhzm6jVPXX5z+V9u
 xnD8XTp+ZvNcWoHXjViM8aXeLLEiCpiVCho307h3XShvqoKINWRQWeAsKKDDAAMFA/48zaey
 wiiEyvI0meJ1KkNHxdLP0yLODr1WV6j9xkPkLWOaIDw7dlwEOlF1N1YtZ2wa0p1wsttdIbIx
 ffgwXmcH4zrdxUIMz3U0BqYzk5H+5cYFXECXTFVOmweS+JECYMj80PjRoKCO1eVO1N30zksB
 36NnhZWPRWIhjK3ZarIYH8JGBBgRAgAGBQI/mKzwAAoJEMAxoJBz2FTG6VEAoKDYHfDp5Q3q
 PuPvPahCE9HsXMgAAJ9INTqcLSJrOfyJ8q95nBO1T26H2Q==
In-Reply-To: <26202.4083.590062.42312@ixdm.fritz.box>
X-Pobox-Relay-ID: ECE497A6-204D-11EF-8BF4-B84BEB2EC81B-81526775!pb-smtp1.pobox.com
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
 <python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
 <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
 <mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
 <26202.4083.590062.42312@ixdm.fritz.box>
Bytes: 8042

On 5/31/24 11:59, Dieter Maurer via Python-list wrote:

hmmm, I "sent" this but there was some problem and it remained unsent. 
Just in case it hasn't All Been Said Already, here's the retry:

> HenHanna wrote at 2024-5-30 13:03 -0700:
>>
>> Given a text file of a novel (JoyceUlysses.txt) ...
>>
>> could someone give me a pretty fast (and simple) Python program that'd
>> give me a list of all words occurring exactly once?
> 
> Your task can be split into several subtasks:
>   * parse the text into words
> 
>     This depends on your notion of "word".
>     In the simplest case, a word is any maximal sequence of non-whitespace
>     characters. In this case, you can use `split` for this task

This piece is by far "the hard part", because of the ambiguity. For 
example, if I just say non-whitespace, then I get as distinct words 
followed by punctuation. What about hyphenation - of which there's both 
the compound word forms and the ones at the end of lines if the source 
text has been formatted that way.  Are all-lowercase words different 
than the same word starting with a capital?  What about non-initial 
capitals, as happens a fair bit in modern usage with acronyms, 
trademarks (perhaps not in Ulysses? :-) ), etc. What about accented letters?

If you want what's at least a quick starting point to play with, you 
could use a very simple regex - a fair amount of thought has gone into 
what a "word character" is (\w), so it deals with excluding both 
punctuation and whitespace.

import re
from collections import Counter

with open("JoyceUlysses/txt", "r") as f:
     wordcount = Counter(re.findall(r'\w+', f.read().lower()))

Now you have a Counter object counting all the "words" with their 
occurrence counts (by this definition) in the document. You can fish 
through that to answer the questions asked (find entries with a count of 
1, 2, 3, etc.)

Some people Go Big and use something that actually tries to recognize 
the language, and opposed to making assumptions from ranges of 
characters.  nltk is a choice there.  But at this point it's not really 
"simple" any longer (though nltk experts might end up disagreeing with 
that).