Deutsch English Français Italiano |
<mailman.81.1717270463.2909.python-list@python.org> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail From: Mats Wichmann <mats@wichmann.us> Newsgroups: comp.lang.python Subject: Re: From JoyceUlysses.txt -- words occurring exactly once Date: Sat, 1 Jun 2024 13:34:11 -0600 Lines: 51 Message-ID: <mailman.81.1717270463.2909.python-list@python.org> References: <v3am2l$1qf6m$3@dont-email.me> <26202.4083.590062.42312@ixdm.fritz.box> <32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Trace: news.uni-berlin.de DJmMM0oasuPZ7IFUDrdNnw+2A99v5KhTG+ucGhUbljRQ== Cancel-Lock: sha1:Ad9o6rLizb0FKFNGpBoiEzJ8zX4= sha256:2JhIKkPIblCrSc4zZeEIOD8/7/ug8d6TJrweQguzH/s= Return-Path: <mats@wichmann.us> X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org Authentication-Results: mail.python.org; dkim=pass reason="1024-bit key; unprotected key" header.d=pobox.com header.i=@pobox.com header.b=TUVszAXP; dkim-adsp=none (unprotected policy); dkim-atps=neutral X-Spam-Status: OK 0.055 X-Spam-Evidence: '*H*': 0.89; '*S*': 0.00; 'usage': 0.05; ':-)': 0.09; "hasn't": 0.09; 'language,': 0.09; 'parse': 0.09; 'regex': 0.09; 'import': 0.15; '"simple"': 0.16; 'assumptions': 0.16; 'characters.': 0.16; 'dieter': 0.16; 'hyphenation': 0.16; 'nltk': 0.16; 'received:64.147': 0.16; 'subject: -- ': 0.16; 'subject:words': 0.16; 'tries': 0.16; 'wrote:': 0.16; 'problem': 0.16; 'python': 0.16; 'to:addr:python-list': 0.20; 'way.': 0.22; "what's": 0.22; 'lines': 0.23; '(and': 0.25; 'depends': 0.25; 'object': 0.26; 'task': 0.26; 'bit': 0.27; 'example,': 0.28; 'asked': 0.29; 'header:User-Agent:1': 0.30; 'program': 0.31; 'python-list': 0.32; 'split': 0.32; 'trademarks': 0.32; 'but': 0.32; 'there': 0.33; 'someone': 0.34; 'same': 0.34; 'header:In- Reply-To:1': 0.34; '"the': 0.35; 'words': 0.35; 'count': 0.36; 'people': 0.36; 'source': 0.36; 'really': 0.37; "it's": 0.37; 'hard': 0.37; 'received:192.168': 0.37; 'file': 0.38; 'could': 0.38; 'least': 0.39; 'text': 0.39; 'list': 0.39; 'use': 0.39; 'wrote': 0.39; 'forms': 0.40; 'gone': 0.40; 'both': 0.40; 'something': 0.40; 'want': 0.40; 'counts': 0.60; "there's": 0.61; 'skip:o 20': 0.63; 'document.': 0.64; 'your': 0.64; 'received:64': 0.67; 'exactly': 0.68; 'counter': 0.69; 'piece': 0.69; 'sequence': 0.69; 'longer': 0.71; 'experts': 0.76; 'choice': 0.76; 'quick': 0.77; 'happens': 0.84; 'novel': 0.84; 'occurring': 0.84; 'remained': 0.84; 'punctuation': 0.91; 'subject:From': 0.91; 'subject:once': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed; d=pobox.com; h=message-id :date:mime-version:subject:to:references:from:in-reply-to :content-type:content-transfer-encoding; s=sasl; bh=NSXpv3d8DSRm G94iwxJAk6PFNJrtD3R7Cgm3iLnekJo=; b=TUVszAXPGd0k5VB5fFWsswbUVGno 90bbQh5zlrCFjJQtu3YdmOM5PLG1P2VKFmdpQ5t+hCJLgiiOYz3zoa27zW64FOFe cDYbVvEQNzqpJYg9VswE+Hf1Rr4QGja6fzEH7JFtvbdB/0fmyiUUpDZLJ8ZcWGGh atSyes/xajtRz84= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed; d=wichmann.us; h=message-id:date:mime-version:subject:to:references:from:in-reply-to:content-type:content-transfer-encoding; s=2018-07.pbsmtp; bh=NSXpv3d8DSRmG94iwxJAk6PFNJrtD3R7Cgm3iLnekJo=; b=PFtbCuiZ3btC1/34KxyEtqgD4u6E8erEPYcDMNvnIM61JBy2h8ZaRfcaKKJddjXmXCjbR6I0WE3iF9d3VBrl2P0OjwuXZwwuzLQfnpa4OnQzkyhrhTOu8cZwP+rtU4+YoTyGBgtt0pXJz2p72xqGqyKgr9nG/8vTgEzLY6lRpmo= User-Agent: Mozilla Thunderbird Content-Language: en-US Autocrypt: addr=mats@wichmann.us; keydata= xsDiBD9xp6oRBAC1vd3YI8Gcr1CxpV1gldNQu0uQsNaICDk+Ai3+R163s/P83JOYG+SBEA3P v7iZx70qpQ3RzP7KrjF1Nm6j0em9ccUX2fPQUCAxXw5Hiq7CSMiwQQZRI6shcnyMh9XTKViT WK5MrKDyvjDEn7epjKzKwPS5SG039l6XaOKU0A4uGwCgsNqUQqC0gMMcbKlJV8ql58iKmbMD /ii8FPQrXmyS/FnsPs7UddV5qMHKm7NUH5oiKuMVyakInRyq9iIxuu3D4Ec6mWRKcGsjmIkW HXCSz0aefs6dsqNqpU54cYioJ3wP5LzHK7oclgJPryVt5Qezbdutf8SQf8gVkaNIlkxwGUzi bKTZ6CHzwlz9nNgeel0XPUcZzFxGA/4paeCg2rMSVuAhUQbsLYHu4XzTs9P16zaXkrtxc4m5 b+BF5xsLgTpyO5l859XudS2Gp+7/Y37dAU4QlyGGOboWmF1y9U5DnzBwG8ghsnym+ga58MJh LdRdQQ6xQolCpEXOuzm40f2r5uMxF3KOJ7WpIPuGAkeCPru9BmlATH+zOs0gTWF0cyBXaWNo bWFubiA8bWF0c0B3aWNobWFubi51cz7CYQQTEQIAIQIbAwYLCQgHAwIDFQIDAxYCAQIeAQIX gAUCT0VyZwIZAQAKCRDAMaCQc9hUxiZBAJ9cWziGp7hVfsu5T+cQptc3rLNndQCgrZh8u5LW BfJ5e/Y+3PwZ8UEm+ELOwE0EP5is8BAEAMtwzcA8TYf5UTjDMgwcSNoErTc9ag+IX05QFgL8 aF8sfJRv5atcitqQy0gSIsOzI+L/AFdPN/+QQI3dL1tCq14t32KPDtigDhzm6jVPXX5z+V9u xnD8XTp+ZvNcWoHXjViM8aXeLLEiCpiVCho307h3XShvqoKINWRQWeAsKKDDAAMFA/48zaey wiiEyvI0meJ1KkNHxdLP0yLODr1WV6j9xkPkLWOaIDw7dlwEOlF1N1YtZ2wa0p1wsttdIbIx ffgwXmcH4zrdxUIMz3U0BqYzk5H+5cYFXECXTFVOmweS+JECYMj80PjRoKCO1eVO1N30zksB 36NnhZWPRWIhjK3ZarIYH8JGBBgRAgAGBQI/mKzwAAoJEMAxoJBz2FTG6VEAoKDYHfDp5Q3q PuPvPahCE9HsXMgAAJ9INTqcLSJrOfyJ8q95nBO1T26H2Q== In-Reply-To: <26202.4083.590062.42312@ixdm.fritz.box> X-Pobox-Relay-ID: ECE497A6-204D-11EF-8BF4-B84BEB2EC81B-81526775!pb-smtp1.pobox.com X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.39 Precedence: list List-Id: General discussion list for the Python programming language <python-list.python.org> List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> List-Archive: <https://mail.python.org/pipermail/python-list/> List-Post: <mailto:python-list@python.org> List-Help: <mailto:python-list-request@python.org?subject=help> List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> X-Mailman-Original-Message-ID: <32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us> X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me> <26202.4083.590062.42312@ixdm.fritz.box> Bytes: 8042 On 5/31/24 11:59, Dieter Maurer via Python-list wrote: hmmm, I "sent" this but there was some problem and it remained unsent. Just in case it hasn't All Been Said Already, here's the retry: > HenHanna wrote at 2024-5-30 13:03 -0700: >> >> Given a text file of a novel (JoyceUlysses.txt) ... >> >> could someone give me a pretty fast (and simple) Python program that'd >> give me a list of all words occurring exactly once? > > Your task can be split into several subtasks: > * parse the text into words > > This depends on your notion of "word". > In the simplest case, a word is any maximal sequence of non-whitespace > characters. In this case, you can use `split` for this task This piece is by far "the hard part", because of the ambiguity. For example, if I just say non-whitespace, then I get as distinct words followed by punctuation. What about hyphenation - of which there's both the compound word forms and the ones at the end of lines if the source text has been formatted that way. Are all-lowercase words different than the same word starting with a capital? What about non-initial capitals, as happens a fair bit in modern usage with acronyms, trademarks (perhaps not in Ulysses? :-) ), etc. What about accented letters? If you want what's at least a quick starting point to play with, you could use a very simple regex - a fair amount of thought has gone into what a "word character" is (\w), so it deals with excluding both punctuation and whitespace. import re from collections import Counter with open("JoyceUlysses/txt", "r") as f: wordcount = Counter(re.findall(r'\w+', f.read().lower())) Now you have a Counter object counting all the "words" with their occurrence counts (by this definition) in the document. You can fish through that to answer the questions asked (find entries with a count of 1, 2, 3, etc.) Some people Go Big and use something that actually tries to recognize the language, and opposed to making assumptions from ranges of characters. nltk is a choice there. But at this point it's not really "simple" any longer (though nltk experts might end up disagreeing with that).