Re: From JoyceUlysses.txt -- words occurring exactly once

Liste des GroupesRevenir à cl python 
Sujet : Re: From JoyceUlysses.txt -- words occurring exactly once
De : hackbeard (at) *nospam* linuxmail.org (Edward Teach)
Groupes : comp.lang.python
Date : 03. Jun 2024, 10:47:42
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <20240603104742.1664b37c@fedora>
References : 1 2 3 4
User-Agent : Claws Mail 4.2.0 (GTK 3.24.42; x86_64-redhat-linux-gnu)
On Sat, 1 Jun 2024 13:34:11 -0600
Mats Wichmann <mats@wichmann.us> wrote:

On 5/31/24 11:59, Dieter Maurer via Python-list wrote:
 
hmmm, I "sent" this but there was some problem and it remained
unsent. Just in case it hasn't All Been Said Already, here's the
retry:
 
HenHanna wrote at 2024-5-30 13:03 -0700: 
>
Given a text file of a novel (JoyceUlysses.txt) ...
>
could someone give me a pretty fast (and simple) Python program
that'd give me a list of all words occurring exactly once? 
 
Your task can be split into several subtasks:
  * parse the text into words
 
    This depends on your notion of "word".
    In the simplest case, a word is any maximal sequence of
non-whitespace characters. In this case, you can use `split` for
this task 
 
This piece is by far "the hard part", because of the ambiguity. For
example, if I just say non-whitespace, then I get as distinct words
followed by punctuation. What about hyphenation - of which there's
both the compound word forms and the ones at the end of lines if the
source text has been formatted that way.  Are all-lowercase words
different than the same word starting with a capital?  What about
non-initial capitals, as happens a fair bit in modern usage with
acronyms, trademarks (perhaps not in Ulysses? :-) ), etc. What about
accented letters?
 
If you want what's at least a quick starting point to play with, you
could use a very simple regex - a fair amount of thought has gone
into what a "word character" is (\w), so it deals with excluding both
punctuation and whitespace.
 
import re
from collections import Counter
 
with open("JoyceUlysses/txt", "r") as f:
     wordcount = Counter(re.findall(r'\w+', f.read().lower()))
 
Now you have a Counter object counting all the "words" with their
occurrence counts (by this definition) in the document. You can fish
through that to answer the questions asked (find entries with a count
of 1, 2, 3, etc.)
 
Some people Go Big and use something that actually tries to recognize
the language, and opposed to making assumptions from ranges of
characters.  nltk is a choice there.  But at this point it's not
really "simple" any longer (though nltk experts might end up
disagreeing with that).
 
 

The Gutenburg Project publishes "plain text".  That's another problem,
because "plain text" means UTF-8....and that means unicode...and that
means running some sort of unicode-to-ascii conversion in order to get
something like "words".  A couple of hours....a couple of hundred lines
of C....problem solved!


Date Sujet#  Auteur
30 May 24 * From JoyceUlysses.txt -- words occurring exactly once28HenHanna
30 May 24 +* Re: From JoyceUlysses.txt -- words occurring exactly once15dn
31 May 24 i`* Re: From JoyceUlysses.txt -- words occurring exactly once14HenHanna
1 Jun 24 i +- Re: From JoyceUlysses.txt -- words occurring exactly once1Peter J. Holzer
1 Jun 24 i +- Re: From JoyceUlysses.txt -- words occurring exactly once1Thomas Passin
5 Jun 24 i +- Re: From JoyceUlysses.txt -- words occurring exactly once1dn
5 Jun 24 i +- Re: From JoyceUlysses.txt -- words occurring exactly once1Grant Edwards
5 Jun 24 i +- Re: From JoyceUlysses.txt -- words occurring exactly once1Thomas Passin
7 Jun 24 i +- Re: From JoyceUlysses.txt -- words occurring exactly once1Mats Wichmann
8 Jun 24 i +* Re: From JoyceUlysses.txt -- words occurring exactly once2Larry Martell
8 Jun 24 i i`- Re: From JoyceUlysses.txt -- words occurring exactly once1Stefan Ram
8 Jun 24 i +- Re: From JoyceUlysses.txt -- words occurring exactly once1Thomas Passin
8 Jun 24 i +- Re: From JoyceUlysses.txt -- words occurring exactly once1<avi.e.gross
8 Jun 24 i +- Re: From JoyceUlysses.txt -- words occurring exactly once1Thomas Passin
9 Jun 24 i +- Re: From JoyceUlysses.txt -- words occurring exactly once1<avi.e.gross
9 Jun 24 i `- Re: From JoyceUlysses.txt -- words occurring exactly once1Grant Edwards
31 May 24 +* Re: From JoyceUlysses.txt -- words occurring exactly once2Pieter van Oostrum
31 May 24 i`- Re: From JoyceUlysses.txt -- words occurring exactly once1Grant Edwards
31 May 24 +- Re: From JoyceUlysses.txt -- words occurring exactly once1dieter.maurer
31 May 24 +- Re: From JoyceUlysses.txt -- words occurring exactly once1Thomas Passin
1 Jun 24 `* Re: From JoyceUlysses.txt -- words occurring exactly once8Mats Wichmann
3 Jun 24  `* Re: From JoyceUlysses.txt -- words occurring exactly once7Edward Teach
3 Jun 24   +* Re: From JoyceUlysses.txt -- words occurring exactly once5Grant Edwards
4 Jun 24   i`* Re: From JoyceUlysses.txt -- words occurring exactly once4Edward Teach
4 Jun 24   i +- Re: From JoyceUlysses.txt -- words occurring exactly once1Grant Edwards
4 Jun 24   i +- Re: From JoyceUlysses.txt -- words occurring exactly once1<avi.e.gross
5 Jun 24   i `- Re: From JoyceUlysses.txt -- words occurring exactly once1Chris Angelico
4 Jun 24   `- Re: From JoyceUlysses.txt -- words occurring exactly once1dieter.maurer

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal