Sujet : Re: From JoyceUlysses.txt -- words occurring exactly once
De : dieter.maurer (at) *nospam* online.de
Groupes : comp.lang.pythonDate : 04. Jun 2024, 18:13:47
Autres entêtes
Message-ID : <mailman.84.1717519110.2909.python-list@python.org>
References : 1 2 3 4 5 6
User-Agent : VM 8.0.12-devo-585 under 21.4 (patch 24) "Standard C" XEmacs Lucid (x86_64-linux-gnu)
Edward Teach wrote at 2024-6-3 10:47 +0100:
...
The Gutenburg Project publishes "plain text". That's another problem,
because "plain text" means UTF-8....and that means unicode...and that
means running some sort of unicode-to-ascii conversion in order to get
something like "words". A couple of hours....a couple of hundred lines
of C....problem solved!
Unicode supports the notion "owrd" even better "ASCII".
For example, the `\w` (word charavter) regular expression wild card,
works for Unicode like for ASCII (of course with enhanced letter,
digits, punctuation, etc.)