Sujet : Re: Chardet oddity
De : nntp.mbourne (at) *nospam* spamgourmet.com (Mark Bourne)
Groupes : comp.lang.pythonDate : 23. Oct 2024, 21:42:00
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vfbjia$28es4$1@dont-email.me>
References : 1 2
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 SeaMonkey/2.53.19
Albert-Jan Roskam wrote:
Today I used chardet.detect in the repl and it returned windows-1252
(incorrect, because it later resulted in a UnicodeDecodeError). When I ran
chardet as a script (which uses UniversalLineDetector) this returned
MacRoman. Isn't charset.detect the correct way? I've used this method many
times.
# Interpreter
>>> contents = open(FILENAME, "rb").read()
>>> chardet.detect(content)
Is that copy and pasted from the terminal, or retyped with possible transcription errors? As written, you've assigned the open file handle to `contents`, but passed `content` (with no "s") to `chardet.detect` - so the result would depend on whatever was previously assigned to `content`.
{'encoding': 'Windows-1252', 'confidence': 0.7282676610947401, 'language':
''}
# Terminal
$ python -m chardet FILENAME
FILENAME: MacRoman with confidence 0.7167379080370483
Thanks!
Albert-Jan
-- Mark.