Sujet : Re: Chardet oddity
De : ram (at) *nospam* zedat.fu-berlin.de (Stefan Ram)
Groupes : comp.lang.pythonDate : 23. Oct 2024, 19:43:51
Autres entêtes
Organisation : Stefan Ram
Message-ID : <script-20241023184256@ram.dialup.fu-berlin.de>
References : 1
Albert-Jan Roskam <
sjeik_appie@hotmail.com> wrote or quoted:
Today I used chardet.detect in the repl and it returned windows-1252
(incorrect, because it later resulted in a UnicodeDecodeError). When I ran
chardet as a script (which uses UniversalLineDetector) this returned
MacRoman. Isn't charset.detect the correct way? I've used this method many
times.
Oof, that's a head-scratcher! Looks like chardet's throwing
you a curveball. Usually, chardet.detect() is the go-to method,
but it seems to be off its game here.
The script version's using UniversalLineDetector under the hood
(as you wrote), which might be giving it an edge in this case.
It's weird that the confidence levels are so close, though.
Maybe the file's got some quirks that are tripping up the
simpler detect() method.
I'd say stick with the script version for now if it's giving
you better results.
Here's how you can use it in your code:
from chardet.universaldetector import UniversalDetector
detector = UniversalDetector()
with open(FILENAME, 'rb') as file:
for line in file:
detector.feed(line)
if detector.done:
break
detector.close()
print(detector.result)