Newsportal USENET - Re: Chardet oddity

Sujet : Re: Chardet oddity
De : ram (at) *nospam* zedat.fu-berlin.de (Stefan Ram)
Groupes : comp.lang.python
Date : 23. Oct 2024, 18:43:51

Autres entêtes

Organisation : Stefan Ram
Message-ID : <script-20241023184256@ram.dialup.fu-berlin.de>
References : 1

Albert-Jan Roskam <sjeik_appie@hotmail.com> wrote or quoted:

Today I used chardet.detect in the repl and it returned windows-1252
(incorrect, because it later resulted in a UnicodeDecodeError). When I ran
chardet as a script (which uses UniversalLineDetector) this returned
MacRoman. Isn't charset.detect the correct way? I've used this method many
times.

Oof, that's a head-scratcher! Looks like chardet's throwing
you a curveball. Usually, chardet.detect() is the go-to method,
but it seems to be off its game here.

The script version's using UniversalLineDetector under the hood
(as you wrote), which might be giving it an edge in this case.

It's weird that the confidence levels are so close, though.
Maybe the file's got some quirks that are tripping up the
simpler detect() method.

I'd say stick with the script version for now if it's giving
you better results.

Here's how you can use it in your code:

from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
with open(FILENAME, 'rb') as file:
for line in file:
detector.feed(line)
if detector.done:
break
detector.close()
print(detector.result)

Date	Sujet	#	Auteur
23 Oct 24	Chardet oddity	3	Albert-Jan Roskam
23 Oct 24	Re: Chardet oddity	1	Stefan Ram
23 Oct 24	Re: Chardet oddity	1	Mark Bourne