Re: Chardet oddity

Liste des GroupesRevenir à cl python 
Sujet : Re: Chardet oddity
De : ram (at) *nospam* zedat.fu-berlin.de (Stefan Ram)
Groupes : comp.lang.python
Date : 23. Oct 2024, 19:43:51
Autres entêtes
Organisation : Stefan Ram
Message-ID : <script-20241023184256@ram.dialup.fu-berlin.de>
References : 1
Albert-Jan Roskam <sjeik_appie@hotmail.com> wrote or quoted:
Today I used chardet.detect in the repl and it returned windows-1252
(incorrect, because it later resulted in a UnicodeDecodeError). When I ran
chardet as a script (which uses UniversalLineDetector) this returned
MacRoman. Isn't charset.detect the correct way? I've used this method many
times.

  Oof, that's a head-scratcher! Looks like chardet's throwing
  you a curveball. Usually, chardet.detect() is the go-to method,
  but it seems to be off its game here.

  The script version's using UniversalLineDetector under the hood
  (as you wrote), which might be giving it an edge in this case.

  It's weird that the confidence levels are so close, though.
  Maybe the file's got some quirks that are tripping up the
  simpler detect() method.

  I'd say stick with the script version for now if it's giving
  you better results.

  Here's how you can use it in your code:

from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
with open(FILENAME, 'rb') as file:
    for line in file:
        detector.feed(line)
        if detector.done:
            break
detector.close()
print(detector.result)



Date Sujet#  Auteur
23 Oct 24 * Chardet oddity3Albert-Jan Roskam
23 Oct 24 +- Re: Chardet oddity1Stefan Ram
23 Oct 24 `- Re: Chardet oddity1Mark Bourne

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal