Re: Chardet oddity

Liste des GroupesRevenir à cl python 
Sujet : Re: Chardet oddity
De : roland.em0001 (at) *nospam* googlemail.com (Roland Mueller)
Groupes : comp.lang.python
Date : 24. Oct 2024, 17:51:47
Autres entêtes
Message-ID : <mailman.36.1729785122.4695.python-list@python.org>
References : 1 2
ke 23. lokak. 2024 klo 20.11 Albert-Jan Roskam via Python-list (
python-list@python.org) kirjoitti:

   Today I used chardet.detect in the repl and it returned windows-1252
   (incorrect, because it later resulted in a UnicodeDecodeError). When I
ran
   chardet as a script (which uses UniversalLineDetector) this returned
   MacRoman. Isn't charset.detect the correct way? I've used this method
many
   times.
   # Interpreter
   >>> contents = open(FILENAME, "rb").read()
   >>> chardet.detect(content)
   {'encoding': 'Windows-1252', 'confidence': 0.7282676610947401,
'language':
   ''}
   # Terminal
   $ python -m chardet FILENAME
   FILENAME: MacRoman with confidence 0.7167379080370483
   Thanks!
   Albert-Jan
>

The entry point for the module chardet is chardet.cli.chardetect:main and
main() calls function description_of(lines, name).
'lines' is an opened file in mode 'rb' and name will hold the filename.

Following way I tried this in interactive mode: I think the crucial
difference is that  description_of(lines, name) reads
the opened file line by line and stops after something has been detected in
some line.

When reading the whole file into the variable contents probably gives
another result depending on the input.
This behaviour I was not able to repeat.
I am assuming that you used the same Python for both tests.

from chardet.cli import chardetect
chardetect.description_of(open('/tmp/DATE', 'rb'), 'some file')
'some file: ascii with confidence 1.0'
>

Your approach
from chardet import detect
detect(open('/tmp/DATE','rb').read())
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}


from /usr/lib/python3/dist-packages/chardet/cli/chardetect.py

def description_of(lines, name='stdin'):
    u = UniversalDetector()
    for line in lines:
        line = bytearray(line)
        u.feed(line)
        # shortcut out of the loop to save reading further - particularly
useful if we read a BOM.
        if u.done:
            break
    u.close()
    result = u.result
    ...


--
https://mail.python.org/mailman/listinfo/python-list
>

Date Sujet#  Auteur
24 Oct 24 o Re: Chardet oddity1Roland Mueller

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal