Newsportal USENET - Re: Chardet oddity

   On Oct 24, 2024 17:51, Roland Mueller via Python-list
   <python-list@python.org> wrote:

   ke 23. lokak. 2024 klo 20.11 Albert-Jan Roskam via Python-list (
   python-list@python.org) kirjoitti:

   >    Today I used chardet.detect in the repl and it returned
   windows-1252
   >    (incorrect, because it later resulted in a UnicodeDecodeError).
   When I
   > ran
   >    chardet as a script (which uses UniversalLineDetector) this
   returned
   >    MacRoman. Isn't charset.detect the correct way? I've used this
   method
   > many
   >    times.
   >    # Interpreter
   >    >>> contents = open(FILENAME, "rb").read()
   >    >>> chardet.detect(content)
   >    {'encoding': 'Windows-1252', 'confidence': 0.7282676610947401,
   > 'language':
   >    ''}
   >    # Terminal
   >    $ python -m chardet FILENAME
   >    FILENAME: MacRoman with confidence 0.7167379080370483
   >    Thanks!
   >    Albert-Jan
   >

   The entry point for the module chardet is chardet.cli.chardetect:main
   and
   main() calls function description_of(lines, name).
   'lines' is an opened file in mode 'rb' and name will hold the filename.

   Following way I tried this in interactive mode: I think the crucial
   difference is that description_of(lines, name) reads
   the opened file line by line and stops after something has been detected
   in
   some line.

   When reading the whole file into the variable contents probably gives
   another result depending on the input.
   This behaviour I was not able to repeat.
   I am assuming that you used the same Python for both tests.

   >>> from chardet.cli import chardetect
   >>> chardetect.description_of(open('/tmp/DATE', 'rb'), 'some file')
   'some file: ascii with confidence 1.0'
   >>>

   Your approach
   >>> from chardet import detect
   >>> detect(open('/tmp/DATE','rb').read())
   {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

   from /usr/lib/python3/dist-packages/chardet/cli/chardetect.py

   def description_of(lines, name='stdin'):
       u = UniversalDetector()
       for line in lines:
           line = bytearray(line)
           u.feed(line)
           # shortcut out of the loop to save reading further -
   particularly
   useful if we read a BOM.
           if u.done:
               break
       u.close()
       result = u.result

   =============
   Hi Mark, Roland,
   Thanks for your replies. I experimented a bit with both methods and the
   derived encoding still differed, even after I removed the "if u.done:
   break" (I removed that because I've seen cp1252 files with a utf8 BOM in
   the past. I kid you not!). BUT next day, at closer inspection I saw that
   the file was quite a mess. I contained mojibake. So I don't blame chardet
   for not being able to figure out the encoding.
   Albert-Jan

Date	Sujet	#		Auteur
25 Oct 24	Re: Chardet oddity	1		Albert-Jan Roskam