Decoding bytes to text strings in Python 2

Liste des GroupesRevenir à cl python 
Sujet : Decoding bytes to text strings in Python 2
De : usenet202101 (at) *nospam* magic-cookie.co.ukNOSPAMPLEASE (Rayner Lucas)
Groupes : comp.lang.python
Date : 21. Jun 2024, 17:49:08
Autres entêtes
Organisation : The Lumber Cartel (TINLC)
Message-ID : <MPG.40dfb14de0110a999896df@news.eternal-september.org>
User-Agent : MicroPlanet-Gravity/3.0.4

I'm curious about something I've encountered while updating a very old
Tk app (originally written in Python 1, but I've ported it to Python 2
as a first step towards getting it running on modern systems). The app
downloads emails from a POP server and displays them. At the moment, the
code is completely unaware of character encodings (which is something I
plan to fix), and I have found that I don't understand what Python is
doing when no character encoding is specified.

To demonstrate, I have written this short example program that displays
a variety of UTF-8 characters to check whether they are decoded
properly:

---- Example Code ----
import Tkinter as tk

window = tk.Tk()

mytext = """
  \xc3\xa9 LATIN SMALL LETTER E WITH ACUTE
  \xc5\x99 LATIN SMALL LETTER R WITH CARON
  \xc4\xb1 LATIN SMALL LETTER DOTLESS I
  \xef\xac\x84 LATIN SMALL LIGATURE FFL
  \xe2\x84\x9a DOUBLE-STRUCK CAPITAL Q
  \xc2\xbd VULGAR FRACTION ONE HALF
  \xe2\x82\xac EURO SIGN
  \xc2\xa5 YEN SIGN
  \xd0\x96 CYRILLIC CAPITAL LETTER ZHE
  \xea\xb8\x80 HANGUL SYLLABLE GEUL
  \xe0\xa4\x93 DEVANAGARI LETTER O
  \xe5\xad\x97 CJK UNIFIED IDEOGRAPH-5B57
  \xe2\x99\xa9 QUARTER NOTE
  \xf0\x9f\x90\x8d SNAKE
  \xf0\x9f\x92\x96 SPARKLING HEART
"""

mytext = mytext.decode(encoding="utf-8")
greeting = tk.Label(text=mytext)
greeting.pack()

window.mainloop()
---- End Example Code ----

This works exactly as expected, with all the characters displaying
correctly.

However, if I comment out the line 'mytext = mytext.decode
(encoding="utf-8")', the program still displays *almost* everything
correctly. All of the characters appear correctly apart from the two
four-byte emoji characters at the end, which instead display as four
characters. For example, the "SNAKE" character actually displays as:
U+00F0 LATIN SMALL LETTER ETH
U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
U+FF90 HALFWIDTH KATAKANA LETTER MI
U+FF8D HALFWIDTH KATAKANA LETTER HE

What's Python 2 doing here? sys.getdefaultencoding() returns 'ascii',
but it's clearly not attempting to display the bytes as ASCII (or
cp1252, or ISO-8859-1). How is it deciding on some sort of almost-but-
not-quite UTF-8 decoding?

I am using Python 2.7.18 on a Windows 10 system. If there's any other
relevant information I should provide please let me know.

Many thanks,
Rayner

Date Sujet#  Auteur
21 Jun 24 * Decoding bytes to text strings in Python 210Rayner Lucas
21 Jun 24 +* Re: Decoding bytes to text strings in Python 27Chris Angelico
22 Jun 24 i`* Re: Decoding bytes to text strings in Python 26Rayner Lucas
23 Jun 24 i +- Re: Decoding bytes to text strings in Python 2 (Posting On Python-List Prohibited)1Lawrence D'Oliveiro
24 Jun 24 i +- Re: Decoding bytes to text strings in Python 21Chris Angelico
24 Jun 24 i +- Re: Decoding bytes to text strings in Python 21MRAB
24 Jun 24 i +- Re: Decoding bytes to text strings in Python 21Chris Angelico
24 Jun 24 i `- Tkinter and astral characters (was: Decoding bytes to text strings in Python 2)1Peter J. Holzer
21 Jun 24 `* Re: Decoding bytes to text strings in Python 22Stefan Ram
22 Jun 24  `- Re: Decoding bytes to text strings in Python 21Rayner Lucas

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal