Sujet : Re: Newsgroups files
De : iulius (at) *nospam* nom-de-mon-site.com.invalid (Julien ÉLIE)
Groupes : news.admin.peeringDate : 03. Mar 2025, 22:55:15
Autres entêtes
Organisation : Groupes francophones par TrigoFACILE
Message-ID : <vq58g3$1nji2$3@news.trigofacile.com>
References : 1
User-Agent : Mozilla Thunderbird
Hi Nigel,
One sample group from 16 peers. the first thing, so many different
encodings. I've got ASCII, UTF-8, ISO-8859-1, WINDOWS-1252, even one
identifying as GB18030.
Next, 8 servers agree on one description, 3 on another, 2 more on yet
another, and finally 3 think the group is moderated.
How did things get in such a mixed up state?
Because there originally wasn't any standard for the encoding of control articles. Most of them did not declare anything (the usual encoding locally used by the sender was assumed - like gb18030 for cn.*, koi8-u for ukr.* [my sympathy to them!], big5 for tw.*, iso-8859-15 for fr.*, cp1252 for most of the others, etc.).
Only "recently" a new version of the standard recommended the use of UTF-8.
That why you end up seeing mixed and incoherent encodings in existing news servers. Not all of them run a version which implements the new interoperable state of art (UTF-8) to parse control articles. And if the descriptions pre-date the receival of new control articles, not all the news administrators have manually homogenized the descriptions to UTF-8. (No blame in my sentence, just a fact.)
What is even worse when trying to automate this, is when the majority
of servers have the wrong description or it's half and half.
Just use
https://raw.githubusercontent.com/Julien-Elie/usenet-hierarchies/refs/heads/main/website/data/newsgroups.utf8 :)
-- Julien ÉLIE« Celui qui sait qu'il ne sait pas, éduque-le. Celui qui sait qu'il sait, écoute-le. Celui qui ne sait pas qu'il sait, éveille-le. Celui qui ne sait pas qu'il ne sait pas, fuis-le. » (proverbe chinois)