Sujet : Archiving Usenet 2003-2025
De : jsevans (at) *nospam* sdf.org (Jason Evans)
Groupes : news.software.miscDate : 01. Jun 2025, 14:34:55
Autres entêtes
Message-ID : <slrn103olo7.gbt.jsevans@jbsd1.home.local>
User-Agent : slrn/1.0.3 (OpenBSD)
A few months ago, I posted about my Usenet archiver application. Since then,
I have completely retooled it, rewrote it in Python, and it is now a very
capable tool.
In January, I began a project that I had started many times before but never
finished. That is, archiving Usenet Newsgroups from 2003 until the current
year. To do this, I am using a paid Usenet provider and downloading all
newsgroups in the mbox format and compressing them with gzip. I've been doing
this since January. You might be wondering why I have been doing this since
January and I'm still not done? That's because paid Usenet providers prioritize
binary groups over text groups. I am not archiving binary groups, but when one
slips under my radar, I can easily see that far more of it has been downloaded
compared to other newsgroups in the same amount of time.
Anyway, since January, I have downloaded approximately 2TB of Newsgroups. What
newsgroups have I downloaded? The list so far is on my GitHub linked below. If
there are any well-known groups that are missing, please let me know, and I
will add them to my queue. You might be wondering where do I get my list of
newsgroups. I began with the semi-official list from isc.org.
(
https://ftp.isc.org/usenet/CONFIG/newsgroups.gz) I have only omitted the
following: test groups, e.g., misc.test, binary groups, and some alt groups
that deal with pedophilia. Next, I got a list of newsgroups that are
carried by eternal-september, and I started a new queue based on that,
downloading all of the groups that are not in the isc list. There are a lot
of them, and I'm hoping to have them done in the coming weeks. I am
downloading approximately 95 newsgroups at a time in parallel. The limit
from my Usenet provider is 100 downloads at a time.
I'll update again later when I begin uploading them to the Internet Archive.
https://github.com/tgeek77/usenet_archiver/blob/main/fetch_log.txt