On 03/10/2024 09:12 PM, David Chmelik wrote:
On Sat, 9 Mar 2024 10:01:52 -0800, Ross Finlayson wrote:
>
Hello. I'd like to start with saying thanks to Usenet administrators
and originators,
Usenet has a lot of perceived value as a cultural artifact, and also a
great experiment in free speech, association, and press.
>
Here I'm mostly interested in text Usenet,
not binaries, that text Usenet is a great artifact and experiment in
speech, association,
and press.
>
When I saw this example that may have a lot of old Usenet, then it sort
of aligned with an idea that started as an idea of vanity press, about
an archive of a group.
Now though, I wonder how to define an "archive any and all text usenet",
AAATU,
filesystem convention, as a sort of "Library Filesystem Format", LFF.
[...]
>
Sounds good; I'm interested in full archive of text newsgroups I use
(1300+) but don't know free Usenet servers even go back to when I started
(1996, though tried Internet in museum before Eternal September). I'm
aware I could use commercial ones that may, but don't know which nor cost/
space. Is Google Groups the only going back to 1981? I hope other
servers managed to save that before Google disconnected from peers or some
might turn up back to 1979.
>
Accessing some old binary ones would be nice also, but these days people
use commercial servers for those, which probably didn't save even back to
'90s... an archive of those (even though I'm uninterested in most rather
than a few relating to history of science, some types of art/graphics &
music) would presumably be too large except for data centres.
>
Hey, thanks for writing.
Estimates and, you know, reliable estimates,
would help a lot to estimate the scope of
the scale of the order of, the things.
For example, in the units of dollars per message
stored for a month , if it's about 25 dollars per
month per million messages, then getting an
estimate on how many millions messages,
has that the original economies of the system,
have since seen the exponential growth in
the availability of storage and exponential
decrease in its cost, more or less, that these
sorts terms are traditionally euphemized,
"napkin-back", "ball-park", out to "wild-ass-guess".
First then is "how many groups are in Big 8"
then "minus how many of those are under
alt.binaries or otherwise effectively binaries",
then along the lines of "how many national,
corp, or institutional groups are in the public
space", to get an idea of the order of groups.
(The order of things is usually enough log 10 or
log 2, or log e, called log, lg, or ln.)
Once upon a time, an outfit called DejaNews
seemed to do Usenet a real solid, favor, and
for quite some years had the best archives,
and served them up. Their value proposition
came across so great that a giant behemoth
bought them up, that apocryphally, the,
"DejaNews CD's", were compact discs, that
had all the contents of DejaNews.
Then, several commercial providers today,
have, Big 8 text, back about 10 years or
more or less. These are just online and can
be slowly and gently and thoroughly suck-fed,
or you know, a leeching action, where the old
ratio of downloads/uploads is called leech,
like "that lurker has an infinite leech ratio",
these kinds of cultural contexts, here point
being that it's the middle ages and the land
before time, that if one could get the DejaNews
CD's, one might think of these as "land before
time", "DejaNews CD's", "middle ages", and
"right about now", basically 1980-sth to date.
We might be able to get from Usenet admin,
something like, "here is the list of groups
and maybe here's all the groups there ever
were", besides locally and site policy and these
kinds of things, these kinds of things.
So, these days, storage, is, available, then
that, basically a volume will store 4BB named
items, MM = millions, BB = billions, and because
Usenet text messages are pretty small or on the
order of linear in 4KB buffers, where compression
results about one less order, the idea is usually
that a computer can mount multiple volumes,
vis-a-vis, whatever it can fit in memory.
One idea while the filesystem value representation
is so great, is that, though it's slow, and, subject
these sorts limits and planning factors, it never
needs to occupy memory pretty much at all,
which helps a lot when the most metered costs,
of the runtime, are, 1) network I/O egress, 2) RAM,
3) CPU 4) object store or various services, or 5) disk.
One thing about this kind of data is that it's
"write-once-read-many" or, you know, "write-
once read-never", that because there are natural
coordinates group and date, once the idea is
that all those have been found, then it can live
in a filesystem of those all packed up as files,
here with the idea that "LFF's only purpose is
to serve as a place to store packed up files,
then you can load them how you want".
Then, the idea is to make a system, where basically
that it has more or less a plan, which is basically
a list of groups, and a matrix, group x date. The
goal is to fill up for each group x date, all its posts,
in the file system, then when it's more or less
reached a consensus, then it's figured they all
have landed there and live there, what results
that basically the LFF has an edition each day
and what's in it is according to that matrix
the counts, and then, its lineage what were
the sources and what was the quality of the data,
then behind that, the data.
Then, for something like "well, we can pretty much
fit 4BB entries on one volume, and can hire any number
of volumes, and they have plenty of space", here is
for the idea that if these are the inputs, count-groups
times all the days the coordinates, group-days, then
<8 @ message-ID the post-depths, that it's heuristic that
post-depths >> group-days, that a usual sort of
volume can store > 4BB/post-depths those.
The usual idea of "object-store" is "hey as long as
you give it a unique name and don't expect to
file-tree-walk the entire store, an object store
will gladly store its path segmented in our binary
tree which results log 2 or better lookup", with
the idea that, that it results that the group-date
coordinates and keyed off the message-Id, will
look up message ID. The idea is that LFF edition
is a list of message ID's for the group-date,
for example for checking they each exist and
checking they're well-formed and validating them.
The date Jan 1 1970 is called "the epoch", and, often
it's so that Internet time date is "since the epoch".
Here this is that Jan 1 2020 = Jan 1 1970 + 18262 days.
So, fifty years of retention, daily, then is that group-days
is about groups * days, though that groups kind of come
and go, and some group-date coordinates of course will
be empty, vis-a-vis the "dense" and the "sparse".
Another thing about data is backing it up, or moving it.
I.e., at the time something like the DejaNews CD's was
a pretty monumental amount of data.
So it was with some great happiness that the other
day it was suggested there's even some of the
"land before time" in great archives, that it was
something like 3 or 4 terabytes, TB, uncompressed,
then with regards to building out estimates, and,
mostly about having a _design_ after a sort of, charter,
of "LFF: library filesystem format conventions
for AAAATU: archive any and all text usenet",
is for that, "it works on any kind of filesystem,
and any old file tools work on it".
If any care to say "hey here's what you should do"
and this kind of thing, I'll thank you, basically that
I wonder about how many groups there are, with
the idea, of, that, my question is whether that
under each given org, like "rec", "soc", "comp",
"sci", "news", "alt minus binaries", ..., then also
the national and corp and institutional, how many
newsgroups are under those, and, also, are
there are any limits of those.
If there was for each group on Usenet, that each
group has a name and it looks like a.b.c, and each
group has a date that it was born or its first post was,
that's basically the origins of the coordinates,
to make estimates for the order of the coordinates,
and, the order of the items, then, order of their sizes.