Re: Meta: a usenet server just for sci.math

Liste des GroupesRevenir à cp threads 
Sujet : Re: Meta: a usenet server just for sci.math
De : ross.a.finlayson (at) *nospam* gmail.com (Ross Finlayson)
Groupes : sci.math
Date : 23. Mar 2024, 05:30:45
Autres entêtes
Message-ID : <Hp-cnUAirtFtx2P4nZ2dnZfqnPednZ2d@giganews.com>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0
On 03/02/2024 01:44 PM, Ross Finlayson wrote:
On 02/20/2024 08:38 PM, Ross Finlayson wrote:
>
>
Alright then, about the SFF, "summary" file-format,
"sorted" file-format, "search" file-format, the idea
here is to figure out normal forms of summary,
that go with the posts, with the idea that "a post's
directory is on the order of contained size of the
size of the post", while, "a post's directory is on
a constant order of entries", here is for sort of
summarizing what a post's directory looks like
in "well-formed BFF", then as with regards to
things like Intermediate file-formats as mentioned
above here with the goal of "very-weakly-encrypted
at rest as constant contents", then here for
"SFF files, either in the post's-directory or
on the side, and about how links to them get
collected to directories in a filesystem structure
for the conventions of the concatenation of files".
>
So, here the idea so far is that BFF has a normative
form for each post, which has a particular opaque
globally-universal unique identifier, the Message-ID,
then that the directory looks like MessageId/ then its
contents were as these files.
>
id hd bd yd td rd ad dd ud xd
id, header, body, year-to-date, thread, referenced, authored, dead,
undead, expired
>
or just files named
>
i h b y t r a d u x
>
which according to the presence of the files and
their contents, indicate that the presence of the
MessageId/ directory indicates the presence of
a well-formed message, contingent not being expired.
>
... Where hd bd are the message split into its parts,
with regards to the composition of messages by
concatenating those back together with the computed
message numbers and this kind of thing, with regards to
the site, and the idea that they're stored at-rest pre-compressed,
then knowledge of the compression algorithm makes for
concatenating them in message-composition as compressed.
>
Then, there are variously already relations of the
posts, according to groups, then here as above that
there's perceived required for date, and author.
I.e. these are files on the order the counts of posts,
or span in time, or count of authors.
>
(About threading and relating posts, is the idea of
matching subjects not-so-much but employing the
References header, then as with regards to IMAP and
parity as for IMAP's THREADS extension, ...,
www.rfc-editor.org/rfc/rfc5256.html , cf SORT and THREAD.
There's a usual sort of notion that sorted, threaded
enumeration is either in date order or thread-tree
traversal order, usually more sensibly date order,
with regards to breaking out sub-threads, variously.
"It's all one thread." IMAP: "there is an implicit sort
criterion of sequence number".)
>
>
Then, similarly is for defining models for the sort, summary,
search, SFF, that it sort of (ha) rather begins with sort,
about the idea that it's sort of expected that there will
be a date order partition either as symlinks or as an index file,
or as with regards to that messages date is also stored in
the yd file, then as with regards to "no file-times can be
assumed or reliable", with regards to "there's exactly one
file named YYYY-MM-DD-HH-MM-SS in MessageId/", these
kinds of things. There's a real goal that it works easy
with shell built-ins and text-utils, or "command line",
to work with the files.
>
>
So, sort pretty well goes with filtering.
If you're familiar with the context, of, "data tables",
with a filter-predicate and a sort-predicate,
they're different things but then go together.
It's figured that they get front-ended according
to the quite most usual "column model" of the
"table model" then "yes/no/maybe" row filtering
and "multi-sort" row sorting. (In relational algebra, ...,
or as rather with 'relational algebra with rows and nulls',
this most usual sort of 'composable filtering' and 'multi-sort').
>
Then in IMAP, the THREAD command is "a variant of
SEARCH with threading semantics for the results".
This is where both posts and emails work off the
References header, but it looks like in the wild there
is something like "a vendor does poor-man's subject
threading for you and stuffs in a X-References",
this kind of thing, here with regards to that
instead of concatenation, is that intermediate
results get sorted and threaded together,
then those, get interleaved and stably sorted
together, that being sort of the idea, with regards
to search results in or among threads.
>
(Cf www.jwz.org/doc/threading.html as
via www.rfc-editor.org/rfc/rfc5256.html ,
with regards to In-Reply-To and References.
There are some interesting articles there
about "mailbox summarization".)
>
About the summary of posts, one way to start
as for example an interesting article about mailbox
summarization gets into, is, all the necessary text-encodings
to result UTF-8, of Unicode, after UCS-2 or UCS-4 or ASCII,
or CP-1252, in the base of BE or LE BOMs, or anything to
do with summarizing the character data, of any of the
headers, or the body of the text, figuring of course
that everything's delivered as it arrives, as with regards
to the opacity usually of everything vis-a-vis its inspection.
>
This could be a normative sort of file that goes in the messageId/
folder.
>
cd: character-data, a summary of whatever form of character
encoding or requirements of unfolding or unquoting or in
the headers or the body or anywhere involved indicating
a stamp indicating each of the encodings or character sets.
>
Then, the idea is that it's a pretty deep inspection to
figure out how the various attributes, what are their
encodings, and the body, and the contents, with regards
to a sort of, "a normalized string indicating the necessary
character encodings necessary to extract attributes and
given attributes and the body and given sections", for such
matters of indicating the needful for things like sort,
and collation, in internationalization and localization,
aka i18n and l10n. (Given that the messages are stored
as they arrived and undisturbed.)
>
The idea is that "the cd file doesn't exist for messages
in plain ASCII7, but for anything anywhere else, breaks
out what results how to get it out". This is where text
is often in a sort of format like this.
>
Ascii
it's keyboard characters
ISO8859-1/ISO8859-15/CP-1252
it's Latin1 often though with the Windows guys
Sideout
it's Ascii with 0-127 gigglies or upper glyphs
Wideout
it's 0-256 with any 256 wide characters in upper Unicode planes
Unicode
it's Unicode
>
Then there are all sorts of encodings, this is according to
the rules of Messages with regards to header and body
and content and transfer-encoding and all these sorts
things, it's Unicode.
>
Then, another thing to get figured out is lengths,
the size of contents or counts or lengths, figuring
that it's a great boon to message-composition to
allocate exactly what it needs for when, as a sum
of invariant lengths.
>
Then the MessageId/ files still has un-used 'l' and 's',
then though that 'l' looks too close to '1', here it's
sort of unambiguous.
>
ld: lengthed, the coded and uncoded lengths of attributes and parts
>
The idea here is to make it easiest for something like
"consult the lengths and allocate it raw, concatenate
the message into it, consult the lengths and allocate
it uncoded, uncode the message into it".
>
So, getting into the SFF, is that basically
"BFF indicates well-formed messages or their expiry",
"SFF is derived via a common algorithm for all messages",
and "some SFF lives next to BFF and is also write-once-read-many",
vis-a-vis that "generally SFF is discardable because it's derivable".
>
>
>
>
>
So, figuring that BFF then is about designed,
basically for storing Internet messages with
regards to MessageId, then about ContentId
and external resources separately, then here
the idea again becomes how to make for
the SFF files, what results, intermediate, tractable,
derivable, discardable, composable data structures,
in files of a format with regards to write-once-read-many,
write-once-read-never, and, "partition it", in terms of
natural partitions like time intervals and categorical attributes.
>
>
There are some various great open-source search
engines, here with respect to something like Lucene
or SOLR or ElasticSearch.
>
The idea is that there are attributes searches,
and full-text searches, those resulting hits,
to documents apiece, or sections of their content,
then backward along their attributes, like
threads and related threads, and authors and
their cliques, while across groups and periods
of time.
>
There's not much of a notion of "semantic search",
though, it's expected to sort of naturally result,
here as for usually enough least distance, as for
"the terms of matching", and predicates from what
results a filter predicate, here with what I call,
"Yes/No/Maybe".
>
Now, what is, "yes/no/maybe", one might ask.
Well, it's the query specification, of the world
of results, to filter to the specified results.
The idea is that there's an accepter network
for "Yes" and a rejector network for "No"
and an accepter network for "Maybe" and
then rest are rejected.
>
The idea is that the search, is a combination
of a bunch of yes/no/maybe terms, or,
sure/no/yes, to indicate what's definitely
included, what's not, and what is, then that
the term, results that it's composable, from
sorting the terms, to result a filter predicate
implementation, that can run anywhere along
the way, from the backend to the frontend,
this way being a, "search query specification".
>
>
There are notions like, "*", and single match
and multimatch, about basically columns and
a column model, of documents, that are
basically rows.
>
>
The idea of course is to built an arithmetic expression,
that also is exactly a natural expression,
for "matches", and "ranges".
>
"AP"|Archimedes|Plutonium in first|last
>
Here, there is a search, for various names, that
it composes this way.
>
AP first
AP last
Archimedes first
Archimedes last
Plutonium first
Plutonium last
>
As you can see, these "match terms", just naturally
break out, then that what's gets into negations,
break out and double, and what gets into ranges,
then, well that involves for partitions and ranges,
duplicating and breaking that out.
>
It results though a very fungible and normal form
of a search query specification, that rebuilds the
filter predicate according to sorting those, then
has very well understood runtime according to
yes/no/maybe and the multimatch, across and
among multiple attributes, multiple terms.
>
>
This sort of enriches a usual sort of query
"exact full hit", with this sort "ranges and conditions,
exact full hits".
>
So, the Yes/No/Maybe, is the generic search query
specification, overall, just reflecting an accepter/rejector
network, with a bit on the front to reflect keep/toss,
that's it's very practical and of course totally commonplace
and easily written broken out as find or wildmat specs.
>
For then these the objects and the terms relating
the things, there's about maintaining this, while
refining it, that basically there's an ownership
and a reference count of the filter objects, so
that various controls according to the syntax of
the normal form of the expression itself, with
most usual English terms like "is" and "in" and
"has" and "between", and "not", with & for "and"
and | for "or", makes that this should be the kind
of filter query specification that one would expect
to be general purpose on all such manners of
filter query specifications and their controls.
>
So, a normal form for these filter objects, then
gets relating them to the SFF files, because, an
SFF file of a given input corpus, satisifies some
of these specifications, the queries, or for example
doesn't, about making the language and files
first of the query, then the content, then just
mapping those to the content, which are built
off extractors and summarizers.
>
I already thought about this a lot. It results
that it sort of has its own little theory,
thus what can result its own little normal forms,
for making a fungible SFF description, what
results for any query, going through those,
running the same query or as so filtered down
the query for the partition already, from the
front-end to the back-end and back, a little
noisy protocol, that delivers search results.
>
>
Wondering about how to implement SFF or summary
and search, the idea seems "well you just use Lucene
like everybody else", and it's like, well, I sort of have
this idea about a query language already, and there's
that I might or might not have the use case of cluster
computing a whole Internet, and pretty much figure
that it's just some partitions and then there's not much
to be usually having massive-memory on-line clusters,
vis-a-vis, low or no traffic, then for the usual idea
that the implementation should auto-scale, be
elastic as it were, and that it should even fall back
to just looking through files or naive search, vis-a-vis
indices.  The idea of partitions is that they indicate
the beginning through the end of changes to data,
that archive partitions can have enduring search indices,
while active partitions have growing search indices.
So, the main idea is that searches make matches make
hits, then the idea that there's a partitions concordance,
then with regards to the index of a document its terms,
then with regards to the most usual sorts of the fungible
forms the inverse document frequency setup, in the middle.
https://en.wikipedia.org/wiki/Concordance
What this gets into then is "growing file / compacting file".
The idea is that occurrences accumulate in the growing
file, forward, and (linear) searches of the growing file
are backward, though what it entails, is that the entries
get accumulated, then compacting is to deduplicate those,
or just pick off the last, then put that into binary tree
or lexicographic, or about the associations of the terms.
"The quick brown fox jumped over the lazy dog."
This is a usual example sentence, "The quick brown
fox jumped over the lazy dog", vis-a-vis, "Lorem ipsum".
https://en.wikipedia.org/wiki/Lorem_ipsum
Ah, it's, "the quick brown fox jumps over the lazy dog",
specifically as a, "pangram", a sentence containing each
letter of the alphabet.
https://en.wikipedia.org/wiki/The_quick_brown_fox_jumps_over_the_lazy_dog
So, the idea is basically to write lines, appending those,
that basically there's a serial appender, then that search
on the active partition, searches backward so can find
the last most full line, which the appender can also do,
with regards to a corresponding "reverse line reader",
with regards to a line-index file, fixed-length offsets
to each line, with regards to memory-mapping the
file, and forward and reverse iterators.
document 1 See Spot Run
document 2 See Spot Run
See: 1
Spot: 1
Run: 1
See: 1,2
Spot: 1,2
Run: 1,2
That for individual terms, blows up very quickly.  Yet,
the idea is that most terms are in archive partitions,
where then those would be stored in a format
basically with lexicographic or phone-book sorting,
seems for something like, "anagram phonebook",
ees: see 1,2
nru: run 1,2
post: spot 1,2
vis-a-vis "plain phone-book",
run: 1,2
see: 1,2
spot: 1,2
the idea that to look up a word, to look up its letters,
or for example its distinct letters,
es: see 1,2
nru: run 1,2
post: spot 1,2
with regards to a pretty agnostic setting of words, by letters.
Getting into etymology and stemming, and roots and
the whole shebang of parts-of-speech and synonymity,
would seem to get involved, vis-a-vis symbols and terms,
that in terms of letters like ideograms, results ideograms
work out about same, as with regards to contents of
single- and multiple-letter, or glyph, words, and these
kinds things, and for example emojis and the range.
Then another idea that gets involved for close matches
and these kinds of things, is a distance between the minimal
letters, though with regards to hits and misses.
e
es: see 1,2
n
nr
nru: run 1,2
p
po
pos
post: spot 1,2
e 12
es 2
n 345
nr 45
nru 5
p 6789
po 789
pos 89
post 9
https://en.wikipedia.org/wiki/Nonparametric_statistics
https://en.wikipedia.org/wiki/Summary_statistics
The idea for statistics is to help result when it's
possible for "found the hits", vis-a-vis, "not found
the hits", then also as that search queries and search
results also, become "growing files / compacting files"
in the "active partition / archive partition", of search
results, then with regards to "matching queries /
matching hits", with regards to duplicated queries,
and usual and ordinary queries having compiled hits
for their partitions.  (Active query hits for each
partition.)  This gets into MRU, LRU, this kind of
thing, usual notions of cache affinity and coherency.
https://en.wikipedia.org/wiki/Frecency
Now that's a new one, I never heard of "frecency" before,
but the idea of combining MRU and MFU, most-recently
and most-frequently, makes a lot of sense.
Then this idea for search queries, is to break it down,
or to have a default sort of plan, what results then
the terms search in the sub-query, get composable,
vis-a-vis, building the results.
https://en.wikipedia.org/wiki/Indexed_file
https://en.wikipedia.org/wiki/Inverted_index
The idea for binary tree, seems to find the
beginning and end of ranges, then search
the linear part inside that with two or
alternated iterators, that "exact-match
is worst-case", or middle of the range,
yet it works out that most aren't that bad.
I.e., average case.
https://en.wikipedia.org/wiki/Bag-of-words_model
So, this seems sort of a bag-of-letters model,
about things like common letters and words,
and usual means of reducing words to unambiguous
representations removing "redundant" letters,
about rdndnt lttrs though litters.  I.e. it would
be dictionariological, dictionarial, with here that
being secondary, and after stemming and etymology.
https://en.wikipedia.org/wiki/Shorthand
https://en.wikipedia.org/wiki/Stemming
(As far as stemming goes, I'm still trying to
figure out plurals, or plural forms.)
https://en.wikipedia.org/wiki/Z39.50
Huh, haven't heard of Z39.50 in a while.
So, it's like, "well this isn't the usual idea of
making Lucene-compatible input files and
making a big old data structure in memory
and a bit of a multi-cast topology and scaling
by exploding" and it isn't, this is much more
of a "modestly accommodate indices to implement
search with growing and compacting files
and natural partitions with what results
sort of being readable and self-describing".
The query format is this idea of "Sure/No/Yes"
which makes for that the match terms,
and the Boolean, or conjunctive and disjunctive,
of course has a sort of natural language
representation into what queries may be,
then about the goals of results of surveying
the corpus for matching the query.
So, part of surveying the corpus, is hits,
direct deep hits to matches.  The other,
is prompts, that given a query term that
matches many, to then refine those.
Then the idea is to select of among those
putting the result into "Sure", then refine
the query, that the query language, supports
a sort of query session, then to result bulk
actions on the selections.
The query language then, is about as simple
and associative as it can be, for example,
by example, then with regards to that there
are attribute-limited searches, or as with
respect to "columns", about rows and columns,
and then usually with regards to the front-end
doing selection and filtering, and sorting,
and the back-end doing this sort of accumulation
of the query session in terms of the refinements
or iterations of the query, to what should result
the idea that then the query is decomposable,
to reflect that then over the partitions over
the SFF files, as it were, the summary and search
data, and then into the documents themselves,
or as with regards to the concordance the
sections, making for a model of query as
both search and selection, and filtering and sorting,
front-end and back-end, that it's pretty usual
in all sorts of "data table" and "search and browse"
type use-cases, or applications.
Archimedes Plutonium
Name Plutonium?
Subject Plutonium?
Body Plutonium?
The usual idea with prompts is to fill the suggestion
bar with question marks, then to use space
to toggle into those, but that gets involved
with "smart search" and "smart bar" implementations.
Name is Archimedes or Plutonium
Subject has Archimedes or Plutonium
Body has Archimedes or Plutonium
bob not carol joan mark
bob joan mark
not carol
bob
not carol joan mark
bob -carol joan mark
Name is Bob, Role is Job
Archimedes Plutonium
* Archimedes * Plutonium
* *
*
See, the idea is that each term is "column*, term*",
then that those are "or" inside, and "and" outside.
Name bob carol joan mark Role job
Then the various ideas of "or" as combining and
"and" as "excluding outside the or", make and
keep things simple, then also as that when
there are ambiguities, then ambiguities can
be presented as alternatives, then those picked out.
cell|desk 206|415 address local <- different columns, "and", implicit
phone is local, or, address local <- different columns, "or", explicit
The idea is that for a corpus, there are only so
many column names, all else being values,
or term-match-predicate inputs.
2010-   Archimedes Plutonium
It's figured that "between" gets involved in
ranges, either time ranges or lexicographic/alphabetic
ranges, that it's implemented this "not less than"
and "not greater" than, that the _expression_,
get parsed down to these simpler sorts
match terms, so that then those all combine
then for the single and multiple column cases,
with multiplicity in disjoint ranges, this is sort
of how it is when I designed this and implemented
much of a smart search bit for all the usual use-cases.
"Yes No Maybe", ..., with reference-counted search
control owners in a combined selection, search,
and filtering model, for the front-end and back-end,
both the same data structure, "query session",
then mostly about usual match terms and operators.
It's so sensible that it should be pretty much standard,
basically as follows being defined by a column model.
I.e., it's tabular data.
"Prompts" then is figuring out prompts and tops,
column-tops in a column model, then as with
regards to "Excerpts", is that in this particular
use case, messages almost always include both
references in their threads, and, excerpts in
the replies, to associate the excerpts with their
sources, that being as well a sort of matching,
though that it's helped by the convention,
the so-many-deep so-many-back block-quoting
convention, which though is subject to
not following the convention.
Here then this is for one of the BFF files, if
you might recall or it's here in this thread,
about that block-quoting is a convention,
vis-a-vis the usual top-posting and bottom-posting
and the usual full-excerpt or partial-excerpt
and the usual convention and the destroyed,
that the search hit goes to the source, only
falling back to the excerpt, when the source
doesn't exist, or that it sticks out as "broken"
the 'misquoted out of context', bit.
Yet, the BFF is mostly agnostic and that mean
ignorant of anything but "message contents,
one item".  So how the BFF and SFF are co-located,
gets into these things, where there's sort of
1-SFF, that's derivative one message, 2-SFF,
that's pairwise two messages, then as with
regards to n-SFF, is about the relations of
those, with regards to N-SFF the world of those,
then though P-SFF particularly, the partition
of those, and the pair-wise relations which
explode, and the partition summaries which enclose.
These kinds of things, ....

Date Sujet#  Auteur
3 Jan 25 o 

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal