On 03/04/2024 11:23 AM, Ross Finlayson wrote:
>
So, figuring that BFF then is about designed,
basically for storing Internet messages with
regards to MessageId, then about ContentId
and external resources separately, then here
the idea again becomes how to make for
the SFF files, what results, intermediate, tractable,
derivable, discardable, composable data structures,
in files of a format with regards to write-once-read-many,
write-once-read-never, and, "partition it", in terms of
natural partitions like time intervals and categorical attributes.
>
>
There are some various great open-source search
engines, here with respect to something like Lucene
or SOLR or ElasticSearch.
>
The idea is that there are attributes searches,
and full-text searches, those resulting hits,
to documents apiece, or sections of their content,
then backward along their attributes, like
threads and related threads, and authors and
their cliques, while across groups and periods
of time.
>
There's not much of a notion of "semantic search",
though, it's expected to sort of naturally result,
here as for usually enough least distance, as for
"the terms of matching", and predicates from what
results a filter predicate, here with what I call,
"Yes/No/Maybe".
>
Now, what is, "yes/no/maybe", one might ask.
Well, it's the query specification, of the world
of results, to filter to the specified results.
The idea is that there's an accepter network
for "Yes" and a rejector network for "No"
and an accepter network for "Maybe" and
then rest are rejected.
>
The idea is that the search, is a combination
of a bunch of yes/no/maybe terms, or,
sure/no/yes, to indicate what's definitely
included, what's not, and what is, then that
the term, results that it's composable, from
sorting the terms, to result a filter predicate
implementation, that can run anywhere along
the way, from the backend to the frontend,
this way being a, "search query specification".
>
>
There are notions like, "*", and single match
and multimatch, about basically columns and
a column model, of documents, that are
basically rows.
>
>
The idea of course is to built an arithmetic expression,
that also is exactly a natural expression,
for "matches", and "ranges".
>
"AP"|Archimedes|Plutonium in first|last
>
Here, there is a search, for various names, that
it composes this way.
>
AP first
AP last
Archimedes first
Archimedes last
Plutonium first
Plutonium last
>
As you can see, these "match terms", just naturally
break out, then that what's gets into negations,
break out and double, and what gets into ranges,
then, well that involves for partitions and ranges,
duplicating and breaking that out.
>
It results though a very fungible and normal form
of a search query specification, that rebuilds the
filter predicate according to sorting those, then
has very well understood runtime according to
yes/no/maybe and the multimatch, across and
among multiple attributes, multiple terms.
>
>
This sort of enriches a usual sort of query
"exact full hit", with this sort "ranges and conditions,
exact full hits".
>
So, the Yes/No/Maybe, is the generic search query
specification, overall, just reflecting an accepter/rejector
network, with a bit on the front to reflect keep/toss,
that's it's very practical and of course totally commonplace
and easily written broken out as find or wildmat specs.
>
For then these the objects and the terms relating
the things, there's about maintaining this, while
refining it, that basically there's an ownership
and a reference count of the filter objects, so
that various controls according to the syntax of
the normal form of the expression itself, with
most usual English terms like "is" and "in" and
"has" and "between", and "not", with & for "and"
and | for "or", makes that this should be the kind
of filter query specification that one would expect
to be general purpose on all such manners of
filter query specifications and their controls.
>
So, a normal form for these filter objects, then
gets relating them to the SFF files, because, an
SFF file of a given input corpus, satisifies some
of these specifications, the queries, or for example
doesn't, about making the language and files
first of the query, then the content, then just
mapping those to the content, which are built
off extractors and summarizers.
>
I already thought about this a lot. It results
that it sort of has its own little theory,
thus what can result its own little normal forms,
for making a fungible SFF description, what
results for any query, going through those,
running the same query or as so filtered down
the query for the partition already, from the
front-end to the back-end and back, a little
noisy protocol, that delivers search results.
>
>
>
>
The document is element of the corpus.
Here each message is a corpus. Now,
there's a convention in Internet messages,
not always followed, being that the ignorant
or lacking etiquette or just plain different,
don't follow it or break it, there's a convention
of attribution in Internet messages the
content that's replied to, and, this is
variously "block" or "inline".
>
From the outside though, the document here
has the "overview" attributes, the key-value
pairs of the headers those being, and the
"body" or "document" itself, which can as
well have extracted attributes, vis-a-vis
otherwise its, "full text".
>
https://en.wikipedia.org/wiki/Search_engine_indexing
>
>
The key thing here for partitioning is to
make for date-range partitioning, while,
the organization of the messages by ID is
essentially flat, and constant rate to access one
but linear to trawl through them, although parallelizable,
for example with a parallelizable filter predicate
like yes/no/maybe, before getting into the
inter-document of terms, here the idea is that
there's basically
>
date partition
group partition
>
then as with regards to
>
threads
authors
>
that these are each having their own linear organization,
or as with respect to time-series partitions, and the serial.
>
Then, there are two sorts of data structures
to build with:
>
binary trees,
bit-maps.
>
So, the idea is to build indexes for date ranges
and then just search separately, either linear
or from an in-memory currency, the current.
>
I'm not too interested in "rapid results" as
much as "thoroughly parallelizable and
effectively indexed", and "providing
incremental results" and "full hits".
>
The idea here is to relate date ranges,
to an index file for the groups files,
then to just search the date ranges,
and for example as maybe articles expire,
which here they don't as it's archival,
to relate dropping old partitions with
updating the groups indexes.
>
For NNTP and IMAP then there's,
OVERVIEW and SEARCH. So, the
key attributes relevant those protocols,
are here to make it so that messages
have an abstraction of an extraction,
those being fixed as what results,
then those being very naively composable,
with regards to building data structures
of those, what with regards to match terms,
evaluate matches in ranges on those.
>
Now, NNTP is basically write-once-read-many,
though I suppose it's mostly write-once-read-
maybe-a-few-times-then-never, while IMAP
basically adds to the notion of the session,
what's read and un-read, and, otherwise
with regards to flags, IMAP flags. I.e. flags
are variables, all this other stuff being constants.
>
>
So, there's an idea to build a sort of, top-down,
or onion-y, layered, match-finder. This is where
it's naively composable to concatenate the
world of terms, in attributes, of documents,
in date ranges and group partitions, to find
"there is a hit" then to dive deeper into it,
figuring the idea is to horizontally scale
by refining date partitions and serial collections,
then parallelize those, where as well that serial
algorithms work the same on those, eg, by
concatenating those and working on that.
>
This is where a group and a date partition
each have a relatively small range, of overview
attributes, and their values, then that for
noisy values, like timestamps, to detect those
and work out what are small cardinal categories
and large cardinal ergodic identifiers.
>
It's sort of like, "Why don't you check out the
book Information Retrieval and read that again",
and, in a sense, it's because I figure that Google
has littered all their no-brainer patterns with junk patents
that instead I expect to clean-room and prior-art this.
Maybe that's not so, I just wonder sometimes how
they've arrived at monopolizing what's a totally
usual sort of "fetch it" routine.
>
>
So, the goal is to find hits, in conventions of
documents, inside the convention of quoting,
with regards to
bidirectional relations of correspondence, and,
unidirectional relations of nesting, those
being terms for matching, and building matching,
then that the match document, is just copied
and sent to each partition in parallel, each
resulting its hits.
>
The idea is to show a sort of search plan, over
the partitions, then that there's incremental
progress and expected times displayed, and
incremental results gathered, digging it up.
>
There's basically for partitions "has-a-hit" and
"hit-count", "hit-list", "hit-stream". That might
sound sort of macabre, but it means search hits
not mob hits, then for the keep/toss and yes/no/maybe,
that partitions are boundaries of sorts, on down
to ideas of "document-level" and "attribute-level"
aspects of, "intromissive and extromissive visibility".
>
>
https://lucene.apache.org/core/3_5_0/fileformats.html
>
https://solr.apache.org/guide/solr/latest/configuration-guide/index-location-format.html
>
>
It seems sort of sensible to adapt to Lucene's index file format,
or, it's pretty sensible, then with regards to default attributes
and this kind of thing, and the idea that threads are
documents for searching in threads and finding the
content actually aside the quotes.
>
The Lucene's index file format, isn't a data structure itself,
in terms of a data structure built for b-tree/b-map, where
the idea is to result a file, that's a serialization of a data
structure, within it, the pointer relations as to offsets
in the file, so that, it can be loaded into memory and
run, or that, I/O can seek through it and run, but especially
that, it can be mapped into memory and run.
>
I.e., "implementing the lookup" as following pointer offsets
in files, vis-a-vis a usual idea that the pointers are just links
in the tree or off the map, is one of these "SFF" files.
>
So, for an "index", it's really sort of only the terms then
that they're inverted from the documents that contain
them, to point back to them.
>
Then, because there are going to be index files for each
partition, is that there are terms and there are partitions,
with the idea that the query's broken out by organization,
so that search proceeds only when there's matching partitions,
then into matching terms.
>
AP 2020-2023
>
* AP
!afore(2020)
!after(2023)
>
AP 2019, 2024
>
* AP
!afore(2019)
!after(2019)
>
* AP
!afore(2024)
!after(2024)
>
>
Here for example the idea is to search the partitions
according to they match "natural" date terms, vis-a-vis,
referenced dates, and matching the term in any fields,
then that the range terms result either one query or
two, in the sense of breaking those out and resulting
that then their results get concatenated.
>
You can see that "in", here, as "between", for example
in terms of range, is implemented as "not out", for
that this way the Yes/No/Maybe, Sure/No/Yes, runs
>
match _any_ Sure: yes
match _any_ No: no
match _all_ Yes: yes
no
>
I.e. it's not a "Should/Must/MustNot Boolean" query.
>
What happens is that this way everything sort
of "or's" together "any", then when are introduced
no's, then those double about, when introduced
between's, those are no's, and when disjoint between's,
those break out otherwise redundant but separately
partitionable, queries.
>
AP not subject|body AI
>
not subject AI
not body AI
AP
>
Then the filter objects have these attributes:
owner, refcount, sure, not, operand, match term.
>
This is a fundamental sort of accepter/rejector that
I wrote up quite a bit on sci.logic, and here a bit.
>
Then this is that besides terms, a given file, has
for partitions, to relate those in terms of dates,
and skip those that don't apply, having that inside
the file, vis-a-vis, having it alongside the file,
pulling it from a file. Basically a search is to
identify SFF files as they're found going along,
then search through those.
>
The term frequency / inverse document frequency,
gets into summary statistics of terms in documents
the corpus, here as about those building up out
of partitions, and summing the summaries
with either concatenation or categorical closures.
>
So, about the terms, and the content, here it's
plainly text content, and there is a convention
the quoting convention. This is where, a reference
is quoted in part or in full, then the content is
either after-article (the article convention), afore-article
(the email convention) or "amidst-article", inline,
interspersed, or combinations thereof.
>
afore-article: reference follows
amidst-article: article split
after-article: reference is quoted
>
The idea in the quoting convention, is that
nothing changes in the quoted content,
which is indicated by the text convention.
>
This gets into the idea of sorting the hits for
relevance, and origin, about threads, or references,
when terms are introduced into threads, then
to follow those references, returning threads,
that have terms for hits.
>
The idea is to implement a sort of article-diff,
according to discovering quoting character
conventions, about what would be fragments,
of articles as documents, and documents,
their fragments by quoting, referring to
references, as introduce terms.
>
The references thread then as a data structure,
has at least two ways to look at it. The reference
itself is indicated by a directed-acyclic-graph or
tree built as links, it's a primary attribute, then
there's time-series data, then there's matching
of the subject attribute, and even as that search
results are a sort of thread.
>
In this sense then a thread, is abstractly of threads,
threads have heads, about that hits on articles,
are also hits on their threads, with each article
being head of a thread.
>
>
About common words, basically gets into language.
These are the articles (the definite and indefinite
articles of language), the usual copulas, the usual
prepositions, and all such words of parts-of-speech
that are syntactical and implement referents, and
about how they connect meaningful words, and
into language, in terms of sentences, paragraphs,
fragments, articles, and documents.
>
The idea is that a long enough article will eventually
contain all the common words. It's much structurally
about language, though, and usual match terms of
Yes/No/Maybe or the match terms of the Boolean,
are here for firstly exact match then secondarily
into "fuzzy" match and about terms that comprise
phrases, that the goal is that SFF makes data that
can be used to relate these things, when abstractly
each document is in a vacuum of all the languages
and is just an octet stream or character stream.
>
The, multi-lingual, then, basically figures to have
either common words of multiple languages,
and be multi-lingual, or meaningful words from
multiple languages, then that those are loanwords.
>
So, back to NNTP WILDMAT and IMAP SEARCH, ....
>
https://www.rfc-editor.org/rfc/rfc2980.html#section-3.3
https://datatracker.ietf.org/doc/html/rfc3977#section-4.2
>
If you've ever spent a lot of time making regexes
and running find to match files, wildmat is sort
of sensible and indeed a lot like Yes/No/Maybe.
Kind of like, sed accepts a list of commands,
and sometimes tr, when find, sed, and tr are the tools.
Anyways, implementing WILDMAT is to be implemented
according to SFF backing it then a reference algorithm.
The match terms of Yes/No/Maybe, don't really have
wildcards. They match substrings. For example
"equals" is equals and "in" is substring and "~" for
"relates" is by default "in". Then, there's either adding
wildcards, or adding anchors, to those, where the
anchors would be "^" for front and "$" for end.
Basically though WILDMAT is a sequence (Yes|No),
indicated by Yes terms not starting with '!' and No
terms marked with '!', then in reverse order,
i.e., right-to-left, any Yes match is yes and any No
match is no, and default is no. So, in Yes/No/Maybe,
it's a stack of Yes/No/Maybe's.
>
Mostly though NNTP doesn't have SEARCH, though,
so, .... And, wildmat is as much a match term, as
an accepter/rejector, for accepter/rejector algorithms,
that compose as queries.
>
https://datatracker.ietf.org/doc/html/rfc3501#section-6.4.4
>
IMAP defines "keys", these being the language of
the query, then as for expressions in those. Then
most of those get into the flags, counters, and
with regards to the user, session, that get into
the general idea that NNTP's session is just a
notion of "current group and current article",
that IMAP's user and session have flags and counters
applied to each message.
>
Search, then, basically is into search and selection,
and accumulating selection, and refining search,
that basically Sure accumulates as the selection
and No/Yes is the search. This gets relevant in
the IMAP extensions of SEARCH for selection,
then with the idea of commands on the selection.
>
>
>
Relevance: gets into "signal, and noise". That is
to say, back-and-forth references that don't
introduce new terms, are noise, and it's the
introduction of terms, and following that
their reference, that's relevance.
>
For attributes, this basically is for determining
low cardinality and high cardinality attributes,
that low cardinality attributes are categories,
and high cardinality attributes are identifiers.
>
This gets into "distance", and relation, then to
find close relations in near distances, helping
to find the beginnings and ends of things.
>
>
So, I figure BFF is about designed, so to carry
it out, and then get into SFF, that to have in
the middle something MFF metadata file-format
or session and user-wise, and the collection documents
and the query documents, yet, the layout of
the files and partitions, should be planned about
that it will grow, either the number of directories
or files, or there depth thereof, and it should be
partitionable, so that it results being able to add
or drop partitions by moving folders or making
links, about that mailbox is a file and maildir is
a directory and here the idea is "unbounded
retention and performant maintenance".
>
It involves read/write, instead of write-once-ready-many.
Rather, it involves read/write, or growing files,
and critical transactionality of serialization of
parallel routine, vis-a-vis the semantics of atomic move.
>
Then, for, "distance", is the distances of relations,
about how to relate things, and how to find
regions, that result a small distance among them,
like words and roots and authors and topics
and these kinds things, to build summary statistics
that are discrete and composable, then that those
naturally build both summaries as digests and also
histograms, not so much "data mining" as "towers of relation".
>
So, for a sort of notion of, "network distance",
is that basically there is time-series data and
auto-association of equality.
>
Then, it's sort of figured out what is a sort
of BFF that results then a "normal physical
store with atomic file semantics".
The partitioning seems essentially date-ranged,
with regards to then getting figured how to
have the groups and overview file made into
delivering the files.
The SFF seems to make for author->words
and thread->words, author<-> thread, and
about making intermediate files what result
running longer searches in the unbounded,
while also making for usual sorts simple
composable queries.
Then, with that making for the data, then
is again to the consideration of the design
of the server runtime, basically about that
there's to be the layers of protocols, that
result the layers indicate the at-rest formats,
i.e. compressed or padded for encryption,
then to make it so that the protocols per
connection mostly get involved with the
"attachment" per connection, which is
basically the private data structure.
This is where the attachment has for
the protocol as much there is of the
session, about what results that
according to the composability of protocols,
in terms of their message composition
and transport in commands, is to result
that the state-machine of the protocol
layering is to result a sort of stack of
protocols in the attachment, here for
that the attachment is a minimal amount
of data associated with a connection,
and would be the same in a sort of
thread-per-connection model, for
a sort of
intra-protocol,
inter-protocol,
infra-protocol,
that the intra-protocol reflects the
command layer, the inter-protocols
reflect message composition and transport,
and the infra-protocol reflects changed
in protocol.
It's similar then with the connection itself,
intra, inter, infra, with regards to the
semantics of flows, and session, with
regards to single connections and their
flows, and multiple connections and
their session.
Then, the layering of protocol seems
much about one sort of command set,
and various sorts transport encoding,
while related the session, then another
notion of layering of protocol involves
when one protocol is used to fulfill
another protocol directly, figuring
that instead that's "inside" what reflects
usually upstream/downstream, or request/
response, here about IMAP backed by NNTP
and mail2news and this kind of thing.