Sujet : Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
De : olegsivokon (at) *nospam* gmail.com (Left Right)
Groupes : comp.lang.pythonDate : 01. Oct 2024, 22:03:01
Autres entêtes
Message-ID : <mailman.23.1727817087.3018.python-list@python.org>
References : 1 2 3 4 5 6 7
If I recognize the first digit, then I *can* hand that over to an
external function to accumulate the digits that follow.
And what is that external function going to do with this information?
The point is you didn't parse anything if you just sent the digit.
You just delegated the parsing further. Parsing is only meaningful if
you extracted some information, but your idea is, essentially "what if
I do nothing?".
Under that constraint, I'm not sure I can parse anything. How can I
parse a string (and hand it over to an external function) until I've
found the closing quote?
Nobody says that parsing a number is the only pathological case. You,
however, exaggerate by saying you cannot parse _anything_. You can
parse booleans or null, for example. There's no problem there.
Again, I think you misunderstand what streaming is for. Let me remind:
it's for processing information as it comes, potentially,
indefinitely. This has far more important implications than what you
find in computer science. For example, some mathematicians use the
same argument to show that real numbers are either fiction or useless:
consider adding two real numbers (where real numbers are potentially
infinite strings of decimal digits after the period) -- there's no way
to prove that such an addition is possible because you would need an
infinite proof for that (because you need to start adding from the
least significant digit).
In principle, any language that has infinite words will have the same
problem with streaming. If you ever pondered h/w or low-level
protocols s.a. SCSI or IP, you'd see that they are specifically
designed in such a way as to never have infinite words (because they
must be amenable to streaming). Consider also an interesting
consequence of SCSI not being able to have infinite words: this means,
besides other things that fsync() is nonsense! :) If you aren't
familiar with the concept: UNIX filesystem API suggests that it's
possible to destage arbitrary large file (or a chunk of file) to disk.
But SCSI is built of finite "words" and to describe an arbitrary large
file you'd need to list all the blocks that constitute the file! And
that's why fsync() and family are so hated by people who deal with
storage: the only way to implement fsync() in compliance with the
standard is to sync _everything_ (and it hurts!)
On Tue, Oct 1, 2024 at 5:49 PM Dan Sommers via Python-list
<
python-list@python.org> wrote:
>
On 2024-09-30 at 21:34:07 +0200,
Regarding "Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API,"
Left Right via Python-list <python-list@python.org> wrote:
>
What am I missing? Handwavingly, start with the first digit, and as
long as the next character is a digit, multipliy the accumulated result
by 10 (or the appropriate base) and add the next value. Oh, and handle
scientific notation as a special case, and perhaps fail spectacularly
instead of recovering gracefully in certain edge cases. And in the
pathological case of a single number with 60 billion digits, run out of
memory (and complain loudly to the person who claimed that the file
contained a "dataset"). But why do I need to start with the least
significant digit?
>
You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the
magnitude yet. What about two digits? -- Same thing. You cannot
leave the parser code until you know the magnitude (otherwise the
information is useless to the external code).
>
If I recognize the first digit, then I *can* hand that over to an
external function to accumulate the digits that follow.
>
So, even if you have enough memory and don't care about special cases
like scientific notation: yes, you will be able to parse it, but it
won't be a streaming parser.
>
Under that constraint, I'm not sure I can parse anything. How can I
parse a string (and hand it over to an external function) until I've
found the closing quote?
>
How much state can a parser maintain (before it invokes an external
function) and still be considered streaming? I fear that we may be
getting hung up on terminology rather than solving the problem at hand.
--
https://mail.python.org/mailman/listinfo/python-list