Sujet : Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
De : 2QdxY4RzWzUUiLuE (at) *nospam* potatochowder.com
Groupes : comp.lang.pythonDate : 01. Oct 2024, 17:47:24
Autres entêtes
Message-ID : <mailman.21.1727797649.3018.python-list@python.org>
References : 1 2 3 4 5 6
On 2024-09-30 at 21:34:07 +0200,
Regarding "Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API,"
Left Right via Python-list <
python-list@python.org> wrote:
What am I missing? Handwavingly, start with the first digit, and as
long as the next character is a digit, multipliy the accumulated result
by 10 (or the appropriate base) and add the next value. Oh, and handle
scientific notation as a special case, and perhaps fail spectacularly
instead of recovering gracefully in certain edge cases. And in the
pathological case of a single number with 60 billion digits, run out of
memory (and complain loudly to the person who claimed that the file
contained a "dataset"). But why do I need to start with the least
significant digit?
You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the
magnitude yet. What about two digits? -- Same thing. You cannot
leave the parser code until you know the magnitude (otherwise the
information is useless to the external code).
If I recognize the first digit, then I *can* hand that over to an
external function to accumulate the digits that follow.
So, even if you have enough memory and don't care about special cases
like scientific notation: yes, you will be able to parse it, but it
won't be a streaming parser.
Under that constraint, I'm not sure I can parse anything. How can I
parse a string (and hand it over to an external function) until I've
found the closing quote?
How much state can a parser maintain (before it invokes an external
function) and still be considered streaming? I fear that we may be
getting hung up on terminology rather than solving the problem at hand.