Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

Liste des GroupesRevenir à cl python 
Sujet : Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
De : olegsivokon (at) *nospam* gmail.com (Left Right)
Groupes : comp.lang.python
Date : 30. Sep 2024, 21:34:07
Autres entêtes
Message-ID : <mailman.19.1727796506.3018.python-list@python.org>
References : 1 2 3 4 5
What am I missing?  Handwavingly, start with the first digit, and as
long as the next character is a digit, multipliy the accumulated result
by 10 (or the appropriate base) and add the next value.  Oh, and handle
scientific notation as a special case, and perhaps fail spectacularly
instead of recovering gracefully in certain edge cases.  And in the
pathological case of a single number with 60 billion digits, run out of
memory (and complain loudly to the person who claimed that the file
contained a "dataset").  But why do I need to start with the least
significant digit?

You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the
magnitude yet.  What about two digits? -- Same thing.  You cannot
leave the parser code until you know the magnitude (otherwise the
information is useless to the external code).

So, even if you have enough memory and don't care about special cases
like scientific notation: yes, you will be able to parse it, but it
won't be a streaming parser.

On Mon, Sep 30, 2024 at 9:30 PM Left Right <olegsivokon@gmail.com> wrote:
>
Streaming won't work because the file is gzipped.  You have to receive
the whole thing before you can unzip it. Once unzipped it will be even
larger, and all in memory.
>
GZip is specifically designed to be streamed.  So, that's not a
problem (in principle), but you would need to have a streaming GZip
parser, quick search in PyPI revealed this package:
https://pypi.org/project/gzip-stream/ .
>
On Mon, Sep 30, 2024 at 6:20 PM Thomas Passin via Python-list
<python-list@python.org> wrote:
>
On 9/30/2024 11:30 AM, Barry via Python-list wrote:
>
>
On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:
>
>
import polars as pl
pl.read_json("file.json")
>
>
>
This is not going to work unless the computer has a lot more the 60GiB of RAM.
>
As later suggested a streaming parser is required.
>
Streaming won't work because the file is gzipped.  You have to receive
the whole thing before you can unzip it. Once unzipped it will be even
larger, and all in memory.
--
https://mail.python.org/mailman/listinfo/python-list

Date Sujet#  Auteur
30 Sep 24 * Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API10Left Right
1 Oct 24 `* Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API9Greg Ewing
2 Oct 24  +* Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API2<avi.e.gross
2 Oct 24  i`- Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API1Greg Ewing
2 Oct 24  +- Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API1Left Right
2 Oct 24  +- Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API1Chris Angelico
3 Oct 24  +- Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API1Chris Angelico
3 Oct 24  +* Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API2Left Right
3 Oct 24  i`- doRe: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API1Greg Ewing
3 Oct 24  `- Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API1Left Right

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal