Deutsch   English   Français   Italiano  
<mailman.19.1727796506.3018.python-list@python.org>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!fu-berlin.de!uni-berlin.de!not-for-mail
From: Left Right <olegsivokon@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
 GB) from Kenna API
Date: Mon, 30 Sep 2024 21:34:07 +0200
Lines: 58
Message-ID: <mailman.19.1727796506.3018.python-list@python.org>
References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
 <082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
 <9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
 <CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
 <CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de 77n2yoYUEAJaacRY+WzNQwBLsunwGzzm/iTWP83VY3pQ==
Cancel-Lock: sha1:8MeFY9hZ3LNnv+l59GKOn/UPVIE= sha256:fvwG5f0EZ0OiVFQFyIcP7h4po587j6PxPBell3Lk9js=
Return-Path: <olegsivokon@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
 reason="2048-bit key; unprotected key"
 header.d=gmail.com header.i=@gmail.com header.b=ZdRDrKcO;
 dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.001
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'yet.': 0.04; 'pypi': 0.05;
 'received:mail-qk1-x72d.google.com': 0.07; 'subject:API': 0.07;
 'thing.': 0.07; 'cases.': 0.09; 'cc:addr:python-list': 0.09;
 'memory.': 0.09; 'parse': 0.09; 'url-ip:151.101.0.223/32': 0.09;
 'url-ip:151.101.128.223/32': 0.09; 'url-ip:151.101.192.223/32':
 0.09; 'url-ip:151.101.64.223/32': 0.09; 'cc:no real name:2**0':
 0.14; 'import': 0.15; 'url:mailman': 0.15; 'memory': 0.15; '2024':
 0.16; 'barry': 0.16; 'data?': 0.16; 'janhangeer': 0.16;
 'missing?': 0.16; 'oh,': 0.16; 'url:project': 0.16; 'url:pypi':
 0.16; 'useless': 0.16; 'wrote:': 0.16; 'problem': 0.16;
 'subject:Help': 0.17; 'instead': 0.17; 'probably': 0.17;
 'cc:addr:python.org': 0.20; 'code': 0.23; 'run': 0.23; '(and':
 0.25; 'url-ip:188.166.95.178/32': 0.25; 'url-ip:188.166.95/24':
 0.25; 'url:listinfo': 0.25; 'cannot': 0.25; 'cc:2**0': 0.25; 'url-
 ip:188.166/16': 0.25; 'leave': 0.27; 'function': 0.27; 'computer':
 0.29; 'it,': 0.29; 'whole': 0.30; 'am,': 0.31; 'url-ip:188/8':
 0.31; 'python-list': 0.32; 'sep': 0.32; 'message-
 id:@mail.gmail.com': 0.32; 'unless': 0.32; 'but': 0.32;
 'subject:for': 0.33; 'able': 0.34; 'same': 0.34; 'header:In-Reply-
 To:1': 0.34; 'received:google.com': 0.34; 'yes,': 0.35;
 'from:addr:gmail.com': 0.35; 'cases': 0.36; 'mon,': 0.36;
 'special': 0.37; 'subject:from': 0.37; 'file': 0.38; 'two': 0.39;
 'least': 0.39; 'single': 0.39; 'enough': 0.39; 'handle': 0.39;
 'hand': 0.40; 'search': 0.61; 'skip:h 10': 0.61; 'url-
 ip:151.101.0/24': 0.62; 'url-ip:151.101.128/24': 0.62; 'url-
 ip:151.101.192/24': 0.62; 'url-ip:151.101.64/24': 0.62; 'once':
 0.63; 'right': 0.68; 'subject:Data': 0.71; 'receive': 0.71;
 'care': 0.71; 'quick': 0.77; 'significant': 0.78; 'left': 0.83;
 'billion': 0.84; 'forgot': 0.84; 'larger,': 0.84; 'revealed':
 0.84; 'subject: \n ': 0.84
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1727724858; x=1728329658; darn=python.org;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:from:to:cc:subject:date
 :message-id:reply-to;
 bh=y1hnTatHgT4FJ/pIy0TM6dKXfZPbbOo5ou2wOjLj+AM=;
 b=ZdRDrKcON+QPyCkWO9YOhQ5KfYjF+uvtA8rJMx7ljJhRLIZhwumn0ivGLsVf5tKSj6
 M2RacefCKN9wwn/etOAKuTNctBQZWFx/UCoL8pCFM+pRoDDq/j1lHtRzerhkaQB0HQDc
 bb+nwHwERoeE7NI5P4/d97BhahjXfFq89UKLFlo4GsUY5LBt3yE4zxC6OX30962GBgCE
 L50IAudsFYE6QyJuV4MQ0Q4iEs/GVEequ4rOIbjMyYK4iHsBPuBvK1nnRH//41IuNNE2
 owynJWvekT2Ivb8KlBtvsHTlOgWiUhZvZ08LI30j9K+ffJWAIfil+tstCESE7Jhv5E6d
 IlDA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1727724858; x=1728329658;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=y1hnTatHgT4FJ/pIy0TM6dKXfZPbbOo5ou2wOjLj+AM=;
 b=cNeGa177/aBmvHAxtxe5lkCXuxBhpju0REONiW38apJSHHH6MJibJFKAM7fDZw6LSe
 Ze3AwJWtS0MfPUXnue1AIxX2RO+ekuDgh7G2gOe85ASKhktgLTlYxyuLDu+3PMEw3sFa
 iEPLOvBYESkB9nGDoUxqNNLxQqHtRSkS+LQwJ1+oaA+0OWZiwjxbJSi/9vToaDcOPJRH
 AgWipEzcSPWPf5UH7nTIFJ+HUDpIC4F/T8NSaeIIySLLLzL8xg0p/CFiSXALgq+M7FlM
 tVDRskC1XSVXk4fZ977YP+gL6jHhBqWIerQ7RekgrKmXjguq+LxgKsbRRxPtbzr+dWLr
 b86g==
X-Gm-Message-State: AOJu0YwFYvg9sUYmp5hnVzp0+sZDhSabCCvwp1412JGiJ7gj1jiOVmlj
 Mta2rpVo9thOKauHnTbTwY4Bm8dKCQ2vgFGMu3i3L0E/q5btvnB5usy4DCas/ntLwFSGWne0tQx
 88DyLZzMBku+HtXELmBbry79Rk37ANoeE
X-Google-Smtp-Source: AGHT+IFbZ3QjtW/qWxEQfjK2SEtC6qN4cpXSyozTAhytqrMlnnaMfBKsrXIJ3Ick0U4/cIDLawchNY2IkUGKmfYvqrU=
X-Received: by 2002:a0c:e804:0:b0:6cb:584f:ec22 with SMTP id
 6a1803df08f44-6cb729c377dmr10830876d6.21.1727724858266; Mon, 30 Sep 2024
 12:34:18 -0700 (PDT)
In-Reply-To: <CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
X-Mailman-Approved-At: Tue, 01 Oct 2024 11:28:25 -0400
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
 <python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
 <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
 <mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
X-Mailman-Original-References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
 <082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
 <9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
 <CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
Bytes: 8721

> What am I missing?  Handwavingly, start with the first digit, and as
> long as the next character is a digit, multipliy the accumulated result
> by 10 (or the appropriate base) and add the next value.  Oh, and handle
> scientific notation as a special case, and perhaps fail spectacularly
> instead of recovering gracefully in certain edge cases.  And in the
> pathological case of a single number with 60 billion digits, run out of
> memory (and complain loudly to the person who claimed that the file
> contained a "dataset").  But why do I need to start with the least
> significant digit?

You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the
magnitude yet.  What about two digits? -- Same thing.  You cannot
leave the parser code until you know the magnitude (otherwise the
information is useless to the external code).

So, even if you have enough memory and don't care about special cases
like scientific notation: yes, you will be able to parse it, but it
won't be a streaming parser.

On Mon, Sep 30, 2024 at 9:30=E2=80=AFPM Left Right <olegsivokon@gmail.com> =
wrote:
>
> > Streaming won't work because the file is gzipped.  You have to receive
> > the whole thing before you can unzip it. Once unzipped it will be even
> > larger, and all in memory.
>
> GZip is specifically designed to be streamed.  So, that's not a
> problem (in principle), but you would need to have a streaming GZip
> parser, quick search in PyPI revealed this package:
> https://pypi.org/project/gzip-stream/ .
>
> On Mon, Sep 30, 2024 at 6:20=E2=80=AFPM Thomas Passin via Python-list
> <python-list@python.org> wrote:
> >
> > On 9/30/2024 11:30 AM, Barry via Python-list wrote:
> > >
> > >
> > >> On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <=
python-list@python.org> wrote:
> > >>
> > >>
> > >> import polars as pl
> > >> pl.read_json("file.json")
> > >>
> > >>
> > >
> > > This is not going to work unless the computer has a lot more the 60Gi=
B of RAM.
> > >
> > > As later suggested a streaming parser is required.
> >
> > Streaming won't work because the file is gzipped.  You have to receive
> > the whole thing before you can unzip it. Once unzipped it will be even
> > larger, and all in memory.
> > --
> > https://mail.python.org/mailman/listinfo/python-list