Path: ...!fu-berlin.de!uni-berlin.de!not-for-mail From: Left Right Newsgroups: comp.lang.python Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API Date: Mon, 30 Sep 2024 21:34:07 +0200 Lines: 58 Message-ID: References: <082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org> <9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Trace: news.uni-berlin.de 77n2yoYUEAJaacRY+WzNQwBLsunwGzzm/iTWP83VY3pQ== Cancel-Lock: sha1:8MeFY9hZ3LNnv+l59GKOn/UPVIE= sha256:fvwG5f0EZ0OiVFQFyIcP7h4po587j6PxPBell3Lk9js= Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org Authentication-Results: mail.python.org; dkim=pass reason="2048-bit key; unprotected key" header.d=gmail.com header.i=@gmail.com header.b=ZdRDrKcO; dkim-adsp=pass; dkim-atps=neutral X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'yet.': 0.04; 'pypi': 0.05; 'received:mail-qk1-x72d.google.com': 0.07; 'subject:API': 0.07; 'thing.': 0.07; 'cases.': 0.09; 'cc:addr:python-list': 0.09; 'memory.': 0.09; 'parse': 0.09; 'url-ip:151.101.0.223/32': 0.09; 'url-ip:151.101.128.223/32': 0.09; 'url-ip:151.101.192.223/32': 0.09; 'url-ip:151.101.64.223/32': 0.09; 'cc:no real name:2**0': 0.14; 'import': 0.15; 'url:mailman': 0.15; 'memory': 0.15; '2024': 0.16; 'barry': 0.16; 'data?': 0.16; 'janhangeer': 0.16; 'missing?': 0.16; 'oh,': 0.16; 'url:project': 0.16; 'url:pypi': 0.16; 'useless': 0.16; 'wrote:': 0.16; 'problem': 0.16; 'subject:Help': 0.17; 'instead': 0.17; 'probably': 0.17; 'cc:addr:python.org': 0.20; 'code': 0.23; 'run': 0.23; '(and': 0.25; 'url-ip:188.166.95.178/32': 0.25; 'url-ip:188.166.95/24': 0.25; 'url:listinfo': 0.25; 'cannot': 0.25; 'cc:2**0': 0.25; 'url- ip:188.166/16': 0.25; 'leave': 0.27; 'function': 0.27; 'computer': 0.29; 'it,': 0.29; 'whole': 0.30; 'am,': 0.31; 'url-ip:188/8': 0.31; 'python-list': 0.32; 'sep': 0.32; 'message- id:@mail.gmail.com': 0.32; 'unless': 0.32; 'but': 0.32; 'subject:for': 0.33; 'able': 0.34; 'same': 0.34; 'header:In-Reply- To:1': 0.34; 'received:google.com': 0.34; 'yes,': 0.35; 'from:addr:gmail.com': 0.35; 'cases': 0.36; 'mon,': 0.36; 'special': 0.37; 'subject:from': 0.37; 'file': 0.38; 'two': 0.39; 'least': 0.39; 'single': 0.39; 'enough': 0.39; 'handle': 0.39; 'hand': 0.40; 'search': 0.61; 'skip:h 10': 0.61; 'url- ip:151.101.0/24': 0.62; 'url-ip:151.101.128/24': 0.62; 'url- ip:151.101.192/24': 0.62; 'url-ip:151.101.64/24': 0.62; 'once': 0.63; 'right': 0.68; 'subject:Data': 0.71; 'receive': 0.71; 'care': 0.71; 'quick': 0.77; 'significant': 0.78; 'left': 0.83; 'billion': 0.84; 'forgot': 0.84; 'larger,': 0.84; 'revealed': 0.84; 'subject: \n ': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1727724858; x=1728329658; darn=python.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=y1hnTatHgT4FJ/pIy0TM6dKXfZPbbOo5ou2wOjLj+AM=; b=ZdRDrKcON+QPyCkWO9YOhQ5KfYjF+uvtA8rJMx7ljJhRLIZhwumn0ivGLsVf5tKSj6 M2RacefCKN9wwn/etOAKuTNctBQZWFx/UCoL8pCFM+pRoDDq/j1lHtRzerhkaQB0HQDc bb+nwHwERoeE7NI5P4/d97BhahjXfFq89UKLFlo4GsUY5LBt3yE4zxC6OX30962GBgCE L50IAudsFYE6QyJuV4MQ0Q4iEs/GVEequ4rOIbjMyYK4iHsBPuBvK1nnRH//41IuNNE2 owynJWvekT2Ivb8KlBtvsHTlOgWiUhZvZ08LI30j9K+ffJWAIfil+tstCESE7Jhv5E6d IlDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727724858; x=1728329658; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=y1hnTatHgT4FJ/pIy0TM6dKXfZPbbOo5ou2wOjLj+AM=; b=cNeGa177/aBmvHAxtxe5lkCXuxBhpju0REONiW38apJSHHH6MJibJFKAM7fDZw6LSe Ze3AwJWtS0MfPUXnue1AIxX2RO+ekuDgh7G2gOe85ASKhktgLTlYxyuLDu+3PMEw3sFa iEPLOvBYESkB9nGDoUxqNNLxQqHtRSkS+LQwJ1+oaA+0OWZiwjxbJSi/9vToaDcOPJRH AgWipEzcSPWPf5UH7nTIFJ+HUDpIC4F/T8NSaeIIySLLLzL8xg0p/CFiSXALgq+M7FlM tVDRskC1XSVXk4fZ977YP+gL6jHhBqWIerQ7RekgrKmXjguq+LxgKsbRRxPtbzr+dWLr b86g== X-Gm-Message-State: AOJu0YwFYvg9sUYmp5hnVzp0+sZDhSabCCvwp1412JGiJ7gj1jiOVmlj Mta2rpVo9thOKauHnTbTwY4Bm8dKCQ2vgFGMu3i3L0E/q5btvnB5usy4DCas/ntLwFSGWne0tQx 88DyLZzMBku+HtXELmBbry79Rk37ANoeE X-Google-Smtp-Source: AGHT+IFbZ3QjtW/qWxEQfjK2SEtC6qN4cpXSyozTAhytqrMlnnaMfBKsrXIJ3Ick0U4/cIDLawchNY2IkUGKmfYvqrU= X-Received: by 2002:a0c:e804:0:b0:6cb:584f:ec22 with SMTP id 6a1803df08f44-6cb729c377dmr10830876d6.21.1727724858266; Mon, 30 Sep 2024 12:34:18 -0700 (PDT) In-Reply-To: X-Mailman-Approved-At: Tue, 01 Oct 2024 11:28:25 -0400 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.39 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Mailman-Original-Message-ID: X-Mailman-Original-References: <082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org> <9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net> Bytes: 8721 > What am I missing? Handwavingly, start with the first digit, and as > long as the next character is a digit, multipliy the accumulated result > by 10 (or the appropriate base) and add the next value. Oh, and handle > scientific notation as a special case, and perhaps fail spectacularly > instead of recovering gracefully in certain edge cases. And in the > pathological case of a single number with 60 billion digits, run out of > memory (and complain loudly to the person who claimed that the file > contained a "dataset"). But why do I need to start with the least > significant digit? You probably forgot that it has to be _streaming_. Suppose you parse the first digit: can you hand this information over to an external function to process the parsed data? -- No! because you don't know the magnitude yet. What about two digits? -- Same thing. You cannot leave the parser code until you know the magnitude (otherwise the information is useless to the external code). So, even if you have enough memory and don't care about special cases like scientific notation: yes, you will be able to parse it, but it won't be a streaming parser. On Mon, Sep 30, 2024 at 9:30=E2=80=AFPM Left Right = wrote: > > > Streaming won't work because the file is gzipped. You have to receive > > the whole thing before you can unzip it. Once unzipped it will be even > > larger, and all in memory. > > GZip is specifically designed to be streamed. So, that's not a > problem (in principle), but you would need to have a streaming GZip > parser, quick search in PyPI revealed this package: > https://pypi.org/project/gzip-stream/ . > > On Mon, Sep 30, 2024 at 6:20=E2=80=AFPM Thomas Passin via Python-list > wrote: > > > > On 9/30/2024 11:30 AM, Barry via Python-list wrote: > > > > > > > > >> On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <= python-list@python.org> wrote: > > >> > > >> > > >> import polars as pl > > >> pl.read_json("file.json") > > >> > > >> > > > > > > This is not going to work unless the computer has a lot more the 60Gi= B of RAM. > > > > > > As later suggested a streaming parser is required. > > > > Streaming won't work because the file is gzipped. You have to receive > > the whole thing before you can unzip it. Once unzipped it will be even > > larger, and all in memory. > > -- > > https://mail.python.org/mailman/listinfo/python-list