Deutsch English Français Italiano |
<mailman.24.1727825216.3018.python-list@python.org> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!2.eu.feeder.erje.net!feeder.erje.net!fu-berlin.de!uni-berlin.de!not-for-mail From: <avi.e.gross@gmail.com> Newsgroups: comp.lang.python Subject: RE: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API Date: Tue, 1 Oct 2024 19:26:52 -0400 Lines: 72 Message-ID: <mailman.24.1727825216.3018.python-list@python.org> References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com> <082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org> <9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net> <CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com> <CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com> <mailman.19.1727796506.3018.python-list@python.org> <lm391bFu38hU1@mid.individual.net> <020101db1459$65b0c4d0$31124e70$@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: news.uni-berlin.de NpqKPZHkGjhSZhTi+wITOACqC3ANwftY9JKrylrvu1kw== Cancel-Lock: sha1:loQhN70+sCPD0ZJZenbumwgLq/o= sha256:kz/oFLZhqDOHymbakcNxLLl5uc4ZO+w9LsiTTegpPG0= Return-Path: <avi.e.gross@gmail.com> X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org Authentication-Results: mail.python.org; dkim=pass reason="2048-bit key; unprotected key" header.d=gmail.com header.i=@gmail.com header.b=OtW7Qcyr; dkim-adsp=pass; dkim-atps=neutral X-Spam-Status: OK 0.005 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'argument': 0.04; 'stream': 0.04; 'yet.': 0.04; 'row': 0.05; 'subject:API': 0.07; 'compressed': 0.09; 'infinite': 0.09; 'json': 0.09; 'locally': 0.09; 'parse': 0.09; 'smaller': 0.09; 'url:mailman': 0.15; 'problem.': 0.15; '*could*': 0.16; '2024': 0.16; 'along.': 0.16; 'appended': 0.16; 'applies': 0.16; 'arbitrary': 0.16; 'columns': 0.16; 'data?': 0.16; 'decimal': 0.16; 'derive': 0.16; 'discarding': 0.16; 'division': 0.16; 'entirety': 0.16; 'evaluating': 0.16; 'greg': 0.16; 'like.': 0.16; 'pi,': 0.16; 'places,': 0.16; 'primes': 0.16; 'somewhat': 0.16; 'structures': 0.16; 'useful.': 0.16; 'want,': 0.16; 'wrote:': 0.16; 'problem': 0.16; 'python': 0.16; 'api': 0.17; 'larger': 0.17; 'october': 0.17; 'subject:Help': 0.17; 'instead': 0.17; 'probably': 0.17; 'message-id:@gmail.com': 0.18; 'to:addr:python-list': 0.20; 'written': 0.22; 'way.': 0.22; 'code': 0.23; 'list,': 0.24; 'anything': 0.25; 'skip:- 10': 0.25; 'url-ip:188.166.95.178/32': 0.25; 'url-ip:188.166.95/24': 0.25; 'discussion': 0.25; 'url:listinfo': 0.25; 'url-ip:188.166/16': 0.25; 'bit': 0.27; 'function': 0.27; 'output': 0.28; 'sense': 0.28; 'series': 0.28; 'ideas': 0.28; 'keeping': 0.28; 'computer': 0.29; 'asked': 0.29; 'am,': 0.31; 'url-ip:188/8': 0.31; 'think': 0.32; 'context': 0.32; 'manner.': 0.32; 'passes': 0.32; 'python-list': 0.32; 'structure': 0.32; 'zero': 0.32; 'but': 0.32; 'subject:for': 0.33; 'there': 0.33; 'particular': 0.33; 'same': 0.34; 'mean': 0.34; 'header:In- Reply-To:1': 0.34; 'received:google.com': 0.34; 'from:addr:gmail.com': 0.35; 'files': 0.36; 'applying': 0.36; 'year': 0.36; 'necessarily': 0.37; 'subject:from': 0.37; 'hard': 0.37; 'could': 0.38; 'read': 0.38; 'quite': 0.39; 'sending': 0.39; 'list': 0.39; 'received:100': 0.39; 'data.': 0.40; 'hand': 0.40; 'processed': 0.40; 'program.': 0.40; 'serious': 0.40; 'something': 0.40; 'want': 0.40; 'july': 0.60; 'including': 0.60; 'paid': 0.61; 'from:': 0.62; 'to:': 0.62; 'data,': 0.63; 'remote': 0.63; 'ever': 0.63; 'send': 0.63; 'between': 0.63; 'about.': 0.64; 'definition': 0.64; 're:': 0.64; 'remains': 0.64; 'your': 0.64; 'company': 0.64; 'supply': 0.65; 'similar': 0.65; 'well': 0.65; 'less': 0.65; 'wish': 0.66; 'right': 0.68; 'and,': 0.69; 'parts,': 0.69; 'piece': 0.69; 'taylor': 0.69; 'times': 0.69; 'instead,': 0.70; 'claim': 0.71; 'subject:Data': 0.71; 'care': 0.71; 'little': 0.73; 'records': 0.75; 'sent:': 0.78; 'database': 0.80; 'more.': 0.82; 'left': 0.83; 'points': 0.84; 'thousand': 0.84; 'forgot': 0.84; 'gigabytes': 0.84; 'modes,': 0.84; 'streams,': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1727825213; x=1728430013; darn=python.org; h=content-language:thread-index:content-transfer-encoding :mime-version:message-id:date:subject:in-reply-to:references:to:from :from:to:cc:subject:date:message-id:reply-to; bh=Ojx23Z4abYr2tFny/mmNnY/jCnax3XXG5aMGdeWh4tM=; b=OtW7Qcyrs+YntYmmTdLhTh9I1DpkAzkuD3L5vQuOh9u6r4l2fpCE94kImNtSDIeZvj Zztbn0i/6D86WF4yQsYD4id0Xo8fDwmqfOITAJZY0wMFu7cCUGSNSNXrGx1r0r4uuvjF f32OVKsxo277nlx/o4aZNwn5wLhEmteeldfiP64eRARSD3WfntBDXAZ5FpVBNTnF6DVU lXNx/ddvB5M0GnXgB/2whksD4Kjp+7ksa2vom7/yIaM62c0Ik1gZUaWpWAWls+jDN2Vw pHi0vgdAOlZVeY2bwZqbv/fvUu5fioghG8thM3JkasWkZatzI3d2XaJLEQbBUE/xz98Q AreQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727825213; x=1728430013; h=content-language:thread-index:content-transfer-encoding :mime-version:message-id:date:subject:in-reply-to:references:to:from :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Ojx23Z4abYr2tFny/mmNnY/jCnax3XXG5aMGdeWh4tM=; b=I0PwlBJSlRtSE4aJjcPgDggj9hq5w87qZmvMUaUiK6hzensz+SNTGN6KhT8938TJGu Op+dYiaDDlTkCUl2h6UIKLqZCIIhqRoZgYf+IjKGLloQ8qveUgr4a10tNWVugROsdFzD M5HA/rATgVoduRWvTIoFmj7rpxnInTmwHCRfvm5FOPItSSQJ/qIRvVGkGGMKtrpXBFx1 9+o/4LOhnzCip8uzUuv/6MQNM2D1NTJEJHsduuZBxtaTqBhXwSWW/CohM8VFyvYElD86 QxfwEV8OuqEEKu8RQDzmaClDyZFYWZrDcalXFSCX2MvFvFKTazbOGzG9WBizuZuFumGT zGpQ== X-Forwarded-Encrypted: i=1; AJvYcCW1KEgoN89vPM2GMY10XXzSzG2VhV9XVr1qa7ZLPZCSUYHRMIj0SAUPuP0wIvNFUKf+yDzGnhDO/PDWQw==@python.org X-Gm-Message-State: AOJu0YzIkOf2SR+cCIX/SO+QDAr3qR8s1j6T9ztvkF6DoSSQ9rgXNhN/ 0BbRuZJ9EaaQvC1PXm+sOXRGPFy97Z44SrPq0L7+q7VoliBRaWwK X-Google-Smtp-Source: AGHT+IE75gKw3RdLM4Yja4ksK8D7fSbuhZUaMgnW3DwEhLSxJ/CfGGuKlreKSWlJJCuDhT8/pSezyA== X-Received: by 2002:a05:6214:4302:b0:6cb:3131:e287 with SMTP id 6a1803df08f44-6cb81a4eadbmr18477116d6.36.1727825212743; Tue, 01 Oct 2024 16:26:52 -0700 (PDT) In-Reply-To: <lm391bFu38hU1@mid.individual.net> X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQKErNhKWbz+QZtX9o5Uj/t7Xq9luAHpFUchAlYAEYkB45ipqgFWub9/AisKDFYByRG2hLDD4A6A Content-Language: en-us X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.39 Precedence: list List-Id: General discussion list for the Python programming language <python-list.python.org> List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> List-Archive: <https://mail.python.org/pipermail/python-list/> List-Post: <mailto:python-list@python.org> List-Help: <mailto:python-list-request@python.org?subject=help> List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> X-Mailman-Original-Message-ID: <020101db1459$65b0c4d0$31124e70$@gmail.com> X-Mailman-Original-References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com> <082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org> <9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net> <CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com> <CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com> <mailman.19.1727796506.3018.python-list@python.org> <lm391bFu38hU1@mid.individual.net> Bytes: 10872 This discussion has become less useful. E can all agree that in Computer Science, real infinities are avoided, and frankly, need not be taken seriously in any serious program. You can store all kinds of infinities quite compactly as in a transcendental number you can derive to as many decimal points as you like. Want 1/7 to a thousand decimal places, no problem. You can be given a digit 1 and a digit 7 and asked to do a division to as many digits as you wish in a deterministic manner. I can think of quite a few generators that could easily supply the next digit, or just keep giving the next element from 142857 each time from a circular loop. Sines, cosines, pi, e and so on, can often be calculated to arbitrary precision by evaluating things like infinite Taylor Series as many times as needed up to the precision of the data holding the number as you move along. Similar ideas allow generators to give you as many primes as you want, and no more. So, if you can store arbitrary python code as part of your JSON, you can send quite a bit of somewhat compressed data. The real problem is how the JSON is set up. If you take umpteen data structures and wrap them all in something like a list, then it may be a tad hard to stream as you may not necessarily be examining the contents till the list finishes gigabytes later. But if, instead, you send lots of smaller parts, such as perhaps sending each row of something like a data.frame individually, the other side can recombine them incrementally to a larger structure such as a data.frame and do some logic on it as it streams, such as keeping only some columns and discarding the rest, or applying filters that only keep rows you care about. And, of course, all rows could be appended to one and perhaps more .CSV files as well so if you need multiple passes on the data, it can now be processed locally in various modes, including "streamed". I think that for some purposes, it makes some sense to not stream anything but results. I mean consider any database that allows a remote login and SQL commands that only stream results. If I only want info on records about company X between July 1 and September 15 of a particular year and only if the amount paid remains zero or is less than the amount owed, ... -----Original Message----- From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Greg Ewing via Python-list Sent: Tuesday, October 1, 2024 5:48 PM To: python-list@python.org Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API On 1/10/24 8:34 am, Left Right wrote: > You probably forgot that it has to be _streaming_. Suppose you parse > the first digit: can you hand this information over to an external > function to process the parsed data? -- No! because you don't know the > magnitude yet. By that definition of "streaming", no parser can ever be streaming, because there will be some constructs that must be read in their entirety before a suitably-structured piece of output can be emitted. The context of this discussion about integers is the claim that they *could* be parsed incrementally if they were written little endian instead of big endian, but the same argument applies either way. -- Greg -- https://mail.python.org/mailman/listinfo/python-list ========== REMAINDER OF ARTICLE TRUNCATED ==========