Deutsch English Français Italiano |
<mailman.23.1727817087.3018.python-list@python.org> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail From: Left Right <olegsivokon@gmail.com> Newsgroups: comp.lang.python Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API Date: Tue, 1 Oct 2024 23:03:01 +0200 Lines: 87 Message-ID: <mailman.23.1727817087.3018.python-list@python.org> References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com> <082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org> <9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net> <CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com> <CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com> <ZvwZjATEdx8hLhxT@anomaly> <CAJQBtgnjespF-W64mBDYAybvOas12-7zPCjA2=iQuxMMfF73vw@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Trace: news.uni-berlin.de +F9Io+KVPGwed3uzxTnGhg5U/dJwujUJwaYtRr8Ia/xw== Cancel-Lock: sha1:EaEq9XX+mcW9tdvVZQNI+EL1iMg= sha256:X8TFKCnXkjYTkT/llOhQBD10sr6aElFpUsdp8YB8OOU= Return-Path: <olegsivokon@gmail.com> X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org Authentication-Results: mail.python.org; dkim=pass reason="2048-bit key; unprotected key" header.d=gmail.com header.i=@gmail.com header.b=TzAVZTDT; dkim-adsp=pass; dkim-atps=neutral X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'argument': 0.04; 'yet.': 0.04; 'string': 0.07; 'subject:API': 0.07; 'thing.': 0.07; 'cases.': 0.09; 'dan': 0.09; 'describe': 0.09; 'example.': 0.09; 'infinite': 0.09; 'json': 0.09; 'parse': 0.09; 'solving': 0.09; 'url:mailman': 0.15; 'memory': 0.15; '"re:': 0.16; '"what': 0.16; '(because': 0.16; '+0200,': 0.16; '2024': 0.16; 'arbitrary': 0.16; 'constraint,': 0.16; 'data?': 0.16; 'decimal': 0.16; 'filesystem': 0.16; 'for.': 0.16; 'low-level': 0.16; 'missing?': 0.16; 'oh,': 0.16; 'parsing': 0.16; 'sync': 0.16; 'terminology': 0.16; 'useless': 0.16; 'wrote:': 0.16; 'problem': 0.16; 'api': 0.17; 'says': 0.17; 'subject:Help': 0.17; 'instead': 0.17; 'probably': 0.17; "aren't": 0.19; 'implement': 0.19; 'tue,': 0.19; 'to:addr :python-list': 0.20; 'language': 0.21; "i've": 0.22; 'to:no real name:2**1': 0.22; 'code': 0.23; 'run': 0.23; 'idea': 0.24; '(and': 0.25; 'anything': 0.25; 'url-ip:188.166.95.178/32': 0.25; 'url- ip:188.166.95/24': 0.25; 'saying': 0.25; 'url:listinfo': 0.25; 'cannot': 0.25; 'url-ip:188.166/16': 0.25; 'again,': 0.26; 'leave': 0.27; 'function': 0.27; 'example,': 0.28; 'computer': 0.29; 'it,': 0.29; 'url-ip:188/8': 0.31; 'think': 0.32; 'python- list': 0.32; 'message-id:@mail.gmail.com': 0.32; 'but': 0.32; "i'm": 0.33; 'subject:for': 0.33; 'able': 0.34; 'same': 0.34; "didn't": 0.34; 'header:In-Reply-To:1': 0.34; 'received:google.com': 0.34; 'words': 0.35; 'yes,': 0.35; 'from:addr:gmail.com': 0.35; 'built': 0.36; 'cases': 0.36; 'people': 0.36; 'special': 0.37; 'subject:from': 0.37; "it's": 0.37; 'file': 0.38; 'way': 0.38; 'two': 0.39; 'adding': 0.39; 'least': 0.39; 'single': 0.39; 'enough': 0.39; 'handle': 0.39; 'list': 0.39; 'use': 0.39; 'still': 0.40; 'case.': 0.40; 'hand': 0.40; 'family': 0.60; 'skip:h 10': 0.61; "there's": 0.61; 'ever': 0.63; "you'd": 0.64; 'your': 0.64; 'let': 0.66; 'numbers': 0.67; 'right': 0.68; 'closing': 0.69; 'implications': 0.69; 'quote?': 0.69; 'interesting': 0.71; 'subject:Data': 0.71; 'care': 0.71; 'addition': 0.71; 'deal': 0.73; 'potentially': 0.76; 'significant': 0.78; 'constitute': 0.81; 'left': 0.83; 'billion': 0.84; 'extracted': 0.84; 'forgot': 0.84; 'further.': 0.84; 'means,': 0.84; 'strings': 0.84; 'subject: \n ': 0.84; 'words:': 0.84; 'hand.': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1727816593; x=1728421393; darn=python.org; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=0FzuoySlsiTQa4AUr9dSqh68igaoFAIizObLLnc+ako=; b=TzAVZTDTXJ7T8Y2XwD4Ilf2BD4XnPlE3h0f9Kz135ue66jMlWOfLBRJPJ5MPgmN1wS ZYy+V6jn9KJtUHo5MwXGxhiv4aijrOx+FafJx0sz4LrtjCn7whl3CSDdX/sctuSzIerH iGorCS+rfxQ0NPqqC3OP4WUGfb9fvW6Rwl+KHgD5RCw2CBYjQVtyMWJniHnfjKNfCiPh OC09wQ/2cj4tPb/kRAYpSsR80NplroMRlcQIXNe1JWGOTe1KvUtS9x8EPhrWO63jEM6I aEz065rPgfDaf8887fDkLXUq1y//NcOBqGfrQDP24OKMtxcm7NUIkTVYYjAfUnhJBR2O wN2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727816593; x=1728421393; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=0FzuoySlsiTQa4AUr9dSqh68igaoFAIizObLLnc+ako=; b=Y595Of1B0jU398vc1cSVSZPDvwLzd1kUXUxTDRaP3DK+Ws/DLQiA9UoigfvsdL//9O 2ogE0s+z3S03wGg5R3i8udNUFPUX/unlosD5GYtzzlxaRQKwMhT2SkWHGaoET53Nu+Gj QP74ghqwFAlwv4wmPZSczSQGn2flYmv/9nUS1A6K0zdNWSSh5Zn2+HaYYlZNUt8NkhNe DSUPZhrQPIMgCaco/KSg5qbcB2BQm/cwZupWNlhT+V3gxS9KTMmoViFBSfY/Y8hkdf+i YT9UKql5/nAs9jxGxxKjl8tNWnioDD5jlMpYHG5XhA33/zsRDyORG/uKK6VmeEawpd98 6rNA== X-Forwarded-Encrypted: i=1; AJvYcCUwUxG/R45KXZh+GmrI9SWbkfHk/MbxVpoYUbRRGnoUpXjvH3gOzBVLxTQCACtl2lq8AB51oJTMbiB9/A==@python.org X-Gm-Message-State: AOJu0Yyz0NaDq2yXCmIcqX2uMN/cYcYmwXxRETsK+luTWUuekD9Uta0e Q1UBH9gOjXJGOgH9zc4h9OXb30mXe0dK810M2qjK3YahiaGiQseRguTjgXEn6w9UsI5WHXPIEES b1/YJByCCbqabR3gEhuOtkIuG52P5SPOB X-Google-Smtp-Source: AGHT+IHkgF7jJZmm0Plc37A5UoK4m/H5k6M8j1YDf9vGtqFlyUKgLOg4ucRXqUJO/n8eAQVYk4l0o6hz+QKdC7rWKXs= X-Received: by 2002:a05:6214:2e44:b0:6cb:46ce:744a with SMTP id 6a1803df08f44-6cb81b99009mr9618986d6.48.1727816593190; Tue, 01 Oct 2024 14:03:13 -0700 (PDT) In-Reply-To: <ZvwZjATEdx8hLhxT@anomaly> X-Mailman-Approved-At: Tue, 01 Oct 2024 17:11:26 -0400 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.39 Precedence: list List-Id: General discussion list for the Python programming language <python-list.python.org> List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> List-Archive: <https://mail.python.org/pipermail/python-list/> List-Post: <mailto:python-list@python.org> List-Help: <mailto:python-list-request@python.org?subject=help> List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> X-Mailman-Original-Message-ID: <CAJQBtgnjespF-W64mBDYAybvOas12-7zPCjA2=iQuxMMfF73vw@mail.gmail.com> X-Mailman-Original-References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com> <082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org> <9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net> <CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com> <CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com> <ZvwZjATEdx8hLhxT@anomaly> Bytes: 11586 > If I recognize the first digit, then I *can* hand that over to an > external function to accumulate the digits that follow. And what is that external function going to do with this information? The point is you didn't parse anything if you just sent the digit. You just delegated the parsing further. Parsing is only meaningful if you extracted some information, but your idea is, essentially "what if I do nothing?". > Under that constraint, I'm not sure I can parse anything. How can I parse a string (and hand it over to an external function) until I've found the closing quote? Nobody says that parsing a number is the only pathological case. You, however, exaggerate by saying you cannot parse _anything_. You can parse booleans or null, for example. There's no problem there. Again, I think you misunderstand what streaming is for. Let me remind: it's for processing information as it comes, potentially, indefinitely. This has far more important implications than what you find in computer science. For example, some mathematicians use the same argument to show that real numbers are either fiction or useless: consider adding two real numbers (where real numbers are potentially infinite strings of decimal digits after the period) -- there's no way to prove that such an addition is possible because you would need an infinite proof for that (because you need to start adding from the least significant digit). In principle, any language that has infinite words will have the same problem with streaming. If you ever pondered h/w or low-level protocols s.a. SCSI or IP, you'd see that they are specifically designed in such a way as to never have infinite words (because they must be amenable to streaming). Consider also an interesting consequence of SCSI not being able to have infinite words: this means, besides other things that fsync() is nonsense! :) If you aren't familiar with the concept: UNIX filesystem API suggests that it's possible to destage arbitrary large file (or a chunk of file) to disk. But SCSI is built of finite "words" and to describe an arbitrary large file you'd need to list all the blocks that constitute the file! And that's why fsync() and family are so hated by people who deal with storage: the only way to implement fsync() in compliance with the standard is to sync _everything_ (and it hurts!) On Tue, Oct 1, 2024 at 5:49=E2=80=AFPM Dan Sommers via Python-list <python-list@python.org> wrote: > > On 2024-09-30 at 21:34:07 +0200, > Regarding "Re: Help with Streaming and Chunk Processing for Large JSON Da= ta (60 GB) from Kenna API," > Left Right via Python-list <python-list@python.org> wrote: > > > > What am I missing? Handwavingly, start with the first digit, and as > > > long as the next character is a digit, multipliy the accumulated resu= lt > > > by 10 (or the appropriate base) and add the next value. Oh, and hand= le > > > scientific notation as a special case, and perhaps fail spectacularly > > > instead of recovering gracefully in certain edge cases. And in the > > > pathological case of a single number with 60 billion digits, run out = of > > > memory (and complain loudly to the person who claimed that the file > > > contained a "dataset"). But why do I need to start with the least > > > significant digit? > > > > You probably forgot that it has to be _streaming_. Suppose you parse > > the first digit: can you hand this information over to an external > > function to process the parsed data? -- No! because you don't know the > > magnitude yet. What about two digits? -- Same thing. You cannot > > leave the parser code until you know the magnitude (otherwise the > > information is useless to the external code). > > If I recognize the first digit, then I *can* hand that over to an > external function to accumulate the digits that follow. > > > So, even if you have enough memory and don't care about special cases > > like scientific notation: yes, you will be able to parse it, but it > > won't be a streaming parser. > > Under that constraint, I'm not sure I can parse anything. How can I ========== REMAINDER OF ARTICLE TRUNCATED ==========