Deutsch   English   Français   Italiano  
<mailman.3.1727704162.3018.python-list@python.org>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!fu-berlin.de!uni-berlin.de!not-for-mail
From: Left Right <olegsivokon@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
 GB) from Kenna API
Date: Mon, 30 Sep 2024 10:41:44 +0200
Lines: 88
Message-ID: <mailman.3.1727704162.3018.python-list@python.org>
References: <CA+hg4RiGjXw3am1s=zVLDpcA-VGS+cWNp_YEyzvS+j2MyDE2Cg@mail.gmail.com>
 <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
 <CA+hg4Rhn8iX7rp0uC=MbOi+8g73wQ4y4=uV0dU0jHdDUz3jk4w@mail.gmail.com>
 <CAJQBtgk122sHzs+=MumYM1HW2DwKm1+i02bqgBKh4oUJYievCg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de /FSmQFGiBOexQsIkzxq7hgkcM0urFx0EWo1Kxu+xUeiQ==
Cancel-Lock: sha1:64eeyFsnou1Jw0kGR02vbDB9Tcw= sha256:YZvXS5KScPRXpW3Sk1gzFOFvVXvGKmxFtUsLqFVZNWA=
Return-Path: <olegsivokon@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
 reason="2048-bit key; unprotected key"
 header.d=gmail.com header.i=@gmail.com header.b=Sf4yaP5Z;
 dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.001
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'url-ip:140.82/16': 0.03;
 'stream': 0.04; 'subject:API': 0.07; 'python.': 0.08; 'cc:addr
 :python-list': 0.09; 'email addr:python.org>': 0.09; 'fun,': 0.09;
 'general,': 0.09; 'json': 0.09; 'parse': 0.09; 'url-
 ip:151.101.0.223/32': 0.09; 'url-ip:151.101.128.223/32': 0.09;
 'url-ip:151.101.192.223/32': 0.09; 'url-ip:151.101.64.223/32':
 0.09; 'url:reference': 0.09; 'cc:no real name:2**0': 0.14;
 'url:github': 0.14; 'import': 0.15; 'url:mailman': 0.15; 'memory':
 0.15; 'url-ip:140/8': 0.15; '(it': 0.16; '2024': 0.16; 'ali':
 0.16; 'dataset': 0.16; 'efficiently': 0.16; 'endpoint': 0.16;
 'endpoints': 0.16; 'for.': 0.16; 'help!': 0.16; 'janhangeer':
 0.16; 'mauritius': 0.16; 'received:mail-qk1-x733.google.com':
 0.16; 'single,': 0.16; 'size.': 0.16; 'structure.': 0.16;
 'url:project': 0.16; 'url:pypi': 0.16; 'wrote:': 0.16; 'python':
 0.16; 'api': 0.17; 'github': 0.17; 'pull': 0.17; 'subject:Help':
 0.17; 'guidance': 0.19; 'libraries': 0.19; 'cc:addr:python.org':
 0.20; 'cc:2**1': 0.23; "i'd": 0.24; 'url-ip:188.166.95.178/32':
 0.25; 'url-ip:188.166.95/24': 0.25; 'depends': 0.25;
 'url:listinfo': 0.25; 'cannot': 0.25; 'url-ip:188.166/16': 0.25;
 'anyone': 0.25; 'seems': 0.26; 'tried': 0.26; 'library': 0.26;
 'greatly': 0.28; 'output': 0.28; 'example,': 0.28; 'requests':
 0.28; 'blog': 0.30; 'url-ip:188/8': 0.31; 'wondering': 0.31;
 "doesn't": 0.32; 'format,': 0.32; 'manner.': 0.32; 'python-list':
 0.32; 'retrieve': 0.32; 'sep': 0.32; 'message-id:@mail.gmail.com':
 0.32; 'but': 0.32; 'subject:for': 0.33; 'there': 0.33;
 'appreciated.': 0.34; 'header:In-Reply-To:1': 0.34;
 'received:google.com': 0.34; 'encourage': 0.35; 'handling': 0.35;
 'from:addr:gmail.com': 0.35; 'also,': 0.36; 'cases': 0.36; 'mon,':
 0.36; 'subject:from': 0.37; 'using': 0.37; "it's": 0.37; 'file':
 0.38; 'way': 0.38; 'thanks': 0.38; 'quite': 0.39; 'least': 0.39;
 'valid': 0.39; 'use': 0.39; 'still': 0.40; 'data.': 0.40; 'try':
 0.40; 'should': 0.40; 'best': 0.61; 'dear': 0.62; 'format': 0.62;
 'url-ip:151.101.0/24': 0.62; 'url-ip:151.101.128/24': 0.62; 'url-
 ip:151.101.192/24': 0.62; 'url-ip:151.101.64/24': 0.62; 'here':
 0.62; 'experience': 0.64; 'imagine': 0.64; 'your': 0.64;
 'similar': 0.65; 'well': 0.65; 'less': 0.65; 'that,': 0.67;
 'order': 0.69; 'export': 0.69; 'terms': 0.70; 'resulting': 0.70;
 'subject:Data': 0.71; 'offer': 0.71; 'relevant': 0.73; 'degree':
 0.76; 'limits': 0.76; 'significant': 0.78; 'be).': 0.84;
 'management.': 0.84; 'massive': 0.84; 'one:': 0.84; 'proving':
 0.84; 'subject: \n ': 0.84; 'typically,': 0.84
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1727685715; x=1728290515; darn=python.org;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:from:to:cc:subject:date
 :message-id:reply-to;
 bh=HJhnPga8KX/4guD9F/WJ2gM7lg+lQPD2mvPvtsnzDOs=;
 b=Sf4yaP5ZWeWrcSI8xpbH9cWkS8QokE/H8/lDkL7iF/IarlcuyXRoKEFYxj3zfOW9nA
 a0pPYP8CFzYQ/B99EE8mTTLNFW4vqqaqPZiOZ9/l6iNTxvs16trK82UBOuqtXYiFBxzU
 4Hme+ZdOvvVn6YF9wOBuP1MxpczJBPSPPdactK/qmM9IaHPT6QKpZjsyBpCOYo1AdaCz
 RaoAFuixuZv8498FdhAFNKr2n6vlMGr+++g503hGHahrxrvkGXwY4kkBsn+MDbh6oz+p
 zxLze6BD5qKgKQ8SpoKo/F1IT+Dc1QVB+pnH7ErpY+T+Fgzx/rDbVwfgA5AFFPbJ4v9I
 GVKQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1727685715; x=1728290515;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=HJhnPga8KX/4guD9F/WJ2gM7lg+lQPD2mvPvtsnzDOs=;
 b=b1L0v25eMWik/AwLXGxjKZOieEvPMwxY/74VbGXhjUChfR8dHLwOjkDy54Oyt1McXp
 mgj1SjBctHkR20lmwqj5WAzlTDaPyFnYqVIhzNtoXMNhPQnTyqIWzoYVT4KBFUJZMrrc
 LkubjQT7009Trq0bysQi1o0+niEPlagI034V80wrbssBhu9i6PgxF/mw8P7CZ1eLzkg6
 gQWUQYSnIw5YABd/I8D1cxWrae7qUS1E2/7GGRsjWUDs/UB9d9+Fi5UBw21QwMFHTLza
 NT5J5s6vbs7efMm1UDiakgAR4a1OKgtWz1fzUS0FCJGIl5Me7nxN6vW6d1OE7IGP0/wX
 40cA==
X-Forwarded-Encrypted: i=1;
 AJvYcCV4Qs4aiHq2wSvY/QocGFpBWNOws+v23NOIN2kLXyoCg5vA1eI2vdxwemAG1b5Eh57QEn/QfTTx8/A5jQ==@python.org
X-Gm-Message-State: AOJu0Yw/BNRHjkJ/clnj5yW7ykreC//3QCNCqIet8PKAm0YzETCLRfYz
 TOKBS6WC1doXPOboaRz81eG6Rr5v0ekJFs9m8YuEwpEWBjkXdseZQQ2FbfzAM+lb7HJRjoAZzIy
 KO924iv2OlcXxlFSSv0Z/JR3NBqc=
X-Google-Smtp-Source: AGHT+IHsshuJb+TrmS+YNxPXlLiJbtvNYtyY1cQSM+qgV1Gc3Yg/mexb5g0c5Wb07xvqCiQOUnb+bbAbn2l5rbCdYVo=
X-Received: by 2002:a05:6214:5693:b0:6cb:5987:7467 with SMTP id
 6a1803df08f44-6cb59877e7cmr75672836d6.48.1727685715310; Mon, 30 Sep 2024
 01:41:55 -0700 (PDT)
In-Reply-To: <CA+hg4Rhn8iX7rp0uC=MbOi+8g73wQ4y4=uV0dU0jHdDUz3jk4w@mail.gmail.com>
X-Mailman-Approved-At: Mon, 30 Sep 2024 09:49:20 -0400
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
 <python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
 <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
 <mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAJQBtgk122sHzs+=MumYM1HW2DwKm1+i02bqgBKh4oUJYievCg@mail.gmail.com>
X-Mailman-Original-References: <CA+hg4RiGjXw3am1s=zVLDpcA-VGS+cWNp_YEyzvS+j2MyDE2Cg@mail.gmail.com>
 <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
 <CA+hg4Rhn8iX7rp0uC=MbOi+8g73wQ4y4=uV0dU0jHdDUz3jk4w@mail.gmail.com>
Bytes: 10018

Whether and to what degree you can stream JSON depends on JSON
structure. In general, however, JSON cannot be streamed (but commonly
it can be).

Imagine a pathological case of this shape: 1... <60GB of digits>. This
is still a valid JSON (it doesn't have any limits on how many digits a
number can have). And you cannot parse this number in a streaming way
because in order to do that, you need to start with the least
significant digit.

Typically, however, JSON can be parsed incrementally. The format is
conceptually very simple to write a parser for. There are plenty of
parsers that do that, for example, this one:
https://pypi.org/project/json-stream/ . But, I'd encourage you to do
it yourself.  It's fun, and the resulting parser should end up less
than some 50 LoC.  Also, it allows you to closer incorporate your
desired output into your parser.

On Mon, Sep 30, 2024 at 8:44=E2=80=AFAM Asif Ali Hirekumbi via Python-list
<python-list@python.org> wrote:
>
> Thanks Abdur Rahmaan.
> I will give it a try !
>
> Thanks
> Asif
>
> On Mon, Sep 30, 2024 at 11:19=E2=80=AFAM Abdur-Rahmaan Janhangeer <
> arj.python@gmail.com> wrote:
>
> > Idk if you tried Polars, but it seems to work well with JSON data
> >
> > import polars as pl
> > pl.read_json("file.json")
> >
> > Kind Regards,
> >
> > Abdur-Rahmaan Janhangeer
> > about <https://compileralchemy.github.io/> | blog
> > <https://www.pythonkitchen.com>
> > github <https://github.com/Abdur-RahmaanJ>
> > Mauritius
> >
> >
> > On Mon, Sep 30, 2024 at 8:00=E2=80=AFAM Asif Ali Hirekumbi via Python-l=
ist <
> > python-list@python.org> wrote:
> >
> >> Dear Python Experts,
> >>
> >> I am working with the Kenna Application's API to retrieve vulnerabilit=
y
> >> data. The API endpoint provides a single, massive JSON file in gzip
> >> format,
> >> approximately 60 GB in size. Handling such a large dataset in one go i=
s
> >> proving to be quite challenging, especially in terms of memory managem=
ent.
> >>
> >> I am looking for guidance on how to efficiently stream this data and
> >> process it in chunks using Python. Specifically, I am wondering if the=
re=E2=80=99s
> >> a way to use the requests library or any other libraries that would al=
low
> >> us to pull data from the API endpoint in a memory-efficient manner.
> >>
> >> Here are the relevant API endpoints from Kenna:
> >>
> >>    - Kenna API Documentation
> >>    <https://apidocs.kennasecurity.com/reference/welcome>
> >>    - Kenna Vulnerabilities Export
> >>    <https://apidocs.kennasecurity.com/reference/retrieve-data-export>
> >>
> >> If anyone has experience with similar use cases or can offer any advic=
e,
> >> it
> >> would be greatly appreciated.
> >>
> >> Thank you in advance for your help!
> >>
> >> Best regards
========== REMAINDER OF ARTICLE TRUNCATED ==========