Article <slrnvtilds.1usp.anthk@openbsd.home>

Deutsch English Français Italiano
<slrnvtilds.1usp.anthk@openbsd.home>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: anthk <anthk@openbsd.home>
Newsgroups: sci.misc
Subject: Re: difficulty extracting data from PDFs
Date: Tue, 18 Mar 2025 11:23:39 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <slrnvtilds.1usp.anthk@openbsd.home>
References: <67d0deeb$1$19$882e4bbb@reader.netnews.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 18 Mar 2025 12:23:39 +0100 (CET)
Injection-Info: dont-email.me; posting-host="b35ebebce37c0d7a1e2bb60388585270";
	logging-data="2524736"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX184jFPmAvivy5rWbl9pxJOp"
User-Agent: slrn/1.0.3 (OpenBSD)
Cancel-Lock: sha1:5+5dO7noiBh1tb9Ftio80RYopQ4=
Bytes: 3400

On 2025-03-12, Retrograde <fungus@amongus.com.invalid> wrote:
> From the «cry me a river, AI» department:
> Title: Why Extracting Data from PDFs Remains a Nightmare for Data Experts
> Author: feedback@slashdot.org
> Date: Tue, 11 Mar 2025 17:26:00 +0000
> Link: https://it.slashdot.org/story/25/03/11/1726218/why-extracting-data-from-pdfs-remains-a-nightmare-for-data-experts?utm_source=rss1.0mainlinkanon&utm_medium=feed
>
> Businesses, governments, and researchers continue to struggle with extracting
> usable data from PDF files, despite AI advances. These digital documents
> contain valuable information for everything from scientific research to
> government records, but their rigid formats make extraction difficult. "PDFs
> are a creature of a time when print layout was a big influence on publishing
> software," Derek Willis, a lecturer in Data and Computational Journalism at the
> University of Maryland, told ArsTechnica. This print-oriented design means many
> PDFs are essentially "pictures of information" requiring optical character
> recognition (OCR) technology. Traditional OCR systems have existed since the
> 1970s but struggle with complex layouts and poor-quality scans. New AI language
> models from companies like Google and Mistral now attempt to process documents
> more holistically, with varying success. "Right now, the clear leader is
> Google's Gemini 2.0 Flash Pro Experimental," Willis notes, while Mistral's
> recent OCR solution "performed poorly" in tests.
>
> [image 2][2][image 4][4]
>
> Read more of this story[5] at Slashdot.
>
> Links:
> [1]: http://twitter.com/home?status=Why+Extracting+Data+from+PDFs+Remains+a+Nightmare+for+Data+Experts%3A+https%3A%2F%2Fit.slashdot.org%2Fstory%2F25%2F03%2F11%2F1726218%2F%3Futm_source%3Dtwitter%26utm_medium%3Dtwitter (link)
> [2]: https://a.fsdn.com/sd/twitter_icon_large.png (image)
> [3]: http://www.facebook.com/sharer.php?u=https%3A%2F%2Fit.slashdot.org%2Fstory%2F25%2F03%2F11%2F1726218%2Fwhy-extracting-data-from-pdfs-remains-a-nightmare-for-data-experts%3Futm_source%3Dslashdot%26utm_medium%3Dfacebook (link)
> [4]: https://a.fsdn.com/sd/facebook_icon_large.png (image)
> [5]: https://it.slashdot.org/story/25/03/11/1726218/why-extracting-data-from-pdfs-remains-a-nightmare-for-data-experts?utm_source=rss1.0moreanon&utm_medium=feed (link)

Why not Recoll under Linux/Unix/Mac/Windows?

https://www.recoll.org/index.html

Recoll, not Recall.