Deutsch English Français Italiano |
<slrnvtilds.1usp.anthk@openbsd.home> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: anthk <anthk@openbsd.home> Newsgroups: sci.misc Subject: Re: difficulty extracting data from PDFs Date: Tue, 18 Mar 2025 11:23:39 -0000 (UTC) Organization: A noiseless patient Spider Lines: 38 Message-ID: <slrnvtilds.1usp.anthk@openbsd.home> References: <67d0deeb$1$19$882e4bbb@reader.netnews.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Injection-Date: Tue, 18 Mar 2025 12:23:39 +0100 (CET) Injection-Info: dont-email.me; posting-host="b35ebebce37c0d7a1e2bb60388585270"; logging-data="2524736"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX184jFPmAvivy5rWbl9pxJOp" User-Agent: slrn/1.0.3 (OpenBSD) Cancel-Lock: sha1:5+5dO7noiBh1tb9Ftio80RYopQ4= Bytes: 3400 On 2025-03-12, Retrograde <fungus@amongus.com.invalid> wrote: > From the «cry me a river, AI» department: > Title: Why Extracting Data from PDFs Remains a Nightmare for Data Experts > Author: feedback@slashdot.org > Date: Tue, 11 Mar 2025 17:26:00 +0000 > Link: https://it.slashdot.org/story/25/03/11/1726218/why-extracting-data-from-pdfs-remains-a-nightmare-for-data-experts?utm_source=rss1.0mainlinkanon&utm_medium=feed > > Businesses, governments, and researchers continue to struggle with extracting > usable data from PDF files, despite AI advances. These digital documents > contain valuable information for everything from scientific research to > government records, but their rigid formats make extraction difficult. "PDFs > are a creature of a time when print layout was a big influence on publishing > software," Derek Willis, a lecturer in Data and Computational Journalism at the > University of Maryland, told ArsTechnica. This print-oriented design means many > PDFs are essentially "pictures of information" requiring optical character > recognition (OCR) technology. Traditional OCR systems have existed since the > 1970s but struggle with complex layouts and poor-quality scans. New AI language > models from companies like Google and Mistral now attempt to process documents > more holistically, with varying success. "Right now, the clear leader is > Google's Gemini 2.0 Flash Pro Experimental," Willis notes, while Mistral's > recent OCR solution "performed poorly" in tests. > > [image 2][2][image 4][4] > > Read more of this story[5] at Slashdot. > > Links: > [1]: http://twitter.com/home?status=Why+Extracting+Data+from+PDFs+Remains+a+Nightmare+for+Data+Experts%3A+https%3A%2F%2Fit.slashdot.org%2Fstory%2F25%2F03%2F11%2F1726218%2F%3Futm_source%3Dtwitter%26utm_medium%3Dtwitter (link) > [2]: https://a.fsdn.com/sd/twitter_icon_large.png (image) > [3]: http://www.facebook.com/sharer.php?u=https%3A%2F%2Fit.slashdot.org%2Fstory%2F25%2F03%2F11%2F1726218%2Fwhy-extracting-data-from-pdfs-remains-a-nightmare-for-data-experts%3Futm_source%3Dslashdot%26utm_medium%3Dfacebook (link) > [4]: https://a.fsdn.com/sd/facebook_icon_large.png (image) > [5]: https://it.slashdot.org/story/25/03/11/1726218/why-extracting-data-from-pdfs-remains-a-nightmare-for-data-experts?utm_source=rss1.0moreanon&utm_medium=feed (link) Why not Recoll under Linux/Unix/Mac/Windows? https://www.recoll.org/index.html Recoll, not Recall.