difficulty extracting data from PDFs

Liste des GroupesRevenir à s misc 
Sujet : difficulty extracting data from PDFs
De : fungus (at) *nospam* amongus.com.invalid (Retrograde)
Groupes : sci.misc
Date : 12. Mar 2025, 02:10:03
Autres entêtes
Message-ID : <67d0deeb$1$19$882e4bbb@reader.netnews.com>
From the «cry me a river, AI» department:
Title: Why Extracting Data from PDFs Remains a Nightmare for Data Experts
Author: feedback@slashdot.org
Date: Tue, 11 Mar 2025 17:26:00 +0000
Link: https://it.slashdot.org/story/25/03/11/1726218/why-extracting-data-from-pdfs-remains-a-nightmare-for-data-experts?utm_source=rss1.0mainlinkanon&utm_medium=feed

Businesses, governments, and researchers continue to struggle with extracting
usable data from PDF files, despite AI advances. These digital documents
contain valuable information for everything from scientific research to
government records, but their rigid formats make extraction difficult. "PDFs
are a creature of a time when print layout was a big influence on publishing
software," Derek Willis, a lecturer in Data and Computational Journalism at the
University of Maryland, told ArsTechnica. This print-oriented design means many
PDFs are essentially "pictures of information" requiring optical character
recognition (OCR) technology. Traditional OCR systems have existed since the
1970s but struggle with complex layouts and poor-quality scans. New AI language
models from companies like Google and Mistral now attempt to process documents
more holistically, with varying success. "Right now, the clear leader is
Google's Gemini 2.0 Flash Pro Experimental," Willis notes, while Mistral's
recent OCR solution "performed poorly" in tests.

[image 2][2][image 4][4]

Read more of this story[5] at Slashdot.

Links:
[1]: http://twitter.com/home?status=Why+Extracting+Data+from+PDFs+Remains+a+Nightmare+for+Data+Experts%3A+https%3A%2F%2Fit.slashdot.org%2Fstory%2F25%2F03%2F11%2F1726218%2F%3Futm_source%3Dtwitter%26utm_medium%3Dtwitter (link)
[2]: https://a.fsdn.com/sd/twitter_icon_large.png (image)
[3]: http://www.facebook.com/sharer.php?u=https%3A%2F%2Fit.slashdot.org%2Fstory%2F25%2F03%2F11%2F1726218%2Fwhy-extracting-data-from-pdfs-remains-a-nightmare-for-data-experts%3Futm_source%3Dslashdot%26utm_medium%3Dfacebook (link)
[4]: https://a.fsdn.com/sd/facebook_icon_large.png (image)
[5]: https://it.slashdot.org/story/25/03/11/1726218/why-extracting-data-from-pdfs-remains-a-nightmare-for-data-experts?utm_source=rss1.0moreanon&utm_medium=feed (link)

Date Sujet#  Auteur
12 Mar 25 * difficulty extracting data from PDFs2Retrograde
18 Mar 25 `- Re: difficulty extracting data from PDFs1anthk

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal