Comment

Comment on Why extracting data from PDFs is still a nightmare for data experts

GenderNeutralBro@lemmy.sdf.org ⁨4⁩ ⁨months⁩ ago

Well I’m sorry, but most PDF distillers since the 90s have come with OCR software that can extract text from the images and store it in a way that preserves the layout AND the meaning

The accuracy rate of even the best OCR software is far, far too low for a wide array of potential use cases.

Let’s say I have an archive of a few thousand scientific papers. These are neatly formatted digital documents, not even scanned images (though “scanned images” would be within scope of this task and should not be ignored). Even for that, there’s nothing out there that can produce reliably accurate results. Everything requires painstaking validation and correction if you really care about accuracy.

Even ArXiv can’t do a perfect job of this. They launched their “beta” HTML converter a couple years ago. Improving accuracy and reliability is an ongoing challenge. And that’s with the help or LaTeX source material! It would naturally be much, much harder if they had to rely solely on the PDFs generated from that LaTeX. See: info.arxiv.org/about/accessible_HTML.html

As for solving this problem with “AI”…uh…well, it’s not like “OCR” and “AI” are mutually exclusive terms. OCR tools have been using neural networks for a very long time already, it just wasn’t a buzzword back then so nobody called it “AI”. However, in the current landscape of “AI” in 2025, “accuracy” is usually just a happy accident. It doesn’t need to be that way, and I’m sure the folks behind commercial and open-source OCR tools are hard at work implementing new technology in a way that Doesn’t Suck.

I’ve played around with various VL models and they still seem to be in the “proof of concept” phase.

source

Sort:hotnew top