Idea : merge digital extraction result and OCR result. #24

Shadow-Alex · 2023-06-17T08:58:12Z

Hi, I'm currently developing a pdf parser specialised for math pdf. The non-OCR solutions offer great accuracy for text because they are simply extracted, not detected optically. So, is it possible to merge the non-OCR results and Pix2Text results to improve the accuracy?
嗨，我目前正在开发一个专门用于数学 PDF 的解析器。非 OCR 解决方案对于文本具有很高的准确性，因为它们是直接提取的，而不是通过光学检测。那么，是否可以将非 OCR 结果和 Pix2Text 结果合并以提高准确性呢？

breezedeus · 2023-06-17T09:26:54Z

pdf不就是直接就能解析出文字么，提高准确性体现在哪？

Shadow-Alex · 2023-06-17T13:53:51Z

如果pdf含有公式，直接解析会保留文字，但是会破坏公式。
如果想合法解析出pdf中的公式来，则需要采用OCR的手段，将pdf转成img。然而全局的OCR代表文字部分也用了OCR，其中文字部分可以替换为用直接解析出来的结果，它们的准确率是100%，要比OCR结果好。

breezedeus · 2023-07-02T15:41:09Z

所以我理解可以这么做，转成图片后用P2T的MFD检测出数学公式所在位置，然后在原始PDF里把这些位置的文字替换为识别出的Latex表示即可。

Shadow-Alex · 2023-07-02T18:34:50Z

没错，比如说：
OCR 的识别结果："Here the manager will moc $C_t = \sum(blablabla)$"，这里文字部分将 mock 识别为 moc，但公式是正确的。
非OCR的抽取结果："Here the manager will mock Ct = blablabla"，这里文字全部识别正确，但公式是乱码。
一个简单的思路就是模糊匹配出两部分结果的对应部分，然后从非OCR结果中替换公式乱码，或者从OCR结果中替换文字。

Jzhnakui · 2023-11-22T08:14:21Z

没错，比如说： OCR 的识别结果："Here the manager will moc Ct=∑(blablabla)"，这里文字部分将 mock 识别为 moc，但公式是正确的。非OCR的抽取结果："Here the manager will mock Ct = blablabla"，这里文字全部识别正确，但公式是乱码。一个简单的思路就是模糊匹配出两部分结果的对应部分，然后从非OCR结果中替换公式乱码，或者从OCR结果中替换文字。

是的，对于数字原生pdf，版面恢复可以结合Layout-Parser/layout-parser: A Unified Toolkit for Deep Learning Based Document Image Analysis: https://github.com/Layout-Parser/layout-parser
项目，这个项目可以直接抽取pdf的原生字符块，而且是根据坐标获取的，我尝试过breezedeus大佬的布局分析遥遥领先layout-parser提供的模型，但是breezedeus大佬的模型暂时不支持直接根据pdf坐标解析文字字符。

二者可以融合而一下

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea : merge digital extraction result and OCR result. #24

Idea : merge digital extraction result and OCR result. #24

Shadow-Alex commented Jun 17, 2023

breezedeus commented Jun 17, 2023

Shadow-Alex commented Jun 17, 2023

breezedeus commented Jul 2, 2023

Shadow-Alex commented Jul 2, 2023

Jzhnakui commented Nov 22, 2023

Idea : merge digital extraction result and OCR result. #24

Idea : merge digital extraction result and OCR result. #24

Comments

Shadow-Alex commented Jun 17, 2023

breezedeus commented Jun 17, 2023

Shadow-Alex commented Jun 17, 2023

breezedeus commented Jul 2, 2023

Shadow-Alex commented Jul 2, 2023

Jzhnakui commented Nov 22, 2023