PDF OCR Searching

介绍

某些 PDF 扫描版书记无法使用基于文字的搜索

使用

安装依赖

pip3 install pillow pytesseract PyMuPDF
sudo apt-get install tesseract-ocr

或按照 ref 说明安装 tesseract-ocr 的 windows 版本

还需要通过 tessdata-fast 或 tessdata 下载中文识别模型

可通过 tesseract 官网查找中文模型名称

目前, 只能在 pdf-ocrsearch.py 中标明要查找的 PDF 文件绝对路径, 和要查找的关键词

SEARCHING_TARGET = "软件"
PDF_file = Path(r"test.pdf")

效果

在使用多线程优化之后, 可以达到平均 2s 检测一页. 是不使用优化的一半.

OCR 搜索时长是最大的瓶颈, 只可能使用搜索中间结果缓存, 不过如此意义不大

后续

提升搜索速度, 目标达到每秒搜索 100 页
命令行给出搜索结果附近的语境
支持中英文混合文本, 比如 CS 专业书籍

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
pdf-ocrsearch.py		pdf-ocrsearch.py
test.pdf		test.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

pdf-ocrsearch.py

pdf-ocrsearch.py

test.pdf

test.pdf

Repository files navigation

PDF OCR Searching

介绍

使用

效果

后续

About

Releases

Packages

Languages

casual-lab/PDF-OCRSearch

Folders and files

Latest commit

History

Repository files navigation

PDF OCR Searching

介绍

使用

效果

后续

About

Topics

Resources

Stars

Watchers

Forks

Languages