Configurable Tika ocr strategies for PDFs #100

wombat94 · 2020-03-30T12:54:33Z

OCR of PDFs in Tika can take a long time. This is unnecessary if the PDF has already been ORCed.

I would like to see an option to define the OCR strategy used by Tika in the lodestone front end.

Ideally, this would be multi-pass with a first pass being no_ocr and if the size of returned data is below a threshold (perhaps 500 bytes of text) then re-process with text_and_ocr to recognize the document.

dskaggs added type/enhancement New feature or request area/tika labels Feb 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable Tika ocr strategies for PDFs #100

Configurable Tika ocr strategies for PDFs #100

wombat94 commented Mar 30, 2020

Configurable Tika ocr strategies for PDFs #100

Configurable Tika ocr strategies for PDFs #100

Comments

wombat94 commented Mar 30, 2020