You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is simply down to the distance between the words ... in the absence of a template, page structuring has to rely on a bunch of defaults, and those are optimized for printed text, not typewriter ... on top of that, the large line distance also means that lines don't cover each other's gaps in all too many places, and the few places where they do seem to work as intended ...
Meaning to say this simply is a very atypical document that defies the usual catches and defaults ... "Merge Blocks" should be an easy fix, though ... if we run into more typewriter documents, we might have to consider a dedicated set of defaults and somehow tell page structure detection to use that instead of the usual print-oriented defaults.
myrmoteras
changed the title
text flow split
text flow split in OCR-ed documents
Aug 17, 2023
Why does GGI break OCRed text into chunks?
Fosberg and Bullock 1971 plantsOCR.pdf
The text was updated successfully, but these errors were encountered: