Enable specification of a config file, and generate hocr output if option set #92

jhosteny · 2013-08-28T02:31:00Z

Guys,

Sorry it took so long to get to this, but I've opened this as a new pull request as it is substantially different from the prior one (#81). Please note the most important thing - this change adds a dependency on nokogiri.

First, I took the suggestion to make the command line option one to set a config file. If a config file is set, we look for the variable that enables hocr generation in the file itself. One thing to note is that tesseract allows you to use "+[config]" and it will look for the config file in a well-known location. I decided to just require the full path to be explicit when this option is specified, to avoid install issues.

When hocr generation is on, tesseract appears to only produce html output, not text. So, I've added some routines to generate a text file as well. Additionally, the original hocr output is annotated so that the word tags have two new data attributes set, "data-start" and "data-stop," which are the character start / stop positions of that word.

Lastly, note that the cleaning is a little simplified. I run the text from the xml node through the method that checks for a garbage word, but I didn't do anything fancy looking for too much whitespace. I figured this was good enough for now.

…f hocr output is enabled. If so, we also generate the text file from the hocr output, and back annotate the hocr output with word positions in HTML data attributes.

knowtheory · 2014-01-18T17:32:13Z

bump for the PDF Liberation Hackathon

tpendragon · 2014-06-09T16:36:00Z

What's the progress on this? Generating hOCR would be really useful to me.

CaseKey · 2015-05-17T20:41:42Z

I could really use this as well. Using the outdated LegalSifter hocr branch gives errors regarding OpenOffice in the production version.

be42day · 2020-06-22T09:55:38Z

use tesseract ocr
type in cmd:
tesseract "image_file" "hocr_file" -c tessedit_create_hocr=1

This was referenced Aug 28, 2013

Add option to generate hOCR output instead of raw text when performing OCR via tesseract #81

Closed

Add option to generate hOCR output from tesseract #80

Closed

Enable specifying a config file. If a config file is specified, see i…

560a3ea

…f hocr output is enabled. If so, we also generate the text file from the hocr output, and back annotate the hocr output with word positions in HTML data attributes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable specification of a config file, and generate hocr output if option set #92

Enable specification of a config file, and generate hocr output if option set #92

jhosteny commented Aug 28, 2013

knowtheory commented Jan 18, 2014

tpendragon commented Jun 9, 2014

CaseKey commented May 17, 2015

be42day commented Jun 22, 2020

Enable specification of a config file, and generate hocr output if option set #92

Are you sure you want to change the base?

Enable specification of a config file, and generate hocr output if option set #92

Conversation

jhosteny commented Aug 28, 2013

knowtheory commented Jan 18, 2014

tpendragon commented Jun 9, 2014

CaseKey commented May 17, 2015

be42day commented Jun 22, 2020