Integrate OCRing where needed #163

konklone · 2014-08-29T06:00:01Z

I'm not sure the best path for detection of reports that need OCRing (perhaps through a flag set by the scraper), but we should have tesseract for OCRing of some reports.

I'm motivated by this report by the FBI, where only the cover sheet has text. The FBI, at least, has a clear practice of image-izing redacted documents:

http://www.justice.gov/oig/reports/2014/s140827.pdf

It's a great report, and has been getting news coverage. I did some very brief experimentation with OCR parameters for another project, and the 300dpi 8 bit approach seemed good enough to me.

The text was updated successfully, but these errors were encountered:

divergentdave · 2016-04-07T03:08:49Z

As seen in 18F's blog today, 18F/doc_processing_toolkit handles both text extraction and OCRing. This could work for our purposes, though we should make it configurable, for those who don't want to set up Apache Tika and the like.

divergentdave · 2016-08-23T02:44:05Z

Case in point: https://www.si.edu/Content/OIG/Misc/FY16_CSA.pdf has two accessible words in it, "Appendix A".

konklone added this to the Hack Day 2016 milestone Jan 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate OCRing where needed #163

Integrate OCRing where needed #163

konklone commented Aug 29, 2014

divergentdave commented Apr 7, 2016

divergentdave commented Aug 23, 2016

Integrate OCRing where needed #163

Integrate OCRing where needed #163

Comments

konklone commented Aug 29, 2014

divergentdave commented Apr 7, 2016

divergentdave commented Aug 23, 2016