Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate OCRing where needed #163

Open
konklone opened this issue Aug 29, 2014 · 2 comments
Open

Integrate OCRing where needed #163

konklone opened this issue Aug 29, 2014 · 2 comments
Milestone

Comments

@konklone
Copy link
Member

I'm not sure the best path for detection of reports that need OCRing (perhaps through a flag set by the scraper), but we should have tesseract for OCRing of some reports.

I'm motivated by this report by the FBI, where only the cover sheet has text. The FBI, at least, has a clear practice of image-izing redacted documents:

http://www.justice.gov/oig/reports/2014/s140827.pdf

It's a great report, and has been getting news coverage. I did some very brief experimentation with OCR parameters for another project, and the 300dpi 8 bit approach seemed good enough to me.

@konklone konklone added this to the Hack Day 2016 milestone Jan 16, 2016
@divergentdave
Copy link
Contributor

As seen in 18F's blog today, 18F/doc_processing_toolkit handles both text extraction and OCRing. This could work for our purposes, though we should make it configurable, for those who don't want to set up Apache Tika and the like.

@divergentdave
Copy link
Contributor

Case in point: https://www.si.edu/Content/OIG/Misc/FY16_CSA.pdf has two accessible words in it, "Appendix A".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants