You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm not sure the best path for detection of reports that need OCRing (perhaps through a flag set by the scraper), but we should have tesseract for OCRing of some reports.
I'm motivated by this report by the FBI, where only the cover sheet has text. The FBI, at least, has a clear practice of image-izing redacted documents:
It's a great report, and has been getting newscoverage. I did some very brief experimentation with OCR parameters for another project, and the 300dpi 8 bit approach seemed good enough to me.
The text was updated successfully, but these errors were encountered:
As seen in 18F's blog today, 18F/doc_processing_toolkit handles both text extraction and OCRing. This could work for our purposes, though we should make it configurable, for those who don't want to set up Apache Tika and the like.
I'm not sure the best path for detection of reports that need OCRing (perhaps through a flag set by the scraper), but we should have
tesseract
for OCRing of some reports.I'm motivated by this report by the FBI, where only the cover sheet has text. The FBI, at least, has a clear practice of image-izing redacted documents:
http://www.justice.gov/oig/reports/2014/s140827.pdf
It's a great report, and has been getting news coverage. I did some very brief experimentation with OCR parameters for another project, and the 300dpi 8 bit approach seemed good enough to me.
The text was updated successfully, but these errors were encountered: