Scan a folder of document files of all types and extract the text into a CSV suitable for import into Overview. Currently supports TXT, PDF, JPG, HTML, MHTML, RTF, and Microsoft Word, PowerPoint and Excel.
PDFs will be OCRd if -o set and they contain no text, or always if -f set. JPGs will be OCRd if -o set.
First you will need to install
-
Poppler, for pdfimages (and pdftotext on some systems) On Linux, use aptitude, apt-get or yum:
aptitude install poppler-utils poppler-data
On the Mac, you can install from source or use MacPorts:
sudo port install poppler | brew install poppler
-
Tesseract, for OCR
[aptitude | port | brew] install [tesseract | tesseract-ocr]
Without Tesseract installed, you'll still be able to extract text from documents, but you won't be able to automatically OCR them.
Typical usage
ruby docs2csv.rb -r -o directory-to-scan [outputfile]
If outputfile is omitted, docs2csv
will write the CSV to stdout
.
This scans the directory recursively, and OCRs any PDFs which may need it. Other options:
-l, --list Only list files, do not process
-r, --recurse Scan directory recursively
-o, --ocr OCR jpgs and pdfs that do not contain text
-f, --force-ocr Force OCR on all pdfs
Viewing the original files from Overview
The extracted text will be shown in the Overview document viewer, but not the original document pages. You can view the original files in your browser via Overview's "source file" links, if you start up a simple web server like this:
python -m SimpleHTTPServer
The "source file" links use the URL column that docs2csv writes, which has addresses of the form http://localhost:8000/[filename]. You need to run this server from the same directory where you originally ran docs2csv, as these file URLs are relative.