To extract text and data from documents like invoices, book pages, tables etc using OpenCV and Tesseract OCR.
The document scanner implements:
- Preprocessing images to improve OCR accuracy
- Contour detection and perspective transforms to isolate ROIs
- Text extraction using PyTesseract
It can handle multiple document types:
- Invoices
- Book pages
- Tables
The dataextractor.py
module contains the core implementation.
The scanner requires:
- OpenCV
- PyTesseract
- NumPy
Install requirements using:
pip install -r requirements.txt
The static/samples/
folder contains example images of different documents.