Skip to content

PDF parser using pdfminer and pytesseract for OCR support

Notifications You must be signed in to change notification settings

annacprice/pdf-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PDFscraper

PDFscraper uses PDFMiner and Python Tesseract to text mine pdfs.

Requirements

PDFscraper requires python 3.x

The following python packages are prerequisites:

  • pdfminer.six
  • pytesseract
  • chardet
  • Python Imaging Library (PIL) or Pillow
  • pdf2image

Other requirements: Install of Google Tesseract OCR and Poppler

Usage

usage: pdfscraper.py [-h] -i INPDF -o OUTTXT [-t]

optional arguments:
  -h, --help            show this help message and exit
  -i INPDF, --input-dir INPDF
                        Path to the input pdf files
  -o OUTTXT, --output-dir OUTTXT
                        Path for the output txt files
  -t, --token-gen       Use flag to generate tokenized output

E.g. To run

python pdfscraper.py -i /path/to/input/pdfs -o /path/to/output/directory

PDFscraper also has an optional flag -t, which produces tokenized text for use in Natural Language Processing (NLP) tasks. E.g. to produce tokenized output:

python pdfscraper.py -i /path/to/input/pdfs -o /path/to/output/directory -t

Docker

Alternatively, the accompanying Dockerfile can be used to run the program in a docker container.

E.g. To run

docker run -v "/path/to/input/pdfs:/data" --rm pdfscraper:latest -i /data -o /data

About

PDF parser using pdfminer and pytesseract for OCR support

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published