This page lists all of the dependencies of Parsr and what they are used for.
The following required dependencies need to be installed for Parsr to work properly:
node.js
: The underlying framework upon which the platform is built.qpdf
: For reading password-protected PDFs.imagemagick
: For converting between file formats.
Depending upon the type of documents to be treated by the platform, one or multiple of the following dependencies should be installed.
If simple PDFs containing digital (or selectable) textual elements are to be fed into the system, the pdfminer
library needs to be installed.
If images (jpg
, png
, tiff
, etc.) are to be used with the tool, then the tool also supports the use of the following two OCR based solutions as an underlying extraction module:
tesseract
: Open source, support for over ~100 languages, Google's Tesseract is a free, on premise OCR solution. However, text formatting, or tabular data is not detected.ABBYY FineReader Server
: Proprietary OCR solution with extremely high recognition accuracy, formatting recognition and tabular data extraction. It is an optional dependency.
The following optional dependencies may to be installed:
mupdf-tools
: For error-correcting corrupt PDFs at input.pandoc
: Generate PDF files from an intermediate Markdown output after the cleaning operation in the pipeline.