Input Modules

The Input modules in Parsr perform the initial role of importing the raw data from the input files. Each module performs on a particular type of input files, and generate different results. Each module may or may not contain a set of configurable parameters, which (along with the usage documentation) can be consulted in the per-module documentation pages below. Each module returns a valid Document object with an array of Words for each parsed Page.

The Modules

Pdfminer
PDF.js
Tesseract
Google Vision
Amazon Textract
MS Cognitive Services
ABBYY
JSON
MS Word
Email

Supported input formats

Currently, the following file formats are available for Parsr:

Input format	Input modules
Input format	Pdfminer	pdf.js	ABBYY	Tesseract	JSON Extractor	Google Vision	Amazon Textract	MS Cognitive Services
.pdf	✓	✓	✓	✓	✗	✗	✗	✗
.docx	✓	✓	✓	✓	✗	✗	✗	✗
.eml	✓	✓	✓	✓	✗	✗	✗	✗
.tiff	✗	✗	✓	✓	✗	✓	✓	✓
.png	✗	✗	✓	✓	✗	✓	✓	✓
.jpeg	✗	✗	✓	✓	✗	✓	✓	✓
.json	✗	✗	✗	✗	✓	✗	✗	✗
.xml	✗	✗	✓	✗	✗	✗	✗	✗

This means that for processing a pdf file, 4 extractors can be chosen: pdfminer, pdf.js, ABBYY or Tesseract.

Note: not all extractors share the same functionality or return the same information, so one should check for the best extractor given the use case.

Note: when using a json or xml file as input, extractor configuration will be ignored as there is currently only one extractor for each of this formats.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Input Modules

The Modules

Supported input formats

Files

README.md

Latest commit

History

README.md

File metadata and controls

Input Modules

The Modules

Supported input formats