The Input modules in Parsr perform the initial role of importing the raw data from the input files.
Each module performs on a particular type of input files, and generate different results.
Each module may or may not contain a set of configurable parameters, which (along with the usage documentation) can be consulted in the per-module documentation pages below.
Each module returns a valid Document
object with an array of Words
for each parsed Page
.
- Pdfminer
- PDF.js
- Tesseract
- Google Vision
- Amazon Textract
- MS Cognitive Services
- ABBYY
- JSON
- MS Word
Currently, the following file formats are available for Parsr:
Input format | Input modules | |||||||
---|---|---|---|---|---|---|---|---|
Pdfminer | pdf.js | ABBYY | Tesseract | JSON Extractor | Google Vision | Amazon Textract | MS Cognitive Services | |
✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | |
.docx | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
.eml | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
.tiff | ✗ | ✗ | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ |
.png | ✗ | ✗ | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ |
.jpeg | ✗ | ✗ | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ |
.json | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ |
.xml | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
This means that for processing a pdf file, 4 extractors can be chosen: pdfminer, pdf.js, ABBYY or Tesseract.
Note: not all extractors share the same functionality or return the same information, so one should check for the best extractor given the use case.
Note: when using a json or xml file as input, extractor configuration will be ignored as there is currently only one extractor for each of this formats.