The processing modules in Parsr perform a central role of cleaning and enriching the extracted raw output. Each module performs a particular operation on a document representation, generates a new valid Document, and then passes it on to the next module for the next treatment. Each module can contain a set of configurable parameters, which can be consulted in the per-module documentation pages below:
- Drawing Detection
- Header and Footer Detection
- Heading Detection
- Hierarchy Detection
- Image Detection
- Key-Value Pair Detection
- Lines to Paragraph
- Link Detection
- List Detection
- Number Correction
- Out of Page Removal
- Page Number Detection
- Reading Order Detection
- Redundancy Detection
- Regex Matcher
- Remote Module
- Separate Words
- Table Detection
- Table of Contents Detection
- Whitespace Removal
- Words To Line
Creating a custom module can be very useful to add some treatment on the document.
You have two ways to do it:
- Use the Remote Module that will send the JSON by HTTP and expect the modified JSON as an answer
- Create a Typescript Module and add it to the pipeline
The template module folder shows how a module tree needs to be structured. The folder name, the module's filename and the class's name need to follow the PascalCase naming convention.
You can copy the entire folder to help you having a boilerplate. The template code also contains some handy comments to help you get started.
To add your newly created module to the register, simply open the Cleaner file /server/src/Cleaner.ts
and add your module class to the Cleaner.cleaningToolRegister
attribute.
If you want your module to run you need to enable it in your configuration.
Simply add a line in the cleaner
array with the name of your module, and potential options.
That's it! Your new awesome processing module should run and modify the document according to your needs!