ocrdocgen (OCR)

Overview

A simple tool for generating synthetic dataset for training of both detection and recognition part of an OCR pipeline.

Sample of synthetic document

Sample of synthetic document with bounding boxes visualization

Installation

python3 -m pip install -r requirements.txt
python3 -m pip install flit && flit install --symlink --python /usr/bin/python3

Install the provided weasyprint port which supports dumping the required bounding boxes

cd WeasyPrint && python3 -m pip install .

How to use:

Just run the main.py file.

python3 ./main.py

"texts/example.txt" is the input text. "templates/" is the folder to store templates in jinja2. "images/" is the images folder to use in the template. "fonts/" is the fonts folder. poupulate some fonts run "dataset_generator.py".

Dockerimage

DOCKER_BUILDKIT=1 docker build .

Updates

[2023/08/19] Make the code publicly available.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
Warping		Warping
WeasyPrint		WeasyPrint
docs		docs
fonts		fonts
genalog		genalog
img_backs		img_backs
img_logos		img_logos
img_sigs		img_sigs
low-level-drawing		low-level-drawing
segmentator		segmentator
templates		templates
texts		texts
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
get_bounding_boxes.py		get_bounding_boxes.py
main.py		main.py
reformatter.py		reformatter.py
requirements.txt		requirements.txt
visualize.py		visualize.py

License

realm-tech/docgen

Folders and files

Latest commit

History

Repository files navigation

ocrdocgen (OCR)

Overview

Sample of synthetic document

Sample of synthetic document with bounding boxes visualization

Installation

How to use:

Dockerimage

Updates

About

Topics

Resources

License

Stars

Watchers

Forks

Languages