GitHub - huridocs/pdf-labeled-data

PDF labeled data

Labeled data for creating machine learning models related to PDF consumption: token types, paragraph extraction, and reading order

Dependencies

Docker Desktop 4.25.0 install link

Quick Start

Start the labeling tool:

make start

When ready, check out the web here:

 http://localhost:8080

To stop the server:

make stop

Labeled data

Token Type: Labels each word that appears in a PDF. Check out this repository https://github.com/huridocs/pdf-tokens-type-labeler
Reading Order: Sorts the information in a PDF https://github.com/huridocs/pdf-reading-order
Paragraph Extraction: Segments a PDF in paragraphs https://github.com/huridocs/pdf_paragraphs_extraction
Table Of Content: Extracts the Table Of Content https://github.com/huridocs/pdf_paragraphs_extraction

About

This is a fork, supported by HURIDOCS, of the Allen AI project PAWLS https://github.com/allenai/pawls

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
labeled_data		labeled_data
pdfs		pdfs
proxy		proxy
sonar		sonar
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
change_active_dataset.py		change_active_dataset.py
config.py		config.py
create_xmls.py		create_xmls.py
dev-requirements.txt		dev-requirements.txt
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF labeled data

Dependencies

Quick Start

Labeled data

About

About

Releases

Packages

Contributors 2

Languages

License

huridocs/pdf-labeled-data

Folders and files

Latest commit

History

Repository files navigation

PDF labeled data

Dependencies

Quick Start

Labeled data

About

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages