PDF extraction pipelines and benchmarks agenda

Caution

Part of text in this repo written by ChatGPT. Also, I haven't yet run all pipelines because of lack of compute power.

This repository provides an overview of notable pipelines and benchmarks related to PDF/OCR document processing. Each entry includes a brief description, and useful data.

Comparison

Important

Open README.md in separate page, not in repository preview! It will look better.

Pipeline	OmniDocBench Overall ↓	OmniDocBench New ↑	olmOCR Overall ↑	dp-bench NID ↑
MinerU	0.133 ^[2] ⚠️	90.67 ^[1] ⚠️	61.5	91.18
MonkeyOCR	0.138	88.85 ^[2]	75.8 ^[3] ⚠️
PP-StructureV3	0.145	86.73
Marker	0.296	71.3	70.1
Pix2Text	0.32
olmOCR	0.326	81.79	78.5 ^[2] ⚠️
Unstructured	0.586
DocLing	0.589
Open-Parse	0.646
MarkItDown
Zerox
Markdrop
Vision Parse
↓ Specialized VLMs
dots.ocr	0.125 ^[1] ⚠️	88.41 ^[3]	79.1 ^[1] ⚠️
POINTS-Reader	0.133 ^[3] ⚠️	80.98
Dolphin	0.205	74.67
OCRFlux	0.238	74.82
Nanonets-OCR	0.283	85.59	64.5
GOT-OCR	0.287		48.3
Nougat	0.452
SmolDocling	0.493
↓ Proprietary pipelines
Mathpix	0.191
Mistral OCR	0.268	78.83	72.0
Google Document AI				90.86
Azure OCR				87.69
Amazon Textract				96.71 ^[2]
LlamaParse				92.82 ^[3]
Upstage AI				97.02 ^[1] ⚠️
doc2x
↓ General VLMs
Gemini-2.5 Pro	0.148	88.03
Gemini-2.0 Flash	0.191		63.8
Qwen2.5-VL-72B	0.214	87.02	65.5
InternVL3-78B	0.218	80.33
GPT4o	0.233	75.02	69.9
Qwen2-VL-72B	0.252		31.5

Bold indicates the best result for a given metric, and ^[2] indicates 2nd place in that benchmark.
" " means the pipeline was not evaluated in that benchmark.
⚠️ means the pipeline authors are the ones who suggested the results.
Overall ↑ in column name means higher value is better, when Overall ↓ - lower value is better.

Pipelines

MinerU

✏️

Primary Language: Python

License: AGPL-3.0

Description: MinerU is an open-source tool designed to convert PDFs into machine-readable formats, such as Markdown and JSON, facilitating seamless data extraction and further processing. Developed during the pre-training phase of InternLM, MinerU addresses symbol conversion challenges in scientific literature, making it invaluable for research and development in large language models. Key features include:

Content Cleaning: Removes headers, footers, footnotes, and page numbers to ensure semantic coherence.
Structure Preservation: Maintains the original document structure, including titles, paragraphs, and lists.
Multimodal Extraction: Accurately extracts images, image descriptions, tables, and table captions.
Formula Recognition: Converts recognized formulas into LaTeX format.
Table Conversion: Transforms tables into LaTeX or HTML formats.
OCR Capabilities: Detects scanned or corrupted PDFs and enables OCR functionality, supporting text recognition in 84 languages.
Cross-Platform Compatibility: Operates on Windows, Linux, and Mac platforms, supporting both CPU and GPU environments.

Marker

✏️

Primary Language: Python

License: GPL-3.0

Description: Marker “converts PDFs and images to markdown, JSON, and HTML quickly and accurately.” It is designed to handle a wide range of document types in all languages and produce structured outputs.

Benchmark Results: https://github.com/VikParuchuri/marker?tab=readme-ov-file#performance

API Details:

API URL: https://www.datalab.to/
Pricing: https://www.datalab.to/plans
Average Price: $3 per 1000 pages, at least $25 per month

Additional Notes: Demo available after registration on https://www.datalab.to/

dots.ocr

✏️

License: MIT

Description: dots.ocr is a powerful, multilingual document parser that unifies layout detection and content recognition within a single vision-language model while maintaining good reading order. Despite its compact 1.7B-parameter LLM foundation, it achieves state-of-the-art (SOTA) performance on text, tables, and reading order tasks. The model supports over 100 languages and can handle various document types including PDFs, images, tables, formulas, and maintains proper reading order. It offers a significantly more streamlined architecture than conventional methods that rely on complex, multi-model pipelines, allowing users to switch between tasks simply by altering the input prompt.

Benchmark Results: https://huggingface.co/rednote-hilab/dots.ocr#benchmark-results

API Details:

API URL: https://replicate.com/sljeff/dots.ocr

Additional Notes:

Built on 1.7B parameters, providing faster inference speeds than larger models
Supports both layout detection and content recognition in a unified architecture
Multiple deployment options including Docker, vLLM, Hugging Face Transformers, and cloud APIs
Strong multilingual capabilities with particular strength in low-resource languages
Can output structured data in JSON, Markdown, and HTML formats
Includes specialized prompts for different use cases: layout detection, OCR-only, and grounding OCR
4-bit quantized version available for consumer-grade GPUs

OCRFlux

✏️

License: Apache-2.0

Description: OCRFlux is a multimodal large language model based toolkit designed to convert PDFs and images into clean, readable, plain Markdown text. It excels in complex layout handling, including multi-column layouts, figures, insets, complicated tables, and equations. The system also provides automated removal of headers and footers, alongside native support for cross-page table and paragraph merging, a pioneering feature among open-source OCR tools. Built on a 3 billion parameter vision-language model, it can run efficiently on GPUs such as the GTX 3090. OCRFlux provides batch inference support for whole documents and detailed parsing quality with benchmarks demonstrating significant improvements over several leading OCR models.

Additional Notes:

Recommended GPU: 24GB or more VRAM for best performance, but supports tensor parallelism to divide workload across multiple smaller GPUs
Includes Docker container support for easy deployment
Supports various command-line options for customizing inference, GPU memory utilization, page merging behavior, and data type selection
Outputs results as JSONL files convertible into Markdown documents
Developed and maintained by ChatDOC team
Has 2.3k stars on GitHub

Dolphin

✏️

License: MIT

Description: Dolphin is a novel multimodal document image parsing model (0.3B parameters) that follows an analyze-then-parse paradigm. It addresses complex document understanding challenges through a two-stage approach: Stage 1 performs comprehensive page-level layout analysis by generating element sequences in natural reading order, while Stage 2 enables efficient parallel parsing of document elements using heterogeneous anchors and task-specific prompts. The model handles intertwined elements such as text paragraphs, figures, formulas, and tables while maintaining superior efficiency through its lightweight architecture and parallel parsing mechanism. Built on a vision-encoder-decoder architecture using Swin Transformer for visual encoding and MBart for text decoding, Dolphin supports both page-level and element-level parsing tasks.

API Details:

API URL: https://replicate.com/bytedance/dolphin
Average Price: Approximately $17 per 1000 pages (based on Replicate pricing of $0.017 per run)

Additional Notes:

Compact 0.3B parameter model optimized for efficiency
Supports both original config-based framework and Hugging Face integration
Multi-page PDF document parsing capability added in June 2025
TensorRT-LLM and vLLM support for accelerated inference
Two parsing granularities: page-level (entire document) and element-level (individual components)
Element-decoupled parsing strategy allows for easier data collection and training
Natural language prompt-based interface for controlling parsing tasks
Supports various document elements including text paragraphs, tables, formulas, and figures
Open-source with active development and community support (7.4k GitHub stars)
Published research paper accepted at ACL 2025 conference

Nanonets-OCR

✏️ ![License](https://img.shields.io/badge/License-Other (please specify below)-red)

License: Other (please specify below)

Description: Nanonets-OCR-s is a powerful open-source OCR model that converts images or documents into richly structured markdown with intelligent content recognition and semantic tags. Key features include automatic LaTeX equation recognition, intelligent image description, signature detection, watermark extraction, smart checkbox handling, and complex table extraction. It is designed for downstream processing by large language models for tasks like document understanding and parsing.

API Details:

API URL: https://nanonets.com/ocr-api
Pricing: https://nanonets.com/pricing

Additional Notes:

The open-source model supports inference via Hugging Face transformers and vLLM server.
It can be fine-tuned and adapted for custom datasets.
Useful for research, experimentation, and building customized OCR pipelines without commercial restrictions.

PP-StructureV3

✏️

License: Apache-2.0

Description: PP-StructureV3 is a multi-model pipeline for document image parsing that converts document images or PDFs into structured JSON and Markdown files. It integrates several key modules: preprocessing for image quality improvements, an OCR engine (PP-OCRv5), layout detection via PP-DocLayout-plus, document item recognition (tables, formulas, charts, seals), and post-processing to reconstruct element relationships and reading order. The pipeline is designed for high accuracy in complex layouts including multi-column texts, magazines, handwritten documents, and vertically typeset languages.

It supports comprehensive recognition with specialized models for tables (PP-TableMagic), formulas (PP-FormulaNet_plus), charts (PP-Chart2Table), and seals (PP-OCRv4_seal). It achieves state-of-the-art results on benchmarks like OmniDocBench, especially for Chinese and English documents, competing well with expert and general vision-language models.

Additional Notes:

PP-StructureV3 uses PP-OCRv5 as the OCR backbone, which includes improvements in network architecture and training, supporting vertical text, handwriting, and rare Chinese characters.
Preprocessing includes document orientation classification and text unwarping.
Layout analysis uses PP-DocLayout-plus and a region detection model to handle multiple articles per page.
Table recognition with PP-TableMagic outputs HTML formatted structures.
Formula recognition with PP-FormulaNet_plus outputs LaTeX.
Chart parsing converts charts into markdown tables.
Seal recognition handles curved text and round/oval seals.
Post-processing enhances reading order reconstruction especially for complex document layouts (e.g., multi-column magazines, vertical typesetting).
Performance is tested on NVIDIA V100/A100 GPUs with detailed resource usage statistics available.
The system can process PDFs and images and can save results in JSON and Markdown formats.

MarkItDown

✏️

Primary Language: Python

License: MIT

Description: MarkItDown is a Python-based utility developed by Microsoft for converting various file formats into Markdown. It supports a wide range of file types, including:

Office Documents: Word (.docx), PowerPoint (.pptx), Excel (.xlsx)
Media Files: Images (with EXIF metadata and OCR capabilities), Audio (with speech transcription)
Web and Data Formats: HTML, CSV, JSON, XML
Archives: ZIP files (with recursive content parsing)
URLs: YouTube links

This versatility makes MarkItDown a valuable tool for tasks such as indexing, text analysis, and preparing content for Large Language Model (LLM) training. The utility offers both command-line and Python API interfaces, providing flexibility for various use cases. Additionally, MarkItDown features a plugin-based architecture, allowing for easy integration of third-party extensions to enhance its functionality.

olmOCR

✏️

Primary Language: Python

License: Apache-2.0

Description: olmOCR is an open-source toolkit developed by the Allen Institute for AI, designed to convert PDFs and document images into clean, plain text suitable for large language model (LLM) training and other applications. Key features include:

High Accuracy: Preserves reading order and supports complex elements such as tables, equations, and handwriting.
Document Anchoring: Combines text and visual information to enhance extraction accuracy.
Structured Content Representation: Utilizes Markdown to represent structured content, including sections, lists, equations, and tables.
Optimized Pipeline: Compatible with SGLang and vLLM inference engines, enabling efficient scaling from single to multiple GPUs.

MonkeyOCR

✏️

License: Apache-2.0

Description: MonkeyOCR is an open‑source, layout‑aware document parsing system developed by Yuliang‑Liu and collaborators that implements a novel Structure‑Recognition‑Relation (SRR) triplet paradigm. It decomposes document analysis into three phases—block structure detection (“Where is it?”), content recognition (“What is it?”), and reading‑order relation modeling (“How is it organized?”)—delivering both high accuracy and inference speed by avoiding heavy end‑to‑end models or brittle modular pipelines. Trained on the extensive MonkeyDoc dataset (nearly 3.9 million instances across English and Chinese, covering 10+ document types), MonkeyOCR achieves state‑of‑the‑art performance, including significant gains in table (+8.6%) and formula (+15.0%) recognition, and outperforms much larger models like Qwen2.5‑VL (72B) and Gemini 2.5 Pro. Remarkably, the 3B‑parameter variant runs efficiently—approximately 0.84 pages per second on multi‑page input using a single NVIDIA 3090 GPU—making it practical for real‑world document workloads.

POINTS-Reader

✏️

License: MIT

Description: POINTS-Reader is a powerful, distillation-free vision-language model for end-to-end document conversion developed by Tencent's WeChat AI team. It supports English and Chinese document extraction with a streamlined model architecture based on the POINTS1.5 structure, replacing Qwen2.5-7B-Instruct with the more efficient Qwen2.5-3B-Instruct. The input is a fixed prompt with a document image, and the output is a text string representing the extracted content without post-processing. The model excels in handling complex documents including tables, formulas, and multi-column layouts and is designed for high throughput in production environments with support for inference frameworks like SGLang and upcoming vLLM. It employs a novel two-stage data augmentation strategy: first training on large-scale synthetic data, then iteratively self-improving through annotation filtering and retraining on real-world documents. This approach achieves state-of-the-art performance, notably 0.133 overall score for English and 0.212 for Chinese on the OmniDocBench benchmark.

Additional Notes:

Built on a compact yet effective architecture prioritizing throughput and efficiency.
Offers a distillation-free two-stage training approach for continuous self-improvement without teacher model reliance.
Unified output format with consistent representation for text, tables (HTML), and formulas (LaTeX).
High scalability with support for large-scale synthetic data and real-world document adaptation.
Currently supports English and Chinese; multilingual support is limited compared to dots.ocr.
Designed to work well in production environments with mainstream inference frameworks (SGLang, upcoming vLLM support).
Open-source with source code and model weights available on GitHub and Hugging Face.
Suitable for complex layout parsing involving tables, multi-column text, and formulas with minimal post-processing needed.

Benchmark Results: https://github.com/Yuliang-Liu/MonkeyOCR?tab=readme-ov-file#benchmark-results

Mistral OCR

✏️

License: Proprietary

API Details:

API URL: https://docs.mistral.ai/capabilities/document/
Pricing: https://mistral.ai/products/la-plateforme#pricing
Average Price: 1$ per 1000 pages

Google Document AI

✏️

License: Proprietary

Description: Google Document AI is a cloud-based document processing service that uses machine learning to automatically extract structured data from documents. It supports various document types, including invoices, receipts, forms, and identity documents. Key features include:

Optical Character Recognition (OCR): Converts scanned images and PDFs into editable text.
Data Extraction: Identifies and extracts key-value pairs, tables, and other structured data.
Document Understanding Classifies and understands the content of documents.
Customization: Allows users to train custom models for specific document types.

API Details:

API URL: https://cloud.google.com/document-ai/docs/reference/rest
Pricing: https://cloud.google.com/document-ai/pricing
Average Price: $1.50 per 1000 pages

Azure OCR

✏️

License: Proprietary

Description: Azure AI Vision OCR is a cloud-based service that employs advanced machine-learning algorithms to extract printed and handwritten text from images and documents. It supports a wide array of languages and can process various content types, including posters, street signs, product labels, and business documents. The service is designed to detect text lines, words, and paragraphs, providing structured output suitable for integration into applications requiring text extraction capabilities.

API Details:

API URL: https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/ocr
Pricing: https://azure.microsoft.com/en-us/pricing/details/cognitive-services/computer-vision/
Average Price: $1 per 1,000 transactions

Amazon Textract

✏️

License: Proprietary

Description: Amazon Textract is a machine learning service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) by also identifying the contents of fields in forms, information stored in tables, and the presence of selection elements such as checkboxes. This enables the conversion of unstructured content into structured data, facilitating integration into various applications and workflows.

API Details:

API URL: https://docs.aws.amazon.com/textract/latest/dg/API_Reference.html
Pricing: https://aws.amazon.com/textract/pricing/
Average Price: $1.50 per 1000 pages

LlamaParse

✏️

Primary Language: Python

License: Proprietary

Description: LlamaParse is a GenAI-native document parsing platform developed by LlamaIndex. It transforms complex documents—including PDFs, PowerPoint presentations, Word documents, and spreadsheets—into structured, LLM-ready formats. LlamaParse excels in accurately extracting and formatting tables, images, and other non-standard layouts, ensuring high-quality data for downstream applications such as Retrieval-Augmented Generation (RAG) and data processing. The platform supports over 10 file types and offers features like natural language parsing instructions, JSON output, and multilingual support.

API Details:

API URL: https://api.cloud.llamaindex.ai/api/parsing/upload
Pricing: https://docs.cloud.llamaindex.ai/llamaparse/usage_data
Average Price: Free Plan: 1,000 pages per day; Paid Plan: 7,000 pages per week, with additional pages at $ 3 per 1,000 pages

Mathpix

✏️

Primary Language: Not publicly available

License: Proprietary

Description: Mathpix offers advanced Optical Character Recognition (OCR) technology tailored for STEM content. Their services include the Convert API, which accurately digitizes images and PDFs containing complex elements such as mathematical equations, chemical diagrams, tables, and handwritten notes. The platform supports multiple output formats, including LaTeX, MathML, HTML, and Markdown, facilitating seamless integration into various applications and workflows. Additionally, Mathpix provides the Snipping Tool, a desktop application that allows users to capture and convert content from their screens into editable formats with a single keyboard shortcut.

API Details:

API URL: https://docs.mathpix.com/
Pricing: https://mathpix.com/pricing
Average Price: $5 per 1000 pages

Upstage AI

✏️

License: Proprietary

Description: The Upstage AI is a comprehensive suite of artificial intelligence solutions designed to enhance business operations across various industries. It encompasses advanced large language models (LLMs) and document processing engines to streamline workflows and improve efficiency.

Benchmark Results: https://www.upstage.ai/blog/en/icdar-win-interview

API Details:

API URL: https://console.upstage.ai/docs/getting-started
Pricing: https://upstage.ai/pricing
Average Price: $10 per 1000 pages

Nougat

✏️

Primary Language: Python

License: MIT

Description: Nougat (Neural Optical Understanding for Academic Documents) is an open-source Visual Transformer model developed by Meta AI Research. It is designed to perform Optical Character Recognition (OCR) on scientific documents, converting PDFs into a machine-readable markup language. Nougat simplifies the extraction of complex elements such as mathematical expressions and tables, enhancing the accessibility of scientific knowledge. The model processes raw pixel data from document images and outputs structured markdown text, bridging the gap between human-readable content and machine-readable formats.

GOT-OCR

✏️

Primary Language: Python

License: Apache-2.0

Description: GOT-OCR (General OCR Theory) is an open-source, unified end-to-end model designed to advance OCR to version 2.0. It supports a wide range of tasks, including plain document OCR, scene text OCR, formatted document OCR, and OCR for tables, charts, mathematical formulas, geometric shapes, molecular formulas, and sheet music. The model is highly versatile, supporting various input types and producing structured outputs, making it well-suited for complex OCR tasks.

Benchmark Results: https://github.com/Ucas-HaoranWei/GOT-OCR2.0#benchmarks

DocLing

✏️

Primary Language: Python

License: MIT

Description: DocLing is an open-source document processing pipeline developed by IBM Research. It simplifies the parsing of diverse document formats—including PDF, DOCX, PPTX, HTML, and images—and provides seamless integrations with the generative AI ecosystem. Key features include advanced PDF understanding, optical character recognition (OCR) support, and plug-and-play integrations with frameworks like LangChain and LlamaIndex.

SmolDocling

✏️

License: Apache-2.0

Description: SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion, developed by Docling team. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments.

Zerox

✏️

Primary Language: TypeScript

License: MIT

Description: Zerox is an OCR and document extraction tool that leverages vision models to convert PDFs and images into structured Markdown format. It excels in handling complex layouts, including tables and charts, making it ideal for AI ingestion and further text analysis.

Benchmark Results: https://getomni.ai/ocr-benchmark

API Details:

API URL: https://getomni.ai/
Pricing: https://getomni.ai/pricing
Average Price: Extract structured data: 'Startup' plan at $225 per month with 5000 pages included, after that $2 per 1000 pages

Unstructured

✏️

Primary Language: Python

License: Apache-2.0

Description: Unstructured is an open-source library that provides components for ingesting and pre-processing unstructured data, including images and text documents such as PDFs, HTML, and Word documents. It transforms complex data into structured formats suitable for large language models and AI applications. The platform offers enterprise-grade connectors to seamlessly integrate various data sources, making it easier to extract and transform data for analysis and processing.

API Details:

API URL: https://docs.unstructured.io/platform-api/overview
Pricing: https://unstructured.io/developers
Average Price: **Basic Strategy **: $2 per 1,000 pages, suitable for simple, text-only documents. Advanced Strategy: $20 per 1,000 pages, ideal for PDFs, images, and complex file types. Platinum/VLM Strategy: $30 per 1,000 pages, designed for challenging documents, including scanned and handwritten content with VLM API integration.

Pix2Text

✏️

Primary Language: Python

License: MIT

Description: Pix2Text (P2T) is an open-source Python3 tool designed to recognize layouts, tables, mathematical formulas (LaTeX), and text in images, converting them into Markdown format. It serves as a free alternative to Mathpix, supporting over 80 languages, including English, Simplified Chinese, Traditional Chinese, and Vietnamese. P2T can also process entire PDF files, extracting content into structured Markdown, facilitating seamless conversion of visual content into text-based representations.

Open-Parse

✏️

Primary Language: Python

License: MIT

Description: Open Parse is a flexible, open-source library designed to enhance document chunking for Retrieval-Augmented Generation (RAG) systems. It visually analyzes document layouts to effectively group related content, surpassing traditional text-splitting methods. Key features include:

Visually-Driven Analysis: Understands complex layouts for superior chunking.
Markdown Support: Extracts headings, bold, and italic text into Markdown format.
High-Precision Table Extraction: Converts tables into clean Markdown with high accuracy.
Extensibility: Allows implementation of custom post-processing steps.
Intuitive Design: Offers robust editor support for seamless integration.

Extractous

✏️

Primary Language: Rust

License: Apache-2.0

Description: Extractous is a high-performance, open-source library designed for efficient extraction of content and metadata from various document types, including PDF, Word, HTML, and more. Developed in Rust, it offers bindings for multiple programming languages, starting with Python. Extractous aims to provide a comprehensive solution for unstructured data extraction, enabling local and efficient processing without relying on external services or APIs. Key features include:

High Performance: Leveraging Rust's capabilities, Extractous achieves faster processing speeds and lower memory utilization compared to traditional extraction libraries.
Multi-Language Support: While the core is written in Rust, bindings are available for Python, with plans to support additional languages like JavaScript/TypeScript.
Extensive Format Support: Through integration with Apache Tika, Extractous supports a wide range of file formats, ensuring versatility in data extraction tasks.
OCR Integration: Incorporates Tesseract OCR to extract text from images and scanned documents, enhancing its ability to handle diverse content types.

Benchmark Results: https://github.com/yobix-ai/extractous-benchmarks

Markdrop

✏️

Primary Language: Python

License: GPL-3.0

Description: A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.

Vision Parse

✏️

Primary Language: Python

License: MIT

Description: Parse PDFs into markdown using Vision LLMs

doc2x

✏️

License: Proprietary

Description: NoEdgeAI is an open‑source technology initiative focused on enhancing document processing in Retrieval-Augmented Generation (RAG) workflows. Their flagship library, pdfdeal, is a Python wrapper for the Doc2X API that facilitates high‑fidelity PDF-to-text conversion. It extends Doc2X’s capabilities by offering local text preprocessing, Markdown and LaTeX extraction, file splitting, image uploading, and enhancements for better recall when integrating PDFs into knowledge‑base tools like Graphrag, Dify, or FastGPT

API Details:

API URL: https://noedgeai.github.io/pdfdeal-docs/

Benchmarks

OmniDocBench

OmniDocBench is “a benchmark for evaluating diverse document parsing in real-world scenarios” by MinerU devs. It establishes a comprehensive evaluation standard for document content extraction methods.

[2025/09/25] Major update: Updated from v1.0 to v1.5

Evaluation code: (1) Updated the hybrid matching algorithm, allowing formulas and text to be matched with each other, which alleviates score errors caused by models outputting formulas as unicode; (2) Integrated CDM calculation directly into the metric section, so users with a CDM environment can compute the metric directly by calling CDM in config file. The previous interface for outputting formula matching pairs as a JSON file is still retained, now named CDM_plain in config file.
Benchmark dataset: (1) Increased the image resolution for newspaper and note types from 72 DPI to 200 DPI; (2) Added 374 new pages, balanced the number of Chinese and English pages, and increased the proportion of pages containing formulas; (3) Formulas update language atrributes; (4) Fixed typos in some text and table annotations.
Leaderboard: (1) Removed the Chinese/English grouping, now calculating the average score across all pages; (2) The Overall metric is now calculated as ((1 - text Edit distance) * 100 + table TEDS + formula CDM) / 3;
Note: The main branch of evaludation code (this repo) and dataset in HuggingFace and OpenDataLab are now updated to Version v1.5, if you still want to evaluate your model in v1.0, please checkout to branch v1_0.

Notable features: OmniDocBench covers a wide variety of document types and layouts, comprising 981 PDF pages across 9 document types, 4 layout styles, and 3 languages. It provides rich annotations: over 20k block-level elements (paragraphs, headings, tables, etc.) and 80k+ span-level elements (lines, formulas, etc.), including reading order and various attribute tags for pages, text, and tables. The dataset undergoes strict quality control (combining manual annotation, intelligent assistance, and expert review for high accuracy). OmniDocBench also comes with * evaluation code* for fair, end-to-end comparisons of document parsing methods. It supports multiple evaluation tasks ( overall extraction, layout detection, table recognition, formula recognition, OCR text recognition) and standard metrics (Normalized Edit Distance, BLEU, METEOR, TEDS, COCO mAP/mAR, etc.) to benchmark performance across different aspects of document parsing.

End-to-End Evaluation

End-to-end evaluation assesses the model's accuracy in parsing PDF page content. The evaluation uses the model's Markdown output of the entire PDF page parsing results as the prediction.

Method Type	Methods	Overall^Edit↓		Text^Edit↓		Formula^Edit↓		Formula^CDM↑		Table^TEDS↑		Table^Edit↓		Read Order^Edit↓
Method Type	Methods	EN	ZH	EN	ZH	EN	ZH	EN	ZH	EN	ZH	EN	ZH	EN	ZH
Pipeline Tools	MinerU-0.9.3	0.15	0.357	0.061	0.215	0.278	0.577	57.3	42.9	78.6	62.1	0.18	0.344	0.079	0.292
	Marker-1.2.3	0.336	0.556	0.08	0.315	0.53	0.883	17.6	11.7	67.6	49.2	0.619	0.685	0.114	0.34
	Mathpix	0.191	0.365	0.105	0.384	0.306	0.454	62.7	62.1	77.0	67.1	0.243	0.32	0.108	0.304
	Docling-2.14.0	0.589	0.909	0.416	0.987	0.999	1	-	-	61.3	25.0	0.627	0.810	0.313	0.837
	Pix2Text-1.1.2.3	0.32	0.528	0.138	0.356	0.276	0.611	78.4	39.6	73.6	66.2	0.584	0.645	0.281	0.499
	Unstructured-0.17.2	0.586	0.716	0.198	0.481	0.999	1	-	-	0	0.064	1	0.998	0.145	0.387
	OpenParse-0.7.0	0.646	0.814	0.681	0.974	0.996	1	0.106	0	64.8	27.5	0.284	0.639	0.595	0.641
Expert VLMs	GOT-OCR	0.287	0.411	0.189	0.315	0.360	0.528	74.3	45.3	53.2	47.2	0.459	0.52	0.141	0.28
	Nougat	0.452	0.973	0.365	0.998	0.488	0.941	15.1	16.8	39.9	0.0	0.572	1.000	0.382	0.954
	Mistral OCR	0.268	0.439	0.072	0.325	0.318	0.495	64.6	45.9	75.8	63.6	0.6	0.65	0.083	0.284
	OLMOCR-sglang	0.326	0.469	0.097	0.293	0.455	0.655	74.3	43.2	68.1	61.3	0.608	0.652	0.145	0.277
	SmolDocling-256M_transformer	0.493	0.816	0.262	0.838	0.753	0.997	32.1	0.551	44.9	16.5	0.729	0.907	0.227	0.522
General VLMs
	Gemini2.0-flash	0.191	0.264	0.091	0.139	0.389	0.584	77.6	43.6	79.7	78.9	0.193	0.206	0.092	0.128
	Gemini2.5-Pro	0.148	0.212	0.055	0.168	0.356	0.439	80.0	69.4	85.8	86.4	0.13	0.119	0.049	0.121
	GPT4o	0.233	0.399	0.144	0.409	0.425	0.606	72.8	42.8	72.0	62.9	0.234	0.329	0.128	0.251
	Qwen2-VL-72B	0.252	0.327	0.096	0.218	0.404	0.487	82.2	61.2	76.8	76.4	0.387	0.408	0.119	0.193
	Qwen2.5-VL-72B	0.214	0.261	0.092	0.18	0.315	0.434	68.8	62.5	82.9	83.9	0.341	0.262	0.106	0.168
	InternVL2-76B	0.44	0.443	0.353	0.290	0.543	0.701	67.4	44.1	63.0	60.2	0.547	0.555	0.317	0.228

Method Type	Methods	Overall^Edit↓		Text^Edit↓		Formula^Edit↓		Formula^CDM↑		Table^TEDS↑		Table^Edit↓		Read Order^Edit↓
Method Type	Methods	EN	ZH	EN	ZH	EN	ZH	EN	ZH	EN	ZH	EN	ZH	EN	ZH
Pipeline Tools	MinerU-0.9.3	0.15	0.357	0.061	0.215	0.278	0.577	57.3	42.9	78.6	62.1	0.18	0.344	0.079	0.292
	Marker-1.2.3	0.336	0.556	0.08	0.315	0.53	0.883	17.6	11.7	67.6	49.2	0.619	0.685	0.114	0.34
	Mathpix	0.191	0.365	0.105	0.384	0.306	0.454	62.7	62.1	77.0	67.1	0.243	0.32	0.108	0.304
	Docling-2.14.0	0.589	0.909	0.416	0.987	0.999	1	-	-	61.3	25.0	0.627	0.810	0.313	0.837
	Pix2Text-1.1.2.3	0.32	0.528	0.138	0.356	0.276	0.611	78.4	39.6	73.6	66.2	0.584	0.645	0.281	0.499
	Unstructured-0.17.2	0.586	0.716	0.198	0.481	0.999	1	-	-	0	0.064	1	0.998	0.145	0.387
	OpenParse-0.7.0	0.646	0.814	0.681	0.974	0.996	1	0.106	0	64.8	27.5	0.284	0.639	0.595	0.641
Expert VLMs	GOT-OCR	0.287	0.411	0.189	0.315	0.360	0.528	74.3	45.3	53.2	47.2	0.459	0.52	0.141	0.28
	Nougat	0.452	0.973	0.365	0.998	0.488	0.941	15.1	16.8	39.9	0.0	0.572	1.000	0.382	0.954
	Mistral OCR	0.268	0.439	0.072	0.325	0.318	0.495	64.6	45.9	75.8	63.6	0.6	0.65	0.083	0.284
	OLMOCR-sglang	0.326	0.469	0.097	0.293	0.455	0.655	74.3	43.2	68.1	61.3	0.608	0.652	0.145	0.277
	SmolDocling-256M_transformer	0.493	0.816	0.262	0.838	0.753	0.997	32.1	0.551	44.9	16.5	0.729	0.907	0.227	0.522
General VLMs
	Gemini2.0-flash	0.191	0.264	0.091	0.139	0.389	0.584	77.6	43.6	79.7	78.9	0.193	0.206	0.092	0.128
	Gemini2.5-Pro	0.148	0.212	0.055	0.168	0.356	0.439	80.0	69.4	85.8	86.4	0.13	0.119	0.049	0.121
	GPT4o	0.233	0.399	0.144	0.409	0.425	0.606	72.8	42.8	72.0	62.9	0.234	0.329	0.128	0.251
	Qwen2-VL-72B	0.252	0.327	0.096	0.218	0.404	0.487	82.2	61.2	76.8	76.4	0.387	0.408	0.119	0.193
	Qwen2.5-VL-72B	0.214	0.261	0.092	0.18	0.315	0.434	68.8	62.5	82.9	83.9	0.341	0.262	0.106	0.168
	InternVL2-76B	0.44	0.443	0.353	0.290	0.543	0.701	67.4	44.1	63.0	60.2	0.547	0.555	0.317	0.228

Comprehensive evaluation of document parsing algorithms on OmniDocBench: performance metrics for text, formula, table, and reading order extraction, with overall scores derived from ground truth comparisons.

olmoOCR eval

olmOCR-Bench works by testing various "facts" about document pages at the PDF-level. Our intention is that each "fact" is very simple, unambiguous, and machine-checkable, similar to a unit test. For example, once your document has been OCRed, we may check that a particular sentence appears exactly somewhere on the page.

Dataset Link: https://huggingface.co/datasets/allenai/olmOCR-bench

Model	ArXiv	Old Scans Math	Tables	Old Scans	Headers and Footers	Multi column	Long tiny text	Base	Overall
GOT OCR	52.7	52.0	0.20	22.1	93.6	42.0	29.9	94.0	48.3 ± 1.1
Marker v1.7.5 (base, force_ocr)	76.0	57.9	57.6	27.8	84.9	72.9	84.6	99.1	70.1 ± 1.1
MinerU v1.3.10	75.4	47.4	60.9	17.3	96.6	59.0	39.1	96.6	61.5 ± 1.1
Mistral OCR API	77.2	67.5	60.6	29.3	93.6	71.3	77.1	99.4	72.0 ± 1.1
Nanonets OCR	67.0	68.6	77.7	39.5	40.7	69.9	53.4	99.3	64.5 ± 1.1
GPT-4o (No Anchor)	51.5	75.5	69.1	40.9	94.2	68.9	54.1	96.7	68.9 ± 1.1
GPT-4o (Anchored)	53.5	74.5	70.0	40.7	93.8	69.3	60.6	96.8	69.9 ± 1.1
Gemini Flash 2 (No Anchor)	32.1	56.3	61.4	27.8	48.0	58.7	84.4	94.0	57.8 ± 1.1
Gemini Flash 2 (Anchored)	54.5	56.1	72.1	34.2	64.7	61.5	71.5	95.6	63.8 ± 1.2
Qwen 2 VL (No Anchor)	19.7	31.7	24.2	17.1	88.9	8.3	6.8	55.5	31.5 ± 0.9
Qwen 2.5 VL (No Anchor)	63.1	65.7	67.3	38.6	73.6	68.3	49.1	98.3	65.5 ± 1.2
olmOCR v0.1.75 (No Anchor)	71.5	71.4	71.4	42.8	94.1	77.7	71.0	97.8	74.7 ± 1.1
olmOCR v0.1.75 (Anchored)	74.9	71.2	71.0	42.2	94.5	78.3	73.3	98.3	75.5 ± 1.0

Also, the olmOCR project provides an evaluation toolkit (runeval.py) for side-by-side comparison of PDF conversion pipeline outputs. This tool allows researchers to directly compare text extraction results from different pipeline versions against a gold-standard reference. Also olmoOCR authors made some evalutions in their technical report.

We then sampled 2,000 comparison pairs (same PDF, different tool). We asked 11 data researchers and engineers at Ai2 to assess which output was the higher quality representation of the original PDF, focusing on reading order, comprehensiveness of content and representation of structured information. The user interface used is similar to that in Figure 5. Exact participant instructions are listed in Appendix B.

Bootstrapped Elo Ratings (95% CI)

Model	Elo Rating ± CI	95% CI Range
olmoOCR	1813.0 ± 84.9	[1605.9, 1930.0]
MinerU	1545.2 ± 99.7	[1336.7, 1714.1]
Marker	1429.1 ± 100.7	[1267.6, 1645.5]
GOTOCOR	1212.7 ± 82.0	[1097.3, 1408.3]

Table 7: Pairwise Win/Loss Statistics Between Models

Model Pair	Wins	Win Rate (%)
olmOCR vs. Marker	49/31	61.3
olmOCR vs. GOTOCOR	41/29	58.6
olmOCR vs. MinerU	55/22	71.4
Marker vs. MinerU	53/26	67.1
Marker vs. GOTOCOR	45/26	63.4
GOTOCOR vs. MinerU	38/37	50.7
Total	452

Marker benchmarks

The Marker repository provides benchmark results comparing various PDF processing methods, scored based on a heuristic that aligns text with ground truth text segments, and an LLM as a judge scoring method.

Method	Avg Time	Heuristic Score	LLM Score
marker	2.83837	95.6709	4.23916
llamaparse	23.348	84.2442	3.97619
mathpix	6.36223	86.4281	4.15626
docling	3.69949	86.7073	3.70429

READoc

Methods	Text (Concat)	Text (Vocab)	Heading (Concat)	Heading (Tree)	Formula (Embed)	Formula (Isolate)	Table (Concat)	Table (Tree)	Reading Order (Block)	Reading Order (Token)	Average
Baselines
PyMuPDF4LLM	66.66	74.27	27.86	20.77	0.07	0.02	23.27	15.83	87.70	89.09	40.55
Tesseract OCR	78.85	76.51	1.26	0.30	0.12	0.00	0.00	0.00	96.70	97.59	35.13
Pipeline Tools
MinerU	84.15	84.76	62.89	39.15	62.97	71.02	0.00	0.00	98.64	97.72	60.17
Pix2Text	85.85	83.72	63.23	34.53	43.18	37.45	54.08	47.35	97.68	96.78	64.39
Marker	83.58	81.36	68.78	54.82	5.07	56.26	47.12	43.35	98.08	97.26	63.57
Expert Visual Models
Nougat-small	87.35	92.00	86.40	87.88	76.52	79.39	55.63	52.35	97.97	98.36	81.38
Nougat-base	88.03	92.29	86.60	88.50	76.19	79.47	54.40	52.30	97.98	98.41	81.42
Vision-Language Models
DeepSeek-VL-7B-Chat	31.89	39.96	23.66	12.53	17.01	16.94	22.96	16.47	88.76	66.75	33.69
MiniCPM-Llama3-V2.5	58.91	70.87	26.33	7.68	16.70	17.90	27.89	24.91	95.26	93.02	43.95
LLaVa-1.6-Vicuna-13B	27.51	37.09	8.92	6.27	17.80	11.68	23.78	16.23	76.63	51.68	27.76
InternVL-Chat-V1.5	53.06	68.44	25.03	13.57	33.13	24.37	40.44	34.35	94.61	91.31	47.83
GPT-4o-mini	79.44	84.37	31.77	18.65	42.23	41.67	47.81	39.85	97.69	96.35	57.98

Table 3: Evaluation of various Document Structured Extraction systems on READOC-arXiv.

Mistral-OCR benchmarks

Model	Overall	Math	Multilingual	Scanned	Tables
Google Document AI	83.42	80.29	86.42	92.77	78.16
Azure OCR	89.52	85.72	87.52	94.65	89.52
Gemini-1.5-Flash-002	90.23	89.11	86.76	94.87	90.48
Gemini-1.5-Pro-002	89.92	88.48	86.33	96.15	89.71
Gemini-2.0-Flash-001	88.69	84.18	85.80	95.11	91.46
GPT-4o-2024-11-20	89.77	87.55	86.00	94.58	91.70
Mistral OCR 2503	94.89	94.29	89.55	98.96	96.12

dp-bench

Source	Request date	TEDS ↑	TEDS-S ↑	NID ↑	Avg. Time (secs) ↓
upstage	2024-10-24	93.48	94.16	97.02	3.79
aws	2024-10-24	88.05	90.79	96.71	14.47
llamaparse	2024-10-24	74.57	76.34	92.82	4.14
unstructured	2024-10-24	65.56	70.00	91.18	13.14
google	2024-10-24	66.13	71.58	90.86	5.85
microsoft	2024-10-24	87.19	89.75	87.69	4.44

Actualize pro

In the digital age, PDF documents remain a cornerstone for disseminating and archiving information. However, extracting meaningful data from these structured and unstructured formats continues to challenge modern AI systems. Our recent benchmarking study evaluated seven prominent PDF extraction tools to determine their capabilities across diverse document types and applications.

PDF Parser	Overall Score (out of 10)	Text Extraction Accuracy (Score out of 10)	Table Extraction Accuracy (Score out of 10)	Reading Order Accuracy (Score out of 10)	Markdown Conversion Accuracy (Score out of 10)	Code and Math Equations Extraction (Score out of 10)	Image Extraction Accuracy (Score out of 10)
MinerU	8	9.3	7.3	8.7	8.3	6.5	7
Xerox	7.9	8.7	7.7	9	8.7	7	6
MarkItdown	7.78	9	6.83	9	7.67	7.83	5.83
Docling	7.3	8.7	6.3	9	8	6.5	5
Llama parse	7.1	7.3	7.7	8.7	7.3	6	5.3
Marker	6.5	7.3	5.7	7.3	6.7	4.5	6.7
Unstructured	6.2	7.3	5	8.3	6.7	5	4.7

liduos.com

Function	MinerU	PaddleOCR	Marker	Unstructured	gptpdf	Zerox	Chunkr	pdf-extract-api	Sparrow	LlamaParse	DeepDoc	MegaParse
PDF and Image Parsing	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
Parsing of Other Formats (PPT, Excel, DOCX, etc.)	✓	-	-	✓	-	✓	✓	-	✓	✓	✓	✓
Layout Analysis	✓	✓	✓	-	✓	-	✓	-	-	✓	✓	-
Text Recognition	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
Image Recognition	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
Simple (Vertical/Horizontal/Hierarchical) Tables	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
Complex Tables	-	-	-	-	-	-	-	-	-	-	-	-
Formula Recognition	-	-	-	-	-	-	-	-	-	-	-	-
HTML Output	✓	-	✓	✓	-	-	✓	-	-	-	✓	-
Markdown Output	✓	✓	✓	-	✓	✓	✓	✓	✓	✓	-	✓
JSON Output	✓	-	✓	✓	-	-	✓	✓	-	✓	✓	-

Omni OCR Benchmark

JSON Accuracy

Model Provider	JSON Accuracy (%)
OmniAI	91.7%
Gemini 2.0 Flash	86.1%
Azure	85.1%
GPT-4o	75.5%
AWS Textract	74.3%
Claude Sonnet 3.5	69.3%
Google Document AI	67.8%
GPT-4o Mini	64.8%
Unstructured	50.8%

Cost per 1,000 Pages

Model Provider	Cost per 1,000 Pages ($)
GPT-4o Mini	0.97
Gemini 2.0 Flash	1.12
Google Document AI	1.50
AWS Textract	4.00
OmniAI	10.00
Azure	10.00
GPT-4o	18.37
Claude Sonnet 3.5	19.93
Unstructured	20.00

Processing Time per Page

Model Provider	Average Latency (seconds)
Google Document AI	3.19
Azure	4.40
AWS Textract	4.86
Unstructured	7.99
OmniAI	9.69
Gemini 2.0 Flash	10.71
Claude Sonnet 3.5	18.42
GPT-4o Mini	22.73
GPT-4o	24.85

Extractous benchmarks

extractous speedup relative to unstructured-io

extractous memory efficiency relative to unstructured-io

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.github		.github
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benches.xlsx		benches.xlsx
convert_xlsx_to_md.py		convert_xlsx_to_md.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

dantetemplar/pdf-extraction-agenda

Folders and files

Latest commit

History

Repository files navigation

PDF extraction pipelines and benchmarks agenda

Table of contents

Comparison

Pipelines

Benchmarks

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages