Official Code Repository for the paper "KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents".
We propose Knowledge-Aware Preprocessing (KAP), a two-stage preprocessing framework tailored for Traditional Chinese non-narrative documents, designed to enhance retrieval accuracy in Hybrid Retrieval systems. Hybrid Retrieval, which integrates Sparse Retrieval (e.g., BM25) and Dense Retrieval (e.g., vector embeddings), has become a widely adopted approach for improving search effectiveness. However, its performance heavily depends on the quality of input text, which is often degraded when dealing with non-narrative documents such as PDFs containing financial statements, contractual clauses, and tables. KAP addresses these challenges by integrating Multimodal Large Language Models (MLLMs) with LLM-driven post-OCR processing, refining extracted text to reduce OCR noise, restore table structures, and optimize text format. By ensuring better compatibility with Hybrid Retrieval, KAP improves the accuracy of both Sparse and Dense Retrieval methods without modifying the retrieval architecture itself.Clone the repository and navigate into the project directory:
git clone https://github.com/JustinHsu1019/KAP.git
cd KAP
Create and activate a virtual environment:
python3 -m venv kap_venv
source kap_venv/bin/activate
Install all required dependencies:
pip install -r requirements.txt
Additionally, install OCR and Docker-related dependencies:
./exp_src/preprocess/ocr/tessocr.sh
./exp_src/docker/docker_install.sh
Copy the example configuration file and set up your API keys:
cp config.ini config_real.ini
Edit config_real.ini
and manually add your API keys:
- OpenAI API Key: Obtain from the OpenAI official website.
- Claude API Key: Obtain from the Claude official website.
Navigate to the docker
directory:
cd exp_src/docker
Modify the docker-compose.yml
file:
- Replace the following line with your actual OpenAI API Key:
OPENAI_APIKEY: ${OPENAI_APIKEY}
Start the Weaviate database using Docker Compose:
docker-compose up -d
- The dataset used in this study is privately provided by E.SUN Bank. You must obtain authorization from E.SUN Bank to access the dataset.
- If you want to reproduce our methodology, you can use any other dataset with a large number of tabular Chinese PDFs.
- Once obtained, place the dataset in the
data/
directory.
Generate augmented validation sets for evaluation:
python3 exp_src/auto_runall_pipeline/question_augment.py
Convert all PDFs into images for downstream processing:
python3 exp_src/convert_pdfs_to_images.py
Extract OCR text using the baseline Tesseract OCR:
python3 exp_src/rewrite.py --task Tess
Run all text preprocessing pipelines, including ablation studies, and our proposed KAP framework:
python3 exp_src/auto_runall_pipeline/run_all_rewrite.py
Convert the processed text into vector representations and store them in the Weaviate vector database. This step includes:
- Text embedding using OpenAI Embedding Model (for vector retrieval)
- Tokenization using Jieba (for bm25 retrieval)
- Storing the processed embeddings in the vector database
Run the following command to execute the full pipeline:
python3 exp_src/auto_runall_pipeline/run_all_db_insert.py
Execute retrieval experiments using pure sparse retrieval, dense retrieval, and hybrid retrieval:
python3 exp_src/auto_runall_pipeline/run_all_hybrid.py
To validate stability, our experiments were repeated three times in the paper. You may repeat steps 1-6 multiple times to reproduce and verify results.
The core of our approach is MLLM-assisted Post-OCR enhancement.
To view or modify the prompts used for this step, navigate to:
cd exp_src/preprocess/ocr/
This directory contains all ablation experiments and our framework's prompt designs.
This study was supported by E.SUN Bank, which provided the dataset from the "AI CUP 2024 E.SUN Artificial Intelligence Open Competition." We sincerely appreciate E.SUN Bank for its generous data support, which has been invaluable to this research.
If you found the provided code with our paper useful, we kindly request that you cite our work.
@misc{hsu2025kapmllmassistedocrtext,
title={KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents},
author={Hsin-Ling Hsu and Ping-Sheng Lin and Jing-Di Lin and Jengnan Tzeng},
year={2025},
eprint={2503.08452},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2503.08452},
}