Tesseract OCR data trained for Chinese

This is another trained tesseract data pack for Chinese OCR, more accurate than the official ones.

The training fonts includes commonly used fonts for the four font styles:

Currently there are data packs for:

chi_sim: Simplified Chinese (China)
chi_tra: Traditional Chinese (HK style, TW style, Traditional style)
chi_all: Combined Simplified and Traditional Chinese (CN, HK, TW, Traditional style)

The LSTM packs also supports Pinyin (chi_sim) and Bopomofo (chi_tra) characters.

Usage

Download from Releases, and replace *.traineddata into the tessdata directory of your Tesseract installation.

Get the fonts in the fontlist.txt, and put them into the fonts folder.

mkdir train_chi_sim
cd train_chi_sim
python3 ../configure.py chi_sim
make

mkdir train_chi_tra
cd train_chi_tra
python3 ../configure.py chi_tra
make

比官方更准确的 Tesseract 中文模型。

训练集包括常用的宋体、黑体、楷体和仿宋，同时训练了英文短句。其中 LSTM 模型支持拼音字母和注音符号。

目前提供以下模型包：

字符集覆盖常用标点符号、〇、《通用规范汉字表》以及其他字表最常用的部分（按字频、去除了 Unicode BMP 之外的扩展字符），以及规范字表中没有的其他常用汉字。

从 Releases 下载模型包，将 *.traineddata 文件替换进 Tesseract 所使用的 tessdata 目录。

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
char_encoder		char_encoder
fonts		fonts
langdata		langdata
wordlist		wordlist
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compile_char_dict.py		compile_char_dict.py
compile_zht_dict.py		compile_zht_dict.py
configure.py		configure.py
configure_lstm.py		configure_lstm.py
convert_essay_simp.sh		convert_essay_simp.sh
filter_bad_lines.py		filter_bad_lines.py
filter_eng_trainingtext.py		filter_eng_trainingtext.py
filter_line_sample1.py		filter_line_sample1.py
filter_line_sample2.py		filter_line_sample2.py
filter_line_sample_eng.py		filter_line_sample_eng.py
filter_unicharambigs.py		filter_unicharambigs.py
filter_utf8.py		filter_utf8.py
filter_words.py		filter_words.py
generate_strokes_cangjie.py		generate_strokes_cangjie.py
generate_strokes_wubi.py		generate_strokes_wubi.py
generate_text.py		generate_text.py
pinyin_dict.py		pinyin_dict.py
plot-eval-validate-cer.py		plot-eval-validate-cer.py