Multilingual OCR Development Plan #1048

D-DanielYang · 2020-10-28T15:31:30Z

model name	description	model size	download	Update Date
ch	Chinese and English	3.71M	inference model / trained model	2020.9.22
ch_tra	chinese traditional	5.63M	inference model / trained model	2021.1.21
en	English	2.56M	inference model / trained model	2020.9.22
fr	French	2.65M	inference model / trained model	2021.9.22
ar	Arabic	2.53M	inference model / trained model	2021.1.21
es	Spanish	2.53M	inference model / trained model	2021.1.21
pt	Portuguese	2.63M	inference model / trained model	2021.1.21
ru	Russia	2.63M	inference model / trained model	2021.1.21
ge	german	2.65M	inference model / trained model	2020.9.22
kr	Korean	3.9M	inference model / trained model	2020.9.22
jp	Japanese	4.23M	inference model / trained model	2020.9.22
it	Italian	2.53M	inference model / trained model	2021.1.21
hi	Hindi	2.63M	inference model / trained model	2021.1.21
ug	Uyghur	2.63M	inference model / trained model	2021.1.21
fa	Persian	2.63M	inference model / trained model	2021.1.21
ur	Urdu	2.63M	inference model / trained model	2021.1.21
oc	Occitan	2.53M	inference model / trained model	2021.1.21
mr	Marathi	2.63M	inference model / trained model	2021.1.21
ne	Nepali	2.63M	inference model / trained model	2021.1.21
rs_cyrillic	Serbian(cyrillic)	2.63M	inference model / trained model	2021.1.21
rs_latin	Serbian(latin)	2.53M	inference model / trained model	2021.1.21
bg	Bulgarian	2.63M	inference model / trained model	2021.1.21
uk	Ukranian	2.63M	inference model / trained model	2021.1.21
be	Belarusian	2.63M	inference model / trained model	2021.1.21
te	Telugu	2.63M	inference model / trained model	2021.1.21
kn	Kannada	2.63M	inference model / trained model	2021.1.21
ta	Tamil	2.63M	inference model / trained model	2021.1.21
mg	Mongolian	--	Ongoing
bg	Bangla	--	Need dict and corpus
bm	Burmese	--	Need dict and corpus	call for contribution
ku_cent	kurdish central	--	PR8347	call for contribution
od	Odia	--	PR6348	call for contribution
th	thai	--	PR6719 issue chat	call for contribution
	More		TBC

Guideline for new language requests

If you want to request a new language support, a PR with 2 following files are needed：

In folder ppocr/utils/dict,
it is necessary to submit the dict text to this path and name it with {language}_dict.txt that contains a list of all characters. Please see the format example from other files in that folder.
In folder ppocr/utils/corpus,
it is necessary to submit the corpus to this path and name it with {language}_corpus.txt that contains a list of words in your language.
Maybe, 50000 words per language is necessary at least.
Of course, the more, the better.
call for contributions to add new language support for PaddleOCR.
For anyone might be insterested in traing the new language model, Guidance to train the model is provided. We are calling contributions to add new language support for PaddleOCR.

If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.

The text was updated successfully, but these errors were encountered:

saheya · 2020-11-02T03:41:31Z

Traditional Mongolian

omar16100 · 2020-11-08T07:41:48Z

I would love to work on "Bangla"

levanpon98 · 2020-11-10T08:58:05Z

I very happy if you do that with Vietnamese

HusseinYoussef · 2020-11-10T22:02:49Z

How about Arabic? That would be great.

Hieung28 · 2020-11-18T08:05:22Z

I've find out that PADDLE OCR algorithm cannot recognize some special characters (such as comma, semicolon, or dot...) when the language is english. Is there any possible way that i can fix this problem

GmGniap · 2020-11-27T22:50:17Z

I would like to contribute to add the Burmese language. Is it only needed to submit two text files - dict & corpus? How further process do we need to provide?

xeron56 · 2020-11-28T02:31:07Z

Adding "Bangla" will be grate for the people in south Asia

giranntu · 2020-12-07T02:47:05Z

Adding "Traditional Chinese (zh-TW)" would be great support.

Ru-Van · 2020-12-07T10:50:52Z

Do you have preTrained Russian recognition model?

SasiAravind · 2020-12-21T16:16:32Z

Hi adding " Tamil" language will be very grateful.

Tamil_dict.txt
Tamil_corpus.txt

Need more help plz refer this issue:
JaidedAI/EasyOCR#39

fcakyon · 2020-12-24T07:19:49Z

I can help with Turkish language.

krzynio · 2021-01-03T20:26:02Z

I can help with polish language.

xmy0916 · 2021-01-26T05:29:26Z

@GmGniap Hello, Can you provide the corpus file of Burmese Language？

xmy0916 · 2021-01-26T06:36:58Z

@shahidul56 Hello, Can you provide the corpus file of Bangla Languag？

azmat21 · 2021-01-26T10:08:41Z

All models updated in 2021.1.21 cannot be downloaded with following Error：
{ code: "NoSuchKey", message: "The specified key does not exist.", requestId: "aa1bfeff-f572-40aa-8935-6129b1533ed1" }

D-DanielYang · 2021-01-27T08:49:54Z

All models updated in 2021.1.21 cannot be downloaded with following Error：
{ code: "NoSuchKey", message: "The specified key does not exist.", requestId: "aa1bfeff-f572-40aa-8935-6129b1533ed1" }

Sorry for the invalid links and all of them have been revised now, you can try again.

redcinelli · 2021-01-27T19:44:52Z

I very happy if you do that with Vietnamese

#1847, seems to be ongoing.

xmy0916 · 2021-01-28T06:36:42Z

@redcinelli Thank you very much. The Vietnamese model is in training and will be available soon~

fcakyon · 2021-01-28T07:06:24Z

model name description model size download Update Date
ch Chinese and English 3.71M inference model / trained model 2020.9.22
cht chinese traditional 5.63M inference model / trained model 2021.1.21
en English 2.56M inference model / trained model 2020.9.22
fr French 2.65M inference model / trained model 2021.9.22
ar Arabic 2.53M inference model / trained model 2021.1.21
xi Spanish 2.53M inference model / trained model 2021.1.21
pu Portuguese 2.63M inference model / trained model 2021.1.21
ru Russia 2.63M inference model / trained model 2021.1.21
ge german 2.65M inference model / trained model 2020.9.22
kr Korean 3.9M inference model / trained model 2020.9.22
jp Japanese 4.23M inference model / trained model 2020.9.22
it Italian 2.53M inference model / trained model 2021.1.21
hi Hindi 2.63M inference model / trained model 2021.1.21
ug Uyghur 2.63M inference model / trained model 2021.1.21
fa Persian 2.63M inference model / trained model 2021.1.21
ur Urdu 2.63M inference model / trained model 2021.1.21
rs Serbian(latin) 2.53M inference model / trained model 2021.1.21
oc Occitan 2.53M inference model / trained model 2021.1.21
mr Marathi 2.63M inference model / trained model 2021.1.21
ne Nepali 2.63M inference model / trained model 2021.1.21
rsc Serbian(cyrillic) 2.63M inference model / trained model 2021.1.21
bg Bulgarian 2.63M inference model / trained model 2021.1.21
uk Ukranian 2.63M inference model / trained model 2021.1.21
be Belarusian 2.63M inference model / trained model 2021.1.21
te Telugu 2.63M inference model / trained model 2021.1.21
ka Kannada 2.63M inference model / trained model 2021.1.21
ta Tamil 2.63M inference model / trained model 2021.1.21
mg Mongolian -- Ongoing
bg Bangla -- Need dict and corpus
vi Vietnamese -- Need dict and corpus
bm Burmese -- Need dict and corpus
tk Turkish -- Need dict and corpus
po polish -- Need dict and corpus
More TBC

Guideline for new language requests

If you want to request a new language support, a PR with 2 following files are needed：
1. In folder [ppocr/utils/dict](./ppocr/utils/dict),
   it is necessary to submit the dict text to this path and name it with `{language}_dict.txt` that contains a list of all characters. Please see the format example from other files in that folder.

2. In folder [ppocr/utils/corpus](./ppocr/utils/corpus),
   it is necessary to submit the corpus to this path and name it with `{language}_corpus.txt` that contains a list of words in your language.
   Maybe, 50000 words per language is necessary at least.
   Of course, the more, the better.
If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.

@grasswolfs model name for Turkish should be "tr" instead of "tk", it is the widely used abbreviation for Turkish.

fcakyon · 2021-01-28T07:07:24Z

I have also opened a pr for Turkish dict and corpora: #1856

tink2123 · 2021-02-02T03:42:00Z

Thanks @habout632 for adding Southeast Asian languages via #1896

yumeliu · 2021-03-16T14:22:56Z

Here is a dictionary for Greek.
el_dict.txt

alenma04 · 2021-03-16T17:10:20Z

Hi , did we have a model to detect all English characters along with special characters like.,"()

Jane-Ding · 2021-05-12T08:44:52Z

hi, thank you for the great work! I just wonder whether you will add traditional Chinese to the general model? Right now, the general model can support Chinese(sim), English and numbers.

manshulgoel · 2022-11-09T11:15:16Z

Hi I tried to run paddleOCR on an image with ← → ↑ ↓
except arrows everything is coming correctly. Except the font in red color

Kindly advise how to work on this piece

AzizDZH · 2023-01-25T10:27:37Z

Please add Tajik Language
tajik_corpus.txt
tajik_dict.txt

ssavi-ict · 2023-03-13T16:42:57Z

@fcakyon @D-DanielYang @xmy0916

I would like to contribute to Bangla Dictionary and Corpus. Can I do that?

Also, I have a few queries to ask -

Could not clearly understand this line - If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.
In the corpus of at least 50000 words, I am guessing all of them should be unique. Am I right?
Is there any particular category of corpus words? Like except stop words or something similar to that?
In ppocr/utils path I can not see any corpus directory.

Thanks in advance

ariefwijaya · 2023-04-12T03:14:02Z

Please add Indonesia (id) and English (en) together

Truong-Thanh-Quang · 2023-04-30T00:44:46Z

Do you have any plan for Vietnamese release?

ursfan · 2023-05-19T18:58:58Z

Is it sufficient to change the file german_dict.txt if one wants to detect Fraktur a historic german script instead of the current script form? The dictionary which was learnt for the German language should be the same? For tesseract there is one trained file for Fraktur to ocr scan historic documents.

runachan19 · 2023-06-03T14:26:29Z

need indonesian language please

zahamed · 2023-07-23T21:55:34Z

Hi Dear plz add the bangla and english support. I have attach both the file for bangla
bangla_dict.txt

bangla_corpus.txt

EdwardYGLi · 2023-07-27T19:50:47Z

Hi team. Great work on Paddle, it's an amazing OCR engine! Can we please have Hebrew support in multilanguage models ?

Thanks !

zahamed · 2023-07-29T13:34:02Z

Dear Team, Tnx for your reply. I am from Bangladesh. I have already submitted both files like dict and corpus for bangla. I would appreciate if you could add bangla support. Thank you. Zahir

…

On Fri, Jul 28, 2023, 1:50 AM Edward Li ***@***.***> wrote: Hi team. Great work on Paddle, it's an amazing OCR engine! Can we please have *Hebrew* support in multilanguage models ? Thanks ! — Reply to this email directly, view it on GitHub <#1048 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD6CAOC6MTVJDVXY4W65TWDXSLBCHANCNFSM4TCPRJ6Q> . You are receiving this because you commented.Message ID: ***@***.***>

shreyaahh · 2023-08-11T08:58:54Z

Can you provide for any ancient scripts?

hungtrieu07 · 2023-08-31T08:50:19Z

Truong

I'm trying with my private data, but the result very poor

SUPERustam · 2023-09-21T07:45:09Z

Sorry for my stupid question, I am novice at DL: What difference between Inference model and trained model?

github-actions · 2024-01-03T02:34:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

taeefnajib · 2024-01-11T20:28:00Z

I created a PR for Bangla

juvebogdan · 2024-01-22T19:57:58Z

Does this list contain the latest models? If i want to fine tune for example german model do i use this link from this page to download the pretrained model? If so what yml file should i use? How do i know what is the architecture of these models?

AzizDZH · 2024-03-01T06:00:32Z

Please add Tajik Language
tajik_corpus.txt
tajik_dict.txt

D-DanielYang added the language requests Multilingual language requests label Oct 29, 2020

D-DanielYang pinned this issue Jan 13, 2021

D-DanielYang unpinned this issue Jan 13, 2021

D-DanielYang pinned this issue Jan 25, 2021

D-DanielYang changed the title ~~Multilingual OCR development Plan~~ Multilingual OCR Development Plan Jan 25, 2021

D-DanielYang mentioned this issue Mar 16, 2021

How can I submit a new dictionary? #2240

Closed

JonhSilver mentioned this issue Oct 25, 2022

Training own language #8096

Closed

rama298 mentioned this issue Nov 28, 2022

[Feature] Text Recognition for Vietnamese Text open-mmlab/mmocr#1574

Open

andupotorac mentioned this issue Jan 14, 2023

Non english model neonbjb/tortoise-tts#164

Open

the-ge mentioned this issue Apr 13, 2023

Romanian Corpus and Character set #9456

Closed

This was referenced Jul 19, 2023

新增需求征集（Collect Feature Request） #10334

Closed

Slovak language request #8347

Closed

savikko mentioned this issue Sep 7, 2023

add finnish language files #10850

Merged

github-actions bot added the stale label Jan 3, 2024

gitapii mentioned this issue Jan 20, 2024

Would it be possible to train a German model? sdcb/PaddleSharp#82

Open

github-actions bot removed the stale label Jan 24, 2024

Ligoml unpinned this issue Feb 28, 2024

PaddlePaddle locked and limited conversation to collaborators Jun 5, 2024

SWHL converted this issue into discussion #12734 Jun 5, 2024

This issue was moved to a discussion.

Multilingual OCR Development Plan #1048

Multilingual OCR Development Plan #1048

Comments

D-DanielYang commented Oct 28, 2020 • edited by onecatcn

Guideline for new language requests

saheya commented Nov 2, 2020

omar16100 commented Nov 8, 2020

levanpon98 commented Nov 10, 2020

HusseinYoussef commented Nov 10, 2020

Hieung28 commented Nov 18, 2020

GmGniap commented Nov 27, 2020

xeron56 commented Nov 28, 2020

giranntu commented Dec 7, 2020

Ru-Van commented Dec 7, 2020

SasiAravind commented Dec 21, 2020

fcakyon commented Dec 24, 2020 • edited

krzynio commented Jan 3, 2021

xmy0916 commented Jan 26, 2021

xmy0916 commented Jan 26, 2021

azmat21 commented Jan 26, 2021

D-DanielYang commented Jan 27, 2021 • edited

redcinelli commented Jan 27, 2021

xmy0916 commented Jan 28, 2021

fcakyon commented Jan 28, 2021

Guideline for new language requests

fcakyon commented Jan 28, 2021 • edited

tink2123 commented Feb 2, 2021

yumeliu commented Mar 16, 2021

alenma04 commented Mar 16, 2021

Jane-Ding commented May 12, 2021

manshulgoel commented Nov 9, 2022

AzizDZH commented Jan 25, 2023 • edited

ssavi-ict commented Mar 13, 2023 • edited

ariefwijaya commented Apr 12, 2023

Truong-Thanh-Quang commented Apr 30, 2023

ursfan commented May 19, 2023

runachan19 commented Jun 3, 2023

zahamed commented Jul 23, 2023 • edited

EdwardYGLi commented Jul 27, 2023

zahamed commented Jul 29, 2023 via email

shreyaahh commented Aug 11, 2023

hungtrieu07 commented Aug 31, 2023

SUPERustam commented Sep 21, 2023

github-actions bot commented Jan 3, 2024

taeefnajib commented Jan 11, 2024

juvebogdan commented Jan 22, 2024

AzizDZH commented Mar 1, 2024

This issue was moved to a discussion.

D-DanielYang commented Oct 28, 2020 •

edited by onecatcn

fcakyon commented Dec 24, 2020 •

edited

D-DanielYang commented Jan 27, 2021 •

edited

fcakyon commented Jan 28, 2021 •

edited

AzizDZH commented Jan 25, 2023 •

edited

ssavi-ict commented Mar 13, 2023 •

edited

zahamed commented Jul 23, 2023 •

edited