Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bilingual Text Encoding is not Working for Kannada-English Output Hocr File #176

Open
vaibhavsanil opened this issue Jul 14, 2022 · 0 comments

Comments

@vaibhavsanil
Copy link

I am facing issues with hocr pdf conversion for English Kannada encoded into the text layer of the PDF File

I have a image below in kannada language
(https://drive.google.com/file/d/11P2XMFWjmc0S6rzfOX58UtZZJkG2StNI/view?usp=sharing)

following is the corresponding output hocr of the file
https://drive.google.com/file/d/1wm-40rCN_rSE4cqT499kZAjAs5y6A3xl/view?usp=sharing

following is output of the gcv ocr for the particular file in JSON
OCR Output in JSON

The output of hocr-pdf conversion is as follows
Hocr-PDF output

As you can see if you search for english words it will highlight ,but for kannada language its giving gibberish results in the output file generated using hocr-pdf conversion

Any guidance in this regards is appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant