Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the way right to left language (Hebrew) text is inserted to pdf #8

Merged
merged 2 commits into from
Feb 5, 2021

Conversation

smijo149
Copy link

@smijo149 smijo149 commented Feb 4, 2021

Issue Description:

The pdf file generated using hocr-pdf has Hebrew text printed in the opposite direction.

Steps I followed:

  1. I used Google cloud vision to get the OCR
  2. Used gcv2hocr to generate hocr.
  3. Used hocr-pdf --savefile output.pdf actual-file.jpg to generate pdf file.

The pdf file has Hebrew text inserted in it but in the reverse order.

Actual image:

Screen Shot 2021-02-01 at 6 48 35 PM

This is how hocr file looks:

Screen Shot 2021-02-01 at 7 01 04 PM

Text in pdf file: (I have set text visibility mode to 0 so that the inserted text is visible)

Screen Shot 2021-02-01 at 6 48 56 PM

Solution:

Use python package for bidi algorithm ( Bi-Directional Algorithm ) to transform text before drawing it into the pdf file.

Solution is based on the suggestion: ocropus#163

Tests:

Tested both Hebrew text image and English text image and they give expected results.

For Issue:

https://github.com/StarfishCo/api.raven.com/issues/3740

Copy link

@RedEnchilada RedEnchilada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think technically, the proper solution for Python would be something like from bidi import algorithm and then it would just be referenced as algorithm, but I don't know that for sure and if this works then it's good to me 👍

@smijo149
Copy link
Author

smijo149 commented Feb 5, 2021

@RedEnchilada That's a good suggestion. So basically took this as reference, https://github.com/tesseract-ocr/tesstrain/blob/master/generate_wordstr_box.py but then in the actual package documentation, https://pypi.org/project/python-bidi/ I found this in their example, from bidi.algorithm import get_display so I made the change accordingly.

Copy link

@elgranchuy elgranchuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elgranchuy
Copy link

Im surprised by the I am afraid that hocr-pdf was never tested with RTL text.

@smijo149 smijo149 merged commit 3f3284b into master Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants