Fixed the way right to left language (Hebrew) text is inserted to pdf #8

smijo149 · 2021-02-04T23:30:15Z

Issue Description:

The pdf file generated using hocr-pdf has Hebrew text printed in the opposite direction.

Steps I followed:

I used Google cloud vision to get the OCR
Used gcv2hocr to generate hocr.
Used hocr-pdf --savefile output.pdf actual-file.jpg to generate pdf file.

The pdf file has Hebrew text inserted in it but in the reverse order.

Actual image:

This is how hocr file looks:

Text in pdf file: (I have set text visibility mode to 0 so that the inserted text is visible)

Solution:

Use python package for bidi algorithm ( Bi-Directional Algorithm ) to transform text before drawing it into the pdf file.

Solution is based on the suggestion: ocropus#163

Tests:

Tested both Hebrew text image and English text image and they give expected results.

For Issue:

https://github.com/StarfishCo/api.raven.com/issues/3740

…he document.

RedEnchilada

I think technically, the proper solution for Python would be something like from bidi import algorithm and then it would just be referenced as algorithm, but I don't know that for sure and if this works then it's good to me 👍

smijo149 · 2021-02-05T00:03:45Z

@RedEnchilada That's a good suggestion. So basically took this as reference, https://github.com/tesseract-ocr/tesstrain/blob/master/generate_wordstr_box.py but then in the actual package documentation, https://pypi.org/project/python-bidi/ I found this in their example, from bidi.algorithm import get_display so I made the change accordingly.

elgranchuy

LGTM

elgranchuy · 2021-02-05T17:19:31Z

Im surprised by the I am afraid that hocr-pdf was never tested with RTL text.

Fixed the way right to left language (Hebrew) text is inserted into t…

77ad804

…he document.

smijo149 requested review from elgranchuy and RedEnchilada February 4, 2021 23:31

RedEnchilada approved these changes Feb 4, 2021

View reviewed changes

Updated code based on Tim's suggestion.

ad1d5ed

RedEnchilada approved these changes Feb 5, 2021

View reviewed changes

elgranchuy approved these changes Feb 5, 2021

View reviewed changes

smijo149 merged commit 3f3284b into master Feb 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed the way right to left language (Hebrew) text is inserted to pdf #8

Fixed the way right to left language (Hebrew) text is inserted to pdf #8

smijo149 commented Feb 4, 2021 •

edited

Loading

RedEnchilada left a comment

smijo149 commented Feb 5, 2021

elgranchuy left a comment

elgranchuy commented Feb 5, 2021

Fixed the way right to left language (Hebrew) text is inserted to pdf #8

Fixed the way right to left language (Hebrew) text is inserted to pdf #8

Conversation

smijo149 commented Feb 4, 2021 • edited Loading

Issue Description:

Actual image:

This is how hocr file looks:

Text in pdf file: (I have set text visibility mode to 0 so that the inserted text is visible)

Solution:

Tests:

For Issue:

RedEnchilada left a comment

Choose a reason for hiding this comment

smijo149 commented Feb 5, 2021

elgranchuy left a comment

Choose a reason for hiding this comment

elgranchuy commented Feb 5, 2021

smijo149 commented Feb 4, 2021 •

edited

Loading