Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiline Japanese strings support to HocrVisualParser() to fix #534 #537

Closed
wants to merge 10 commits into from
Closed

Conversation

YasushiMiyata
Copy link
Contributor

@YasushiMiyata YasushiMiyata commented Apr 14, 2021

Description of the problems or issues

Is your pull request related to a problem? Please describe.
See #534

Does your pull request fix any issue.
Fix #534

Description of the proposed changes

In case of multi line Japanese strings 'AAAA\nBBBB', spacy[ja] sometimes generates tokens ['AAA', 'AB', 'B', 'BB']. Proposal defines bbox of 'AB' as a multi line word (i.e. left is min left of ['A','B'], top is the top of 'A', right is max right of ['A','B'] and bottom is the bottom of 'B').

Test plan

This is cause of Japanese morphological analysis. So, I have added Japanese test data to 'tests/data/hocr_simple/japan.hocr' and test code to 'tests/parser/test_parser.py::test_parse_hocr'

Checklist

  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • I have updated the CHANGELOG.rst accordingly.

@YasushiMiyata
Copy link
Contributor Author

I have successfully done pytest loccally but the test fails on github.
On github, pytest tests/test_postgres.py::test_cand_gen_cascading_delete is failed related to sqlalchemy.
Now under consideration...

@YasushiMiyata YasushiMiyata marked this pull request as ready for review April 15, 2021 01:38
@YasushiMiyata
Copy link
Contributor Author

I cancel this pull request once because it happens sqlalchemy related error, not related to this issue.

lukehsiao pushed a commit that referenced this pull request May 11, 2021
…ser to fix #534 and redo #537 (#542)

In case of multi line Japanese strings 'AAAA\nBBBB', spacy[ja] sometimes generates tokens ['AAA', 'AB', 'B', 'BB']. This commit defines bbox of 'AB' as a multi line word (i.e. left is min left of ['A','B'], top is the top of 'A', right is max right of ['A','B'] and bottom is the bottom of 'B').

See #537.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HOCRParser fails to multiline Japanese strings
1 participant