Add multiline Japanese strings support to HocrVisualParser() to fix #534 #537

YasushiMiyata · 2021-04-14T07:26:02Z

Description of the problems or issues

Is your pull request related to a problem? Please describe.
See #534

Does your pull request fix any issue.
Fix #534

Description of the proposed changes

In case of multi line Japanese strings 'AAAA\nBBBB', spacy[ja] sometimes generates tokens ['AAA', 'AB', 'B', 'BB']. Proposal defines bbox of 'AB' as a multi line word (i.e. left is min left of ['A','B'], top is the top of 'A', right is max right of ['A','B'] and bottom is the bottom of 'B').

Test plan

This is cause of Japanese morphological analysis. So, I have added Japanese test data to 'tests/data/hocr_simple/japan.hocr' and test code to 'tests/parser/test_parser.py::test_parse_hocr'

Checklist

I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.
I have updated the CHANGELOG.rst accordingly.

…tion Fix #270 Enable RegexMatchSpan with sep="(separator)" option. It concatenates mention spans to one word and does RegexMatch without consideration of the separator.

…tcher.py.

YasushiMiyata · 2021-04-15T01:38:05Z

I have successfully done pytest loccally but the test fails on github.
On github, pytest tests/test_postgres.py::test_cand_gen_cascading_delete is failed related to sqlalchemy.
Now under consideration...

YasushiMiyata · 2021-04-26T05:31:54Z

I cancel this pull request once because it happens sqlalchemy related error, not related to this issue.

and redo #537

…ser to fix #534 and redo #537 (#542) In case of multi line Japanese strings 'AAAA\nBBBB', spacy[ja] sometimes generates tokens ['AAA', 'AB', 'B', 'BB']. This commit defines bbox of 'AB' as a multi line word (i.e. left is min left of ['A','B'], top is the top of 'A', right is max right of ['A','B'] and bottom is the bottom of 'B'). See #537.

YasushiMiyata added 10 commits July 31, 2020 16:31

Enable RegexMatchSpan with concatenates words by sep="(separator)" op…

3050707

…tion Fix #270 Enable RegexMatchSpan with sep="(separator)" option. It concatenates mention spans to one word and does RegexMatch without consideration of the separator.

Merge remote-tracking branch 'upstream/master'

49d24f0

Add Change Log

a84621d

Change the _RegexMatch initial value of sep="(space)" to sep="" in ma…

bd04acd

…tcher.py.

Merge branch 'master' of https://github.com/HazyResearch/fonduer

41504e9

Merge branch 'master' of https://github.com/HazyResearch/fonduer

4c5356e

Merge branch 'master' of https://github.com/HazyResearch/fonduer

e87b390

Add multiline Japanese strings support to HocrVisualParser() (fix #534)

66e36f5

Merge branch 'master' of https://github.com/YasushiMiyata/fonduer

7d35b52

Update CHANGELOG.rst

9ec9cd3

YasushiMiyata marked this pull request as ready for review April 15, 2021 01:38

YasushiMiyata closed this Apr 26, 2021

YasushiMiyata mentioned this pull request May 5, 2021

Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537 #540

Closed

4 tasks

lukehsiao pushed a commit that referenced this pull request May 9, 2021

Add multiline Japanese strings support to HocrVisualParser() to fix #534

54fcfd3

and redo #537

YasushiMiyata mentioned this pull request May 11, 2021

Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537 #542

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multiline Japanese strings support to HocrVisualParser() to fix #534 #537

Add multiline Japanese strings support to HocrVisualParser() to fix #534 #537

YasushiMiyata commented Apr 14, 2021 •

edited

Loading

YasushiMiyata commented Apr 15, 2021

YasushiMiyata commented Apr 26, 2021

Add multiline Japanese strings support to HocrVisualParser() to fix #534 #537

Add multiline Japanese strings support to HocrVisualParser() to fix #534 #537

Conversation

YasushiMiyata commented Apr 14, 2021 • edited Loading

Description of the problems or issues

Description of the proposed changes

Test plan

Checklist

YasushiMiyata commented Apr 15, 2021

YasushiMiyata commented Apr 26, 2021

YasushiMiyata commented Apr 14, 2021 •

edited

Loading