Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unwrap "ocrx_line" as well as "ocr_line" as Fonduer has no data model #526

Merged
merged 4 commits into from
Oct 21, 2020

Conversation

HiromuHota
Copy link
Contributor

@HiromuHota HiromuHota commented Oct 20, 2020

Description of the problems or issues

Is your pull request related to a problem? Please describe.

Currently, parser unwraps "ocr_line" as Fonduer has no data model for lines.
However, hOCR could contains "ocrx_line" for lines.
Fonduer should unwrap this element as well for the same reason.

Does your pull request fix any issue.

N/A

Description of the proposed changes

Unwrap "ocrx_line" as well as "ocr_line" as Fonduer has no data model

Test plan

Use pdftotree (HazyResearch/pdftotree#95) to convert md.pdf to md.hocr, which contains "ocrx_line" elements.
And check if md.hocr can be correctly parsed.

Checklist

  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • I have updated the CHANGELOG.rst accordingly.

@HiromuHota HiromuHota marked this pull request as draft October 20, 2020 21:20
@HiromuHota
Copy link
Contributor Author

The mypy complains:

src/fonduer/utils/utils_table.py:27: error: Value of type variable "_LT" of "min" cannot be "_T"
src/fonduer/utils/utils_table.py:29: error: Value of type variable "_LT" of "min" cannot be "_T"

This python/mypy#9582 could be related.

I will pin mypy at 0.782 for now and address it in a different issue.

@codecov-io
Copy link

codecov-io commented Oct 21, 2020

Codecov Report

Merging #526 into master will decrease coverage by 0.00%.
The diff coverage is 57.14%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #526      +/-   ##
==========================================
- Coverage   86.03%   86.02%   -0.01%     
==========================================
  Files          92       92              
  Lines        4769     4774       +5     
  Branches      896      899       +3     
==========================================
+ Hits         4103     4107       +4     
- Misses        475      476       +1     
  Partials      191      191              
Flag Coverage Δ
#unittests 86.02% <57.14%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...duer/parser/preprocessors/hocr_doc_preprocessor.py 86.31% <57.14%> (-0.36%) ⬇️

@HiromuHota HiromuHota marked this pull request as ready for review October 21, 2020 03:21
@lukehsiao lukehsiao merged commit a2d097b into HazyResearch:master Oct 21, 2020
@HiromuHota HiromuHota deleted the fix/ocrx_line branch October 21, 2020 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants