Unwrap "ocrx_line" as well as "ocr_line" as Fonduer has no data model #526
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of the problems or issues
Is your pull request related to a problem? Please describe.
Currently, parser unwraps "ocr_line" as Fonduer has no data model for lines.
However, hOCR could contains "ocrx_line" for lines.
Fonduer should unwrap this element as well for the same reason.
Does your pull request fix any issue.
N/A
Description of the proposed changes
Unwrap "ocrx_line" as well as "ocr_line" as Fonduer has no data model
Test plan
Use pdftotree (HazyResearch/pdftotree#95) to convert md.pdf to md.hocr, which contains "ocrx_line" elements.
And check if md.hocr can be correctly parsed.
Checklist