Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect table cell word and line order #369

Open
wessens opened this issue May 27, 2024 · 3 comments
Open

Incorrect table cell word and line order #369

wessens opened this issue May 27, 2024 · 3 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@wessens
Copy link

wessens commented May 27, 2024

Hello, this issue seems very similar to #136 , but I just can't make it work: the word and line order inside table cells is not preserved when invoking the get_text method.

The json attached is a reslt of running Textract start_document_analysis with parameters [TextractFeatures.TABLES, TextractFeatures.LAYOUT].

When running

import json

import textractor
from textractor.entities.document import Document

j = json.load(open('../data/processed/6e2ab4b2a234e0410205db117803203a1be55a3fc766d56083c62512d71e556e.json'))

doc = Document.open(j)
print(doc.tables[1].get_text())

print(textractor.__version__)

I get as output for example

...
of adolescent and girls

6.1.2.4 the Ensure
...

But the actual lines are "of adolescent girls and" and "6.1.2.4 Ensure the" and the line order is different.

Blocks seem fine and the child order in "Relationships" also seem correct.

What am i doing wrong?

6e2ab4b2a234e0410205db117803203a1be55a3fc766d56083c62512d71e556e.json

@wessens
Copy link
Author

wessens commented May 27, 2024

Sorry, I am using Textractor version 1.7.11

@Belval
Copy link
Contributor

Belval commented May 27, 2024

I'll try to reproduce the issue on our side and get back to you on this. Thanks!

@Belval Belval self-assigned this May 27, 2024
@Belval
Copy link
Contributor

Belval commented Aug 20, 2024

I was able to find the asset online, the issue is that the page is rotated and Textractor has a significant amount of logic that does ordering based on x and y which does not account for rectification.

This will be an enhancement in a future version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants