Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect order of text layouts due to compare_bounding_box() used in group_elements_horizontally() #389

Open
keitaf opened this issue Aug 20, 2024 · 3 comments
Assignees

Comments

@keitaf
Copy link

keitaf commented Aug 20, 2024

When I send a PDF with the following paragraph (which is a bit tilted, part of this PDF file)
image
and use Document.get_text(), I get the following text where the order of the lines are shuffled.

Administration (FDA) has completed its review of your premarket approval application the The Center for Devices and Radiological Health (CDRH) of the Food and Drug This device is indicated as an aid in the management of chronic intractable pain programmer, the Model 1232 programming wand and the Model 1210 patient magnet. of the following components: the Model 3608 pulse generator, the Model 3850 patient (PMA) for the Genesis Neurostimulation (IPG) System. The System includes trunk and/or limbs, including unilateral or bilateral pain associated with failed back that surgery the PMA is approved subject to the conditions described below and in the syndrome, intractable low back pain and leg pain. We are pleased to inform you "Conditions of Approval" (enclosed). You may begin commercial distribution of the device upon receipt of this letter. 

I debugged the code, and it looks like it's due to text_util.compare_bounding_box(), which is called from layout.group_elements_horizontally().

group_elements_horizontally() receives a list of elements, which are layout texts for this paragraph.

The first element has BoundingBox as x: 0.08591524511575699, y: 0.4836207926273346, width: 0.6273355484008789, height: 0.03193599358201027 and text as 'The Center for Devices and Radiological Health (CDRH) of the Food and Drug'.

The second element has BoundingBox as x: 0.08505144715309143, y: 0.5002045631408691, width: 0.6902255415916443, height: 0.03553390130400658 and text as 'Administration (FDA) has completed its review of your premarket approval application the'.

group_elements_horizontally() sorts the elements by using compare_bounding_box(), and due to the following block, compare_bounding_box() sorts the elements by x axis instead of y axis.

    if abs(ay_mid - by_mid) < delta:
        if a.bbox.x > b.bbox.x:
            return 1
        else:
            return -1

Because of that, the second element comes before the second element after the sort.

compare_bounding_box() was introduced in this commit, but it's unclear to me what was the heuristic behind the logic.

Could you please improve / fix the logic of compare_bounding_box(), and/or add an option to not use the heuristic and simply order the elements by y axis?

@Belval
Copy link
Contributor

Belval commented Aug 20, 2024

Thank you for sharing the problematic sample. I will need to reproduce the issue first, but the code snippet that you highlighted is used to reconcile lines and ensure that the words within a given line are ordered by their x, it should not result in what you are seeing even though compare_bounding_box is indeed the culprit.

This line https://github.com/aws-samples/amazon-textract-textractor/blob/master/textractor/utils/text_utils.py#L20 creates new lines if the y distance of the center of word a to the center of word b is too high. This is likely what is happening here.

@Belval Belval self-assigned this Aug 20, 2024
@Belval
Copy link
Contributor

Belval commented Aug 21, 2024

Same issue as #369

@keitaf
Copy link
Author

keitaf commented Aug 22, 2024

Here is the Textract response JSON generated from this PDF file.

I can reproduce it by running the following code.

from textractor.entities.document import Document

document = Document.open('P010032A.pdf.json')
text = document.get_text()
print(text)
DEPARTMENT OF HEALTH & HUMAN SERVICES 

Public Health Service 
...
Dear Mr. Johnson: 

Administration (FDA) has completed its review of your premarket approval application the The Center for Devices and Radiological Health (CDRH) of the Food and Drug This device is indicated as an aid in the management of chronic intractable pain programmer, the Model 1232 programming wand and the Model 1210 patient magnet. of the following components: the Model 3608 pulse generator, the Model 3850 patient (PMA) for the Genesis Neurostimulation (IPG) System. The System includes trunk and/or limbs, including unilateral or bilateral pain associated with failed back that surgery the PMA is approved subject to the conditions described below and in the syndrome, intractable low back pain and leg pain. We are pleased to inform you "Conditions of Approval" (enclosed). You may begin commercial distribution of the device upon receipt of this letter. 
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants