You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I send a PDF with the following paragraph (which is a bit tilted, part of this PDF file)
and use Document.get_text(), I get the following text where the order of the lines are shuffled.
Administration (FDA) has completed its review of your premarket approval application the The Center for Devices and Radiological Health (CDRH) of the Food and Drug This device is indicated as an aid in the management of chronic intractable pain programmer, the Model 1232 programming wand and the Model 1210 patient magnet. of the following components: the Model 3608 pulse generator, the Model 3850 patient (PMA) for the Genesis Neurostimulation (IPG) System. The System includes trunk and/or limbs, including unilateral or bilateral pain associated with failed back that surgery the PMA is approved subject to the conditions described below and in the syndrome, intractable low back pain and leg pain. We are pleased to inform you "Conditions of Approval" (enclosed). You may begin commercial distribution of the device upon receipt of this letter.
I debugged the code, and it looks like it's due to text_util.compare_bounding_box(), which is called from layout.group_elements_horizontally().
group_elements_horizontally() receives a list of elements, which are layout texts for this paragraph.
The first element has BoundingBox as x: 0.08591524511575699, y: 0.4836207926273346, width: 0.6273355484008789, height: 0.03193599358201027 and text as 'The Center for Devices and Radiological Health (CDRH) of the Food and Drug'.
The second element has BoundingBox as x: 0.08505144715309143, y: 0.5002045631408691, width: 0.6902255415916443, height: 0.03553390130400658 and text as 'Administration (FDA) has completed its review of your premarket approval application the'.
group_elements_horizontally() sorts the elements by using compare_bounding_box(), and due to the following block, compare_bounding_box() sorts the elements by x axis instead of y axis.
if abs(ay_mid - by_mid) < delta:
if a.bbox.x > b.bbox.x:
return 1
else:
return -1
Because of that, the second element comes before the second element after the sort.
compare_bounding_box() was introduced in this commit, but it's unclear to me what was the heuristic behind the logic.
Could you please improve / fix the logic of compare_bounding_box(), and/or add an option to not use the heuristic and simply order the elements by y axis?
The text was updated successfully, but these errors were encountered:
Thank you for sharing the problematic sample. I will need to reproduce the issue first, but the code snippet that you highlighted is used to reconcile lines and ensure that the words within a given line are ordered by their x, it should not result in what you are seeing even though compare_bounding_box is indeed the culprit.
from textractor.entities.document import Document
document = Document.open('P010032A.pdf.json')
text = document.get_text()
print(text)
DEPARTMENT OF HEALTH & HUMAN SERVICES
Public Health Service
...
Dear Mr. Johnson:
Administration (FDA) has completed its review of your premarket approval application the The Center for Devices and Radiological Health (CDRH) of the Food and Drug This device is indicated as an aid in the management of chronic intractable pain programmer, the Model 1232 programming wand and the Model 1210 patient magnet. of the following components: the Model 3608 pulse generator, the Model 3850 patient (PMA) for the Genesis Neurostimulation (IPG) System. The System includes trunk and/or limbs, including unilateral or bilateral pain associated with failed back that surgery the PMA is approved subject to the conditions described below and in the syndrome, intractable low back pain and leg pain. We are pleased to inform you "Conditions of Approval" (enclosed). You may begin commercial distribution of the device upon receipt of this letter.
...
When I send a PDF with the following paragraph (which is a bit tilted, part of this PDF file)
and use
Document.get_text()
, I get the following text where the order of the lines are shuffled.I debugged the code, and it looks like it's due to
text_util.compare_bounding_box()
, which is called fromlayout.group_elements_horizontally()
.group_elements_horizontally()
receives a list of elements, which are layout texts for this paragraph.The first element has
BoundingBox
asx: 0.08591524511575699, y: 0.4836207926273346, width: 0.6273355484008789, height: 0.03193599358201027
andtext
as'The Center for Devices and Radiological Health (CDRH) of the Food and Drug'
.The second element has
BoundingBox
asx: 0.08505144715309143, y: 0.5002045631408691, width: 0.6902255415916443, height: 0.03553390130400658
andtext
as'Administration (FDA) has completed its review of your premarket approval application the'
.group_elements_horizontally()
sorts the elements by usingcompare_bounding_box()
, and due to the following block,compare_bounding_box()
sorts the elements by x axis instead of y axis.Because of that, the second element comes before the second element after the sort.
compare_bounding_box()
was introduced in this commit, but it's unclear to me what was the heuristic behind the logic.Could you please improve / fix the logic of
compare_bounding_box()
, and/or add an option to not use the heuristic and simply order the elements by y axis?The text was updated successfully, but these errors were encountered: