`get_text_from_layout_json` throws `'NoneType' object is not subscriptable` for a specific PDF #411

neil-sola · 2024-12-02T18:35:10Z

get_text_from_layout_json throws 'NoneType' object is not subscriptable for a specific PDF.

Unfortunately, I can't share the specific PDF for privacy reasons — but this line seems to be the cause:

amazon-textract-textractor/prettyprinter/textractprettyprinter/t_pretty_print_layout.py

Line 173 in 9fb7d22

children = [(x, depth + 1) for x in relationships[0]['Ids']]

Might also be an issue with Textract's output itself, rather than this library's parsing. This issue seems isolated to a specific PDF, and other pdfs work fine. Notes: seems to be something related to the metadata / structure of the file itself, multiple runs + changing orientiation + deleting pages does not seem to fix the issue.

Is this an error than anyone else has encountered / figured out a resolution for?

The text was updated successfully, but these errors were encountered:

neil-sola · 2024-12-03T23:43:34Z

Found the specific issue: it is possible for a LAYOUT_FIGURE to have "Relationships": null which breaks this function:

Example:

{"BlockType":"LAYOUT_FIGURE","ColumnIndex":null,"ColumnSpan":null,"Confidence":94.62890625,"EntityTypes":null,"Geometry":{"BoundingBox":{"Height":0.04673086851835251,"Left":0.06788529455661774,"Top":0.8822278380393982,"Width":0.4904918074607849},"Polygon":[{"X":0.06790152192115784,"Y":0.8822278380393982},{"X":0.5583770871162415,"Y":0.8828750252723694},{"X":0.558368444442749,"Y":0.9289587140083313},{"X":0.06788529455661774,"Y":0.9283040761947632}]},"Hint":null,"Id":"4859938e-4c4a-46bb-b40c-34d93486b824","Page":1,"PageClassification":null,"Query":null,"Relationships":null,"RowIndex":null,"RowSpan":null,"SelectionStatus":null,"Text":null,"TextType":null},

neil-sola mentioned this issue Dec 4, 2024

[textract-pretty-printer] add null check to get_text_from_layout_json parsing #412

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`get_text_from_layout_json` throws `'NoneType' object is not subscriptable` for a specific PDF #411

`get_text_from_layout_json` throws `'NoneType' object is not subscriptable` for a specific PDF #411

neil-sola commented Dec 2, 2024 •

edited

Loading

neil-sola commented Dec 3, 2024

get_text_from_layout_json throws 'NoneType' object is not subscriptable for a specific PDF #411

get_text_from_layout_json throws 'NoneType' object is not subscriptable for a specific PDF #411

Comments

neil-sola commented Dec 2, 2024 • edited Loading

neil-sola commented Dec 3, 2024

`get_text_from_layout_json` throws `'NoneType' object is not subscriptable` for a specific PDF #411

`get_text_from_layout_json` throws `'NoneType' object is not subscriptable` for a specific PDF #411

neil-sola commented Dec 2, 2024 •

edited

Loading