Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[textractprettyprinter] List contents are duplicated when generating text output using get_text_from_layout_json #391

Open
adityachandak287 opened this issue Aug 27, 2024 · 4 comments

Comments

@adityachandak287
Copy link

adityachandak287 commented Aug 27, 2024

Current Behavior

While trying to create markdown or text files from AWS Textract JSON output using the get_text_from_layout_json function, the contents of ALL the list items are duplicated in the output.

Expected Behavior

Each list item's contents should be included in the output only once.

Related Issues

Possible Solution

The AWS docs on Textract Layout Response Objects mention that in the case of LAYOUT_LIST elements, their children can point to LAYOUT_TEXT elements, which is the case here.

Layout elements can also point to different objects, such as TABLE objects, Key-Value pairs, or LAYOUT_TEXT elements in the case of LAYOUT_LIST

Due to this, when getting all layouts from the Textract JSON output (LinearizeLayout._get_layout_blocks), the LIST_LAYOUT as well as its child TEXT_LAYOUT layout elements are included, which leads to the duplication in output text.

The get_text_from_layout_json function is a wrapper over LinearizeLayout.get_text function which loops over all layouts (blocks with LAYOUT.* type) from the Textract JSON output and collects the text contents from their children blocks.

The fix lies in the LinearizeLayout._get_layout_blocks function where we can exclude the LAYOUT_TEXT elements which are children of LAYOUT_LIST elements.

Steps to Reproduce

Minimal reproduction repo: adityachandak287/textractprettyprinter-list-duplication-bug-repro

The repository contains the following for reference:

Environment
amazon-textract-caller==0.2.4
amazon-textract-prettyprinter==0.1.10
amazon-textract-response-parser==0.1.48
boto3==1.35.6
botocore==1.35.6

Edit: Added related issues section.

@adityachandak287
Copy link
Author

I've made the proposed changes here. I'd be happy to create a PR!

Also updated the min repro repo to use this latest version and the comparison between the fixed branch and main shows the change in the sample output, i.e. list content is not duplicated anymore.

@adityachandak287
Copy link
Author

Found a couple other cases where LAYOUT_LIST had LAYOUT_SECTION_HEADER and LAYOUT_TITLE as its children. Created a PR with a fix that excludes all LAYOUT* elements which are children of LAYOUT_LIST elements.

@Belval
Copy link
Contributor

Belval commented Sep 17, 2024

Regarding that last message do you have an example of LAYOUT_LIST containing LAYOUT_SECTION_HEADER or LAYOUT_TITLE?

@adityachandak287
Copy link
Author

Sure! I can't share the original document, but here's another sample document which recreates the LAYOUT_LIST -> LAYOUT_SECTION_HEADER scenario. JSON output for reference.

You can check this comparison to see the difference between ignoring text only v/s all layout children.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants