-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issue with ordering in extractions, markdown and gettext methods #388
Comments
also do look into this output for the attached pdf as well, same issue is being observed here as well for the 1st page the tables are being printed down and as for the second page complete text file: |
where as the ordering is present in this text file when extracted using get_text_from_layout_json(textract_json=textract_json) text file for reference: Egypt_EG01_Credit Agricole_using_gettextfromlayout_json.txt I am thinking is this a bug for .to_markdown() and get_text() methods because for gettextfromlayoutjson() we are getting the output in correct order. ultimately the final goal is to get the extraction like we did in gettextfromlayoutjson but with markdown bordering and no duplication. so, I believe it would be better if we could get the extraction properly by using .to_markdown method only, because in this method we have markdown bordering and the only issue is ordering which can debugged I guess by comparing the gettextfromlayoutjson and to_markdown functions code of traversing the json dict. |
I will test it first but this looks like a known issue that happens when the LAYOUT predictions do not match the TABLE predictions, causing the reading order to be wrong. |
What version of
Which does not match what you are reporting. |
@Belval , I am attaching the input pdf, when tested on the single page like I attached( in the first thread, which you tested) its giving the same output like you got, but when tested as a whole(pdf) that is when I am facing issue. I am using amazon-textract-textractor version 1.8.2 |
Thank you for clarifying and sharing the file, I will attempt to reproduce the issue. |
Hello @Belval, were you able to reproduce this issue. |
I have noticed this a few times myself. If order is important, I would usually get the bbox of the entity and sort by x or y axis. Combining page ordering, together with entity bboxes guarantees that order is maintain in the output. Of course, you will need to know the format of you input pdf beforehand to do this. |
the attached input document contains text then a table followed by some text, we want the text file to be the same as the input pdf file.
I tried extraction using different methods:
for 1.) and 2.) this is the code I am using:
textract_json = extractor.start_document_analysis( file_source="s3://s3sagemakerbucket/textract_analysis/12382593_bnp_credit_facility_20m.pdf", features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES], save_image=False, )
response_textract_async = extractor.get_result(job_id=textract_json.job_id, api=Textract_API.ANALYZE)
markdown_text = response_textract_async.to_markdown()
1.) .to_markdown() method
the issue here is the two table are at the bottom.
2.) .get_text() method
in this case as well we can see the two tables are at the bottom and like we know without config parameter we wont get markdown output.
now the third is interesting
the code used for this is:
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import get_text_from_layout_json
textract_json = call_textract(input_document="s3://s3sagemakerbucket/textract_analysis/12382593_bnp_credit_facility_20m.pdf", features=[Textract_Features.LAYOUT,Textract_Features.TABLES],)
3.) get_text_from_layout_json(textract_json=textract_json)
also tried with get_text_from_layout_json(textract_json=textract_json, generate_markdown = True) in both of these cases getting the same output.
the issue in using this method is like you can see, the data is getting repeated twice, also there is no markdown format present.
@Belval or anyone can you please suggest if there is anything we can do to prevent this and get the text in correct like we have in the pdf file.
Thanks.
The text was updated successfully, but these errors were encountered: