issue with ordering in extractions, markdown and gettext methods #388

red-sky17 · 2024-08-17T12:02:57Z

the attached input document contains text then a table followed by some text, we want the text file to be the same as the input pdf file.

I tried extraction using different methods:

for 1.) and 2.) this is the code I am using:
textract_json = extractor.start_document_analysis( file_source="s3://s3sagemakerbucket/textract_analysis/12382593_bnp_credit_facility_20m.pdf", features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES], save_image=False, )
response_textract_async = extractor.get_result(job_id=textract_json.job_id, api=Textract_API.ANALYZE)
markdown_text = response_textract_async.to_markdown()
1.) .to_markdown() method

the issue here is the two table are at the bottom.

2.) .get_text() method

in this case as well we can see the two tables are at the bottom and like we know without config parameter we wont get markdown output.

now the third is interesting
the code used for this is:
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import get_text_from_layout_json

textract_json = call_textract(input_document="s3://s3sagemakerbucket/textract_analysis/12382593_bnp_credit_facility_20m.pdf", features=[Textract_Features.LAYOUT,Textract_Features.TABLES],)
3.) get_text_from_layout_json(textract_json=textract_json)
also tried with get_text_from_layout_json(textract_json=textract_json, generate_markdown = True) in both of these cases getting the same output.

the issue in using this method is like you can see, the data is getting repeated twice, also there is no markdown format present.

@Belval or anyone can you please suggest if there is anything we can do to prevent this and get the text in correct like we have in the pdf file.

Thanks.

The text was updated successfully, but these errors were encountered:

red-sky17 · 2024-08-17T13:37:20Z

also do look into this output for the attached pdf as well, same issue is being observed here as well for the 1st page the tables are being printed down and as for the second page
Egypt_EG01_Credit Agricole.pdf

this is for 2nd page:

complete text file:
Egypt_EG01_Credit Agricole_using_markdown.txt

red-sky17 · 2024-08-17T14:04:18Z

where as the ordering is present in this text file when extracted using get_text_from_layout_json(textract_json=textract_json)
the issue is same like the one discussed in the first thread (3.).

text file for reference:

Egypt_EG01_Credit Agricole_using_gettextfromlayout_json.txt

I am thinking is this a bug for .to_markdown() and get_text() methods because for gettextfromlayoutjson() we are getting the output in correct order.

ultimately the final goal is to get the extraction like we did in gettextfromlayoutjson but with markdown bordering and no duplication.

so, I believe it would be better if we could get the extraction properly by using .to_markdown method only, because in this method we have markdown bordering and the only issue is ordering which can debugged I guess by comparing the gettextfromlayoutjson and to_markdown functions code of traversing the json dict.

Belval · 2024-08-20T18:08:57Z

I will test it first but this looks like a known issue that happens when the LAYOUT predictions do not match the TABLE predictions, causing the reading order to be wrong.

Belval · 2024-08-20T18:14:22Z

What version of amazon-textract-textractor are you using? With 1.8.2 I get:

Page 2 of 10


Schneider Electric South East Asia (HQ) Pte. Ltd. Schneider Electric Overseas Asia Pte Ltd Schneider Electric Singapore Pte. Ltd. Schneider Electric IT Singapore Pte. Ltd. (formerly known as MGE Asia Pte Ltd) Schneider Electric IT Logistics Asia Pacific Pte. Ltd. Schneider Electric Logistics Asia Pte Ltd Schneider Electric Systems Singapore Pte. Ltd. (formerly known as Invensys Process Systems (S) Pte. Ltd.) 1 March 2017 

Previous Facility Letters. In the event that this Facility Letter is not accepted or lapses and is not extended by the Bank, the terms and conditions in the Previous Facility Letters shall continue to apply, save for any revision or amendments to the Interest Rate and any reduction in the amount of the Lines of Credit as stated herein. 

## A. LINE(S) OF CREDIT 



| AMOUNT          | TYPE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|-----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| SGD20,000,000/- | Multi-currency Banker's Guarantee [including but not limited to Performance Guarantee or Payment Guarantee (for up to 60 months or such other tenor as may be agreed by the Bank from time to time) or to finance any other transactions acceptable to the Bank on a case-by-case subject to such conditions as may be determined by the Bank in its sole and absolute discretion] and/or Sight & Usance Letters of Credit (for up to 12 months) (with/ without control of goods) and/or Shipping Guarantee & Acceptance Under Usance Letters of Credit. |



## 1. PURPOSE 

The Facilities shall be used solely to finance the Borrower's working capital requirements. However, without prejudice to the Borrower's obligations, the Bank shall not be obliged to check that the Borrower does so or that the Facilities or any part thereof is utilized in such a manner. 

## 2. INTEREST RATE/COMMISSION/FEE 

(a) Commission on Banker's Guarantee shall be calculated on the face amount of the Banker's Guarantee for the period from the date of issuance upto the expiry date of the Banker's Guarantee, payable upfront as follows :- 
 
(b) Non-refundable Commission / Interest on the Trade Facilities shall be payable at the following rates and in the following manner:- 
(i) Letters of Credit 0.125% per month, minimum 2 months 



| Tenor                    | Commission    |
|--------------------------|---------------|
| Less than 3 years,       | 0.2%pa        |
| 3 years and upto 5 years | 0.25%pa       |

Which does not match what you are reporting.

red-sky17 · 2024-08-21T03:53:56Z

@Belval , I am attaching the input pdf, when tested on the single page like I attached( in the first thread, which you tested) its giving the same output like you got, but when tested as a whole(pdf) that is when I am facing issue.

I am using amazon-textract-textractor version 1.8.2

this_pdf.pdf

Belval · 2024-08-21T16:26:06Z

Thank you for clarifying and sharing the file, I will attempt to reproduce the issue.

red-sky17 · 2024-09-12T09:44:00Z

Hello @Belval, were you able to reproduce this issue.

Chuukwudi · 2024-11-17T23:09:34Z

I have noticed this a few times myself.

If order is important, I would usually get the bbox of the entity and sort by x or y axis.

Combining page ordering, together with entity bboxes guarantees that order is maintain in the output.

Of course, you will need to know the format of you input pdf beforehand to do this.

red-sky17 changed the title ~~issue with ordering after extraction, in the final text file.~~ issue with ordering in extractions, markdown and gettext methods Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue with ordering in extractions, markdown and gettext methods #388

issue with ordering in extractions, markdown and gettext methods #388

red-sky17 commented Aug 17, 2024

red-sky17 commented Aug 17, 2024

red-sky17 commented Aug 17, 2024

Belval commented Aug 20, 2024

Belval commented Aug 20, 2024

red-sky17 commented Aug 21, 2024 •

edited

Loading

Belval commented Aug 21, 2024

red-sky17 commented Sep 12, 2024

Chuukwudi commented Nov 17, 2024

issue with ordering in extractions, markdown and gettext methods #388

issue with ordering in extractions, markdown and gettext methods #388

Comments

red-sky17 commented Aug 17, 2024

red-sky17 commented Aug 17, 2024

red-sky17 commented Aug 17, 2024

Belval commented Aug 20, 2024

Belval commented Aug 20, 2024

red-sky17 commented Aug 21, 2024 • edited Loading

Belval commented Aug 21, 2024

red-sky17 commented Sep 12, 2024

Chuukwudi commented Nov 17, 2024

red-sky17 commented Aug 21, 2024 •

edited

Loading