Bounding box is incorrect for text converted from Markdown. #410

dharnieshraja · 2024-11-26T14:34:01Z

Hi Devs,

I've been using the amazon-textract-textractor for a while and have encountered an issue where the bounding boxes for markdown-converted text are inaccurate when I try to retrieve and draw them. I suspect this problem arises during the conversion from regular text to markdown, and it might not be handled correctly (or maybe I'm the only one who needs bounding boxes for markdown text! ).

From what I can tell, the bounding boxes are not being adjusted properly for the markdown/config file text that's generated. For example, when a prefix or suffix (provided as a parameter in the config file) is added, the bounding box needs to be updated accordingly. Similarly, when using settings like the same_layout_element_separator or add_prefixes_and_suffixes_as_words, vertical adjustments need to be made as well.

I'm wondering if there’s a known solution to this issue? Here are the textractor function and config file I'm using.

def textractor_func(s3_path):
    extractor = Textractor(region_name='eu-west-2') 
    document = extractor.start_document_analysis(
        file_source = s3_path, #s3 url 
        features = [TextractFeatures.LAYOUT, TextractFeatures.TABLES, TextractFeatures.SIGNATURES, TextractFeatures.FORMS],
        save_image = False
    )
    
    config = MarkdownLinearizationConfig(
        hide_figure_layout=False,
        title_prefix="# ",
        section_header_prefix="## ",
        table_remove_column_headers=True,
        table_flatten_headers=True,
        hide_page_num_layout=True,
        hide_footer_layout=False,
        hide_header_layout=False,
        hide_table_layout=False,
        table_tabulate_format = "github",
        hide_key_value_layout=False,
        figure_layout_prefix="<image>",
        figure_layout_suffix="</image>",
        signature_token="[SIGNATURE]",
        table_linearization_format="markdown",
        same_paragraph_separator = "\n",
        same_layout_element_separator = "\n\n",
        add_prefixes_and_suffixes_as_words = True               
    )

    
    # Generate text output from Textractor's response
    return document

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bounding box is incorrect for text converted from Markdown. #410

Bounding box is incorrect for text converted from Markdown. #410

dharnieshraja commented Nov 26, 2024 •

edited

Loading

Bounding box is incorrect for text converted from Markdown. #410

Bounding box is incorrect for text converted from Markdown. #410

Comments

dharnieshraja commented Nov 26, 2024 • edited Loading

dharnieshraja commented Nov 26, 2024 •

edited

Loading