Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bounding box is incorrect for text converted from Markdown. #410

Open
dharnieshraja opened this issue Nov 26, 2024 · 0 comments
Open

Bounding box is incorrect for text converted from Markdown. #410

dharnieshraja opened this issue Nov 26, 2024 · 0 comments

Comments

@dharnieshraja
Copy link

dharnieshraja commented Nov 26, 2024

Hi Devs,

I've been using the amazon-textract-textractor for a while and have encountered an issue where the bounding boxes for markdown-converted text are inaccurate when I try to retrieve and draw them. I suspect this problem arises during the conversion from regular text to markdown, and it might not be handled correctly (or maybe I'm the only one who needs bounding boxes for markdown text! ).

From what I can tell, the bounding boxes are not being adjusted properly for the markdown/config file text that's generated. For example, when a prefix or suffix (provided as a parameter in the config file) is added, the bounding box needs to be updated accordingly. Similarly, when using settings like the same_layout_element_separator or add_prefixes_and_suffixes_as_words, vertical adjustments need to be made as well.

I'm wondering if there’s a known solution to this issue? Here are the textractor function and config file I'm using.

def textractor_func(s3_path):
    extractor = Textractor(region_name='eu-west-2') 
    document = extractor.start_document_analysis(
        file_source = s3_path, #s3 url 
        features = [TextractFeatures.LAYOUT, TextractFeatures.TABLES, TextractFeatures.SIGNATURES, TextractFeatures.FORMS],
        save_image = False
    )
    
    config = MarkdownLinearizationConfig(
        hide_figure_layout=False,
        title_prefix="# ",
        section_header_prefix="## ",
        table_remove_column_headers=True,
        table_flatten_headers=True,
        hide_page_num_layout=True,
        hide_footer_layout=False,
        hide_header_layout=False,
        hide_table_layout=False,
        table_tabulate_format = "github",
        hide_key_value_layout=False,
        figure_layout_prefix="<image>",
        figure_layout_suffix="</image>",
        signature_token="[SIGNATURE]",
        table_linearization_format="markdown",
        same_paragraph_separator = "\n",
        same_layout_element_separator = "\n\n",
        add_prefixes_and_suffixes_as_words = True               
    )

    
    # Generate text output from Textractor's response
    return document
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant