Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: disable table_as_cells output by default #3093

Merged
merged 4 commits into from
May 24, 2024

Conversation

badGarnet
Copy link
Collaborator

@badGarnet badGarnet commented May 23, 2024

This PR changes the output of table elements: now by default the table elements' metadata.table_as_cells is None. The data will only be populated when the env EXTRACT_TABLE_AS_CELLS is set to true.

The original design of the table_as_cells is for evaluate table extraction performance. The format itself is not as readable as the table_as_html metadata for human or RAG consumption. Therefore by default this data is not needed.

Since this output is meant for evaluation use this PR choose to use an environment variable to control if it should be present in the partitioned results. This approach avoids adding parameters to the partition function call. Adding a new parameter to the partition interface increases the complexity of the interface and adds more maintenance cost since there is a long chain of function calls to pass down this parameter to where it is needed.

test

running the following code snippet on main vs. this PR

from unstructured.partition.auto import partition

elements = partition("example-docs/layout-parser-paper-with-table.pdf", strategy="hi_res", skip_infer_table_types=[])
table_cells = [element.metadata.table_as_cells, None) for element in elements if element.category == "Table"]

on main branch table_cells contains cell structured data but on this branch it is a list of None

However if we first set in terminal:

export EXTRACT_TABLE_AS_CELLS=true

then run the same code again with this PR the table_cells would contain actual data, the same as on main branch.

- now requires env EXTRACT_TABLE_AS_CELLS to be true to output
  table_as_cells in Table elements' metadata
… update (#3094)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: badGarnet <[email protected]>
@badGarnet badGarnet added this pull request to the merge queue May 24, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch May 24, 2024
@badGarnet badGarnet enabled auto-merge May 24, 2024 16:07
@badGarnet badGarnet added this pull request to the merge queue May 24, 2024
Merged via the queue into main with commit 32df4ee May 24, 2024
46 checks passed
@badGarnet badGarnet deleted the feat/default-to-not-output-table-cell-structure branch May 24, 2024 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants