Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/chunk_by_title disregarding combine_text_under_n_chars #2699

Closed
georearl opened this issue Mar 27, 2024 · 6 comments
Closed

bug/chunk_by_title disregarding combine_text_under_n_chars #2699

georearl opened this issue Mar 27, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@georearl
Copy link

Describe the bug
I am partitioning and then chunking an html file. The HTML has 12357 chars including spaces., but even with very large values for
max_characters, combine_text_under_n_chars and new_after_n_chars it still gives me 9 chunks.

To Reproduce
Provide a code snippet that reproduces the issue. Use an HTML based document, roughly 1 page in length with numerous titles, then partition and chunk as follows.

elements = partition_html(file=bytes_io)
chunks = chunk_by_title(elements, multipage_sections=True, new_after_n_chars=100000, combine_text_under_n_chars=75000, max_characters=100000)

Expected behavior
Given the parameter values I would expect a single chunk. This is just exploratory to understand an issue. In the real scenario I wouldn't use these values or desire a single chunk.

Screenshots
If applicable, add screenshots to help explain your problem.
image

Environment Info
Please run python scripts/collect_env.py and paste the output here.
This will help us understand more about the environment in which the bug occurred.

Additional context
Add any other context about the problem here.
We are using version 0.12.6

@georearl georearl added the bug Something isn't working label Mar 27, 2024
@scanny
Copy link
Collaborator

scanny commented Mar 27, 2024

@georearl A Table element is never combined with any other element (even another Table) to form a chunk. So where a Table appears in the element stream, the prior chunk will close, the table chunk will appear, and a new chunk will start after the table.

This is what we're seeing in the chunk stream; chunk, table, chunk, table, etc. Note that Table is a chunk-type, along with CompositeElement and TableChunk. So you could think of this as text-chunk, table-chunk, text-chunk, table-chunk, ...

Closing for now as no bug is evident here but feel free to ask any other questions about this if you need more clarification :)

@scanny scanny closed this as completed Mar 27, 2024
@georearl
Copy link
Author

Thank you. Really appreciate it. Possible follow up question. If you have a table in a file, say a word file and it crosses multiple pages, say 20 pages, but headings aren't repeated on each page, do you chunk the table, but pull forward the headings from page 1?

@scanny
Copy link
Collaborator

scanny commented Mar 27, 2024

This will likely be file-format dependent somewhat, but in the DOCX case you mentioned, the Table element would include the entire table, regardless of any page-breaks Word placed to fit it on printable pages.

@LucasOliveira44
Copy link

@georearl A Table element is never combined with any other element (even another Table) to form a chunk. So where a Table appears in the element stream, the prior chunk will close, the table chunk will appear, and a new chunk will start after the table.

This is what we're seeing in the chunk stream; chunk, table, chunk, table, etc. Note that Table is a chunk-type, along with CompositeElement and TableChunk. So you could think of this as text-chunk, table-chunk, text-chunk, table-chunk, ...

Closing for now as no bug is evident here but feel free to ask any other questions about this if you need more clarification :)

Hi @scanny is there a way to avoid this behaviour? I tried modifying the element type before chunking and removing text_as_html and table_as_cells from the metadata, but this didn't work, I guess there is some underlying table identifier in the metadata that I can not see.

@scanny
Copy link
Collaborator

scanny commented May 8, 2024

@LucasOliveira44 it would be better to open a new issue for this as it's not strictly related to the original question. If you add a new issue I'll see it and be happy to respond.

The short answer though is you'll need to change the element type. An element that returns True for isinstance(element, Table) will never "fit" in the prior pre-chunk so will never be combined:
https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/chunking/base.py#L875

You could potentially filter the element stream before passing it along to chunking and convert Table elements to Text say. Not sure if that would produce the results you're after.

But I'd like to discuss further because this isn't the first time it's come up. If you'll open a fresh issue I can say more.

@LucasOliveira44
Copy link

Done @scanny you can find it here, #2990 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants