Skip to content

bug/chunk_by_title disregarding combine_text_under_n_chars #2699

Closed
@georearl

Description

@georearl

Describe the bug
I am partitioning and then chunking an html file. The HTML has 12357 chars including spaces., but even with very large values for
max_characters, combine_text_under_n_chars and new_after_n_chars it still gives me 9 chunks.

To Reproduce
Provide a code snippet that reproduces the issue. Use an HTML based document, roughly 1 page in length with numerous titles, then partition and chunk as follows.

elements = partition_html(file=bytes_io)
chunks = chunk_by_title(elements, multipage_sections=True, new_after_n_chars=100000, combine_text_under_n_chars=75000, max_characters=100000)

Expected behavior
Given the parameter values I would expect a single chunk. This is just exploratory to understand an issue. In the real scenario I wouldn't use these values or desire a single chunk.

Screenshots
If applicable, add screenshots to help explain your problem.
image

Environment Info
Please run python scripts/collect_env.py and paste the output here.
This will help us understand more about the environment in which the bug occurred.

Additional context
Add any other context about the problem here.
We are using version 0.12.6

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions