Skip to content

Commit b6cf510

Browse files
authored
feat(chunking): repeat table headers on continuation chunks (#4298)
## Behavior summary ### Before - Oversized table chunks only preserved headers in the first chunk; continuation chunks could lose column context. - Table header semantics (`<thead>` / `<th>`) were not retained as explicit row-level metadata after compactification. ### After - Added `repeat_table_headers` (default `True`) to chunking APIs and strategy plumbing: - `chunk_elements(..., repeat_table_headers=...)` - `chunk_by_title(..., repeat_table_headers=...)` - `add_chunking_strategy(...)` forwarded args/docs - `_TableChunker` now detects contiguous leading header rows and repeats them on non-initial continuation chunks. - Repeated header rows are prepended to both continuation chunk text and `text_as_html`. - First chunk behavior remains unchanged relative to legacy output. - Added a guardrail: if a repeated header row would consume more than half the chunk window, splitter falls back to legacy non-repeating behavior. ## Invariants - No body-row drop, duplication, or reordering across emitted continuation chunks. - Opt-out behavior (`repeat_table_headers=False`) matches legacy table splitting behavior. - Chunk windows still respect max-size constraints, including near-boundary continuation windows. - Only contiguous leading header rows are repeated; later non-leading header-like rows are not promoted. ## Edge cases covered - No headers, single leading header row, multiple leading header rows. - Header detection from both `<thead>` and `<th>` rows. - Exact-fit and near-boundary continuation sizing. - Cascading repetition across 3+ continuation chunks. - Pathologically large header rows trigger safe fallback to non-repeating behavior. - Strategy-path forwarding validated through `partition_html(..., chunking_strategy="by_title")`. ## Test evidence - `uv run --no-sync pytest -q test_unstructured/chunking/test_dispatch.py` (6 passed) - `uv run --no-sync pytest -q test_unstructured/chunking/test_base.py -k "Describe_TableChunker"` (26 passed) - `uv run --no-sync pytest -q test_unstructured/chunking/test_title.py::test_add_chunking_strategy_forwards_repeat_table_headers` (1 passed) - `uv run --no-sync pytest -q test_unstructured/chunking/test_title.py -k "repeat_table_headers"` (5 passed) - `uv run --with python-docx pytest -q test_unstructured/chunking/test_basic.py -k "repeat_table_headers"` (4 passed) - `uv run --no-sync pytest -q test_unstructured/common/test_html_table.py` (26 passed) authored by codex
1 parent 6360ef7 commit b6cf510

13 files changed

Lines changed: 1067 additions & 38 deletions

File tree

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 0.22.10
2+
### Enhancements
3+
- **Repeat table headers across continuation chunks**: Add `repeat_table_headers` to basic/title chunking options and table chunking internals so leading header rows are detected once and carried forward when large tables spill across multiple chunks.
4+
15
## 0.22.9
26

37
### Enhancements

0 commit comments

Comments
 (0)