Commit b6cf510
authored
feat(chunking): repeat table headers on continuation chunks (#4298)
## Behavior summary
### Before
- Oversized table chunks only preserved headers in the first chunk;
continuation chunks could lose column context.
- Table header semantics (`<thead>` / `<th>`) were not retained as
explicit row-level metadata after compactification.
### After
- Added `repeat_table_headers` (default `True`) to chunking APIs and
strategy plumbing:
- `chunk_elements(..., repeat_table_headers=...)`
- `chunk_by_title(..., repeat_table_headers=...)`
- `add_chunking_strategy(...)` forwarded args/docs
- `_TableChunker` now detects contiguous leading header rows and repeats
them on non-initial continuation chunks.
- Repeated header rows are prepended to both continuation chunk text and
`text_as_html`.
- First chunk behavior remains unchanged relative to legacy output.
- Added a guardrail: if a repeated header row would consume more than
half the chunk window, splitter falls back to legacy non-repeating
behavior.
## Invariants
- No body-row drop, duplication, or reordering across emitted
continuation chunks.
- Opt-out behavior (`repeat_table_headers=False`) matches legacy table
splitting behavior.
- Chunk windows still respect max-size constraints, including
near-boundary continuation windows.
- Only contiguous leading header rows are repeated; later non-leading
header-like rows are not promoted.
## Edge cases covered
- No headers, single leading header row, multiple leading header rows.
- Header detection from both `<thead>` and `<th>` rows.
- Exact-fit and near-boundary continuation sizing.
- Cascading repetition across 3+ continuation chunks.
- Pathologically large header rows trigger safe fallback to
non-repeating behavior.
- Strategy-path forwarding validated through `partition_html(...,
chunking_strategy="by_title")`.
## Test evidence
- `uv run --no-sync pytest -q
test_unstructured/chunking/test_dispatch.py` (6 passed)
- `uv run --no-sync pytest -q test_unstructured/chunking/test_base.py -k
"Describe_TableChunker"` (26 passed)
- `uv run --no-sync pytest -q
test_unstructured/chunking/test_title.py::test_add_chunking_strategy_forwards_repeat_table_headers`
(1 passed)
- `uv run --no-sync pytest -q test_unstructured/chunking/test_title.py
-k "repeat_table_headers"` (5 passed)
- `uv run --with python-docx pytest -q
test_unstructured/chunking/test_basic.py -k "repeat_table_headers"` (4
passed)
- `uv run --no-sync pytest -q
test_unstructured/common/test_html_table.py` (26 passed)
authored by codex1 parent 6360ef7 commit b6cf510
13 files changed
Lines changed: 1067 additions & 38 deletions
File tree
- test_unstructured
- chunking
- common
- unstructured
- chunking
- common
- documents
- staging
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
1 | 5 | | |
2 | 6 | | |
3 | 7 | | |
| |||
0 commit comments