Skip to content

feat: infer hierarchical heading levels (H1-H6) for PDFs (#4204)#4325

Open
statxc wants to merge 2 commits intoUnstructured-IO:mainfrom
statxc:statxc/feat-pdf-heading-hierarchy
Open

feat: infer hierarchical heading levels (H1-H6) for PDFs (#4204)#4325
statxc wants to merge 2 commits intoUnstructured-IO:mainfrom
statxc:statxc/feat-pdf-heading-hierarchy

Conversation

@statxc
Copy link
Copy Markdown

@statxc statxc commented Apr 7, 2026

Summary

  • Add two-strategy heading level inference for PDF Title elements via category_depth metadata
    • Outline extraction (primary): walks PDF bookmark tree and matches entries to Title elements by page number + text similarity
    • Font-size analysis (fallback): clusters distinct font sizes from pdfminer LTChar data, ranks largest-first to assign depth 0-5
  • Integrates as a post-processing step in partition_pdf_or_image(), works with all strategies (fast, hi_res, ocr_only)
  • Correctly skipped for image partitioning

Test plan

  • Existing test_document_to_element_list_sets_category_depth_titles passes unchanged
  • 154 existing PDF tests pass (1 pre-existing OCR language failure unrelated)

Closes #4204

@statxc
Copy link
Copy Markdown
Author

statxc commented Apr 10, 2026

Hi, @PastelStorm @cragwolfe
Could you review my PR please.
Please let me know if anything else is needed to update.
Thanks.

@codebymikey
Copy link
Copy Markdown

There are currenty some merge conflicts that need resolving.

@statxc
Copy link
Copy Markdown
Author

statxc commented Apr 22, 2026

@PastelStorm @codebymikey
Could you review this PR please? I'd appreciate any feedback from you. Please review my PR when you have time. Thanks.

@codebymikey
Copy link
Copy Markdown

Hi @statxc,

I'm not particularly sure I'll be able to review this effectively as I'm not a maintainer of the project or familiar enough with it to comment. I'm merely a potential end-user who pointed out the missing functionality.

Did you implement this yourself or with help from AI? One of the issues the previous PR trying to address this had was how it was hard to review because the original PR user wasn't fully aware of what the code actually did, and just made PRs based off the output of the tool.

Also, based off her Github activity, I believe @PastelStorm might be away for a while.

If you feel your PR addresses the issue to a sufficiently high standard, then feel free to ping some of the more recently active maintainers/committers for review.

@statxc
Copy link
Copy Markdown
Author

statxc commented Apr 23, 2026

@badGarnet @cragwolfe Hi, Sorry for ping many people. But this PR has been open for quite a while. I'd appreciate you give me any feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat/Infer the hierarchical heading/title levels such as H1, H2, H3, H4 for PDFs

2 participants