feat: infer hierarchical heading levels (H1-H6) for PDFs (#4204)#4325
feat: infer hierarchical heading levels (H1-H6) for PDFs (#4204)#4325statxc wants to merge 2 commits intoUnstructured-IO:mainfrom
Conversation
|
Hi, @PastelStorm @cragwolfe |
|
There are currenty some merge conflicts that need resolving. |
|
@PastelStorm @codebymikey |
|
Hi @statxc, I'm not particularly sure I'll be able to review this effectively as I'm not a maintainer of the project or familiar enough with it to comment. I'm merely a potential end-user who pointed out the missing functionality. Did you implement this yourself or with help from AI? One of the issues the previous PR trying to address this had was how it was hard to review because the original PR user wasn't fully aware of what the code actually did, and just made PRs based off the output of the tool. Also, based off her Github activity, I believe @PastelStorm might be away for a while. If you feel your PR addresses the issue to a sufficiently high standard, then feel free to ping some of the more recently active maintainers/committers for review. |
|
@badGarnet @cragwolfe Hi, Sorry for ping many people. But this PR has been open for quite a while. I'd appreciate you give me any feedback. |
Summary
Titleelements viacategory_depthmetadataLTChardata, ranks largest-first to assign depth 0-5partition_pdf_or_image(), works with all strategies (fast,hi_res,ocr_only)Test plan
test_document_to_element_list_sets_category_depth_titlespasses unchangedCloses #4204