Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for marked content section IDs #961

Merged
merged 6 commits into from
Aug 19, 2023
Merged

Conversation

dhdaines
Copy link
Contributor

@dhdaines dhdaines commented Aug 9, 2023

As requested, this is the MCID part of #937 split out. Structure tree support (using pdfminer.six) will be a separate PR.

@codecov
Copy link

codecov bot commented Aug 9, 2023

Codecov Report

Merging #961 (8b5b6a3) into develop (d8b9c15) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##           develop      #961   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           18        18           
  Lines         1588      1613   +25     
=========================================
+ Hits          1588      1613   +25     
Files Changed Coverage Δ
pdfplumber/page.py 100.00% <100.00%> (ø)

@dhdaines
Copy link
Contributor Author

dhdaines commented Aug 9, 2023

Note! This page only extracts marked-content identifiers for sequences of objects. There are a few other kinds is one kind of marked content that exist in PDF which it doesn't handle:

  • marked-content sequences with tags (and no identifiers) - This PR will be fixed ASAP to support this DONE!
  • marked-content points - These are marked points (with a tag and possible attributes) in the content stream which don't correspond to any given object. It isn't clear how this could be supported in pdfplumber.

@jsvine jsvine merged commit 142fc90 into jsvine:develop Aug 19, 2023
7 checks passed
@jsvine
Copy link
Owner

jsvine commented Aug 19, 2023

Many thanks for this, @dhdaines! It's a clever solution, and adds what seems like will be a powerful feature for people working with PDFs that have marked content.

For now, I'm going to mark mcid and tag in the README as experimental attributes, but will remove that note if/when the pdfminer.six internals that make this possible remain stable.

@dhdaines
Copy link
Contributor Author

Many thanks for this, @dhdaines! It's a clever solution, and adds what seems like will be a powerful feature for people working with PDFs that have marked content.

For now, I'm going to mark mcid and tag in the README as experimental attributes, but will remove that note if/when the pdfminer.six internals that make this possible remain stable.

Thank you! I will submit another PR soon to add the tag attributes, as these are useful for identifying headers and footers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants