Support for marked content section IDs #961

dhdaines · 2023-08-09T19:59:10Z

As requested, this is the MCID part of #937 split out. Structure tree support (using pdfminer.six) will be a separate PR.

codecov · 2023-08-09T20:04:18Z

Codecov Report

Merging #961 (8b5b6a3) into develop (d8b9c15) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##           develop      #961   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           18        18           
  Lines         1588      1613   +25     
=========================================
+ Hits          1588      1613   +25

Files Changed	Coverage Δ
pdfplumber/page.py	`100.00% <100.00%> (ø)`

dhdaines · 2023-08-09T23:09:36Z

Note! This page only extracts marked-content identifiers for sequences of objects. There ~~are a few other kinds~~ is one kind of marked content that exist in PDF which it doesn't handle:

~~marked-content sequences with tags (and no identifiers) - This PR will be fixed ASAP to support this~~ DONE!
marked-content points - These are marked points (with a tag and possible attributes) in the content stream which don't correspond to any given object. It isn't clear how this could be supported in pdfplumber.

jsvine · 2023-08-19T15:31:52Z

Many thanks for this, @dhdaines! It's a clever solution, and adds what seems like will be a powerful feature for people working with PDFs that have marked content.

For now, I'm going to mark mcid and tag in the README as experimental attributes, but will remove that note if/when the pdfminer.six internals that make this possible remain stable.

dhdaines · 2023-08-19T15:38:24Z

Many thanks for this, @dhdaines! It's a clever solution, and adds what seems like will be a powerful feature for people working with PDFs that have marked content.

For now, I'm going to mark mcid and tag in the README as experimental attributes, but will remove that note if/when the pdfminer.six internals that make this possible remain stable.

Thank you! I will submit another PR soon to add the tag attributes, as these are useful for identifying headers and footers.

feat: Extract marked content IDs for all objects

c0b03ee

dhdaines mentioned this pull request Aug 9, 2023

Add support for structure tree and marked content sections #937

Closed

dhdaines added 3 commits August 9, 2023 19:45

feat: add tag names as well as mcids

bb9b3db

docs: update README

885fe4f

test: check that lines/curves are in Figure

1c5c0d0

This was referenced Aug 10, 2023

Support for PDF 1.3 logical structure #963

Merged

Add --structure-text flag to CLI (like pdfinfo -struct-text but better) #967

Closed

dhdaines and others added 2 commits August 17, 2023 16:38

fix: handle tags without attributes/MCIDs (e.g. Artifact)

c47995c

Merge branch 'develop' into add_mcids

8b5b6a3

jsvine merged commit 142fc90 into jsvine:develop Aug 19, 2023
7 checks passed

dhdaines deleted the add_mcids branch September 5, 2023 20:37

dhdaines mentioned this pull request Oct 13, 2023

Future Road Map pdfminer/pdfminer.six#154

Open

jsvine mentioned this pull request Nov 9, 2023

Accessibility tagging #909

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for marked content section IDs #961

Support for marked content section IDs #961

dhdaines commented Aug 9, 2023

codecov bot commented Aug 9, 2023 •

edited

Loading

dhdaines commented Aug 9, 2023 •

edited

Loading

jsvine commented Aug 19, 2023

dhdaines commented Aug 19, 2023

Support for marked content section IDs #961

Support for marked content section IDs #961

Conversation

dhdaines commented Aug 9, 2023

codecov bot commented Aug 9, 2023 • edited Loading

Codecov Report

dhdaines commented Aug 9, 2023 • edited Loading

jsvine commented Aug 19, 2023

dhdaines commented Aug 19, 2023

codecov bot commented Aug 9, 2023 •

edited

Loading

dhdaines commented Aug 9, 2023 •

edited

Loading