Skip to content

Commit 6ada488

Browse files
quedclaude
andauthored
fix: pdfminer drops extractable text (#4310)
<!-- CURSOR_SUMMARY --> > [!NOTE] > **Medium Risk** > Changes pdfminer integration to override CID font/CMap handling and introduces custom stream decoding/parsing, which can affect text extraction behavior and performance on diverse PDFs (mitigated by size/mapping caps). > > **Overview** > Fixes PDFs where **body text was silently dropped** because CIDFonts used an *embedded Encoding CMap stream* that `pdfminer.six` doesn’t resolve. > > Adds a bounded embedded-CMap decoder/parser and wires it in via `CustomPDFCIDFont` + `CustomPDFResourceManager` so `init_pdfminer()` constructs CID fonts with a parsed CMap (including `WMode`), with DoS-oriented caps on decompression and total mappings. > > Updates tests with a new fixture-driven regression for both `FAST` and `HI_RES` strategies plus targeted unit tests for CMap parsing/stream decoding, and bumps version to `0.22.12` with a changelog entry. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4326b15. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent a3172f8 commit 6ada488

6 files changed

Lines changed: 652 additions & 4 deletions

File tree

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
## 0.22.12
2+
3+
### Fixes
4+
- **Fix fast strategy silently skipping text in some PDFs**: Certain PDF generators (e.g. Prince XML) embed font encoding data in a non-standard way that pdfminer.six does not handle, causing body text to be silently dropped while headings still extract correctly. Added a workaround that reads the embedded encoding data directly.
5+
16
## 0.22.11
27

38
### Enhancements
2.08 KB
Binary file not shown.

test_unstructured/partition/pdf_image/test_pdf.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -477,6 +477,44 @@ def test_partition_pdf_with_fast_strategy_deduplicates_fake_bold(monkeypatch):
477477
)
478478

479479

480+
def test_partition_pdf_with_fast_strategy_extracts_embedded_cmap_text():
481+
"""Test that fast strategy extracts text from CIDFonts with embedded CMap streams.
482+
483+
Some PDF generators (e.g. Prince XML) embed custom Encoding CMaps as PDF streams
484+
rather than using predefined CMap names. Without handling this, pdfminer.six silently
485+
falls back to an empty CMap and all text using those fonts is lost.
486+
487+
The test fixture has two fonts: a simple Type1 font (Helvetica) that pdfminer handles
488+
fine, and a Type0/CIDFont with an embedded CMap named "Test-Identity-H" that triggers
489+
the bug.
490+
"""
491+
filename = example_doc_path("pdf/embedded-cmap-cidfont.pdf")
492+
elements = pdf.partition_pdf(filename=filename, url=None, strategy=PartitionStrategy.FAST)
493+
494+
all_text = " ".join(e.text for e in elements)
495+
496+
# The Helvetica heading should always be extracted
497+
assert "Heading in Helvetica" in all_text
498+
499+
# These strings are rendered with the CIDFont using the embedded CMap.
500+
# Without the fix, they would be silently dropped.
501+
assert "This text uses an embedded CMap" in all_text
502+
assert "and should be extractable" in all_text
503+
504+
assert len(elements) == 3
505+
506+
507+
def test_partition_pdf_with_hi_res_strategy_extracts_embedded_cmap_text():
508+
"""Same as the fast strategy test but through hi_res, since both strategies use pdfminer."""
509+
filename = example_doc_path("pdf/embedded-cmap-cidfont.pdf")
510+
elements = pdf.partition_pdf(filename=filename, url=None, strategy=PartitionStrategy.HI_RES)
511+
512+
all_text = " ".join(e.text for e in elements)
513+
514+
assert "This text uses an embedded CMap" in all_text
515+
assert "and should be extractable" in all_text
516+
517+
480518
def test_partition_pdf_raises_with_bad_strategy():
481519
filename = example_doc_path("pdf/layout-parser-paper-fast.pdf")
482520
with pytest.raises(ValueError):

0 commit comments

Comments
 (0)