Commit 6ada488
fix: pdfminer drops extractable text (#4310)
<!-- CURSOR_SUMMARY -->
> [!NOTE]
> **Medium Risk**
> Changes pdfminer integration to override CID font/CMap handling and
introduces custom stream decoding/parsing, which can affect text
extraction behavior and performance on diverse PDFs (mitigated by
size/mapping caps).
>
> **Overview**
> Fixes PDFs where **body text was silently dropped** because CIDFonts
used an *embedded Encoding CMap stream* that `pdfminer.six` doesn’t
resolve.
>
> Adds a bounded embedded-CMap decoder/parser and wires it in via
`CustomPDFCIDFont` + `CustomPDFResourceManager` so `init_pdfminer()`
constructs CID fonts with a parsed CMap (including `WMode`), with
DoS-oriented caps on decompression and total mappings.
>
> Updates tests with a new fixture-driven regression for both `FAST` and
`HI_RES` strategies plus targeted unit tests for CMap parsing/stream
decoding, and bumps version to `0.22.12` with a changelog entry.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
4326b15. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent a3172f8 commit 6ada488
6 files changed
Lines changed: 652 additions & 4 deletions
File tree
- example-docs/pdf
- test_unstructured/partition/pdf_image
- unstructured
- partition/pdf_image
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
1 | 6 | | |
2 | 7 | | |
3 | 8 | | |
| |||
Binary file not shown.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
477 | 477 | | |
478 | 478 | | |
479 | 479 | | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
480 | 518 | | |
481 | 519 | | |
482 | 520 | | |
| |||
0 commit comments