Working with PDFPlumber and Multiprocessing #1007

wadeflash12 · 2023-10-09T16:26:53Z

wadeflash12
Oct 9, 2023

Hi! Thanks for the great library @jsvine

I have a use-case to process hundreds of pdf's for downstream tasks, and to quicken my text/table extraction and fully utilize my available compute I am using Multiprocessing. The files will reside on s3 or similar.

This question revolves around I/O and trying to understand how pdfplumber and its pdf class loads in the pdf data. The suggestions to clear page cache and lru cache are well received and will be used.

I have a question that is fairly novice,

Is it better to instantiate the pdfplumber.open("file.pdf") class for every single page on a pdf (Each page gets processed on a separate core.)
Process all the pages of a pdf within one core and call pdfplumber.open("file.pdf") once per pdf. (I process N=available cpu cores pdf's at once).

In my view, if pdfplumber loads pdf data on every pdfplumber.open("file.pdf") call, then it will be better to process an entire pdf within a single core and avoid memory overhead. This especially can come into play when large pdf's are being processed.

jsvine · 2023-10-12T15:55:08Z

jsvine
Oct 12, 2023
Maintainer

Hi @wadeflash12, and thanks for your interest in pdfplumber and this problem in particular. Unfortunately, I haven't spent much time recently investigating multiprocessing. My instinct is that the most efficient approach would be to split the PDF into pages/N ranges (i.e., for a 90-page PDF and 4 cores: page 1-25, 26-50, 51-75, and 76-90), and process those in separate cores. I'll be eager to hear what you find, however, in practice.

0 replies

wadeflash12 · 2023-10-12T16:08:20Z

wadeflash12
Oct 12, 2023
Author

Thanks for the reply @jsvine , will update what I find.

I have a couple questions on the pdfplumber side.

When we instantiate the pdfplumber.open("file.pdf") class, does it bring the entire pdf into memory?
Or does it wait until we call one of its properties (.metadata or .pages) - and then does it load entire pdf into memory? Or load pdf into memory page by page?

Just trying to understand how pdfplumber accesses the pdf.

1 reply

jsvine Oct 12, 2023
Maintainer

For parsing, pdfplumber uses pdfminer.six. When pdfplumber loads the PDF, it loads a bare minimum of information via pdfminer.six — generally metadata and file structure. It does not load the full page contents; that only occurs on a page-by-page basis and only when you try accessing something that requires that loading — e.g., page.chars or page.extract_text().

s3dev · 2024-03-25T13:29:11Z

s3dev
Mar 25, 2024

@wadeflash12 - How did your work / experiments go with multiprocessing? I'm looking to do the same, so wanted to ask if you have any observations to share. Cheers.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working with PDFPlumber and Multiprocessing #1007

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Working with PDFPlumber and Multiprocessing #1007

wadeflash12 Oct 9, 2023

Replies: 3 comments · 1 reply

jsvine Oct 12, 2023 Maintainer

wadeflash12 Oct 12, 2023 Author

jsvine Oct 12, 2023 Maintainer

s3dev Mar 25, 2024

wadeflash12
Oct 9, 2023

Replies: 3 comments 1 reply

jsvine
Oct 12, 2023
Maintainer

wadeflash12
Oct 12, 2023
Author

jsvine Oct 12, 2023
Maintainer

s3dev
Mar 25, 2024