Working with PDFPlumber and Multiprocessing #1007
Replies: 3 comments 1 reply
-
Hi @wadeflash12, and thanks for your interest in |
Beta Was this translation helpful? Give feedback.
-
Thanks for the reply @jsvine , will update what I find. I have a couple questions on the pdfplumber side. When we instantiate the Just trying to understand how pdfplumber accesses the pdf. |
Beta Was this translation helpful? Give feedback.
-
@wadeflash12 - How did your work / experiments go with multiprocessing? I'm looking to do the same, so wanted to ask if you have any observations to share. Cheers. |
Beta Was this translation helpful? Give feedback.
-
Hi! Thanks for the great library @jsvine
I have a use-case to process hundreds of pdf's for downstream tasks, and to quicken my text/table extraction and fully utilize my available compute I am using Multiprocessing. The files will reside on s3 or similar.
This question revolves around I/O and trying to understand how pdfplumber and its pdf class loads in the pdf data. The suggestions to clear page cache and lru cache are well received and will be used.
I have a question that is fairly novice,
Is it better to instantiate the
pdfplumber.open("file.pdf")
class for every single page on a pdf (Each page gets processed on a separate core.)Process all the pages of a pdf within one core and call
pdfplumber.open("file.pdf")
once per pdf. (I process N=available cpu cores pdf's at once).In my view, if pdfplumber loads pdf data on every
pdfplumber.open("file.pdf")
call, then it will be better to process an entire pdf within a single core and avoid memory overhead. This especially can come into play when large pdf's are being processed.Beta Was this translation helpful? Give feedback.
All reactions