-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Functions that can be multi-threaded - Enhancement to documentation #995
Comments
Interesting. My best guess is "just luck," since they use the same underlying PDF-parsing process. |
I was able to use multi-threading no problem :) You need to use |
Thanks for the note, @Pk13055! Are you able to share some code that demonstrates your approach? |
Here's a small example I put together. It may not run off-the-bat, but should provide a general idea: from asyncio import gather, ensure_future, get_event_loop, run
import pdfplumber
async def process_page(page):
processed = page.extract_tables()
# do other stuff with page
return processed
async def main():
pdf = pdfplumber.open("test.pdf")
loop = get_event_loop()
futures = []
for pg_idx in range(len(pdf.pages)):
page = pdf.pages[pg_idx]
futures.append(ensure_future(process_page(page), loop=loop))
await gather(*futures)
if __name__ == "__main__":
run(main()) I found this approach to be much faster than using a from concurrent.futures import ThreadPoolExecutor, as_completed
from asyncio import run
import pdfplumber
async def process_page(page):
processed = page.extract_tables()
# do other stuff with page
return processed
async def main():
pdf = pdfplumber.open("test.pdf")
futures = []
with ThreadPoolExecutor() as executor:
for pg_idx in range(len(pdf.pages)):
page = pdf.pages[pg_idx]
futures.append(executor.submit(process_page, page))
for res in as_completed(futures):
processed = res.result()
# do something with processed
if __name__ == "__main__":
run(main()) |
With reference to #91
Is
extract_tables
the only function with this issue?I am using multiprocessing with
extract_words
and haven't faced this issue so far. I wonder if this is just luck or ifextract_words
doesn't depend on document-wide._tokens
issue that @jsvine mentioned in #91It will be very helpful if this aspect is mentioned in the documentation.
The text was updated successfully, but these errors were encountered: