Missing Pages #939
Replies: 3 comments 1 reply
-
Hi @ZainAfr, and thanks for your interest in Also, does the issue persist even after repairing the PDF? To try repairing it, you can follow these instructions or use the new, experimental feature (as of |
Beta Was this translation helpful? Give feedback.
-
I'm running into this issue when parallelizing the load of my PDF pages: def process_page(page: Page):
try:
temp = page.extract_text()
token_count = num_tokens_from_string(temp)
print(f"--------- PAGE {page.page_number} ({token_count}) ----------")
page.flush_cache()
page.get_textmap.cache_clear()
parallel_results[page.page_number] = temp
except Exception as e:
print(f"Error processing page {page.page_number}: {e}")
def parse_pdf_parallel(file_path):
start = time.time()
with pdfplumber.open(file_path) as pdf:
with ThreadPoolExecutor(max_workers=5) as executor:
executor.map(process_page, pdf.pages)
end = time.time()
print(f"Total time taken: {end - start} seconds") I get a few It's happening inconsistently, but frequently enough that I cannot use any multi-threading when loading PDF pages. I can share the file that it's happening with, if required. Also- I can't install |
Beta Was this translation helpful? Give feedback.
-
... aaand problem solved: lock = threading.Lock()
...
def process_page(page: Page):
try:
with lock:
temp = page.extract_text()
page.flush_cache()
page.get_textmap.cache_clear() Adding a lock around the |
Beta Was this translation helpful? Give feedback.
-
Hi 👋,
I have been using pdf plumber for extracting tables and text with no issues so far.
Recently, however, I came across an issue with multiple pdf files (that all come from the same source), where most pages in the pdf file are not being detected/returned when opening them pdfplumber.
Apart from manually and programatically verifying that pages were, infact, not being detected, I used the to_image method with debug_tablefinder to see which parts of the text were being detected.
It turned out that the page the text was extracted from was not the same page that was showing up with the debug_tablefinder i.e. debug_tablefinder would show the first page of the pdf where as when extract_text was run on the first page it would return results from the 3rd page or 5th page or any other page other than the actual first page
Is this a known issue with pdfs encoded in a certain way and is there possibly a workaround? Thanks 🙏
Beta Was this translation helpful? Give feedback.
All reactions