Missing Pages #939

ZainAfr · 2023-07-20T09:56:47Z

ZainAfr
Jul 20, 2023

Hi 👋,

I have been using pdf plumber for extracting tables and text with no issues so far.

Recently, however, I came across an issue with multiple pdf files (that all come from the same source), where most pages in the pdf file are not being detected/returned when opening them pdfplumber.

Apart from manually and programatically verifying that pages were, infact, not being detected, I used the to_image method with debug_tablefinder to see which parts of the text were being detected.

It turned out that the page the text was extracted from was not the same page that was showing up with the debug_tablefinder i.e. debug_tablefinder would show the first page of the pdf where as when extract_text was run on the first page it would return results from the 3rd page or 5th page or any other page other than the actual first page

Is this a known issue with pdfs encoded in a certain way and is there possibly a workaround? Thanks 🙏

jsvine · 2023-07-20T13:39:17Z

jsvine
Jul 20, 2023
Maintainer

Hi @ZainAfr, and thanks for your interest in pdfplumber. I've never heard of something matching that description, although it's hard to say for sure without seeing the PDF itself. Could you attach a copy of the file here? (And also a short snippet of code that reproduces the problem you're noticing?)

Also, does the issue persist even after repairing the PDF? To try repairing it, you can follow these instructions or use the new, experimental feature (as of pdfplumber v0.10.0): pdf = pdfplumber.open(path, repair=True)

1 reply

ZainAfr Jul 22, 2023
Author

Hi @jsvine,

Thanks for getting back. It seems that repair=True solved the issue and now all pages are being detected across the board. Thanks!

With reference to the strange behaviour of image.debug_tablefinder and extract_text recognizing different pages at the same index, unfortunately, I cannot provide further information as it seems repair=True has made it so that I can't render any of the pages anymore.

P.S. I am doing all of this in google colab if that makes any difference.

Thanks again!

EDIT: It seems that although all pages are now recognized, the pages that were missing initially, are still unavailable, as I get the following error when using .extract_text() on them

WARNING:pdfminer.pdftypes:Data-loss while decompressing corrupted data

I will try and see if I can share a copy of the pdf file(s) in question, along with the code that produces the unexpected behavior.

aronweiler · 2024-05-08T05:43:12Z

aronweiler
May 8, 2024

I'm running into this issue when parallelizing the load of my PDF pages:

def process_page(page: Page):
    try:
        temp = page.extract_text()
        token_count = num_tokens_from_string(temp)
        print(f"--------- PAGE {page.page_number} ({token_count}) ----------")
        page.flush_cache()
        page.get_textmap.cache_clear()

        parallel_results[page.page_number] = temp
    except Exception as e:
        print(f"Error processing page {page.page_number}: {e}")

def parse_pdf_parallel(file_path):
    start = time.time()
    with pdfplumber.open(file_path) as pdf:
        with ThreadPoolExecutor(max_workers=5) as executor:
            executor.map(process_page, pdf.pages)
    end = time.time()
    print(f"Total time taken: {end - start} seconds")

I get a few Data-loss while decompressing corrupted data in my output, and then when comparing my results from loading synchronously, I see differences.

It's happening inconsistently, but frequently enough that I cannot use any multi-threading when loading PDF pages.

I can share the file that it's happening with, if required. Also- I can't install Ghostscript due to licensing issues, and therefore can't use the repair=True flag (even though that is probably not the issue here).

0 replies

aronweiler · 2024-05-08T05:53:09Z

aronweiler
May 8, 2024

... aaand problem solved:

lock = threading.Lock()
...
def process_page(page: Page):
    try:
        with lock:
            temp = page.extract_text()
            page.flush_cache()
            page.get_textmap.cache_clear()

Adding a lock around the .extract_text() and cache clears resolves the issue with barely any slowdown.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Pages #939

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Missing Pages #939

ZainAfr Jul 20, 2023

Replies: 3 comments · 1 reply

jsvine Jul 20, 2023 Maintainer

ZainAfr Jul 22, 2023 Author

aronweiler May 8, 2024

aronweiler May 8, 2024

ZainAfr
Jul 20, 2023

Replies: 3 comments 1 reply

jsvine
Jul 20, 2023
Maintainer

ZainAfr Jul 22, 2023
Author

aronweiler
May 8, 2024

aronweiler
May 8, 2024