Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Following dependencies are missing: pikepdf. Please install them using pip install pikepdf. #2984

Closed
OLR-Nadia opened this issue May 8, 2024 · 10 comments

Comments

@OLR-Nadia
Copy link

Currently I developing RAG using libraries from langchain and unstructured for read pdf file. but I have an issue and cannot solve the error, which makes it unable to move to the next code.

from unstructured.partition.pdf import partition_pdf
fpath = "files/"
fname = "darlie-brief.pdf"

  • Extract elements from PDF
def extract_pdf_elements(path, fname):
    """
    Extract images, tables, and chunk text from a PDF file.
    path: File path, which is used to dump images (.jpg)
    fname: File name
    """
    return partition_pdf(
        filename=path + fname,
        # extract_images_in_pdf=True,
        infer_table_structure=True,
        chunking_strategy="by_title",
        max_characters=4000,
        new_after_n_chars=3800,
        combine_text_under_n_chars=2000,
        strategy="fast",
        # image_output_dir_path=path,
    )
  • Get elements
raw_pdf_elements = extract_pdf_elements(fpath, fname)

printed error message:

Following dependencies are missing: pikepdf. Please install them using pip install pikepdf.
PDF text extraction failed, skip text extraction...

current version I'm using:

  • python 3.10.11
  • unstructured==0.13.6
  • unstructured-client==0.18.0
  • unstructured-inference==0.7.29
  • unstructured.pytesseract==0.3.12

all libraries installed in venv

@scanny
Copy link
Collaborator

scanny commented May 8, 2024

@OLR-Nadia please copy and paste in the exact error message with full stack trace so we can see where this message is arising.

@scanny
Copy link
Collaborator

scanny commented May 8, 2024

Also, please mention how you installed unstructured and what OSS you're on (Linux, macOS, Linux).

@OLR-Nadia
Copy link
Author

@scanny Hi Steve, for the error message, it shows like the printed messages (see the image with the green mark)
And, I installed unstructured library inside my virtual environment like this (see the image with the blue mark). I'm using Windows 11 OSS.
image

@scanny
Copy link
Collaborator

scanny commented May 10, 2024

All of these commands should be run after activating your virtualenv.

First, you'll need to install unstructured with some extras, minimally:

> pip install unstructured[pdf]

Then you should be able to see pikepdf in the list when you run:

> pip list

And this command should produce no error messages:

> python -c "import pikepdf"

If that all works try again and let us know how you go.

@OLR-Nadia
Copy link
Author

I already followed your instructions and still got the same error message.

image
image
image

@scanny
Copy link
Collaborator

scanny commented May 10, 2024

Ok, next thing to try is running this outside of VSCode, like in a terminal/command-line application. I don't completely understand why, but we've had other reports like this where the VSCode terminal was doing something unexpected.

It's possible it's related to the VSCode environment setting: https://code.visualstudio.com/docs/python/environments#_creating-environments

Anyway, we'll want to eliminate VSCode subtleties for diagnostic purposes and running it directly from the command-line in the activated virtualenv should do that.

Btw are you using conda for Python and have you checked the installation docs here?:
https://docs.unstructured.io/open-source/installation/full-installation#installation-with-conda-on-windows

@OLR-Nadia
Copy link
Author

Hi Steve,

I wanted to update you that the issue has been solved. I initially tried installing and uninstalling the library via the command line, outside of VSCode, and encountered a LongPathEnabled issue on my Windows system, which prevented the installation of the unstructured-inference library.

After addressing this, I successfully installed the unstructured-inference and pikepdf libraries separately.

Thank you for your assistance.

@scanny
Copy link
Collaborator

scanny commented May 13, 2024

@OLR-Nadia glad you got it working :)

Can you say any more about the LongPathEnabled problem like how it presented itself and how you resolved it? I expect others will face the same problem so it would be a big help here to have an idea how to resolve the problem :)

@OLR-Nadia
Copy link
Author

Previously, while I was installing the unstructured-inference library, one of the dependencies was stopped installing due to the path being too long (see the warning below)
image

So, I googled it and found that the issue was with the LongPathEnabled setting on my Windows. You can access it by opening the Registry Editor on Windows, searching for LongPathEnabled, and changing the value from 0 to 1 to enable it. If it still can't be changed, try running your laptop as Administrator. After that, I re-installed the libraries and it worked.

Hope this helps!

@scanny
Copy link
Collaborator

scanny commented May 14, 2024

@OLR-Nadia This is terrific, thanks so much for this Nadia! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants