PDF parsing error handling #14

ricomnl · 2023-04-26T20:41:56Z

Hi, it would be useful if some error handling was added in case a PDF fails to parse. I earlier got this error after parsing 1000s of PDFs and had to restart from scratch (not a big deal of course I used a small model for embedding but annoying if a large openai model would have been used).

(semantra) rico@xxx:~/src/semantra$ semantra --model sgpt-1.3B data/*pdf
semantra --model sgpt-1.3B data/test.pdf 
test.pdf:   0%|  | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/rico/.local/bin/semantra", line 8, in <module>
    sys.exit(main())
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 594, in main
    documents[fn] = process(
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 146, in process
    content = get_text_content(md5, filename, semantra_dir, force, silent)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 45, in get_text_content
    return get_pdf_content(md5, filename, semantra_dir, force, silent)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/pdf.py", line 53, in get_pdf_content
    pdf = pdfium.PdfDocument(filename)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/pypdfium2/_helpers/document.py", line 86, in __init__
    self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/pypdfium2/_helpers/document.py", line 721, in _open_pdf
    raise PdfiumError(f"Failed to load document (PDFium: {consts.ErrorToStr.get(err_code)}).")
pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error).

ricomnl · 2023-04-26T20:42:28Z

Nevermind, I jsut realized it caches everything. Still nice to have the error handling though

freedmand · 2023-04-27T04:44:58Z

I should probably make the cache handling more clear in the docs so folks are reassured.

Great point re: error handling. Logging an error message and continuing is the way to go here. Also, if there's a PDF that's not parsing correctly that should be (and you're comfortable sharing), let me know!

ricomnl · 2023-04-28T16:08:18Z

it was a fault on my end, the pdf was empty for some reason

ricomnl · 2023-04-28T16:09:18Z

I also realized the search is quite slow for 1000s of PDFs. Is this because I'm using a relatively big model or just because they're in PDF format? Would it be faster if it was raw text or if I use a smaller model?

freedmand added the enhancement New feature or request label Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF parsing error handling #14

PDF parsing error handling #14

ricomnl commented Apr 26, 2023

ricomnl commented Apr 26, 2023

freedmand commented Apr 27, 2023

ricomnl commented Apr 28, 2023

ricomnl commented Apr 28, 2023

PDF parsing error handling #14

PDF parsing error handling #14

Comments

ricomnl commented Apr 26, 2023

ricomnl commented Apr 26, 2023

freedmand commented Apr 27, 2023

ricomnl commented Apr 28, 2023

ricomnl commented Apr 28, 2023