Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: OCR support for digitally signed dcouments. #603

Open
ShakataGaNai opened this issue Mar 1, 2024 · 3 comments
Open

Feature Request: OCR support for digitally signed dcouments. #603

ShakataGaNai opened this issue Mar 1, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request feature request

Comments

@ShakataGaNai
Copy link

Running v3.1 out of docker containers for testing (per https://docs.papermerge.io/3.1/setup/docker-compose/ ). When you upload and attempt to OCR a digitally signed document, the process fails silently. Looking at the logs (from the worker) finds a logical error message:

[2024-03-02 00:20:56,933: ERROR/ForkPoolWorker-8] Task papermerge.core.tasks.ocr_document_task[77be2d59-9703-42df-a3cc-bf920a61eab4] raised unexpected: DigitalSignatureError()
Traceback (most recent call last):
  File "/core_app/.venv/lib/python3.10/site-packages/celery/app/trace.py", line 477, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/core_app/.venv/lib/python3.10/site-packages/celery/app/trace.py", line 760, in __protected_call__
    return self.run(*args, **kwargs)
  File "/core_app/papermerge/core/tasks.py", line 79, in ocr_document_task
    ocr_document(
  File "/core_app/papermerge/core/ocr/document.py", line 86, in ocr_document
    _ocr_document(
  File "/core_app/papermerge/core/ocr/document.py", line 54, in _ocr_document
    ocrmypdf.ocr(
  File "/core_app/.venv/lib/python3.10/site-packages/ocrmypdf/api.py", line 337, in ocr
    return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
  File "/core_app/.venv/lib/python3.10/site-packages/ocrmypdf/_sync.py", line 388, in run_pipeline
    validate_pdfinfo_options(context)
  File "/core_app/.venv/lib/python3.10/site-packages/ocrmypdf/_pipeline.py", line 204, in validate_pdfinfo_options
    raise DigitalSignatureError()
ocrmypdf.exceptions.DigitalSignatureError: Input PDF has a digital signature. OCR would alter the document,
invalidating the signature.

I can't find any mention of this anywhere, but supporting OCR for digitally signed documents would be nice. Perhaps the version dropdown can indicate something like "Version X w/ OCRed and w/o digital signature". Honestly, I don't even care about accessing a version of the document with OCR'd text, so long as the text is there for full text search. Especially when dealing with a multiplicity of signed legal documents.

@ciur
Copy link
Owner

ciur commented Mar 3, 2024

Thank you for opening this ticket.

Would you mind uploading a digitally signed document that I can experiment with? Of course, I mean document without sensitive information. One page document (digitally signed) with a couple of words would do the job just fine.

This will help me understand your request better and, of course, validate the feature while developing it.

@ShakataGaNai
Copy link
Author

Attaching 3. One is a digital document pushed right through docusign. One is the same document printed then scanned, and through docusign. The third is the same print/scan document signed with Adobe Acrobat (which I'm least confident in working, because Adobe...)

Lipsum scan - adobe signed.pdf
Lipsum scan - docusign.pdf
lipsum - docusign.pdf

@bluekitedreamer
Copy link

Possibly a simple issue to fix, see another issue recently filed here with solution suggestion (#614 (comment)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature request
Projects
None yet
Development

No branches or pull requests

3 participants