Feature Request: OCR support for digitally signed dcouments. #603

ShakataGaNai · 2024-03-01T23:30:04Z

Running v3.1 out of docker containers for testing (per https://docs.papermerge.io/3.1/setup/docker-compose/ ). When you upload and attempt to OCR a digitally signed document, the process fails silently. Looking at the logs (from the worker) finds a logical error message:

[2024-03-02 00:20:56,933: ERROR/ForkPoolWorker-8] Task papermerge.core.tasks.ocr_document_task[77be2d59-9703-42df-a3cc-bf920a61eab4] raised unexpected: DigitalSignatureError()
Traceback (most recent call last):
  File "/core_app/.venv/lib/python3.10/site-packages/celery/app/trace.py", line 477, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/core_app/.venv/lib/python3.10/site-packages/celery/app/trace.py", line 760, in __protected_call__
    return self.run(*args, **kwargs)
  File "/core_app/papermerge/core/tasks.py", line 79, in ocr_document_task
    ocr_document(
  File "/core_app/papermerge/core/ocr/document.py", line 86, in ocr_document
    _ocr_document(
  File "/core_app/papermerge/core/ocr/document.py", line 54, in _ocr_document
    ocrmypdf.ocr(
  File "/core_app/.venv/lib/python3.10/site-packages/ocrmypdf/api.py", line 337, in ocr
    return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
  File "/core_app/.venv/lib/python3.10/site-packages/ocrmypdf/_sync.py", line 388, in run_pipeline
    validate_pdfinfo_options(context)
  File "/core_app/.venv/lib/python3.10/site-packages/ocrmypdf/_pipeline.py", line 204, in validate_pdfinfo_options
    raise DigitalSignatureError()
ocrmypdf.exceptions.DigitalSignatureError: Input PDF has a digital signature. OCR would alter the document,
invalidating the signature.

I can't find any mention of this anywhere, but supporting OCR for digitally signed documents would be nice. Perhaps the version dropdown can indicate something like "Version X w/ OCRed and w/o digital signature". Honestly, I don't even care about accessing a version of the document with OCR'd text, so long as the text is there for full text search. Especially when dealing with a multiplicity of signed legal documents.

The text was updated successfully, but these errors were encountered:

ciur · 2024-03-03T08:12:52Z

Thank you for opening this ticket.

Would you mind uploading a digitally signed document that I can experiment with? Of course, I mean document without sensitive information. One page document (digitally signed) with a couple of words would do the job just fine.

This will help me understand your request better and, of course, validate the feature while developing it.

ShakataGaNai · 2024-03-03T22:58:07Z

Attaching 3. One is a digital document pushed right through docusign. One is the same document printed then scanned, and through docusign. The third is the same print/scan document signed with Adobe Acrobat (which I'm least confident in working, because Adobe...)

Lipsum scan - adobe signed.pdf
Lipsum scan - docusign.pdf
lipsum - docusign.pdf

bluekitedreamer · 2024-04-23T06:16:09Z

Possibly a simple issue to fix, see another issue recently filed here with solution suggestion (#614 (comment)).

ShakataGaNai added enhancement New feature or request feature request labels Mar 1, 2024

ShakataGaNai assigned ciur Mar 1, 2024

bluekitedreamer mentioned this issue Apr 23, 2024

Invalidate digital signatures for uploaded PDF files #614

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: OCR support for digitally signed dcouments. #603

Feature Request: OCR support for digitally signed dcouments. #603

ShakataGaNai commented Mar 1, 2024

ciur commented Mar 3, 2024 •

edited

Loading

ShakataGaNai commented Mar 3, 2024

bluekitedreamer commented Apr 23, 2024

Feature Request: OCR support for digitally signed dcouments. #603

Feature Request: OCR support for digitally signed dcouments. #603

Comments

ShakataGaNai commented Mar 1, 2024

ciur commented Mar 3, 2024 • edited Loading

ShakataGaNai commented Mar 3, 2024

bluekitedreamer commented Apr 23, 2024

ciur commented Mar 3, 2024 •

edited

Loading