You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The test cases aren't great. I OCRed these files fairly naively, using PyPDFOCR (which, in turn, relies on Poppler and Tesseract), and the quality of the OCR is low, in my experience. The biggest problem is that it just randomly fails to OCR digits, digits that are perfectly obvious. So 123-45-6789 might be OCRed as 123-4-6789 and that, of course, is not identified as an SSN.
Review at least a few dozen SSN-bearing PDFs, testing the candidate regular expression against real data, to fine-tune it.
The text was updated successfully, but these errors were encountered: