You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OCR software (including, notably, Tesseract) sometimes gets a little too clever, and believes that hyphens are actually em dashes, en dashes, or minus signs. (Possibly other characters too, I'm not sure.) e.g., 012—34—5678 instead of 012-34-5678. These, of course, are not found by our regex.
I suggest that we convert the character set down to ASCII (assuming that Ruby can do such a thing), so that all hyphen-like characters become, simply, hyphens.
The text was updated successfully, but these errors were encountered:
I tried dealing with this within regular expressions, but it went badly. We're using a backreference for the second appearance of a hyphen, but that assumes that the same character is being used for each hyphen. If the first one is a hyphen and the second one is an em dash, then the SSN won't be found.
What we really need is to match a second hyphen-shaped thing if a first hyphen-shaped thing is found.
Or, of course, we can just convert all em dashes, en dashes, and minus signs to hyphens before running this regex, as proposed in the initial issue.
I think the proper solution is to do a conversion on the input stream. Replace −, –, —, ~, and ‐ with -. (There are other characters that look the same, but OCR software is vanishingly unlikely to use them, in my experience.)
I'm assigning this to @jazzido, since this is now out of my bailiwick.
waldoj
changed the title
Deal with em dashes, en dashes, and minus signs
Deal with em dashes, en dashes, etc.
Jun 22, 2014
OCR software (including, notably, Tesseract) sometimes gets a little too clever, and believes that hyphens are actually em dashes, en dashes, or minus signs. (Possibly other characters too, I'm not sure.) e.g.,
012—34—5678
instead of012-34-5678
. These, of course, are not found by our regex.I suggest that we convert the character set down to ASCII (assuming that Ruby can do such a thing), so that all hyphen-like characters become, simply, hyphens.
The text was updated successfully, but these errors were encountered: