Deal with em dashes, en dashes, etc. #3

waldoj · 2014-06-22T03:46:50Z

OCR software (including, notably, Tesseract) sometimes gets a little too clever, and believes that hyphens are actually em dashes, en dashes, or minus signs. (Possibly other characters too, I'm not sure.) e.g., 012—34—5678 instead of 012-34-5678. These, of course, are not found by our regex.

I suggest that we convert the character set down to ASCII (assuming that Ruby can do such a thing), so that all hyphen-like characters become, simply, hyphens.

The text was updated successfully, but these errors were encountered:

waldoj · 2014-06-22T14:19:57Z

I tried dealing with this within regular expressions, but it went badly. We're using a backreference for the second appearance of a hyphen, but that assumes that the same character is being used for each hyphen. If the first one is a hyphen and the second one is an em dash, then the SSN won't be found.

What we really need is to match a second hyphen-shaped thing if a first hyphen-shaped thing is found.

Or, of course, we can just convert all em dashes, en dashes, and minus signs to hyphens before running this regex, as proposed in the initial issue.

waldoj · 2014-06-22T16:26:30Z

I think the proper solution is to do a conversion on the input stream. Replace −, –, —, ~, and ‐ with -. (There are other characters that look the same, but OCR software is vanishingly unlikely to use them, in my experience.)

I'm assigning this to @jazzido, since this is now out of my bailiwick.

waldoj added the bug label Jun 22, 2014

waldoj assigned jazzido Jun 22, 2014

waldoj mentioned this issue Jun 22, 2014

Test candidate regex on actual PDFs #11

Closed

waldoj changed the title ~~Deal with em dashes, en dashes, and minus signs~~ Deal with em dashes, en dashes, etc. Jun 22, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal with em dashes, en dashes, etc. #3

Deal with em dashes, en dashes, etc. #3

waldoj commented Jun 22, 2014

waldoj commented Jun 22, 2014

waldoj commented Jun 22, 2014

Deal with em dashes, en dashes, etc. #3

Deal with em dashes, en dashes, etc. #3

Comments

waldoj commented Jun 22, 2014

waldoj commented Jun 22, 2014

waldoj commented Jun 22, 2014