Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with em dashes, en dashes, etc. #3

Open
waldoj opened this issue Jun 22, 2014 · 2 comments
Open

Deal with em dashes, en dashes, etc. #3

waldoj opened this issue Jun 22, 2014 · 2 comments
Assignees
Labels

Comments

@waldoj
Copy link
Member

waldoj commented Jun 22, 2014

OCR software (including, notably, Tesseract) sometimes gets a little too clever, and believes that hyphens are actually em dashes, en dashes, or minus signs. (Possibly other characters too, I'm not sure.) e.g., 012—34—5678 instead of 012-34-5678. These, of course, are not found by our regex.

I suggest that we convert the character set down to ASCII (assuming that Ruby can do such a thing), so that all hyphen-like characters become, simply, hyphens.

@waldoj waldoj added the bug label Jun 22, 2014
@waldoj
Copy link
Member Author

waldoj commented Jun 22, 2014

I tried dealing with this within regular expressions, but it went badly. We're using a backreference for the second appearance of a hyphen, but that assumes that the same character is being used for each hyphen. If the first one is a hyphen and the second one is an em dash, then the SSN won't be found.

What we really need is to match a second hyphen-shaped thing if a first hyphen-shaped thing is found.

Or, of course, we can just convert all em dashes, en dashes, and minus signs to hyphens before running this regex, as proposed in the initial issue.

@waldoj
Copy link
Member Author

waldoj commented Jun 22, 2014

I think the proper solution is to do a conversion on the input stream. Replace , , , ~, and with -. (There are other characters that look the same, but OCR software is vanishingly unlikely to use them, in my experience.)

I'm assigning this to @jazzido, since this is now out of my bailiwick.

@waldoj waldoj changed the title Deal with em dashes, en dashes, and minus signs Deal with em dashes, en dashes, etc. Jun 22, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants