-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeywordProcessor returns wrong span for text containing non-ascii characters when case_sentsitive=False #119
Comments
Hey Mauro, it doesn't look like the repo is being actively maintained these days. As a pet project, I was going to go through the codebase and give this a revamp, and given this issue is not exceptionally common, non-ascii character or otherwise, what I've done to address the issues amounts to the following:
In such instances, the onus is usually on the user to make sure the text is normalised, and this is fundamentally a text cleanliness issue, rather than an issue with calculating the spans, which thus far looks to be behaving as it should in this case. If you modify the length of the string part way through, I would consider raising an error to be sensible and block the span from calculating an incorrect value. |
Hi all, first thanks a lot for the great library you created, I really appreciate it!
When working with non-ascii characters I found a case, where the span returned by the
KeywordProcessor
is wrong, whencase_sentsitive=False
.Please find a sample below that reproduces the error:
Output:
When looking in the error, I figured out, that the length of the “İ” changes from 1 (when uppercase) to 2 (when lowercase), which I believe results in the span shift (because the span is only wrong when non-case sensitive).
Could any of the authors comment on the issue and mention, if they intent to do something about it or if it is out of scope?
Thanks a lot!
The text was updated successfully, but these errors were encountered: