Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case-insensitive Unicode manipulation #49

Open
ashvardanian opened this issue Sep 23, 2023 · 1 comment
Open

Case-insensitive Unicode manipulation #49

ashvardanian opened this issue Sep 23, 2023 · 1 comment
Labels
huge Large task potentially involving architectural or breaking changes

Comments

@ashvardanian
Copy link
Owner

Python strings offer a lot of powerful methods, such as:

  • isalnum, isalpha, isascii, isdecimal, isdigit, isspace, islower, isupper, istitle, isnumeric for checks.
  • lower and upper that copy the string.
  • casfold described in section 3.13 of the Unicode Standard.

There are very few C-level libraries that provide such functionality, and most of them are not characterized by speed. Covering a subset of that functionality in StringZilla makes sense.

@ashvardanian ashvardanian added the huge Large task potentially involving architectural or breaking changes label Feb 27, 2024
@ashvardanian
Copy link
Owner Author

Starting with v3, part of this functionality is already available for ASCII strings. Implementing the same for UTF8 would involve preparing huge dictionaries, and potentially designing some SIMD-friendly trie or automata. So we are not rushing those features for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
huge Large task potentially involving architectural or breaking changes
Projects
None yet
Development

No branches or pull requests

1 participant