-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex: unicode property S doesn't match '🥰' emoji and others #8713
Comments
Further investigation uncovered a total of 432 emojis, out of 1902, not matched by unicode property # "full_emoji_list.txt" |> File.read!() |> String.split("\n", trim: true) |> Enum.filter(&(:re.run(&1, ~c'\\p{S}', [:unicode]) === :nomatch)) |> length()
432
At least some of them match the unicode property |
I think that this is because the emojis in question are from a unicode standard from after the last update of the underlying pcre library. There is nothing we can do about this except migrate to a different re backend, which is something that we have planned to do for a long time, but have not gotten around to yet due to backwards compatability issues with other implementations. |
I suspected pcre, that makes sense. In the process I noticed an up to date emoji-data.txt, used for unicode-related code generation, and wondered: is the info on grapheme (cluster) categories, if any, exposed somewhere in the stdlib? - as an alternative to regular expressions. Essentially, a way to answer the question: according to the unicode standard shipped with a specific version of OTP, is X a symbol? (Or any other unicode category/subcategory) |
You can do it through an undocumented function:
maybe @dgud knows of an official API that can be used? |
No API is exposed today, unicode_util was made to support the module Some functionality could be exposed in the |
Closing this as there is not anything we can do about this except change re implementation and that is tracked elsewhere. |
Describe the bug
The regular expression
"\\p{S}"
, matching codepoints categorized as symbols, doesn't match the string"🥰"
- unlike for other emoji.To Reproduce
Expected behavior
What happens with other emoji:
Character properties for 😎, as well as for 🥰, both list their general category as 'Other symbol'.
Affected versions
OTP 27.0.1 on GNU/Linux
The text was updated successfully, but these errors were encountered: