Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex: unicode property S doesn't match '🥰' emoji and others #8713

Closed
g-andrade opened this issue Aug 12, 2024 · 6 comments
Closed

Regex: unicode property S doesn't match '🥰' emoji and others #8713

g-andrade opened this issue Aug 12, 2024 · 6 comments
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@g-andrade
Copy link
Contributor

Describe the bug

The regular expression "\\p{S}", matching codepoints categorized as symbols, doesn't match the string "🥰" - unlike for other emoji.

To Reproduce

% re:run("🥰", "\\p{S}", [unicode]).
nomatch

Expected behavior

What happens with other emoji:

% re:run("😎", "\\p{S}", [unicode]).
{match,[{0,4}]}

Character properties for 😎, as well as for 🥰, both list their general category as 'Other symbol'.

Affected versions

OTP 27.0.1 on GNU/Linux

@g-andrade g-andrade added the bug Issue is reported as a bug label Aug 12, 2024
@g-andrade g-andrade changed the title Regex: unicode property S doesn't match '🥰' emoji Regex: unicode property S doesn't match '🥰' emoji and others Aug 12, 2024
@g-andrade
Copy link
Contributor Author

g-andrade commented Aug 12, 2024

Further investigation uncovered a total of 432 emojis, out of 1902, not matched by unicode property S:

# "full_emoji_list.txt" |> File.read!() |> String.split("\n", trim: true) |> Enum.filter(&(:re.run(&1, ~c'\\p{S}',  [:unicode]) === :nomatch)) |> length()
432

Screenshot from 2024-08-12 19-58-50

At least some of them match the unicode property C instead (control characters).

@garazdawi
Copy link
Contributor

I think that this is because the emojis in question are from a unicode standard from after the last update of the underlying pcre library. There is nothing we can do about this except migrate to a different re backend, which is something that we have planned to do for a long time, but have not gotten around to yet due to backwards compatability issues with other implementations.

@g-andrade
Copy link
Contributor Author

g-andrade commented Aug 12, 2024

I suspected pcre, that makes sense. In the process I noticed an up to date emoji-data.txt, used for unicode-related code generation, and wondered: is the info on grapheme (cluster) categories, if any, exposed somewhere in the stdlib? - as an alternative to regular expressions.

Essentially, a way to answer the question: according to the unicode standard shipped with a specific version of OTP, is X a symbol? (Or any other unicode category/subcategory)

@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label Aug 13, 2024
@garazdawi
Copy link
Contributor

You can do it through an undocumented function:

1> unicode_util:lookup($🥰).
#{category => {symbol,other},
  canon => [],ccc => 0,compat => []}

maybe @dgud knows of an official API that can be used?

@dgud
Copy link
Contributor

dgud commented Aug 13, 2024

No API is exposed today, unicode_util was made to support the module strings "new" API.

Some functionality could be exposed in the unicode module maybe?
Nothing we have intended to work with a PR would be interesting?

@garazdawi
Copy link
Contributor

Closing this as there is not anything we can do about this except change re implementation and that is tracked elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

4 participants