-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix handling of non-matching surrogates in collation data. #147
Comments
@markusicu How important is it to test unpaired surrogate collation behavior? |
https://www.unicode.org/Public/UCA/latest/CollationTest.html “These files contain test cases that include ill-formed strings, with surrogate code points. Implementations that do not weight surrogate code points the same way as reserved code points may filter out such lines in the test cases, before testing for conformance.” |
A key problem here is that unpaired surrogates cannot be represented in UTF-8 (they can be in WTF-8). I feel like I'm not super interested in testing this corner of the conformance data for collation and we should just limit our testing to things that are valid in UTF-8. |
That's fine. Did you see my reply from jun05? |
Yes I did, and it seems like this is the current behavior. But, the conformance data contains unpaired surrogates presumably because in environments that support them, they need to have a certain behavior, right? So it seems like unicode-org/conformance should pass them down to executors that represent implementations that handle them. So, I propose keeping this issue open, but demoting the priority. |
The current test generator doesn't create tests for collation data when either of the test strings contains an incomplete surrogate. These are recorded in the logging files but they are not stored in any data or mentioned in any dashboards.
The text was updated successfully, but these errors were encountered: