Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of non-matching surrogates in collation data. #147

Open
sven-oly opened this issue Dec 18, 2023 · 5 comments
Open

Fix handling of non-matching surrogates in collation data. #147

sven-oly opened this issue Dec 18, 2023 · 5 comments

Comments

@sven-oly
Copy link
Collaborator

The current test generator doesn't create tests for collation data when either of the test strings contains an incomplete surrogate. These are recorded in the logging files but they are not stored in any data or mentioned in any dashboards.

@sffc
Copy link
Member

sffc commented Jun 3, 2024

@markusicu How important is it to test unpaired surrogate collation behavior?

@markusicu
Copy link
Member

https://www.unicode.org/Public/UCA/latest/CollationTest.html

“These files contain test cases that include ill-formed strings, with surrogate code points. Implementations that do not weight surrogate code points the same way as reserved code points may filter out such lines in the test cases, before testing for conformance.”

@sffc
Copy link
Member

sffc commented Aug 26, 2024

A key problem here is that unpaired surrogates cannot be represented in UTF-8 (they can be in WTF-8). I feel like I'm not super interested in testing this corner of the conformance data for collation and we should just limit our testing to things that are valid in UTF-8.

@sffc sffc added this to the Backlog ⟨P3⟩ milestone Aug 26, 2024
@markusicu
Copy link
Member

That's fine. Did you see my reply from jun05?

@sffc
Copy link
Member

sffc commented Aug 27, 2024

That's fine. Did you see my reply from jun05?

Yes I did, and it seems like this is the current behavior.

But, the conformance data contains unpaired surrogates presumably because in environments that support them, they need to have a certain behavior, right? So it seems like unicode-org/conformance should pass them down to executors that represent implementations that handle them.

So, I propose keeping this issue open, but demoting the priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants