Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 575, regression tests #582

Merged
merged 11 commits into from
Jun 13, 2023
Merged

Fix 575, regression tests #582

merged 11 commits into from
Jun 13, 2023

Conversation

kozross
Copy link
Contributor

@kozross kozross commented Apr 18, 2023

Sorry for the long wait on this one folks. This fixes #575, as well as incorporating @Bodigrim's regression tests (which all pass, of course).

The reason this happened was a subtle mistake on my part. In brief, the way the SIMDed (both SSE and AVX) validator used to work is roughly as so:

  1. Break the input up into 128 byte (64 byte on SSE) strides.
  2. For each stride:
    a. Check speculatively if the entire block is ASCII.
    b. If the entire block is ASCII, plough on.
    c. Otherwise, check if the block is UTF-8.
    d. Accumulate error
  3. Finish up the tail one byte at a time.

A key factor to making this work is the ability to 'preserve validation state' across strides: this means that, were we to have a stride where the last byte is the start of a multi-byte sequence, the validator is aware of this and can respond.

The issue I had to fix was that step 2a did two very naughty things:

alpha. Ignore the 'preserved state' from the previous stride entirely.
beta. Overwrite the 'preserved state' with a 'clean' one, indicating no carryover.

Thus, the issues observed previously. The checks for both SSE and AVX versions of speculative ASCII validation have been modified to take this into account. Near as I can tell, this has no impact on performance: the code was choked on memory movement (and the corresponding true dependencies), and register pressure was not increased by this change. I've added comments explaining exactly how both versions work in case someone else needs to maintain this code.

I also added an extra test to exclude the possibility that ASCII verification itself was broken (it wasn't).

@kozross
Copy link
Contributor Author

kozross commented Apr 18, 2023

@Bodigrim - what version of QuickCheck gets pulled into our unusual architecture jobs? I'm also seeing some tests regarding lazy ByteStrings which failed even before I made any changes, but not consistently.

@Bodigrim
Copy link
Contributor

@kozross it's QuickCheck-2.12. Feel free to bump uraimo/run-on-arch-action to v2.5 or whichever is the latest, and distro to ubuntu22.04, but I think it still is not enough for chooseEnum; use old plain choose instead.

@kozross
Copy link
Contributor Author

kozross commented Apr 19, 2023

I'm getting a bunch of failures for lazy bytestring tests on both Windows and 64-bit ARM. The NetBSD tests are just not starting at all. Not sure how to fix either issue.

@clyring
Copy link
Member

clyring commented Apr 19, 2023

I'm not sure what you mean by "failures for lazy bytestring tests on both Windows and 64-bit ARM." Are you seeing these locally? These are the only test failures I see in the CI logs:

windows
2023-04-19T01:24:48.8529648Z     Invalid UTF-8 ByteString:                              FAIL (0.45s)
2023-04-19T01:24:48.8545529Z       *** Failed! Falsified (after 20 tests and 188 shrinks):
2023-04-19T01:24:48.8570168Z       InvalidUtf8 {prefix = "!\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL", invalid = "\244\176\142\169", suffix = "", asBS = "!\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\244\176\142\169", length = 34}
2023-04-19T01:24:48.8583723Z       Use --quickcheck-replay=316071 to reproduce.
2023-04-19T01:24:48.8608578Z       Use -p '/Invalid UTF-8 ByteString/' to rerun this test only.
[...]
2023-04-19T01:24:48.8652669Z     Invalid UTF-8 ShortByteString:                         FAIL
2023-04-19T01:24:48.8653501Z       *** Failed! Falsified (after 47 tests and 35 shrinks):
2023-04-19T01:24:48.8654413Z       InvalidUtf8 {prefix = "!\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL", invalid = "\240\131\183\168", suffix = "", asBS = "!\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\240\131\183\168", length = 34}
2023-04-19T01:24:48.8655491Z       Use --quickcheck-replay=723812 to reproduce.
2023-04-19T01:24:48.8656096Z       Use -p '/Invalid UTF-8 ShortByteString/' to rerun this test only.
arm64
2023-04-19T01:22:59.1431991Z       Invalid byte between spaces:                         FAIL
2023-04-19T01:22:59.1432363Z         *** Failed! Falsified (after 1 test):
2023-04-19T01:22:59.1432794Z         Use --quickcheck-replay=96163 to reproduce.
2023-04-19T01:22:59.1433296Z         Use -p '/Invalid byte between spaces/' to rerun this test only.
2023-04-19T01:22:59.1433711Z       Two invalid bytes between spaces:                    FAIL
2023-04-19T01:22:59.1434074Z         *** Failed! Falsified (after 1 test):
2023-04-19T01:22:59.1434498Z         Use --quickcheck-replay=334360 to reproduce.
2023-04-19T01:22:59.1434998Z         Use -p '/Two invalid bytes between spaces/' to rerun this test only.
2023-04-19T01:22:59.1435716Z       Three invalid bytes between spaces:                  FAIL
2023-04-19T01:22:59.1436076Z         *** Failed! Falsified (after 1 test):
2023-04-19T01:22:59.1436513Z         Use --quickcheck-replay=72422 to reproduce.
2023-04-19T01:22:59.1437029Z         Use -p '/Three invalid bytes between spaces/' to rerun this test only.

The NetBSD job is indeed currently broken. I haven't seen this s390x failure before; I just tried restarting the job.

@kozross
Copy link
Contributor Author

kozross commented Apr 19, 2023

@clyring - thanks for pointing that out to me: I thought the laziness stuff indicated failures due to exceptions. I'll try and repro those locally, but it's quite odd that these don't show up anywhere else.

Edit: The ARM stuff isn't too surprising actually: I forgot to put the fix in the NEON code, but that shouldn't be difficult.

@Bodigrim
Copy link
Contributor

@kozross if you rebase atop the latest master, NetBSD failure should go away.

@kozross kozross force-pushed the koz/invalid-utf8 branch 2 times, most recently from 8ca5425 to 71cbb89 Compare April 24, 2023 01:22
@kozross
Copy link
Contributor Author

kozross commented Apr 24, 2023

@Bodigrim - the NEON side of things is fixed now, using a similar fix to the SSE/AVX ones. However, the Windows issue is weird, and I'm unable to diagnose it. Here's what I can say for sure:

  • It's definitely a Windows-only problem: it triggers basically every time, but none of its seeds or specific cases fail on anything else I've been able to test.
  • The failure doesn't happen in SIMD code: whichever version of SSE or AVX Windows ends up compiling to, neither will get called on the failing cases, as they're too small.
  • Every case has the same form: a 32-byte-ish stretch of ASCII, followed by 2-3 bytes of nearly-valid UTF-8.
  • This seems to happen with GHC 9.6.1, and whichever C compiler it's bundled with. While our test matrix doesn't include 9.6 for Linux, I checked locally, and the issue doesn't seem to happen there.

My best guess is that there's a subtle form of UB at play here which gets compiled differently on Windows than on Mac or Linux. However, this both seems unlikely, and is nearly impossible for me to verify: it's clearly not a Clang-versus-GCC issue, as 9.6 on Windows bundles Clang AFAICT, and the Mac build doesn't hit this problem. Thus, it means it's some kind of Clang version divergence in the non-SIMDed validation code, and as it currently stands, I don't even know what Clang version gets bundled with GHC 9.6.1 on Windows, and have no good way to check.

If someone has access to a Windows machine, and some expertise with C-on-Windows-with-Clang, I'd appreciate the assistance, as I have neither.

@Kleidukos
Copy link
Member

According to this comment, this might be Clang 15 haskell/cabal#8841 (comment)

@Bodigrim
Copy link
Contributor

@kozross
On Windows GHC 9.4 (shipped with Clang 13.0.0) works fine, but GHC 9.6 (shipped with Clang 14.0.6) fails in the same manner as the CI job. Running GHC 9.4 with Clang 14.0.6 also fails, so you are likely right that the issue is connected to Clang version.

@Bodigrim Bodigrim mentioned this pull request Apr 24, 2023
@Bodigrim
Copy link
Contributor

@kozross could you please rebase once more? And feel free to set fail-fast: false for Windows jobs in your branch. Let's see which of them pass.

@sergv
Copy link
Contributor

sergv commented Apr 25, 2023

Tests on Windows seem to fail because __builtin_ctzl accepts longs, which are 4 byte wide there. The argument to __builtin_ctzl is an 8-byte uint64_t which leads to truncation (or is it UB??). In any case it seems that __builtin_ctzll than accepts long long should be used instead to always process all 8 bytes. Cannot comment inline so will point thus.

@kozross
Copy link
Contributor Author

kozross commented Apr 25, 2023

@sergv - thanks, that's an excellent catch!

@Bodigrim
Copy link
Contributor

CI looks good so far, great stuff!

@kozross could you possibly rebase one last time atop of #578 please? There've been a bug in shrinking which I just fixed + added one more test case.

@Bodigrim Bodigrim added this to the 0.11.5.0 milestone Apr 25, 2023
@kozross
Copy link
Contributor Author

kozross commented Apr 25, 2023

@Bodigrim - done.

Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, many thanks!

@Bodigrim Bodigrim requested a review from clyring April 25, 2023 23:46
@clyring
Copy link
Member

clyring commented Apr 26, 2023

Gross. Why isn't -Wconversion on by default?

Copy link
Member

@clyring clyring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(To be clear, my previous comment was an expression of frustration rather than a request for this patch. Thanks @sergv for spotting that; I was not looking forward to diagnosing it.)

cbits/aarch64/is-valid-utf8.c Show resolved Hide resolved
cbits/is-valid-utf8.c Show resolved Hide resolved
cbits/is-valid-utf8.c Show resolved Hide resolved
cbits/is-valid-utf8.c Show resolved Hide resolved
@Bodigrim
Copy link
Contributor

@kozross @clyring what remains to be done here? Only enabling -fno-strict-aliasing? I'm keen to get this merged and released.

@kozross
Copy link
Contributor Author

kozross commented May 20, 2023

@Bodigrim - yep, -fno-strict-aliasing is the only thing left to go as far as I can tell.

@clyring
Copy link
Member

clyring commented May 22, 2023

I'd rather rewrite to avoid strict-aliasing violations to begin with, but it doesn't make sense to hold up this patch over that. Let's add the flag to cc-options and merge.

@kozross
Copy link
Contributor Author

kozross commented May 22, 2023

@clyring This flag is used fairly often, and a rewrite to fix this would be tedious and pointless. There's nothing per-se wrong with disabling this behaviour: the standard is not a goal in and of itself.

@Bodigrim
Copy link
Contributor

@clyring good to go now?

Copy link
Member

@clyring clyring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clyring good to go now?

Yes. To quote myself:

Let's add the flag to cc-options and merge.

I'll create separate issues for the low-priority follow-up work.

@clyring clyring merged commit dac5675 into haskell:master Jun 13, 2023
@clyring
Copy link
Member

clyring commented Jun 13, 2023

@kozross @sergv @Bodigrim Thanks again!

clyring pushed a commit that referenced this pull request Jun 13, 2023
* Refactor isValidUtf8 tests to enable shrinking

* Test isValidUtf8 on strings which are almost valid, with long ranges of ASCII

* Repair invalid UTF-8 issue, more tests

* Fix NEON ASCII check

* Refactor isValidUtf8 tests to enable shrinking

* Test isValidUtf8 on strings which are almost valid, with long ranges of ASCII

* Use ctzll to ensure 8 bytes get processed

* Set -fno-strict-aliasing

---------

Co-authored-by: Bodigrim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid UTF-8 byte sequence is accepted as valid UTF-8 by text-2.0.1
5 participants