-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace volatiles with atomic for SMP tests #134
Conversation
c246e20
to
efbe56a
Compare
Probably need to add
Would also be good to at least have -O1 or -O2, currently it's generating ridiculous assembly for debug builds. Edit: Also need to manually set CMAKE_OBJCOPY, otherwise it will try to use the wrong one. |
Will set that up. |
void sched_context_0014_helper_fn(_Atomic int *state) | ||
{ | ||
while (1) { | ||
*state = *state + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want this or *state += 1
? The difference is whether the compiler generates either a load acquire and a separate store release or it generates an atomic increment I think (From some aarch64 compiler tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could really be *state = 1
here. I wouldn't expect the compiler to generate different code though, as += 1 is just a shorthand.
If doing the increment, calling seL4_Yield() first would make the test more robust I think. So I think I'll either change it to = 1 or add the yield.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect the compiler to generate different code though, as += 1 is just a shorthand.
I was looking at what _Atomic actually does to an int and an int* and on my aarch64 compiler (12.2.0), a = a + 1; generated different atomic intrinsic than a += 1. The first being a separate load acquire/store release, and the second was a looping load exclusive/store conditional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was that with the actual build? For debug build there is no optimisation at all enabled so the generated assembly is horrible. But for optimised builds I would expect the same code, though neither gcc nor clang do:
https://godbolt.org/z/qbzxqPYTz
There's some subtle difference between the two that eludes me. For some reason neither generate atomic instructions for *num = *num + 1;
.
Edit: I guess C gives stronger guarantees about ++
and += 1
than for separate reads and writes on atomics. That's nice to know, but very subtle...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your godbolt link is what I was seeing too. += and ++ with _Atomic are treated as a __atomic_fetch_add() whereas a = a + 1 is treated as a sequence of __atomic_load_n
-> add 1 -> __atomic_store_n
. So there's ordering but the fetch and add and write back isn't atomic.
ldar w8, [x0]
add w8, w8, #1
stlr w8, [x0]
ldar and stlr are "atomics" in that they're __atomic_load_n(num, __ATOMIC_ACQUIRE)
and __atomic_store_n(num, val, __ATOMIC_RELEASE)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks, I missed that ldar
and stlr
have acquire/release semantics. As long as the store is release it shouldn't really matter for the test though.
It's already added for user level builds here: https://github.com/seL4/seL4_tools/blob/master/cmake-tool/helpers/environment_flags.cmake#L68 So you could change the condition to make it not conditional on a specific compiler version. |
|
Thanks, I'll add it there.
Then you get compile errors for unknown options, so we need some checks. |
efbe56a
to
d17d1c9
Compare
Made the loop exactly the same as for the other code (so following Kent's suggestion basically), let's see if that helps. |
Looks like we need to merge #135 first. @kent-mcleod, can that go in independently from the other ones? |
No, this seems to be testing on top of an old commit of the kernel that doesn't have the change in it yet:
Why is it targeting #871 and not seL4/master? |
I had a test-with in my comment, but removed it today. |
oh right, thanks for the explaination. I thought there was some extra magic that I didn't know about happening. |
Add the right TLS variable settings for the gnu-elf-abi used by our compilers. Signed-off-by: Kent McLeod <[email protected]>
|
After running the test multiple times, all failures are a combination of *_SMP_hyp_MCS_clang_64. That's too much of a coincidence to not be a real bug somewhere. Same as @lsf37's results before actually, but I missed that they were all HYP configs. The first run |
Signed-off-by: Indan Zupancic <[email protected]>
cdc596b
to
916ecab
Compare
How about we merge this one, since it does improve the other tests, and I enable those other tests in CI while we are looking more closely at I have not seen |
I did a lot of testing with tx2. Conclusion is that it's all caused by a task having MIN_BUDGET budget not making forward progress. The test succeeds if the helper task manages to increment the counter within its budget. If it doesn't, it will hang forever and so will the whole test. The reason it fails on some configs and not on others seems to be purely timing dependent: Some configs have more overhead and cache pressure than others. seL4_DebugDumpScheduler() confirms my theory, the helper thread is stuck in the release queue forever at the same instruction. However, the reason why it doesn't make progress is surprising: It's because the (Edit: tx1 -> tx2) |
Interesting. So that means that MIN_BUDGET is set too small for that config, right? |
start_helper would do unnecessary complicated thread startup code which may contain FPU instructions with some compilers and configs. Normally this would be harmless (though it would thwart the test's intention), but if the task runs out of budget before it finishes the FPU instruction, it will be put on the release queue and make no forward progress (unclear why). This is only a problem for tasks with budget equal to MIN_BUDGET, any time lost because of syscalls or traps will cause the task to hang. Signed-off-by: Indan Zupancic <[email protected]>
My theory above is wrong, lazy FPU loading probably isn't to blame, as it doesn't do a budget check or schedule. And if I add printfs to the FPU fault handling, it doesn't show up at all when the task hangs. However, my latest push without the All in all it still makes no sense and might be some subtle kernel bug after all, I don't know anymore.
Maybe, I'm not sure. The helper task doesn't do any syscalls at all. Biggest interrupt latency I measured was 2 us, not counting IPIs, But no IPIs are sent after the helper is started, except if the test finishes and everything works. I'm quite mystified at this point. I thought I knew what was going on yesterday, except I can't verify my theory and code review proofs me wrong.
With the latest commit it should work fine on all boards I think.
I've seen one or the other SERSERV tests fail here and there recently, both before and after my SMP commits. |
Sneaked in when style checking was temporarily broken. Signed-off-by: Indan Zupancic <[email protected]>
Almost: the Odroid C4 still has SCHED_CONTEXT_0014 failures, there rest is fine: |
Indeed, FAILED tests:
At least it's also non-hyp and gcc in there. Probably should rerun the tests to see if anything else fails sometimes. I'll focus my attention on the Odroic C4 now. |
Oops, merged it with smp-tests instead of master, not sure how to fix that. |
I'll fix it on the smp-test branch (dropping that commit). We can then merge that on master. One sec. |
Now in #136 |
Let's see if this fixes the recent new failures and the old ones for the disabled CI tests.
Would be nice to test this with seL4/ci-actions#379, but no idea how to do that.