Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace volatiles with atomic for SMP tests #134

Merged
merged 4 commits into from
Mar 5, 2025

Conversation

Indanz
Copy link
Contributor

@Indanz Indanz commented Feb 26, 2025

Let's see if this fixes the recent new failures and the old ones for the disabled CI tests.

Would be nice to test this with seL4/ci-actions#379, but no idea how to do that.

@Indanz Indanz added the hw-test set to run sel4test hardware test for this PR label Feb 26, 2025
@Indanz Indanz added hw-build set to run all sel4test hardware builds on this PR and removed hw-test set to run sel4test hardware test for this PR labels Feb 26, 2025
@Indanz
Copy link
Contributor Author

Indanz commented Feb 26, 2025

Probably need to add -mno-outline-atomics to the build flags for sel4test, but I don't know how to do that with cmake. With clang 21 I'm getting:

/usr/bin/aarch64-linux-gnu-ld: /usr/lib/gcc-cross/aarch64-linux-gnu/10/libgcc.a(lse-init.o): in function init_have_lse_atomics': (.text.startup+0xc): undefined reference to __getauxval'

Would also be good to at least have -O1 or -O2, currently it's generating ridiculous assembly for debug builds.

Edit: Also need to manually set CMAKE_OBJCOPY, otherwise it will try to use the wrong one.

@Indanz Indanz added the hw-test set to run sel4test hardware test for this PR label Feb 26, 2025
@lsf37
Copy link
Member

lsf37 commented Feb 26, 2025

Would be nice to test this with seL4/ci-actions#379, but no idea how to do that.

Will set that up.

@lsf37 lsf37 changed the base branch from master to smp-tests February 26, 2025 22:22
@lsf37 lsf37 added hw-test set to run sel4test hardware test for this PR and removed hw-build set to run all sel4test hardware builds on this PR hw-test set to run sel4test hardware test for this PR labels Feb 26, 2025
void sched_context_0014_helper_fn(_Atomic int *state)
{
while (1) {
*state = *state + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want this or *state += 1? The difference is whether the compiler generates either a load acquire and a separate store release or it generates an atomic increment I think (From some aarch64 compiler tests).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could really be *state = 1 here. I wouldn't expect the compiler to generate different code though, as += 1 is just a shorthand.

If doing the increment, calling seL4_Yield() first would make the test more robust I think. So I think I'll either change it to = 1 or add the yield.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect the compiler to generate different code though, as += 1 is just a shorthand.

I was looking at what _Atomic actually does to an int and an int* and on my aarch64 compiler (12.2.0), a = a + 1; generated different atomic intrinsic than a += 1. The first being a separate load acquire/store release, and the second was a looping load exclusive/store conditional

Copy link
Contributor Author

@Indanz Indanz Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was that with the actual build? For debug build there is no optimisation at all enabled so the generated assembly is horrible. But for optimised builds I would expect the same code, though neither gcc nor clang do:

https://godbolt.org/z/qbzxqPYTz

There's some subtle difference between the two that eludes me. For some reason neither generate atomic instructions for *num = *num + 1;.

Edit: I guess C gives stronger guarantees about ++ and += 1 than for separate reads and writes on atomics. That's nice to know, but very subtle...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your godbolt link is what I was seeing too. += and ++ with _Atomic are treated as a __atomic_fetch_add() whereas a = a + 1 is treated as a sequence of __atomic_load_n -> add 1 -> __atomic_store_n. So there's ordering but the fetch and add and write back isn't atomic.

        ldar    w8, [x0]
        add     w8, w8, #1
        stlr    w8, [x0]

ldar and stlr are "atomics" in that they're __atomic_load_n(num, __ATOMIC_ACQUIRE) and __atomic_store_n(num, val, __ATOMIC_RELEASE)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks, I missed that ldar and stlr have acquire/release semantics. As long as the store is release it shouldn't really matter for the test though.

@kent-mcleod
Copy link
Member

Probably need to add -mno-outline-atomics to the build flags for sel4test, but I don't know how to do that with cmake. With clang 21 I'm getting:

It's already added for user level builds here: https://github.com/seL4/seL4_tools/blob/master/cmake-tool/helpers/environment_flags.cmake#L68

So you could change the condition to make it not conditional on a specific compiler version.

@lsf37
Copy link
Member

lsf37 commented Feb 27, 2025

SCHED_CONTEXT_0014 is still there, but the other failures seem to be fixed.

@Indanz
Copy link
Contributor Author

Indanz commented Feb 27, 2025

Probably need to add -mno-outline-atomics to the build flags for sel4test, but I don't know how to do that with cmake. With clang 21 I'm getting:

It's already added for user level builds here: https://github.com/seL4/seL4_tools/blob/master/cmake-tool/helpers/environment_flags.cmake#L68

Thanks, I'll add it there.

So you could change the condition to make it not conditional on a specific compiler version.

Then you get compile errors for unknown options, so we need some checks.

@Indanz
Copy link
Contributor Author

Indanz commented Feb 27, 2025

Made the loop exactly the same as for the other code (so following Kent's suggestion basically), let's see if that helps.

@Indanz Indanz added hw-build set to run all sel4test hardware builds on this PR and removed hw-test set to run sel4test hardware test for this PR labels Feb 27, 2025
@lsf37
Copy link
Member

lsf37 commented Feb 27, 2025

Looks like we need to merge #135 first. @kent-mcleod, can that go in independently from the other ones?

@kent-mcleod
Copy link
Member

Looks like we need to merge #135 first. @kent-mcleod, can that go in independently from the other ones?

No, this seems to be testing on top of an old commit of the kernel that doesn't have the change in it yet:

   sel4test-manifest: b6a33334 Updating default.xml                           (origin/master)
                seL4: 81679f2b SMP: Fix IPI Race                              (pull/871/head)
            musllibc: 9798aedb Makefile: update config.mak generation         (seL4/sel4)
           seL4_libs: 54539a72 Bump minimum CMake version                     (seL4/master)
  sel4_projects_libs: bca46738 libsel4vm: Proper error for map_vm_memory_res  (seL4/master)
         sel4runtime: fb26509f Arm: Support KernelArmTLSReg                   (seL4/master)

Why is it targeting #871 and not seL4/master?

@Indanz
Copy link
Contributor Author

Indanz commented Feb 27, 2025

Why is it targeting #871 and not seL4/master?

I had a test-with in my comment, but removed it today.

@Indanz Indanz added hw-test set to run sel4test hardware test for this PR and removed hw-build set to run all sel4test hardware builds on this PR labels Feb 27, 2025
@kent-mcleod
Copy link
Member

I had a test-with in my comment, but removed it today.

oh right, thanks for the explaination. I thought there was some extra magic that I didn't know about happening.

Add the right TLS variable settings for the gnu-elf-abi used by our
compilers.

Signed-off-by: Kent McLeod <[email protected]>
@Indanz
Copy link
Contributor Author

Indanz commented Feb 28, 2025

SCHED_CONTEXT_0014 might be a real problem, it's still there. Will need to look at generated assembly to be sure.

@Indanz
Copy link
Contributor Author

Indanz commented Mar 1, 2025

After running the test multiple times, all failures are a combination of *_SMP_hyp_MCS_clang_64. That's too much of a coincidence to not be a real bug somewhere. Same as @lsf37's results before actually, but I missed that they were all HYP configs.

The first run SERSERV_CLI_PROC_005 failed for TX2_verification_SMP_hyp_MCS_gcc_64, that's something else to keep an eye on, but probably independent.

@Indanz Indanz force-pushed the volatile2atomic branch 2 times, most recently from cdc596b to 916ecab Compare March 1, 2025 16:59
@lsf37
Copy link
Member

lsf37 commented Mar 1, 2025

How about we merge this one, since it does improve the other tests, and I enable those other tests in CI while we are looking more closely at SCHED_CONTEXT_0014.

I have not seen SERSERV_CLI_PROC_005 fail before, that's definitely something to keep an eye on. As you says, it does not look like it is because of something in here, though.

@Indanz
Copy link
Contributor Author

Indanz commented Mar 2, 2025

I did a lot of testing with tx2. Conclusion is that it's all caused by a task having MIN_BUDGET budget not making forward progress.

The test succeeds if the helper task manages to increment the counter within its budget. If it doesn't, it will hang forever and so will the whole test.

The reason it fails on some configs and not on others seems to be purely timing dependent: Some configs have more overhead and cache pressure than others.

seL4_DebugDumpScheduler() confirms my theory, the helper thread is stuck in the release queue forever at the same instruction. However, the reason why it doesn't make progress is surprising:

It's because the helper_thread function uses a floating point instruction (movi) which triggers the lazy FPU loading trap. After the trap the task doesn't have enough budget left, so it gets put into the release queue and the FPU gets disabled again. This will repeat forever.

(Edit: tx1 -> tx2)

@lsf37
Copy link
Member

lsf37 commented Mar 2, 2025

Interesting. So that means that MIN_BUDGET is set too small for that config, right?

start_helper would do unnecessary complicated thread startup code
which may contain FPU instructions with some compilers and configs.

Normally this would be harmless (though it would thwart the test's
intention), but if the task runs out of budget before it finishes
the FPU instruction, it will be put on the release queue and make
no forward progress (unclear why).

This is only a problem for tasks with budget equal to MIN_BUDGET,
any time lost because of syscalls or traps will cause the task to
hang.

Signed-off-by: Indan Zupancic <[email protected]>
@Indanz
Copy link
Contributor Author

Indanz commented Mar 2, 2025

My theory above is wrong, lazy FPU loading probably isn't to blame, as it doesn't do a budget check or schedule. And if I add printfs to the FPU fault handling, it doesn't show up at all when the task hangs. However, my latest push without the movi works fine, but if I add asm volatile("movi v0.2d, #0x0"); just to before the while loop in the helper, it hangs without any prints.

All in all it still makes no sense and might be some subtle kernel bug after all, I don't know anymore.

Interesting. So that means that MIN_BUDGET is set too small for that config, right?

Maybe, I'm not sure. The helper task doesn't do any syscalls at all. Biggest interrupt latency I measured was 2 us, not counting IPIs, But no IPIs are sent after the helper is started, except if the test finishes and everything works.

I'm quite mystified at this point. I thought I knew what was going on yesterday, except I can't verify my theory and code review proofs me wrong.

How about we merge this one, since it does improve the other tests, and I enable those other tests in CI while we are looking more closely at SCHED_CONTEXT_0014.

With the latest commit it should work fine on all boards I think.

I have not seen SERSERV_CLI_PROC_005 fail before, that's definitely something to keep an eye on. As you says, it does not look like it is because of something in here, though.

I've seen one or the other SERSERV tests fail here and there recently, both before and after my SMP commits.

Sneaked in when style checking was temporarily broken.

Signed-off-by: Indan Zupancic <[email protected]>
@Indanz Indanz added hw-test set to run sel4test hardware test for this PR and removed hw-test set to run sel4test hardware test for this PR labels Mar 2, 2025
@lsf37 lsf37 added hw-test set to run sel4test hardware test for this PR and removed hw-test set to run sel4test hardware test for this PR labels Mar 2, 2025
@lsf37
Copy link
Member

lsf37 commented Mar 3, 2025

How about we merge this one, since it does improve the other tests, and I enable those other tests in CI while we are looking more closely at SCHED_CONTEXT_0014.

With the latest commit it should work fine on all boards I think.

Almost: the Odroid C4 still has SCHED_CONTEXT_0014 failures, there rest is fine:
https://github.com/seL4/sel4test/actions/runs/13620049487/job/38075368124?pr=134

@Indanz
Copy link
Contributor Author

Indanz commented Mar 3, 2025

Almost: the Odroid C4 still has SCHED_CONTEXT_0014 failures, there rest is fine: https://github.com/seL4/sel4test/actions/runs/13620049487/job/38075368124?pr=134

Indeed, FAILED tests:

  • ODROID_C4_debug_SMP_MCS_gcc_64
  • ODROID_C4_debug_SMP_hyp_MCS_gcc_64
  • ODROID_C4_verification_SMP_MCS_gcc_64
  • ODROID_C4_verification_SMP_hyp_MCS_gcc_64

At least it's also non-hyp and gcc in there. Probably should rerun the tests to see if anything else fails sometimes. I'll focus my attention on the Odroic C4 now.

@Indanz Indanz merged commit 184d5e4 into seL4:smp-tests Mar 5, 2025
49 of 81 checks passed
@Indanz
Copy link
Contributor Author

Indanz commented Mar 5, 2025

Oops, merged it with smp-tests instead of master, not sure how to fix that.

@lsf37
Copy link
Member

lsf37 commented Mar 6, 2025

Oops, merged it with smp-tests instead of master, not sure how to fix that.

I'll fix it on the smp-test branch (dropping that commit). We can then merge that on master. One sec.

@lsf37 lsf37 mentioned this pull request Mar 6, 2025
@lsf37
Copy link
Member

lsf37 commented Mar 6, 2025

Now in #136

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hw-test set to run sel4test hardware test for this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants