-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: cranelift: omit function prologue and epilogue on x64 if possible #8516
base: main
Are you sure you want to change the base?
WIP: cranelift: omit function prologue and epilogue on x64 if possible #8516
Conversation
Neat! I was thinking about working on this soon, because the work that Trevor and I have been doing on cleaning up tail calls has fixed some of the issues that would have made this even harder. But it's even cooler to see somebody else tackling this and I would love to see you continue. One thing I notice is that After deleting That's not the only lib-call we may emit on x86; there are also functions implementing floating-point fused multiply-add (FmaF32/64) and various floating-point rounding modes (Ceil/Floor/Nearest/Trunc), for older CPUs without the corresponding instructions. So this isn't specifically a SIMD issue, and it doesn't apply to most SIMD instructions either. But I'm also not sure where the segfault you saw came from. Off hand, I don't know why omitting a frame pointer would cause a function call to segfault. So more investigation is in order, I think! |
Yes, that's what I found too.
My (unsubstantiated) guess is that the X86Pshufb function is more complex than those other lib-calls and therefore causes trouble. Replacing the lib call with a custom function in the final x86 does not seem to cause any trouble, so it's not any call that leads to a segfault. Is this the code for the libcall?
At the top of that file, it says:
Maybe part of the problem is that x86pshufb requires |
From what I remember when I looked into this a year or so ago was that this was an issue with stack alignment. I think that the libcalls expect a 16byte aligned stack on entry (due to ABI) and will use aligned loads and stores on the stack directly, and we weren't respecting that when omitting the prologue. Take this with a grain of salt, since it has been a while since I last looked at this. Edit: I should also note that not all libcalls will come from |
Interesting! @afonso360, do you know of a way that I could check that hypothesis?
Good to know! Thanks! I wonder if there's some easy way that we can enable this for everything that doesn't involve libcalls? Is there a way to check whether a function will contain libcalls? |
Well, I'd check if the code segfaults in a 16byte aligned load inside the libcall that is referencing the stack and that, that address is under-aligned (i.e. only aligned to 8 bytes or less). And additionally the stack at the entrypoint of that libcall was only aligned to 8 (this by itself is an ABI breakage). You should be able to do this by running the test under It would also be nice if we had a libcall that didn't use any SSE instructions or referenced the stack, that would be also another clue that our issue was with that, but we don't.
Unfortunately I don't think so. Since these libcalls only end up being placed during lowering if we don't have a direct instruction for that, it means that any instruction could end up being lowered to a libcall. In practice there are only a few situations where we do that. One way that came to mind was to complete #7321 and with the legalizations in place, we could simply check if there is a Otherwise, I'm not too sure, but people might have some ideas. |
We compute the frame layout after lowering to vcode, so I think we can make the decision about whether this is a leaf function after we've already made all the decisions about whether to use any libcalls. We "just" need to thread that information through correctly somehow. |
That sounds reasonable. I'll give that a shot! |
Oh, I also wanted to note: Good points, both of you, on stack alignment and vector arguments. I hadn't thought about either of those things and I think together they fully explain this symptom. I want to write down this conclusion for future reference although I don't think either of you need the explanation 😁 Pushing the return address leaves the stack 8-aligned, and normally pushing the frame pointer would make it 16-aligned again, but we skip that step here. Then, most x86 code can actually tolerate 8-aligned stacks, but moving between MMX registers and memory generally requires 16-alignment, and indeed it looks like Rust/LLVM are producing So that explains why you'd only see problems with this specific call. We still should ensure that all libcalls get the ABI-required stack alignment though, so the plan to have any function with a libcall not be a leaf function is still a good one. |
I did in fact need that conclusion 😄 Thanks! |
I'm looking at this again and I might need some help. I'm struggling to find where this information should originate from. Is there some struct that can keep track of a boolean or should we scan over the emitted instructions? Or maybe the I'm a bit overwhelmed at the moment :) |
Yeah, totally reasonable questions! I would start by looking at how In Note that if you're using Those There's another field on |
After a little more investigation: the |
This makes cranelift omit the function prologue and epilogue on the x64 architecture if possible, like it already does for aarch64.
However, this is not quite right yet, because (if I understand correctly) the
is_leaf
computation is wrong for x64 when SIMD instructions are involved. With this change, this function gives a segfault:I think this happens because it generated this assembly:
Note that there's a
callq
in there, which means that this is not a leaf function, but according to the architecture-independentis_leaf
function ofFunctionStencil
it looks like a leaf function, because that only considers calls in CLIF.With ssse3 the call disappears and the test passes.
This means that I need some advice on how to proceed. I could try to make the
is_leaf
function architecture-dependent or conservatively treat (some?) SIMD instructions as calls in general. Or maybe the information that I need is already computed somewhere and I just need to pass that in? Or I could remove this optimization if ssse3 or later is not supported, but that doesn't seem great 😄This replaces #5352 and is a follow up of #2960. I'm not sure why I'm getting to the conclusion that SIMD is the issue, but it wasn't mentioned before. Maybe there are other issues as well. Has the testing setup become more robust in the meantime?