tertsdiepraam opened PR #8516 from tertsdiepraam:preserve-fp-false-x64
to bytecodealliance:main
:
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->This makes cranelift omit the function prologue and epilogue on the x64 architecture if possible, like it already does for aarch64.
However, this is not quite right yet, because (if I understand correctly) the
is_leaf
computation is wrong for x64 when SIMD instructions are involved. With this change, this function gives a segfault:function %swizzle_i8x16(i8x16, i8x16) -> i8x16 { block0(v0: i8x16, v1: i8x16): v2 = swizzle v0, v1 return v2 }
I think this happens because it generated this assembly:
pushq %rbp movq %rsp, %rbp paddusb 0x14(%rip), %xmm1 movabsq $0, %r9 callq *%r9 movq %rbp, %rsp popq %rbp retq addb %al, (%rax) jo 0x92 jo 0x94 jo 0x96 jo 0x98 jo 0x9a jo 0x9c jo 0x9e jo 0xa0
Note that there's a
callq
in there, which means that this is not a leaf function, but according to the architecture-independentis_leaf
function ofFunctionStencil
it looks like a leaf function, because that only considers calls in CLIF.This means that I need some advice on how to proceed. I could try to make the
is_leaf
function architecture-dependent or conservatively treat (some?) SIMD instructions as calls in general. Or maybe the information that I need is already computed somewhere and I just need to pass that in?
tertsdiepraam edited PR #8516:
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->This makes cranelift omit the function prologue and epilogue on the x64 architecture if possible, like it already does for aarch64.
However, this is not quite right yet, because (if I understand correctly) the
is_leaf
computation is wrong for x64 when SIMD instructions are involved. With this change, this function gives a segfault:function %swizzle_i8x16(i8x16, i8x16) -> i8x16 { block0(v0: i8x16, v1: i8x16): v2 = swizzle v0, v1 return v2 }
I think this happens because it generated this assembly:
pushq %rbp movq %rsp, %rbp paddusb 0x14(%rip), %xmm1 movabsq $0, %r9 callq *%r9 movq %rbp, %rsp popq %rbp retq addb %al, (%rax) jo 0x92 jo 0x94 jo 0x96 jo 0x98 jo 0x9a jo 0x9c jo 0x9e jo 0xa0
Note that there's a
callq
in there, which means that this is not a leaf function, but according to the architecture-independentis_leaf
function ofFunctionStencil
it looks like a leaf function, because that only considers calls in CLIF.With ssse3 the call disappears and the test passes.
This means that I need some advice on how to proceed. I could try to make the
is_leaf
function architecture-dependent or conservatively treat (some?) SIMD instructions as calls in general. Or maybe the information that I need is already computed somewhere and I just need to pass that in?
tertsdiepraam edited PR #8516:
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->This makes cranelift omit the function prologue and epilogue on the x64 architecture if possible, like it already does for aarch64.
However, this is not quite right yet, because (if I understand correctly) the
is_leaf
computation is wrong for x64 when SIMD instructions are involved. With this change, this function gives a segfault:function %swizzle_i8x16(i8x16, i8x16) -> i8x16 { block0(v0: i8x16, v1: i8x16): v2 = swizzle v0, v1 return v2 }
I think this happens because it generated this assembly:
pushq %rbp movq %rsp, %rbp paddusb 0x14(%rip), %xmm1 movabsq $0, %r9 callq *%r9 movq %rbp, %rsp popq %rbp retq addb %al, (%rax) jo 0x92 jo 0x94 jo 0x96 jo 0x98 jo 0x9a jo 0x9c jo 0x9e jo 0xa0
Note that there's a
callq
in there, which means that this is not a leaf function, but according to the architecture-independentis_leaf
function ofFunctionStencil
it looks like a leaf function, because that only considers calls in CLIF.With ssse3 the call disappears and the test passes.
This means that I need some advice on how to proceed. I could try to make the
is_leaf
function architecture-dependent or conservatively treat (some?) SIMD instructions as calls in general. Or maybe the information that I need is already computed somewhere and I just need to pass that in? Or I could remove this optimization if ssse3 or later is not supported, but that doesn't seem great :big_smile:
tertsdiepraam edited PR #8516:
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->This makes cranelift omit the function prologue and epilogue on the x64 architecture if possible, like it already does for aarch64.
However, this is not quite right yet, because (if I understand correctly) the
is_leaf
computation is wrong for x64 when SIMD instructions are involved. With this change, this function gives a segfault:function %swizzle_i8x16(i8x16, i8x16) -> i8x16 { block0(v0: i8x16, v1: i8x16): v2 = swizzle v0, v1 return v2 }
I think this happens because it generated this assembly:
pushq %rbp movq %rsp, %rbp paddusb 0x14(%rip), %xmm1 movabsq $0, %r9 callq *%r9 movq %rbp, %rsp popq %rbp retq addb %al, (%rax) jo 0x92 jo 0x94 jo 0x96 jo 0x98 jo 0x9a jo 0x9c jo 0x9e jo 0xa0
Note that there's a
callq
in there, which means that this is not a leaf function, but according to the architecture-independentis_leaf
function ofFunctionStencil
it looks like a leaf function, because that only considers calls in CLIF.With ssse3 the call disappears and the test passes.
This means that I need some advice on how to proceed. I could try to make the
is_leaf
function architecture-dependent or conservatively treat (some?) SIMD instructions as calls in general. Or maybe the information that I need is already computed somewhere and I just need to pass that in? Or I could remove this optimization if ssse3 or later is not supported, but that doesn't seem great :big_smile:This replaces https://github.com/bytecodealliance/wasmtime/pull/5352 and is a follow up of https://github.com/bytecodealliance/wasmtime/pull/2960. I'm not sure why I'm getting to the conclusion that SIMD is the issue, but it wasn't mentioned before. Maybe there are other issues as well. Has the testing setup become more robust in the meantime?
jameysharp commented on PR #8516:
Neat! I was thinking about working on this soon, because the work that Trevor and I have been doing on cleaning up tail calls has fixed some of the issues that would have made this even harder. But it's even cooler to see somebody else tackling this and I would love to see you continue.
One thing I notice is that
cranelift/filetests/filetests/isa/x64/simd-lane-access-compile.clif
turns on SSSE3, so the case you're running into isn't represented in our compile tests. That would be good to fix.After deleting
has_ssse3 has_sse41
from that file and re-running the filetests withCRANELIFT_TEST_BLESS=1
, I see that this call is a lib-call toX86Pshufb
.That's not the only lib-call we may emit on x86; there are also functions implementing floating-point fused multiply-add (FmaF32/64) and various floating-point rounding modes (Ceil/Floor/Nearest/Trunc), for older CPUs without the corresponding instructions.
So this isn't specifically a SIMD issue, and it doesn't apply to most SIMD instructions either. But I'm also not sure where the segfault you saw came from. Off hand, I don't know why omitting a frame pointer would cause a function call to segfault. So more investigation is in order, I think!
tertsdiepraam commented on PR #8516:
After deleting has_ssse3 has_sse41 from that file and re-running the filetests with CRANELIFT_TEST_BLESS=1, I see that this call is a lib-call to X86Pshufb.
Yes, that's what I found too.
That's not the only lib-call we may emit on x86; there are also functions implementing floating-point fused multiply-add (FmaF32/64) and various floating-point rounding modes (Ceil/Floor/Nearest/Trunc), for older CPUs without the corresponding instructions.
My (unsubstantiated) guess is that the X86Pshufb function is more complex than those other lib-calls and therefore causes trouble. Replacing the lib call with a custom function in the final x86 does not seem to cause any trouble, so it's not any call that leads to a segfault.
Is this the code for the libcall? https://github.com/bytecodealliance/wasmtime/blob/d911f4b10fe908bc819a1420482952e44b039db3/crates/wasmtime/src/runtime/vm/libcalls.rs#L736
At the top of that file, it says:
They must only contain basic, raw i32/i64/f32/f64/pointer parameters that are safe to pass across the system ABI.
Maybe part of the problem is that x86pshufb requires
__m128i
as arguments? Or that it requires bumping some things on the stack while the others don't? I don't really know, I'm just guessing at this point.
tertsdiepraam edited a comment on PR #8516:
After deleting has_ssse3 has_sse41 from that file and re-running the filetests with CRANELIFT_TEST_BLESS=1, I see that this call is a lib-call to X86Pshufb.
Yes, that's what I found too.
That's not the only lib-call we may emit on x86; there are also functions implementing floating-point fused multiply-add (FmaF32/64) and various floating-point rounding modes (Ceil/Floor/Nearest/Trunc), for older CPUs without the corresponding instructions.
My (unsubstantiated) guess is that the X86Pshufb function is more complex than those other lib-calls and therefore causes trouble. Replacing the lib call with a custom function in the final x86 does not seem to cause any trouble, so it's not any call that leads to a segfault.
Is this the code for the libcall? https://github.com/bytecodealliance/wasmtime/blob/d911f4b10fe908bc819a1420482952e44b039db3/crates/wasmtime/src/runtime/vm/libcalls.rs#L736
At the top of that file, it says:
They must only contain basic, raw i32/i64/f32/f64/pointer parameters that are safe to pass across the system ABI.
Maybe part of the problem is that x86pshufb requires
__m128i
as arguments? Or that it requires bumping some things on the stack while the others don't? If I throw that libcall in godbolt, it does show that it pushes some thing to the stack, which might then be at the wrong location. I don't really know, I'm just guessing at this point.
afonso360 commented on PR #8516:
From what I remember when I looked into this a year or so ago was that this was an issue with stack alignment. I think that the libcalls expect a 16byte aligned stack on entry (due to ABI) and will use aligned loads and stores on the stack directly.
Take this with a grain of salt, since it has been a while since I last looked at this.
afonso360 edited a comment on PR #8516:
From what I remember when I looked into this a year or so ago was that this was an issue with stack alignment. I think that the libcalls expect a 16byte aligned stack on entry (due to ABI) and will use aligned loads and stores on the stack directly, and we weren't respecting that when omitting the frame pointer.
Take this with a grain of salt, since it has been a while since I last looked at this.
afonso360 edited a comment on PR #8516:
From what I remember when I looked into this a year or so ago was that this was an issue with stack alignment. I think that the libcalls expect a 16byte aligned stack on entry (due to ABI) and will use aligned loads and stores on the stack directly, and we weren't respecting that when omitting the prologue.
Take this with a grain of salt, since it has been a while since I last looked at this.
afonso360 edited a comment on PR #8516:
From what I remember when I looked into this a year or so ago was that this was an issue with stack alignment. I think that the libcalls expect a 16byte aligned stack on entry (due to ABI) and will use aligned loads and stores on the stack directly, and we weren't respecting that when omitting the prologue.
Take this with a grain of salt, since it has been a while since I last looked at this.
Edit: I should also note that not all libcalls will come from
wasmtime/crates/wasmtime/src/runtime/vm/libcalls.rs
. They can be provded by the system's libc or whatever the linker decides, in cases where cranelift isn't used with wasmtime.
tertsdiepraam commented on PR #8516:
Interesting! @afonso360, do you know of a way that I could check that hypothesis?
I should also note that not all libcalls will come from wasmtime/crates/wasmtime/src/runtime/vm/libcalls.rs. They can be provded by the system's libc or whatever the linker decides, in cases where cranelift isn't used with wasmtime.
Good to know! Thanks!
I wonder if there's some easy way that we can enable this for everything that doesn't involve libcalls? Is there a way to check whether a function will contain libcalls?
afonso360 commented on PR #8516:
Well, I'd check if the code segfaults in a 16byte aligned load inside the libcall that is referencing the stack and that that address ends up being under-aligned (i.e. only aligned to 8 bytes or less).
And additionally the stack at the entrypoint of that libcall was only aligned to 8 (this by itself is an ABI breakage).
You should be able to do this by running the test under
gdb
/lldb
and setting a breakpoint in the libcall function and checking what the stack pointer is, and where it first crashes, etc...It would also be nice if we had a libcall that didn't use any SSE instructions or referenced the stack, that would be also another clue that our issue was with that, but we don't.
afonso360 edited a comment on PR #8516:
Well, I'd check if the code segfaults in a 16byte aligned load inside the libcall that is referencing the stack and that, that address is under-aligned (i.e. only aligned to 8 bytes or less).
And additionally the stack at the entrypoint of that libcall was only aligned to 8 (this by itself is an ABI breakage).
You should be able to do this by running the test under
gdb
/lldb
and setting a breakpoint in the libcall function and checking what the stack pointer is, and where it first crashes, etc...It would also be nice if we had a libcall that didn't use any SSE instructions or referenced the stack, that would be also another clue that our issue was with that, but we don't.
afonso360 edited a comment on PR #8516:
Well, I'd check if the code segfaults in a 16byte aligned load inside the libcall that is referencing the stack and that, that address is under-aligned (i.e. only aligned to 8 bytes or less).
And additionally the stack at the entrypoint of that libcall was only aligned to 8 (this by itself is an ABI breakage).
You should be able to do this by running the test under
gdb
/lldb
and setting a breakpoint in the libcall function and checking what the stack pointer is, and where it first crashes, etc...It would also be nice if we had a libcall that didn't use any SSE instructions or referenced the stack, that would be also another clue that our issue was with that, but we don't.
I wonder if there's some easy way that we can enable this for everything that doesn't involve libcalls? Is there a way to check whether a function will contain libcalls?
Unfortunately I don't think so. Since these libcalls only end up being placed during lowering if we don't have a direct instruction for that, it means that any instruction could end up being lowered to a libcall. In practice there are only a few situations where we do that.
One way that came to mind was to complete #7321 and with the legalizations in place, we could simply check if there was a
call
opcode in the function.Otherwise, I'm not too sure, but people might have some ideas.
afonso360 edited a comment on PR #8516:
Well, I'd check if the code segfaults in a 16byte aligned load inside the libcall that is referencing the stack and that, that address is under-aligned (i.e. only aligned to 8 bytes or less).
And additionally the stack at the entrypoint of that libcall was only aligned to 8 (this by itself is an ABI breakage).
You should be able to do this by running the test under
gdb
/lldb
and setting a breakpoint in the libcall function and checking what the stack pointer is, and where it first crashes, etc...It would also be nice if we had a libcall that didn't use any SSE instructions or referenced the stack, that would be also another clue that our issue was with that, but we don't.
I wonder if there's some easy way that we can enable this for everything that doesn't involve libcalls? Is there a way to check whether a function will contain libcalls?
Unfortunately I don't think so. Since these libcalls only end up being placed during lowering if we don't have a direct instruction for that, it means that any instruction could end up being lowered to a libcall. In practice there are only a few situations where we do that.
One way that came to mind was to complete #7321 and with the legalizations in place, we could simply check if there was a
call
opcode in the function since any opcode that we couldn't implement would be legalized into acall
.Otherwise, I'm not too sure, but people might have some ideas.
afonso360 edited a comment on PR #8516:
Well, I'd check if the code segfaults in a 16byte aligned load inside the libcall that is referencing the stack and that, that address is under-aligned (i.e. only aligned to 8 bytes or less).
And additionally the stack at the entrypoint of that libcall was only aligned to 8 (this by itself is an ABI breakage).
You should be able to do this by running the test under
gdb
/lldb
and setting a breakpoint in the libcall function and checking what the stack pointer is, and where it first crashes, etc...It would also be nice if we had a libcall that didn't use any SSE instructions or referenced the stack, that would be also another clue that our issue was with that, but we don't.
I wonder if there's some easy way that we can enable this for everything that doesn't involve libcalls? Is there a way to check whether a function will contain libcalls?
Unfortunately I don't think so. Since these libcalls only end up being placed during lowering if we don't have a direct instruction for that, it means that any instruction could end up being lowered to a libcall. In practice there are only a few situations where we do that.
One way that came to mind was to complete #7321 and with the legalizations in place, we could simply check if there is a
call
opcode in the function since any opcode that we couldn't implement would be legalized into acall
.Otherwise, I'm not too sure, but people might have some ideas.
jameysharp commented on PR #8516:
We compute the frame layout after lowering to vcode, so I think we can make the decision about whether this is a leaf function after we've already made all the decisions about whether to use any libcalls. We "just" need to thread that information through correctly somehow.
tertsdiepraam commented on PR #8516:
That sounds reasonable. I'll give that a shot!
jameysharp commented on PR #8516:
Oh, I also wanted to note: Good points, both of you, on stack alignment and vector arguments. I hadn't thought about either of those things and I think together they fully explain this symptom. I want to write down this conclusion for future reference although I don't think either of you need the explanation :grin:
Pushing the return address leaves the stack 8-aligned, and normally pushing the frame pointer would make it 16-aligned again, but we skip that step here. Then, most x86 code can actually tolerate 8-aligned stacks, but moving between MMX registers and memory generally requires 16-alignment, and indeed it looks like Rust/LLVM are producing
movdqa
instructions with 16-aligned offsets from the stack pointer in this particular libcall.So that explains why you'd only see problems with this specific call. We still should ensure that all libcalls get the ABI-required stack alignment though, so the plan to have any function with a libcall not be a leaf function is still a good one.
tertsdiepraam commented on PR #8516:
I did in fact need that conclusion :big_smile: Thanks!
tertsdiepraam commented on PR #8516:
I'm looking at this again and I might need some help. I'm struggling to find where this information should originate from. Is there some struct that can keep track of a boolean or should we scan over the emitted instructions? Or maybe the
dfg
should also keep track of the libcalls that have been made in addition to thesignatures
andext_funcs
?I'm a bit overwhelmed at the moment :)
jameysharp commented on PR #8516:
I'm looking at this again and I might need some help. I'm struggling to find where this information should originate from. Is there some struct that can keep track of a boolean …
Yeah, totally reasonable questions!
I would start by looking at how
outgoing_args_size
andtail_args_size
get computed and used, since those are also properties of the function calls that appear in the current function.In
cranelift/codegen/src/machinst/abi.rs
, there are methods onCallee
namedaccumulate_outgoing_args_size
andaccumulate_tail_args_size
. You can look for where these are called. Those are places where, while lowering the body of the function, we found a function call. I don't remember exactly but libcalls should be similar.Note that if you're using
rust-analyzer
to navigate, there are some nearby parts of the call-graph that it can't find because they're in macros, so if you hit a point where things seem to be missing you might also want to try just searching for the function's name. Someday I hope to reduce our usage of macros…Those
*_args_size
fields are stored on theCallee
untilCallee::compute_frame_layout
is called, and then they're forwarded to the backend's implementation ofABIMachineSpec::compute_frame_layout
.There's another field on
Callee
that's also forwarded,is_leaf
. Maybe the thing to do here is just set that tofalse
if there's any kind of function call, including libcalls.
jameysharp commented on PR #8516:
After a little more investigation: the
is_leaf
field is currently set by callingFunctionStencil::is_leaf
, which tries to guess whether there are function calls or libcalls, and which is not used anywhere else. Let's just delete that method, and compute the true answer during lowering.
Last updated: Jan 24 2025 at 00:11 UTC