wasmtime / PR #8516 WIP: cranelift: omit function prologu... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #8516 WIP: cranelift: omit function prologu...

Wasmtime GitHub notifications bot (May 01 2024 at 09:58):

tertsdiepraam opened PR #8516 from tertsdiepraam:preserve-fp-false-x64 to bytecodealliance:main:

This makes cranelift omit the function prologue and epilogue on the x64 architecture if possible, like it already does for aarch64.

However, this is not quite right yet, because (if I understand correctly) the is_leaf computation is wrong for x64 when SIMD instructions are involved. With this change, this function gives a segfault:
function %swizzle_i8x16(i8x16, i8x16) -> i8x16 {
block0(v0: i8x16, v1: i8x16):
    v2 = swizzle v0, v1
    return v2
}
I think this happens because it generated this assembly:
pushq   %rbp
movq    %rsp, %rbp
paddusb 0x14(%rip), %xmm1
movabsq $0, %r9
callq   *%r9
movq    %rbp, %rsp
popq    %rbp
retq
addb    %al, (%rax)
jo      0x92
jo      0x94
jo      0x96
jo      0x98
jo      0x9a
jo      0x9c
jo      0x9e
jo      0xa0
Note that there's a callq in there, which means that this is not a leaf function, but according to the architecture-independent is_leaf function of FunctionStencil it looks like a leaf function, because that only considers calls in CLIF.

This means that I need some advice on how to proceed. I could try to make the is_leaf function architecture-dependent or conservatively treat (some?) SIMD instructions as calls in general. Or maybe the information that I need is already computed somewhere and I just need to pass that in?

Wasmtime GitHub notifications bot (May 01 2024 at 10:08):

tertsdiepraam edited PR #8516:

This makes cranelift omit the function prologue and epilogue on the x64 architecture if possible, like it already does for aarch64.

However, this is not quite right yet, because (if I understand correctly) the is_leaf computation is wrong for x64 when SIMD instructions are involved. With this change, this function gives a segfault:
function %swizzle_i8x16(i8x16, i8x16) -> i8x16 {
block0(v0: i8x16, v1: i8x16):
    v2 = swizzle v0, v1
    return v2
}
I think this happens because it generated this assembly:
pushq   %rbp
movq    %rsp, %rbp
paddusb 0x14(%rip), %xmm1
movabsq $0, %r9
callq   *%r9
movq    %rbp, %rsp
popq    %rbp
retq
addb    %al, (%rax)
jo      0x92
jo      0x94
jo      0x96
jo      0x98
jo      0x9a
jo      0x9c
jo      0x9e
jo      0xa0
Note that there's a callq in there, which means that this is not a leaf function, but according to the architecture-independent is_leaf function of FunctionStencil it looks like a leaf function, because that only considers calls in CLIF.

With ssse3 the call disappears and the test passes.

This means that I need some advice on how to proceed. I could try to make the is_leaf function architecture-dependent or conservatively treat (some?) SIMD instructions as calls in general. Or maybe the information that I need is already computed somewhere and I just need to pass that in?

Wasmtime GitHub notifications bot (May 01 2024 at 10:14):

tertsdiepraam edited PR #8516:

This makes cranelift omit the function prologue and epilogue on the x64 architecture if possible, like it already does for aarch64.

However, this is not quite right yet, because (if I understand correctly) the is_leaf computation is wrong for x64 when SIMD instructions are involved. With this change, this function gives a segfault:
function %swizzle_i8x16(i8x16, i8x16) -> i8x16 {
block0(v0: i8x16, v1: i8x16):
    v2 = swizzle v0, v1
    return v2
}
I think this happens because it generated this assembly:
pushq   %rbp
movq    %rsp, %rbp
paddusb 0x14(%rip), %xmm1
movabsq $0, %r9
callq   *%r9
movq    %rbp, %rsp
popq    %rbp
retq
addb    %al, (%rax)
jo      0x92
jo      0x94
jo      0x96
jo      0x98
jo      0x9a
jo      0x9c
jo      0x9e
jo      0xa0
Note that there's a callq in there, which means that this is not a leaf function, but according to the architecture-independent is_leaf function of FunctionStencil it looks like a leaf function, because that only considers calls in CLIF.

With ssse3 the call disappears and the test passes.

This means that I need some advice on how to proceed. I could try to make the is_leaf function architecture-dependent or conservatively treat (some?) SIMD instructions as calls in general. Or maybe the information that I need is already computed somewhere and I just need to pass that in? Or I could remove this optimization if ssse3 or later is not supported, but that doesn't seem great :big_smile:

Wasmtime GitHub notifications bot (May 01 2024 at 10:31):

tertsdiepraam edited PR #8516:

This makes cranelift omit the function prologue and epilogue on the x64 architecture if possible, like it already does for aarch64.

However, this is not quite right yet, because (if I understand correctly) the is_leaf computation is wrong for x64 when SIMD instructions are involved. With this change, this function gives a segfault:
function %swizzle_i8x16(i8x16, i8x16) -> i8x16 {
block0(v0: i8x16, v1: i8x16):
    v2 = swizzle v0, v1
    return v2
}
I think this happens because it generated this assembly:
pushq   %rbp
movq    %rsp, %rbp
paddusb 0x14(%rip), %xmm1
movabsq $0, %r9
callq   *%r9
movq    %rbp, %rsp
popq    %rbp
retq
addb    %al, (%rax)
jo      0x92
jo      0x94
jo      0x96
jo      0x98
jo      0x9a
jo      0x9c
jo      0x9e
jo      0xa0
Note that there's a callq in there, which means that this is not a leaf function, but according to the architecture-independent is_leaf function of FunctionStencil it looks like a leaf function, because that only considers calls in CLIF.

With ssse3 the call disappears and the test passes.

This means that I need some advice on how to proceed. I could try to make the is_leaf function architecture-dependent or conservatively treat (some?) SIMD instructions as calls in general. Or maybe the information that I need is already computed somewhere and I just need to pass that in? Or I could remove this optimization if ssse3 or later is not supported, but that doesn't seem great :big_smile:

This replaces https://github.com/bytecodealliance/wasmtime/pull/5352 and is a follow up of https://github.com/bytecodealliance/wasmtime/pull/2960. I'm not sure why I'm getting to the conclusion that SIMD is the issue, but it wasn't mentioned before. Maybe there are other issues as well. Has the testing setup become more robust in the meantime?

Wasmtime GitHub notifications bot (May 01 2024 at 22:09):

jameysharp commented on PR #8516:

Neat! I was thinking about working on this soon, because the work that Trevor and I have been doing on cleaning up tail calls has fixed some of the issues that would have made this even harder. But it's even cooler to see somebody else tackling this and I would love to see you continue.

One thing I notice is that cranelift/filetests/filetests/isa/x64/simd-lane-access-compile.clif turns on SSSE3, so the case you're running into isn't represented in our compile tests. That would be good to fix.

After deleting has_ssse3 has_sse41 from that file and re-running the filetests with CRANELIFT_TEST_BLESS=1, I see that this call is a lib-call to X86Pshufb.

That's not the only lib-call we may emit on x86; there are also functions implementing floating-point fused multiply-add (FmaF32/64) and various floating-point rounding modes (Ceil/Floor/Nearest/Trunc), for older CPUs without the corresponding instructions.

So this isn't specifically a SIMD issue, and it doesn't apply to most SIMD instructions either. But I'm also not sure where the segfault you saw came from. Off hand, I don't know why omitting a frame pointer would cause a function call to segfault. So more investigation is in order, I think!

Wasmtime GitHub notifications bot (May 02 2024 at 08:23):

tertsdiepraam commented on PR #8516:

After deleting has_ssse3 has_sse41 from that file and re-running the filetests with CRANELIFT_TEST_BLESS=1, I see that this call is a lib-call to X86Pshufb.

Yes, that's what I found too.

That's not the only lib-call we may emit on x86; there are also functions implementing floating-point fused multiply-add (FmaF32/64) and various floating-point rounding modes (Ceil/Floor/Nearest/Trunc), for older CPUs without the corresponding instructions.

My (unsubstantiated) guess is that the X86Pshufb function is more complex than those other lib-calls and therefore causes trouble. Replacing the lib call with a custom function in the final x86 does not seem to cause any trouble, so it's not any call that leads to a segfault.

Is this the code for the libcall? https://github.com/bytecodealliance/wasmtime/blob/d911f4b10fe908bc819a1420482952e44b039db3/crates/wasmtime/src/runtime/vm/libcalls.rs#L736

At the top of that file, it says:

They must only contain basic, raw i32/i64/f32/f64/pointer parameters that are safe to pass across the system ABI.

Maybe part of the problem is that x86pshufb requires __m128i as arguments? Or that it requires bumping some things on the stack while the others don't? I don't really know, I'm just guessing at this point.

Wasmtime GitHub notifications bot (May 02 2024 at 08:26):

tertsdiepraam edited a comment on PR #8516:

After deleting has_ssse3 has_sse41 from that file and re-running the filetests with CRANELIFT_TEST_BLESS=1, I see that this call is a lib-call to X86Pshufb.

Yes, that's what I found too.

That's not the only lib-call we may emit on x86; there are also functions implementing floating-point fused multiply-add (FmaF32/64) and various floating-point rounding modes (Ceil/Floor/Nearest/Trunc), for older CPUs without the corresponding instructions.

My (unsubstantiated) guess is that the X86Pshufb function is more complex than those other lib-calls and therefore causes trouble. Replacing the lib call with a custom function in the final x86 does not seem to cause any trouble, so it's not any call that leads to a segfault.

Is this the code for the libcall? https://github.com/bytecodealliance/wasmtime/blob/d911f4b10fe908bc819a1420482952e44b039db3/crates/wasmtime/src/runtime/vm/libcalls.rs#L736

At the top of that file, it says:

They must only contain basic, raw i32/i64/f32/f64/pointer parameters that are safe to pass across the system ABI.

Maybe part of the problem is that x86pshufb requires __m128i as arguments? Or that it requires bumping some things on the stack while the others don't? If I throw that libcall in godbolt, it does show that it pushes some thing to the stack, which might then be at the wrong location. I don't really know, I'm just guessing at this point.

Wasmtime GitHub notifications bot (May 02 2024 at 13:20):

afonso360 commented on PR #8516:

From what I remember when I looked into this a year or so ago was that this was an issue with stack alignment. I think that the libcalls expect a 16byte aligned stack on entry (due to ABI) and will use aligned loads and stores on the stack directly.

Take this with a grain of salt, since it has been a while since I last looked at this.

Wasmtime GitHub notifications bot (May 02 2024 at 13:22):

afonso360 edited a comment on PR #8516:

From what I remember when I looked into this a year or so ago was that this was an issue with stack alignment. I think that the libcalls expect a 16byte aligned stack on entry (due to ABI) and will use aligned loads and stores on the stack directly, and we weren't respecting that when omitting the frame pointer.

Take this with a grain of salt, since it has been a while since I last looked at this.

Wasmtime GitHub notifications bot (May 02 2024 at 13:27):

afonso360 edited a comment on PR #8516:

From what I remember when I looked into this a year or so ago was that this was an issue with stack alignment. I think that the libcalls expect a 16byte aligned stack on entry (due to ABI) and will use aligned loads and stores on the stack directly, and we weren't respecting that when omitting the prologue.

Take this with a grain of salt, since it has been a while since I last looked at this.

Wasmtime GitHub notifications bot (May 02 2024 at 13:47):

afonso360 edited a comment on PR #8516:

From what I remember when I looked into this a year or so ago was that this was an issue with stack alignment. I think that the libcalls expect a 16byte aligned stack on entry (due to ABI) and will use aligned loads and stores on the stack directly, and we weren't respecting that when omitting the prologue.

Take this with a grain of salt, since it has been a while since I last looked at this.

Edit: I should also note that not all libcalls will come from wasmtime/crates/wasmtime/src/runtime/vm/libcalls.rs. They can be provded by the system's libc or whatever the linker decides, in cases where cranelift isn't used with wasmtime.

Wasmtime GitHub notifications bot (May 02 2024 at 14:55):

tertsdiepraam commented on PR #8516:

Interesting! @afonso360, do you know of a way that I could check that hypothesis?

I should also note that not all libcalls will come from wasmtime/crates/wasmtime/src/runtime/vm/libcalls.rs. They can be provded by the system's libc or whatever the linker decides, in cases where cranelift isn't used with wasmtime.

Good to know! Thanks!

I wonder if there's some easy way that we can enable this for everything that doesn't involve libcalls? Is there a way to check whether a function will contain libcalls?

Wasmtime GitHub notifications bot (May 02 2024 at 15:24):

afonso360 commented on PR #8516:

Well, I'd check if the code segfaults in a 16byte aligned load inside the libcall that is referencing the stack and that that address ends up being under-aligned (i.e. only aligned to 8 bytes or less).

And additionally the stack at the entrypoint of that libcall was only aligned to 8 (this by itself is an ABI breakage).

You should be able to do this by running the test under gdb/lldb and setting a breakpoint in the libcall function and checking what the stack pointer is, and where it first crashes, etc...

It would also be nice if we had a libcall that didn't use any SSE instructions or referenced the stack, that would be also another clue that our issue was with that, but we don't.

Wasmtime GitHub notifications bot (May 02 2024 at 15:26):

afonso360 edited a comment on PR #8516:

Well, I'd check if the code segfaults in a 16byte aligned load inside the libcall that is referencing the stack and that, that address is under-aligned (i.e. only aligned to 8 bytes or less).

And additionally the stack at the entrypoint of that libcall was only aligned to 8 (this by itself is an ABI breakage).

You should be able to do this by running the test under gdb/lldb and setting a breakpoint in the libcall function and checking what the stack pointer is, and where it first crashes, etc...

It would also be nice if we had a libcall that didn't use any SSE instructions or referenced the stack, that would be also another clue that our issue was with that, but we don't.

Wasmtime GitHub notifications bot (May 02 2024 at 15:39):

afonso360 edited a comment on PR #8516:

Well, I'd check if the code segfaults in a 16byte aligned load inside the libcall that is referencing the stack and that, that address is under-aligned (i.e. only aligned to 8 bytes or less).

And additionally the stack at the entrypoint of that libcall was only aligned to 8 (this by itself is an ABI breakage).

You should be able to do this by running the test under gdb/lldb and setting a breakpoint in the libcall function and checking what the stack pointer is, and where it first crashes, etc...

It would also be nice if we had a libcall that didn't use any SSE instructions or referenced the stack, that would be also another clue that our issue was with that, but we don't.

I wonder if there's some easy way that we can enable this for everything that doesn't involve libcalls? Is there a way to check whether a function will contain libcalls?

Unfortunately I don't think so. Since these libcalls only end up being placed during lowering if we don't have a direct instruction for that, it means that any instruction could end up being lowered to a libcall. In practice there are only a few situations where we do that.

One way that came to mind was to complete #7321 and with the legalizations in place, we could simply check if there was a call opcode in the function.

Otherwise, I'm not too sure, but people might have some ideas.

Wasmtime GitHub notifications bot (May 02 2024 at 15:40):

afonso360 edited a comment on PR #8516:

Well, I'd check if the code segfaults in a 16byte aligned load inside the libcall that is referencing the stack and that, that address is under-aligned (i.e. only aligned to 8 bytes or less).

And additionally the stack at the entrypoint of that libcall was only aligned to 8 (this by itself is an ABI breakage).

You should be able to do this by running the test under gdb/lldb and setting a breakpoint in the libcall function and checking what the stack pointer is, and where it first crashes, etc...

It would also be nice if we had a libcall that didn't use any SSE instructions or referenced the stack, that would be also another clue that our issue was with that, but we don't.

I wonder if there's some easy way that we can enable this for everything that doesn't involve libcalls? Is there a way to check whether a function will contain libcalls?

Unfortunately I don't think so. Since these libcalls only end up being placed during lowering if we don't have a direct instruction for that, it means that any instruction could end up being lowered to a libcall. In practice there are only a few situations where we do that.

One way that came to mind was to complete #7321 and with the legalizations in place, we could simply check if there was a call opcode in the function since any opcode that we couldn't implement would be legalized into a call.

Otherwise, I'm not too sure, but people might have some ideas.

Wasmtime GitHub notifications bot (May 02 2024 at 15:49):

afonso360 edited a comment on PR #8516:

Well, I'd check if the code segfaults in a 16byte aligned load inside the libcall that is referencing the stack and that, that address is under-aligned (i.e. only aligned to 8 bytes or less).

And additionally the stack at the entrypoint of that libcall was only aligned to 8 (this by itself is an ABI breakage).

You should be able to do this by running the test under gdb/lldb and setting a breakpoint in the libcall function and checking what the stack pointer is, and where it first crashes, etc...

It would also be nice if we had a libcall that didn't use any SSE instructions or referenced the stack, that would be also another clue that our issue was with that, but we don't.

I wonder if there's some easy way that we can enable this for everything that doesn't involve libcalls? Is there a way to check whether a function will contain libcalls?

Unfortunately I don't think so. Since these libcalls only end up being placed during lowering if we don't have a direct instruction for that, it means that any instruction could end up being lowered to a libcall. In practice there are only a few situations where we do that.

One way that came to mind was to complete #7321 and with the legalizations in place, we could simply check if there is a call opcode in the function since any opcode that we couldn't implement would be legalized into a call.

Otherwise, I'm not too sure, but people might have some ideas.

Wasmtime GitHub notifications bot (May 02 2024 at 16:37):

jameysharp commented on PR #8516:

We compute the frame layout after lowering to vcode, so I think we can make the decision about whether this is a leaf function after we've already made all the decisions about whether to use any libcalls. We "just" need to thread that information through correctly somehow.

Wasmtime GitHub notifications bot (May 02 2024 at 17:24):

tertsdiepraam commented on PR #8516:

That sounds reasonable. I'll give that a shot!

Wasmtime GitHub notifications bot (May 02 2024 at 17:42):

jameysharp commented on PR #8516:

Oh, I also wanted to note: Good points, both of you, on stack alignment and vector arguments. I hadn't thought about either of those things and I think together they fully explain this symptom. I want to write down this conclusion for future reference although I don't think either of you need the explanation :grin:

Pushing the return address leaves the stack 8-aligned, and normally pushing the frame pointer would make it 16-aligned again, but we skip that step here. Then, most x86 code can actually tolerate 8-aligned stacks, but moving between MMX registers and memory generally requires 16-alignment, and indeed it looks like Rust/LLVM are producing movdqa instructions with 16-aligned offsets from the stack pointer in this particular libcall.

So that explains why you'd only see problems with this specific call. We still should ensure that all libcalls get the ABI-required stack alignment though, so the plan to have any function with a libcall not be a leaf function is still a good one.

Wasmtime GitHub notifications bot (May 02 2024 at 17:45):

tertsdiepraam commented on PR #8516:

I did in fact need that conclusion :big_smile: Thanks!

Wasmtime GitHub notifications bot (May 17 2024 at 13:57):

tertsdiepraam commented on PR #8516:

I'm looking at this again and I might need some help. I'm struggling to find where this information should originate from. Is there some struct that can keep track of a boolean or should we scan over the emitted instructions? Or maybe the dfg should also keep track of the libcalls that have been made in addition to the signatures and ext_funcs?

I'm a bit overwhelmed at the moment :)

Wasmtime GitHub notifications bot (May 17 2024 at 20:34):

jameysharp commented on PR #8516:

I'm looking at this again and I might need some help. I'm struggling to find where this information should originate from. Is there some struct that can keep track of a boolean …

Yeah, totally reasonable questions!

I would start by looking at how outgoing_args_size and tail_args_size get computed and used, since those are also properties of the function calls that appear in the current function.

In cranelift/codegen/src/machinst/abi.rs, there are methods on Callee named accumulate_outgoing_args_size and accumulate_tail_args_size. You can look for where these are called. Those are places where, while lowering the body of the function, we found a function call. I don't remember exactly but libcalls should be similar.

Note that if you're using rust-analyzer to navigate, there are some nearby parts of the call-graph that it can't find because they're in macros, so if you hit a point where things seem to be missing you might also want to try just searching for the function's name. Someday I hope to reduce our usage of macros…

Those *_args_size fields are stored on the Callee until Callee::compute_frame_layout is called, and then they're forwarded to the backend's implementation of ABIMachineSpec::compute_frame_layout.

There's another field on Callee that's also forwarded, is_leaf. Maybe the thing to do here is just set that to false if there's any kind of function call, including libcalls.

Wasmtime GitHub notifications bot (May 17 2024 at 20:37):

jameysharp commented on PR #8516:

After a little more investigation: the is_leaf field is currently set by calling FunctionStencil::is_leaf, which tries to guess whether there are function calls or libcalls, and which is not used anywhere else. Let's just delete that method, and compute the true answer during lowering.

Last updated: Apr 17 2025 at 10:03 UTC