s0me0ne-unkn0wn opened issue #4473:
Researching the stack frame allocation logic in Cranelift, I've came across a behavior I consider to be suboptimal, which I cannot explain.
I generated void N-ary WASM functions with empty body for N=[1..99]. Arguments are
i64
for the matter of simplicity. Then I compiled them all forx86_64-none-linux-gnu
target and explored the machine code generated.Functions with arity from 1 to 4 shows some minimal yet useless preamble/postamble code generated:
push %rbp mov %rsp,%rbp mov %rbp,%rsp pop %rbp ret
Starting from arity 5, argument loading code is generated, although values loaded are never used (the example is 7-ary func):
push %rbp mov %rsp,%rbp mov 0x10(%rbp),%rax mov 0x18(%rbp),%r10 mov 0x20(%rbp),%r11 mov %rbp,%rsp pop %rbp ret
Starting from arity 8, stack frame is generated as a result of
num_spillslots
from regalloc2 growing lineary with the number of arguments:<_wasm_function_0>: push %rbp mov %rsp,%rbp mov 0x8(%rdi),%r10 mov (%r10),%r10 add $0x10,%r10 cmp %rsp,%r10 jbe <_wasm_function_0+0x1a> ud2 sub $0x20,%rsp mov %r15,0x10(%rsp) mov 0x10(%rbp),%rax mov 0x18(%rbp),%r10 mov 0x20(%rbp),%r11 mov 0x28(%rbp),%r15 mov 0x10(%rsp),%r15 add $0x20,%rsp mov %rbp,%rsp pop %rbp ret
The higher the arity, the higher the frame size. For 99-ary function, 784-byte frame is generated, although it obviously cannot be used for anything by an empty-body function, which looks like a problem to me.
Besides that, higher arities produce really weird argument loading code which just load values to registers only to overwrite them with other values at once:
... mov 0x60(%rbp),%rsi mov 0x68(%rbp),%rsi mov 0x70(%rbp),%rsi mov 0x78(%rbp),%rcx mov 0x80(%rbp),%rcx mov 0x88(%rbp),%rcx ...
Tested with the tip of
master
branch ofwasmtime
, as of today.
bjorn3 commented on issue #4473:
Duplicate of https://github.com/bytecodealliance/wasmtime/issues/1148.
bjorn3 edited a comment on issue #4473:
Partially duplicate of https://github.com/bytecodealliance/wasmtime/issues/1148.
bjorn3 edited a comment on issue #4473:
Partially duplicate of https://github.com/bytecodealliance/wasmtime/issues/1148 (for the stack frame setup)
bjorn3 commented on issue #4473:
Besides that, higher arities produce really weird argument loading code which just load values to registers only to overwrite them with other values at once:
I think regalloc is smart enough to collapse multiple unused virtual registers into the same register here. However I don't think regalloc is allowed to remove unused instructions entirely. I'm guessing there did need to be some extra code in the backend to avoid loading unused arguments.
pepyakin commented on issue #4473:
(I started writing before your edits but figured it may still be relevant).
I think we've spotted a similar behavior for aarch64, whereas the linked issue says it was fixed there. @s0me0ne-unkn0wn do you mind posting your finding for the aarch64?
pepyakin labeled issue #4473:
Researching the stack frame allocation logic in Cranelift, I've came across a behavior I consider to be suboptimal, which I cannot explain.
I generated void N-ary WASM functions with empty body for N=[1..99]. Arguments are
i64
for the matter of simplicity. Then I compiled them all forx86_64-none-linux-gnu
target and explored the machine code generated.Functions with arity from 1 to 4 shows some minimal yet useless preamble/postamble code generated:
push %rbp mov %rsp,%rbp mov %rbp,%rsp pop %rbp ret
Starting from arity 5, argument loading code is generated, although values loaded are never used (the example is 7-ary func):
push %rbp mov %rsp,%rbp mov 0x10(%rbp),%rax mov 0x18(%rbp),%r10 mov 0x20(%rbp),%r11 mov %rbp,%rsp pop %rbp ret
Starting from arity 8, stack frame is generated as a result of
num_spillslots
from regalloc2 growing lineary with the number of arguments:<_wasm_function_0>: push %rbp mov %rsp,%rbp mov 0x8(%rdi),%r10 mov (%r10),%r10 add $0x10,%r10 cmp %rsp,%r10 jbe <_wasm_function_0+0x1a> ud2 sub $0x20,%rsp mov %r15,0x10(%rsp) mov 0x10(%rbp),%rax mov 0x18(%rbp),%r10 mov 0x20(%rbp),%r11 mov 0x28(%rbp),%r15 mov 0x10(%rsp),%r15 add $0x20,%rsp mov %rbp,%rsp pop %rbp ret
The higher the arity, the higher the frame size. For 99-ary function, 784-byte frame is generated, although it obviously cannot be used for anything by an empty-body function, which looks like a problem to me.
Besides that, higher arities produce really weird argument loading code which just load values to registers only to overwrite them with other values at once:
... mov 0x60(%rbp),%rsi mov 0x68(%rbp),%rsi mov 0x70(%rbp),%rsi mov 0x78(%rbp),%rcx mov 0x80(%rbp),%rcx mov 0x88(%rbp),%rcx ...
Tested with the tip of
master
branch ofwasmtime
, as of today.
pepyakin labeled issue #4473:
Researching the stack frame allocation logic in Cranelift, I've came across a behavior I consider to be suboptimal, which I cannot explain.
I generated void N-ary WASM functions with empty body for N=[1..99]. Arguments are
i64
for the matter of simplicity. Then I compiled them all forx86_64-none-linux-gnu
target and explored the machine code generated.Functions with arity from 1 to 4 shows some minimal yet useless preamble/postamble code generated:
push %rbp mov %rsp,%rbp mov %rbp,%rsp pop %rbp ret
Starting from arity 5, argument loading code is generated, although values loaded are never used (the example is 7-ary func):
push %rbp mov %rsp,%rbp mov 0x10(%rbp),%rax mov 0x18(%rbp),%r10 mov 0x20(%rbp),%r11 mov %rbp,%rsp pop %rbp ret
Starting from arity 8, stack frame is generated as a result of
num_spillslots
from regalloc2 growing lineary with the number of arguments:<_wasm_function_0>: push %rbp mov %rsp,%rbp mov 0x8(%rdi),%r10 mov (%r10),%r10 add $0x10,%r10 cmp %rsp,%r10 jbe <_wasm_function_0+0x1a> ud2 sub $0x20,%rsp mov %r15,0x10(%rsp) mov 0x10(%rbp),%rax mov 0x18(%rbp),%r10 mov 0x20(%rbp),%r11 mov 0x28(%rbp),%r15 mov 0x10(%rsp),%r15 add $0x20,%rsp mov %rbp,%rsp pop %rbp ret
The higher the arity, the higher the frame size. For 99-ary function, 784-byte frame is generated, although it obviously cannot be used for anything by an empty-body function, which looks like a problem to me.
Besides that, higher arities produce really weird argument loading code which just load values to registers only to overwrite them with other values at once:
... mov 0x60(%rbp),%rsi mov 0x68(%rbp),%rsi mov 0x70(%rbp),%rsi mov 0x78(%rbp),%rcx mov 0x80(%rbp),%rcx mov 0x88(%rbp),%rcx ...
Tested with the tip of
master
branch ofwasmtime
, as of today.
s0me0ne-unkn0wn commented on issue #4473:
@pepyakin I didn't do as much tests for aarch64 as for x64, but in principle it goes the same way there. Frame generated is a little smaller on ARM (656 bytes instead of 784 for 99-ary func) but is still linearly growing with the number of arguments. Argument loading code looks similar to x64 too:
... ldr x1, [x29, #272] ldr x1, [x29, #280] ldr x1, [x29, #288] ldr x1, [x29, #296] ldr x1, [x29, #304] ldr x1, [x29, #312] ldr x1, [x29, #320] ldr x1, [x29, #328] ldr x1, [x29, #336] ldr x1, [x29, #344] ldr x2, [x29, #352] ldr x3, [x29, #360] ldr x4, [x29, #368] ldr x5, [x29, #376] ldr x6, [x29, #384] ldr x7, [x29, #392] ldr x1, [x29, #400] ldr x1, [x29, #408] ldr x1, [x29, #416] ldr x1, [x29, #424] ldr x1, [x29, #432] ldr x1, [x29, #440] ldr x1, [x29, #448] ...
s0me0ne-unkn0wn commented on issue #4473:
I've just checked a useless preamble case on aarch64, and this one is fixed for ARM indeed. Function from unary to 6-ary are just like that:
<_wasm_function_0>: ret
alexcrichton commented on issue #4473:
cc @cfallin this was something that Nick and I actually ran into when working on the stack unwinding PR that I forgot to open an issue for. I think that this is probably happening because ABI bits in Wasmtime are modeled as moving the argument register into a virtual register unconditionally. When this mov instruction is between two registers it's later deleted during register allocation (or around there I think). When arguments are moved from the stack frame into a register, though, that's not detected as a non-side-effectful move which means that the mov instruction is left.
We were playing around with https://github.com/bytecodealliance/wasmtime/blob/main/tests/misc_testsuite/func-400-params.wast locally. One thing that we found which was odd was that each stack argument wasn't moved into exactly the same destination register. For example we saw:
38: b84103a1 ldur w1, [x29, #16] 3c: b84183a1 ldur w1, [x29, #24] 40: b84203a1 ldur w1, [x29, #32] 44: b84283a1 ldur w1, [x29, #40] 48: b84303a1 ldur w1, [x29, #48] 4c: b84383a1 ldur w1, [x29, #56] 50: b84403a1 ldur w1, [x29, #64] 54: b84483a1 ldur w1, [x29, #72] 58: b84503a1 ldur w1, [x29, #80] 5c: b84583a1 ldur w1, [x29, #88] 60: b84603af ldur w15, [x29, #96] 64: b84683a3 ldur w3, [x29, #104] 68: b84703a4 ldur w4, [x29, #112] 6c: b84783a5 ldur w5, [x29, #120] 70: b84803a6 ldur w6, [x29, #128] 74: b84883a7 ldur w7, [x29, #136] 78: b84903a1 ldur w1, [x29, #144] 7c: b84983a1 ldur w1, [x29, #152] 80: b84a03a1 ldur w1, [x29, #160] 84: b84a83a1 ldur w1, [x29, #168] 88: b84b03a1 ldur w1, [x29, #176] 8c: b84b83a1 ldur w1, [x29, #184] 90: b84c03a1 ldur w1, [x29, #192] 94: b84c83a1 ldur w1, [x29, #200] 98: b84d03a1 ldur w1, [x29, #208] 9c: b84d83a1 ldur w1, [x29, #216] a0: b84e03af ldur w15, [x29, #224] a4: b84e83a3 ldur w3, [x29, #232] a8: b84f03a4 ldur w4, [x29, #240] ac: b84f83a5 ldur w5, [x29, #248] b0: b94103a6 ldr w6, [x29, #256] b4: b9410ba7 ldr w7, [x29, #264] b8: b94113a1 ldr w1, [x29, #272] bc: b9411ba1 ldr w1, [x29, #280] c0: b94123a1 ldr w1, [x29, #288] c4: b9412ba1 ldr w1, [x29, #296] c8: b94133a1 ldr w1, [x29, #304]
as a subset of the function which seemed odd that lots of different registers were being used when all the registers were dead anyway.
bjorn3 commented on issue #4473:
Regalloc2 does some randomization of the order in which it selects registers. Could be related to this.
cfallin commented on issue #4473:
So I think there are two separable issues here: (i) use of the frame pointer, and (ii) loads of stack arguments.
On (i) the frame pointer, we have ongoing discussions about this in #4431 and related issues but the main takeaway is that we will need an explicit frame pointer setup/teardown even in leaf functions in order to allow for stack walking / unwinding. There are other approaches one could take, and tradeoffs to make here; but, that's the reason. So I would quibble somewhat with the "useless" descriptor as this does have a use :-)
On (ii) stack argument loads, these are indeed useless, and are an artifact of ABI approach as @alexcrichton notes above:
I think that this is probably happening because ABI bits in Wasmtime are modeled as moving the argument register into a virtual register unconditionally.
This is indeed the case; the proper fix is making regalloc aware of the initial location on the stack, so the same move-elision applies as for register arguments, but that requires more thinking around how exactly to expose the stack argument area as additional "spillslots".
I will note that there are some subtle correctness issues around reftypes and stackmaps here: taking ownership of all args immediately (by copying into vregs) lets us then note locations of ref-typed args, whereas if they stay in stack-arg position, we need to reason about that as well when generating stackmaps.
So all that to say: yes, should be improved; the generated code is correct now (not a bug) but suboptimal!
akirilov-arm labeled issue #4473:
Researching the stack frame allocation logic in Cranelift, I've came across a behavior I consider to be suboptimal, which I cannot explain.
I generated void N-ary WASM functions with empty body for N=[1..99]. Arguments are
i64
for the matter of simplicity. Then I compiled them all forx86_64-none-linux-gnu
target and explored the machine code generated.Functions with arity from 1 to 4 shows some minimal yet useless preamble/postamble code generated:
push %rbp mov %rsp,%rbp mov %rbp,%rsp pop %rbp ret
Starting from arity 5, argument loading code is generated, although values loaded are never used (the example is 7-ary func):
push %rbp mov %rsp,%rbp mov 0x10(%rbp),%rax mov 0x18(%rbp),%r10 mov 0x20(%rbp),%r11 mov %rbp,%rsp pop %rbp ret
Starting from arity 8, stack frame is generated as a result of
num_spillslots
from regalloc2 growing lineary with the number of arguments:<_wasm_function_0>: push %rbp mov %rsp,%rbp mov 0x8(%rdi),%r10 mov (%r10),%r10 add $0x10,%r10 cmp %rsp,%r10 jbe <_wasm_function_0+0x1a> ud2 sub $0x20,%rsp mov %r15,0x10(%rsp) mov 0x10(%rbp),%rax mov 0x18(%rbp),%r10 mov 0x20(%rbp),%r11 mov 0x28(%rbp),%r15 mov 0x10(%rsp),%r15 add $0x20,%rsp mov %rbp,%rsp pop %rbp ret
The higher the arity, the higher the frame size. For 99-ary function, 784-byte frame is generated, although it obviously cannot be used for anything by an empty-body function, which looks like a problem to me.
Besides that, higher arities produce really weird argument loading code which just load values to registers only to overwrite them with other values at once:
... mov 0x60(%rbp),%rsi mov 0x68(%rbp),%rsi mov 0x70(%rbp),%rsi mov 0x78(%rbp),%rcx mov 0x80(%rbp),%rcx mov 0x88(%rbp),%rcx ...
Tested with the tip of
master
branch ofwasmtime
, as of today.
Last updated: Nov 22 2024 at 17:03 UTC