wasmtime / issue #4473 Useless stack frame allocation in ... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #4473 Useless stack frame allocation in ...

Wasmtime GitHub notifications bot (Jul 20 2022 at 10:39):

s0me0ne-unkn0wn opened issue #4473:

Researching the stack frame allocation logic in Cranelift, I've came across a behavior I consider to be suboptimal, which I cannot explain.

I generated void N-ary WASM functions with empty body for N=[1..99]. Arguments are i64 for the matter of simplicity. Then I compiled them all for x86_64-none-linux-gnu target and explored the machine code generated.

Functions with arity from 1 to 4 shows some minimal yet useless preamble/postamble code generated:
push   %rbp
mov    %rsp,%rbp
mov    %rbp,%rsp
pop    %rbp
ret
Starting from arity 5, argument loading code is generated, although values loaded are never used (the example is 7-ary func):
push   %rbp
mov    %rsp,%rbp
mov    0x10(%rbp),%rax
mov    0x18(%rbp),%r10
mov    0x20(%rbp),%r11
mov    %rbp,%rsp
pop    %rbp
ret
Starting from arity 8, stack frame is generated as a result of num_spillslots from regalloc2 growing lineary with the number of arguments:
<_wasm_function_0>:
push   %rbp
mov    %rsp,%rbp
mov    0x8(%rdi),%r10
mov    (%r10),%r10
add    $0x10,%r10
cmp    %rsp,%r10
jbe    <_wasm_function_0+0x1a>
ud2
sub    $0x20,%rsp
mov    %r15,0x10(%rsp)
mov    0x10(%rbp),%rax
mov    0x18(%rbp),%r10
mov    0x20(%rbp),%r11
mov    0x28(%rbp),%r15
mov    0x10(%rsp),%r15
add    $0x20,%rsp
mov    %rbp,%rsp
pop    %rbp
ret
The higher the arity, the higher the frame size. For 99-ary function, 784-byte frame is generated, although it obviously cannot be used for anything by an empty-body function, which looks like a problem to me.

Besides that, higher arities produce really weird argument loading code which just load values to registers only to overwrite them with other values at once:
...
mov    0x60(%rbp),%rsi
mov    0x68(%rbp),%rsi
mov    0x70(%rbp),%rsi
mov    0x78(%rbp),%rcx
mov    0x80(%rbp),%rcx
mov    0x88(%rbp),%rcx
...
Tested with the tip of master branch of wasmtime, as of today.

Wasmtime GitHub notifications bot (Jul 20 2022 at 11:08):

bjorn3 commented on issue #4473:

Duplicate of https://github.com/bytecodealliance/wasmtime/issues/1148.

Wasmtime GitHub notifications bot (Jul 20 2022 at 11:09):

bjorn3 edited a comment on issue #4473:

Partially duplicate of https://github.com/bytecodealliance/wasmtime/issues/1148.

Wasmtime GitHub notifications bot (Jul 20 2022 at 11:11):

bjorn3 edited a comment on issue #4473:

Partially duplicate of https://github.com/bytecodealliance/wasmtime/issues/1148 (for the stack frame setup)

Wasmtime GitHub notifications bot (Jul 20 2022 at 11:11):

bjorn3 commented on issue #4473:

Besides that, higher arities produce really weird argument loading code which just load values to registers only to overwrite them with other values at once:

I think regalloc is smart enough to collapse multiple unused virtual registers into the same register here. However I don't think regalloc is allowed to remove unused instructions entirely. I'm guessing there did need to be some extra code in the backend to avoid loading unused arguments.

Wasmtime GitHub notifications bot (Jul 20 2022 at 11:18):

pepyakin commented on issue #4473:

(I started writing before your edits but figured it may still be relevant).

I think we've spotted a similar behavior for aarch64, whereas the linked issue says it was fixed there. @s0me0ne-unkn0wn do you mind posting your finding for the aarch64?

Wasmtime GitHub notifications bot (Jul 20 2022 at 11:18):

pepyakin labeled issue #4473:

Researching the stack frame allocation logic in Cranelift, I've came across a behavior I consider to be suboptimal, which I cannot explain.

I generated void N-ary WASM functions with empty body for N=[1..99]. Arguments are i64 for the matter of simplicity. Then I compiled them all for x86_64-none-linux-gnu target and explored the machine code generated.

Functions with arity from 1 to 4 shows some minimal yet useless preamble/postamble code generated:
push   %rbp
mov    %rsp,%rbp
mov    %rbp,%rsp
pop    %rbp
ret
Starting from arity 5, argument loading code is generated, although values loaded are never used (the example is 7-ary func):
push   %rbp
mov    %rsp,%rbp
mov    0x10(%rbp),%rax
mov    0x18(%rbp),%r10
mov    0x20(%rbp),%r11
mov    %rbp,%rsp
pop    %rbp
ret
Starting from arity 8, stack frame is generated as a result of num_spillslots from regalloc2 growing lineary with the number of arguments:
<_wasm_function_0>:
push   %rbp
mov    %rsp,%rbp
mov    0x8(%rdi),%r10
mov    (%r10),%r10
add    $0x10,%r10
cmp    %rsp,%r10
jbe    <_wasm_function_0+0x1a>
ud2
sub    $0x20,%rsp
mov    %r15,0x10(%rsp)
mov    0x10(%rbp),%rax
mov    0x18(%rbp),%r10
mov    0x20(%rbp),%r11
mov    0x28(%rbp),%r15
mov    0x10(%rsp),%r15
add    $0x20,%rsp
mov    %rbp,%rsp
pop    %rbp
ret
The higher the arity, the higher the frame size. For 99-ary function, 784-byte frame is generated, although it obviously cannot be used for anything by an empty-body function, which looks like a problem to me.

Besides that, higher arities produce really weird argument loading code which just load values to registers only to overwrite them with other values at once:
...
mov    0x60(%rbp),%rsi
mov    0x68(%rbp),%rsi
mov    0x70(%rbp),%rsi
mov    0x78(%rbp),%rcx
mov    0x80(%rbp),%rcx
mov    0x88(%rbp),%rcx
...
Tested with the tip of master branch of wasmtime, as of today.

Wasmtime GitHub notifications bot (Jul 20 2022 at 11:18):

pepyakin labeled issue #4473:

Researching the stack frame allocation logic in Cranelift, I've came across a behavior I consider to be suboptimal, which I cannot explain.

I generated void N-ary WASM functions with empty body for N=[1..99]. Arguments are i64 for the matter of simplicity. Then I compiled them all for x86_64-none-linux-gnu target and explored the machine code generated.

Functions with arity from 1 to 4 shows some minimal yet useless preamble/postamble code generated:
push   %rbp
mov    %rsp,%rbp
mov    %rbp,%rsp
pop    %rbp
ret
Starting from arity 5, argument loading code is generated, although values loaded are never used (the example is 7-ary func):
push   %rbp
mov    %rsp,%rbp
mov    0x10(%rbp),%rax
mov    0x18(%rbp),%r10
mov    0x20(%rbp),%r11
mov    %rbp,%rsp
pop    %rbp
ret
Starting from arity 8, stack frame is generated as a result of num_spillslots from regalloc2 growing lineary with the number of arguments:
<_wasm_function_0>:
push   %rbp
mov    %rsp,%rbp
mov    0x8(%rdi),%r10
mov    (%r10),%r10
add    $0x10,%r10
cmp    %rsp,%r10
jbe    <_wasm_function_0+0x1a>
ud2
sub    $0x20,%rsp
mov    %r15,0x10(%rsp)
mov    0x10(%rbp),%rax
mov    0x18(%rbp),%r10
mov    0x20(%rbp),%r11
mov    0x28(%rbp),%r15
mov    0x10(%rsp),%r15
add    $0x20,%rsp
mov    %rbp,%rsp
pop    %rbp
ret
The higher the arity, the higher the frame size. For 99-ary function, 784-byte frame is generated, although it obviously cannot be used for anything by an empty-body function, which looks like a problem to me.

Besides that, higher arities produce really weird argument loading code which just load values to registers only to overwrite them with other values at once:
...
mov    0x60(%rbp),%rsi
mov    0x68(%rbp),%rsi
mov    0x70(%rbp),%rsi
mov    0x78(%rbp),%rcx
mov    0x80(%rbp),%rcx
mov    0x88(%rbp),%rcx
...
Tested with the tip of master branch of wasmtime, as of today.

Wasmtime GitHub notifications bot (Jul 20 2022 at 11:27):

s0me0ne-unkn0wn commented on issue #4473:

@pepyakin I didn't do as much tests for aarch64 as for x64, but in principle it goes the same way there. Frame generated is a little smaller on ARM (656 bytes instead of 784 for 99-ary func) but is still linearly growing with the number of arguments. Argument loading code looks similar to x64 too:
...
ldr x1, [x29, #272]
ldr x1, [x29, #280]
ldr x1, [x29, #288]
ldr x1, [x29, #296]
ldr x1, [x29, #304]
ldr x1, [x29, #312]
ldr x1, [x29, #320]
ldr x1, [x29, #328]
ldr x1, [x29, #336]
ldr x1, [x29, #344]
ldr x2, [x29, #352]
ldr x3, [x29, #360]
ldr x4, [x29, #368]
ldr x5, [x29, #376]
ldr x6, [x29, #384]
ldr x7, [x29, #392]
ldr x1, [x29, #400]
ldr x1, [x29, #408]
ldr x1, [x29, #416]
ldr x1, [x29, #424]
ldr x1, [x29, #432]
ldr x1, [x29, #440]
ldr x1, [x29, #448]
...

Wasmtime GitHub notifications bot (Jul 20 2022 at 11:48):

s0me0ne-unkn0wn commented on issue #4473:

I've just checked a useless preamble case on aarch64, and this one is fixed for ARM indeed. Function from unary to 6-ary are just like that:
<_wasm_function_0>:
    ret

Wasmtime GitHub notifications bot (Jul 20 2022 at 14:42):

alexcrichton commented on issue #4473:

cc @cfallin this was something that Nick and I actually ran into when working on the stack unwinding PR that I forgot to open an issue for. I think that this is probably happening because ABI bits in Wasmtime are modeled as moving the argument register into a virtual register unconditionally. When this mov instruction is between two registers it's later deleted during register allocation (or around there I think). When arguments are moved from the stack frame into a register, though, that's not detected as a non-side-effectful move which means that the mov instruction is left.

We were playing around with https://github.com/bytecodealliance/wasmtime/blob/main/tests/misc_testsuite/func-400-params.wast locally. One thing that we found which was odd was that each stack argument wasn't moved into exactly the same destination register. For example we saw:

      38:       b84103a1        ldur    w1, [x29, #16]
      3c:       b84183a1        ldur    w1, [x29, #24]
      40:       b84203a1        ldur    w1, [x29, #32]
      44:       b84283a1        ldur    w1, [x29, #40]
      48:       b84303a1        ldur    w1, [x29, #48]
      4c:       b84383a1        ldur    w1, [x29, #56]
      50:       b84403a1        ldur    w1, [x29, #64]
      54:       b84483a1        ldur    w1, [x29, #72]
      58:       b84503a1        ldur    w1, [x29, #80]
      5c:       b84583a1        ldur    w1, [x29, #88]
      60:       b84603af        ldur    w15, [x29, #96]
      64:       b84683a3        ldur    w3, [x29, #104]
      68:       b84703a4        ldur    w4, [x29, #112]
      6c:       b84783a5        ldur    w5, [x29, #120]
      70:       b84803a6        ldur    w6, [x29, #128]
      74:       b84883a7        ldur    w7, [x29, #136]
      78:       b84903a1        ldur    w1, [x29, #144]
      7c:       b84983a1        ldur    w1, [x29, #152]
      80:       b84a03a1        ldur    w1, [x29, #160]
      84:       b84a83a1        ldur    w1, [x29, #168]
      88:       b84b03a1        ldur    w1, [x29, #176]
      8c:       b84b83a1        ldur    w1, [x29, #184]
      90:       b84c03a1        ldur    w1, [x29, #192]
      94:       b84c83a1        ldur    w1, [x29, #200]
      98:       b84d03a1        ldur    w1, [x29, #208]
      9c:       b84d83a1        ldur    w1, [x29, #216]
      a0:       b84e03af        ldur    w15, [x29, #224]
      a4:       b84e83a3        ldur    w3, [x29, #232]
      a8:       b84f03a4        ldur    w4, [x29, #240]
      ac:       b84f83a5        ldur    w5, [x29, #248]
      b0:       b94103a6        ldr     w6, [x29, #256]
      b4:       b9410ba7        ldr     w7, [x29, #264]
      b8:       b94113a1        ldr     w1, [x29, #272]
      bc:       b9411ba1        ldr     w1, [x29, #280]
      c0:       b94123a1        ldr     w1, [x29, #288]
      c4:       b9412ba1        ldr     w1, [x29, #296]
      c8:       b94133a1        ldr     w1, [x29, #304]

as a subset of the function which seemed odd that lots of different registers were being used when all the registers were dead anyway.

Wasmtime GitHub notifications bot (Jul 20 2022 at 15:16):

bjorn3 commented on issue #4473:

Regalloc2 does some randomization of the order in which it selects registers. Could be related to this.

Wasmtime GitHub notifications bot (Jul 20 2022 at 17:14):

cfallin commented on issue #4473:

So I think there are two separable issues here: (i) use of the frame pointer, and (ii) loads of stack arguments.

On (i) the frame pointer, we have ongoing discussions about this in #4431 and related issues but the main takeaway is that we will need an explicit frame pointer setup/teardown even in leaf functions in order to allow for stack walking / unwinding. There are other approaches one could take, and tradeoffs to make here; but, that's the reason. So I would quibble somewhat with the "useless" descriptor as this does have a use :-)

On (ii) stack argument loads, these are indeed useless, and are an artifact of ABI approach as @alexcrichton notes above:

I think that this is probably happening because ABI bits in Wasmtime are modeled as moving the argument register into a virtual register unconditionally.

This is indeed the case; the proper fix is making regalloc aware of the initial location on the stack, so the same move-elision applies as for register arguments, but that requires more thinking around how exactly to expose the stack argument area as additional "spillslots".

I will note that there are some subtle correctness issues around reftypes and stackmaps here: taking ownership of all args immediately (by copying into vregs) lets us then note locations of ref-typed args, whereas if they stay in stack-arg position, we need to reason about that as well when generating stackmaps.

So all that to say: yes, should be improved; the generated code is correct now (not a bug) but suboptimal!

Wasmtime GitHub notifications bot (Sep 02 2022 at 15:44):

akirilov-arm labeled issue #4473:

Researching the stack frame allocation logic in Cranelift, I've came across a behavior I consider to be suboptimal, which I cannot explain.

I generated void N-ary WASM functions with empty body for N=[1..99]. Arguments are i64 for the matter of simplicity. Then I compiled them all for x86_64-none-linux-gnu target and explored the machine code generated.

Functions with arity from 1 to 4 shows some minimal yet useless preamble/postamble code generated:
push   %rbp
mov    %rsp,%rbp
mov    %rbp,%rsp
pop    %rbp
ret
Starting from arity 5, argument loading code is generated, although values loaded are never used (the example is 7-ary func):
push   %rbp
mov    %rsp,%rbp
mov    0x10(%rbp),%rax
mov    0x18(%rbp),%r10
mov    0x20(%rbp),%r11
mov    %rbp,%rsp
pop    %rbp
ret
Starting from arity 8, stack frame is generated as a result of num_spillslots from regalloc2 growing lineary with the number of arguments:
<_wasm_function_0>:
push   %rbp
mov    %rsp,%rbp
mov    0x8(%rdi),%r10
mov    (%r10),%r10
add    $0x10,%r10
cmp    %rsp,%r10
jbe    <_wasm_function_0+0x1a>
ud2
sub    $0x20,%rsp
mov    %r15,0x10(%rsp)
mov    0x10(%rbp),%rax
mov    0x18(%rbp),%r10
mov    0x20(%rbp),%r11
mov    0x28(%rbp),%r15
mov    0x10(%rsp),%r15
add    $0x20,%rsp
mov    %rbp,%rsp
pop    %rbp
ret
The higher the arity, the higher the frame size. For 99-ary function, 784-byte frame is generated, although it obviously cannot be used for anything by an empty-body function, which looks like a problem to me.

Besides that, higher arities produce really weird argument loading code which just load values to registers only to overwrite them with other values at once:
...
mov    0x60(%rbp),%rsi
mov    0x68(%rbp),%rsi
mov    0x70(%rbp),%rsi
mov    0x78(%rbp),%rcx
mov    0x80(%rbp),%rcx
mov    0x88(%rbp),%rcx
...
Tested with the tip of master branch of wasmtime, as of today.

Last updated: Apr 17 2025 at 01:31 UTC