wasmtime / PR #6478 Cranelift: Add the ability to pop sta... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #6478 Cranelift: Add the ability to pop sta...

Wasmtime GitHub notifications bot (May 30 2023 at 22:19):

fitzgen opened PR #6478 from fitzgen:stack-to-pop-in-ret to bytecodealliance:main:

This is necessary for implementing callee-pops calling conventions, as is required for tail calls. This is just a small part of tail calls, and doesn't implement everything, but is a good piece to land on its own so that eventual PR isn't so huge.

Wasmtime GitHub notifications bot (May 30 2023 at 22:19):

fitzgen requested cfallin for a review on PR #6478.

Wasmtime GitHub notifications bot (May 30 2023 at 22:19):

fitzgen requested wasmtime-compiler-reviewers for a review on PR #6478.

Wasmtime GitHub notifications bot (May 30 2023 at 22:19):

fitzgen edited PR #6478:

This is necessary for implementing callee-pops calling conventions, as is required for tail calls. This is just a small part of tail calls, and doesn't implement everything, but is a good piece to land on its own so that eventual PR isn't so huge. Just trying to shrink the yak stack.

Wasmtime GitHub notifications bot (May 30 2023 at 22:30):

cfallin submitted PR review:

Looks great; thanks for splitting this out!

One thought on the pseudo-assembly printing in VCode (for aarch64 but similar thoughts on riscv64); and one request for a bit more in a doc-comment; but otherwise LGTM.

Wasmtime GitHub notifications bot (May 30 2023 at 22:30):

cfallin submitted PR review:

Looks great; thanks for splitting this out!

One thought on the pseudo-assembly printing in VCode (for aarch64 but similar thoughts on riscv64); and one request for a bit more in a doc-comment; but otherwise LGTM.

Wasmtime GitHub notifications bot (May 30 2023 at 22:30):

cfallin created PR review comment:

Rather than print a pseudo-inst here, could we print e.g. `add sp, sp, #imm ; ret" or similar?

Wasmtime GitHub notifications bot (May 30 2023 at 22:30):

cfallin created PR review comment:

"At the same time as returning" is slightly ambiguous: does the pop happen logically just before popping the return address, or just after? (I think after?) A diagram of the stack before-and-after might be helpful actually, if it's not too much trouble to draw one quickly!

Wasmtime GitHub notifications bot (May 31 2023 at 00:52):

jameysharp created PR review comment:

The exact sequence of events is target-specific. The return address may not be on the stack at all by the time we get to the return instruction; x86 is the only target we support where ret pops an address off the stack for you. So yes, on x86 this emits an instruction which first pops the return address, and then removes the arguments from the caller's frame. Everywhere else, the return address is in a register by that point. So we can emit separate instructions to adjust the stack pointer and then branch through the appropriate register.

Wasmtime GitHub notifications bot (May 31 2023 at 16:25):

fitzgen updated PR #6478.

Wasmtime GitHub notifications bot (May 31 2023 at 16:48):

fitzgen updated PR #6478.

Wasmtime GitHub notifications bot (May 31 2023 at 16:48):

fitzgen has enabled auto merge for PR #6478.

Wasmtime GitHub notifications bot (May 31 2023 at 16:50):

cfallin created PR review comment:

Unresolving -- this could still use a diagram or better description in the comment, I think.

The sequencing question is really about the order of items on the stack: are the additionally-popped items before (below) the return address or after (above)? I think the latter, but I have to think about it to come to that conclusion by reasoning through it; I'd prefer for it to be clear in the comment here.

Wasmtime GitHub notifications bot (May 31 2023 at 16:51):

cfallin created PR review comment:

Ah, I see the just-pushed commit does resolve the ambiguity, so I'm happy now!

Wasmtime GitHub notifications bot (May 31 2023 at 18:50):

fitzgen updated PR #6478.

Wasmtime GitHub notifications bot (Jun 01 2023 at 15:30):

fitzgen updated PR #6478.

Wasmtime GitHub notifications bot (Jun 01 2023 at 16:17):

bjorn3 created PR review comment:

Less commonly used instructions from the 16bit x86 era can be slower than writing multiple simpler instructions. Does this also apply to retn or is it just as fast as add rsp, n + ret?

Wasmtime GitHub notifications bot (Jun 01 2023 at 16:27):

cfallin created PR review comment:

According to Agner Fog's instruction tables, and picking a random semi-recent microarchitecture (Skylake), ret and ret imm are both 2 µops; the latter has a throughput of 0.5 per cycle, the former 1 per cycle. So it looks like there's a tiny penalty but it's no worse than rearranging the stack manually. (And keep in mind that the popped data is above the return address, so one would need to load that, store it higher up, move rsp, and ret; the store alone in that sequence would kill the perf advantage.)

Wasmtime GitHub notifications bot (Jun 01 2023 at 16:28):

fitzgen created PR review comment:

add rsp, n; ret isn't equivalent because the return address must be on top of the stack in both cases. That is, ret n pops the RA and then pops n bytes.

So I don't know if it is actually slower or not, but it is not semantically equivalent, so the point is moot to some degree.

The actual alternative would be something like
mov tmp, [rsp]
add rsp, n
mov [rsp], tmp
ret
I kind of doubt that is any better than ret n, but it is certainly possible. Would be interesting to see what LLVM does here.

Wasmtime GitHub notifications bot (Jun 01 2023 at 16:29):

cfallin edited PR review comment.

Wasmtime GitHub notifications bot (Jun 01 2023 at 16:39):

fitzgen merged PR #6478.

Wasmtime GitHub notifications bot (Jun 01 2023 at 16:41):

cfallin created PR review comment:

I did a fun little experiment with a ret imm implementation, a "move the return address and ret" implementation, and a "load the return address, manually pop, and jump-indirect" implementation.

On my system (Zen 2 core, Ryzen 3900X) the first two have identical performance, and the last is 25% faster. Huh!

I sort of suspect what's happening here is:

With enough spare pipeline capacity, the store of the return address in the new location is "free", and the memory-renaming that Zen is known to do (tracking constant offsets of rsp stored from registers and using that register directly) eliminate a critical path on re-loading the RA as part of ret.

With only one returned-to address in this microbenchmark, the jmp rdi is trivial for the indirect predictor to get right; but this example will overflow the return-stack predictor and so in a real program it would do poorly.

In other words, this is kind of hard to test without having a "real program" surrounding it.

Anyway it's sort of getting into the weeds and ret imm isn't obviously worse so I think we should keep this until/unless we discover it becoming a bottleneck.

Wasmtime GitHub notifications bot (Jun 01 2023 at 21:13):

bjorn3 created PR review comment:

Makes sense. I didn't know that ret imm would move the stack pointer after reading the return address.

Last updated: Apr 18 2025 at 09:03 UTC