cfallin opened PR #11727 from cfallin:direct-stack-loads-stores to bytecodealliance:main:
We provide
stack_load/stack_store/stack_addrinstructions in Cranelift to operate on stack slots, and the first two are legalized to astack_addrplus an ordinary load or store instruction.We currently have lowerings for
stack_addrthat materialize an SP-relative address into a register: for example,leaq 8(%rsp), %raxon x86-64 oradd x0, sp, #8on aarch64.Taken together, we see sequences like (aarch64 / x86-64)
add x0, sp, #8 / leaq 8(%rsp), %rax str x1, [x0] / movq %rdx, (%rax)when using
stack_stores. In particular, we do not use the direct SP-relative form, which would look likestr x1, [sp, #8] / movq %rdx, 8(%rsp)and which we can already generate in other cases, e.g. spillslot moves (spills/reloads) and clobber saves/restores.
This inefficiency is undesirable whenever the embedder is using stackslots, but in particular when we expect to have high memory traffic to stack slots (e.g., I am seeing this now when implementing debug instrumentation in Wasmtime, and user stack map instrumentation for GC will also benefit).
This PR adds new lowerings that use the existing synthetic address mode we already use for spillslots to emit loads/stores to stackslots directly when possible. The PR does this for x86-64 and aarch64; others could be updated later.
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin requested abrown for a review on PR #11727.
cfallin requested wasmtime-compiler-reviewers for a review on PR #11727.
cfallin requested pchickey for a review on PR #11727.
cfallin requested wasmtime-core-reviewers for a review on PR #11727.
github-actions[bot] commented on PR #11727:
Subscribe to Label Action
cc @cfallin, @fitzgen
<details>
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"Thus the following users have been cc'd because of the following labels:
- cfallin: isle
- fitzgen: isle
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
bjorn3 commented on PR #11727:
This is a much cleaner implementation than what I did for https://bytecodealliance.zulipchat.com/#narrow/channel/217117-cranelift/topic/stack_addr.20.2B.20load.2Fstore.20merging/with/540466352, while still having the exact same performance on x86_64 (aka cg_clif produces faster executables than llvm
-O0) and also working on arm64. This passes the full cg_clif test suite on x86_64.On arm64 I'm getting a test failure with the jit mode however. There is a call to
printfwith0x10000e73c18d0as address, but the expected string can be found at0xffffe73c18d0on the stack (the stack is from0xfffffffdf000to0x1000000000000).
bjorn3 edited a comment on PR #11727:
This is a much cleaner implementation than what I did for https://bytecodealliance.zulipchat.com/#narrow/channel/217117-cranelift/topic/stack_addr.20.2B.20load.2Fstore.20merging/with/540466352, while still having the exact same performance on x86_64 (aka cg_clif produces faster executables than llvm
-O0) and also working on arm64. This passes the full cg_clif test suite on x86_64.On arm64 I'm getting a test failure with the jit mode however. There is a call to
printfwith0x10000e73c18d0as address, but the expected string can be found at0xffffe73c18d0on the stack (the stack is from0xfffffffdf000to0x1000000000000). You can reproduce this by running./test.shafter patching the Cargo.toml of cg_clif to use the Cranelift from this PR.
bjorn3 edited a comment on PR #11727:
This is a much cleaner implementation than what I did for https://bytecodealliance.zulipchat.com/#narrow/channel/217117-cranelift/topic/stack_addr.20.2B.20load.2Fstore.20merging/with/540466352, while still having the exact same performance on x86_64 (aka cg_clif produces faster executables than llvm
-O0) and also working on arm64. This passes the full cg_clif test suite on x86_64.On arm64 I'm getting a test failure with the jit mode however. There is a call to
printfwith0x10000e73c18d0as address, but the expected string can be found at0xffffe73c18d0on the stack (the stack is from0xfffffffdf000to0x1000000000000). You can reproduce this by running./test.shafter patching the Cargo.toml of cg_clif to use the Cranelift from this PR.
Edit: Never mind. The test failure is unrelated to this PR.
bjorn3 edited a comment on PR #11727:
This is a much cleaner implementation than what I did for https://bytecodealliance.zulipchat.com/#narrow/channel/217117-cranelift/topic/stack_addr.20.2B.20load.2Fstore.20merging/with/540466352, while still having the exact same performance on x86_64 (aka cg_clif produces faster executables than llvm
-O0) and also working on arm64. This passes the full cg_clif test suite on x86_64.On arm64 I'm getting a test failure with the jit mode however. There is a call to
printfwith0x10000e73c18d0as address, but the expected string can be found at0xffffe73c18d0on the stack (the stack is from0xfffffffdf000to0x1000000000000). You can reproduce this by running./test.shafter patching the Cargo.toml of cg_clif to use the Cranelift from this PR.
Edit: Never mind. The test failure is unrelated to this PR.
Edit2: https://github.com/bytecodealliance/wasmtime/pull/11734 has the fix.
cfallin edited PR #11727:
We provide
stack_load/stack_store/stack_addrinstructions in Cranelift to operate on stack slots, and the first two are legalized to astack_addrplus an ordinary load or store instruction.We currently have lowerings for
stack_addrthat materialize an SP-relative address into a register: for example,leaq 8(%rsp), %raxon x86-64 oradd x0, sp, #8on aarch64.Taken together, we see sequences like (aarch64 / x86-64)
add x0, sp, #8 / leaq 8(%rsp), %rax str x1, [x0] / movq %rdx, (%rax)when using
stack_stores. In particular, we do not use the direct SP-relative form, which would look likestr x1, [sp, #8] / movq %rdx, 8(%rsp)and which we can already generate in other cases, e.g. spillslot moves (spills/reloads) and clobber saves/restores.
This inefficiency is undesirable whenever the embedder is using stackslots, but in particular when we expect to have high memory traffic to stack slots (e.g., I am seeing this now when implementing debug instrumentation in Wasmtime, and user stack map instrumentation for GC will also benefit).
This PR adds new lowerings that use the existing synthetic address mode we already use for spillslots to emit loads/stores to stackslots directly when possible. The PR does this for x86-64 and aarch64; others could be updated later.
Fixes #1064.
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin commented on PR #11727:
(In case others didn't see email updates from edits in bjorn3's comment above: the issue was unrelated from a
cg_clifupgrade of Cranelift seeing another regression; this PR is unrelated and remains ready for review)
abrown submitted PR review:
Makes sense!
cfallin merged PR #11727.
Last updated: Dec 06 2025 at 06:05 UTC