cfallin opened PR #8221 from cfallin:a-load-a-day-keeps-the-torn-stores-away
to bytecodealliance:main
:
As discussed at in #7237 and WebAssembly/design#1490, some instruction-set architectures do not guarantee that a store that "partially traps" (overlaps multiple pages, only one of which disallows the store) does not also have some side-effects. In particular, the part of the store that is legal might succeed.
This has fascinating implications for virtual memory-based WebAssembly heap implementations: when a store is partially out-of-bounds, it should trap (there is no "partially" here: if the last byte is out of bounds, the store is out of bounds). A trapping store should not alter the final memory state, which is observable by the outside world after the trap. Yet, the simple lowering of a Wasm store to a machine store instruction could violate this expectation, in the presence of "store tearing" as described above.
Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and we have observed it in tests on RISC-V.
This PR implements the idea first proposed [here], namely to prepend a load of the same size to every store. The idea is that if the store will trap, the load will as well; and precise exception handling, a well-established principle in all modern ISAs, guarantees that instructions beyond a trapping instruction have no effect.
This is off by default, and is mainly meant as an option to study the impact of this idea and to allow for precise trap execution semantics on affected machines unless/until the spec is clarified.
On an Apple M2 Max machine (aarch64), this was measured to have a 2% performance impact when running
spidermonkey.wasm
with a simple recursive Fibonacci program. It can be used via the-Ccranelift-ensure_precise_store_traps=true
flag to Wasmtime.[here]:
https://github.com/WebAssembly/design/issues/1490#issuecomment-1762043224<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin requested elliottt for a review on PR #8221.
cfallin requested wasmtime-compiler-reviewers for a review on PR #8221.
cfallin requested fitzgen for a review on PR #8221.
cfallin requested wasmtime-core-reviewers for a review on PR #8221.
cfallin commented on PR #8221:
@afonso360, if you still have access to your RISC-V hardware and a bit of time to test, it would be useful to confirm that this changes the result of the "torn stores" test case from earlier! @candymate, it would be interesting to confirm that this resolves your fuzzbug as well, and this could be useful for further fuzzing for you.
cfallin updated PR #8221.
bjorn3 submitted PR review.
bjorn3 created PR review comment:
Stack stores are guaranteed to succeed when stack probing (or some other kind of stack size check) is used.
github-actions[bot] commented on PR #8221:
Subscribe to Label Action
cc @peterhuene
<details>
This issue or pull request has been labeled: "cranelift", "cranelift:meta", "wasmtime:api"Thus the following users have been cc'd because of the following labels:
- peterhuene: wasmtime:api
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
this was measured to have a 2% performance impact
How does this only have a 2% perf impact? Is some optimization pass removing part of the loads or is it the M2 chip being able to exploit a lot of instruction level parallelism?
cfallin commented on PR #8221:
Probably the latter -- the loads are not removed, and make it all the way to machine code.
cfallin commented on PR #8221:
(More specifically, a load inserted by this pass could later be removed by alias analysis, but only if subsumed by another earlier load to the same address, so there is still always some load before the store.)
cfallin commented on PR #8221:
One other performance-related note: this will possibly have worse impact when the pooling allocator with copy-on-write is enabled in Wasmtime, as the "first touch" to a page matters: if demanded in by a load first, a read-only mapping is created, which might immediately fault again on the store (doing the actual CoW). We've seen such "mapping change" operations become bottlenecks in multithreaded processes due to IPIs. I don't expect we would turn this option on in any production setting though; rather we'd ensure (as is already the case on x86) any production hardware does not have the torn-write behavior. But that's something to consider if we ever discuss having this on by default in all cases.
afonso360 commented on PR #8221:
I can confirm that this fixes the torn writes at least on QEMU. :tada:
Interestingly QEMU RISC-V only shows this torn write test when using the vector extensions to perform the write (
v128
), but not using the normali64.store
.I'm going to test this on real hardware over the weekend, and also see if I can run some benchmarks.
For completeness here's the WAST test that I used.
<details><summary>WAST Test</summary>(module (memory 1) (func (export "i128.store") (param i32 i64) local.get 0 v128.const i32x4 0xffffffff 0xffffffff 0xffffffff 0xffffffff v128.store align=1) (func (export "i32.load8_u") (param i32) (result i32) local.get 0 i32.load8_u)) (assert_trap (invoke "i128.store" (i32.const 65529) (i64.const 0xffffffffffffffff)) "out of bounds memory access") ;; Partial bytes were not written. (assert_return (invoke "i32.load8_u" (i32.const 65529)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65530)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65531)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65532)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65533)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65534)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65535)) (i32.const 0))
</details>
cfallin edited PR #8221:
As discussed at in #7237 and WebAssembly/design#1490, some instruction-set architectures do not guarantee that a store that "partially traps" (overlaps multiple pages, only one of which disallows the store) does not also have some side-effects. In particular, the part of the store that is legal might succeed.
This has fascinating implications for virtual memory-based WebAssembly heap implementations: when a store is partially out-of-bounds, it should trap (there is no "partially" here: if the last byte is out of bounds, the store is out of bounds). A trapping store should not alter the final memory state, which is observable by the outside world after the trap. Yet, the simple lowering of a Wasm store to a machine store instruction could violate this expectation, in the presence of "store tearing" as described above.
Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and we have observed it in tests on RISC-V.
This PR implements the idea first proposed [here], namely to prepend a load of the same size to every store. The idea is that if the store will trap, the load will as well; and precise exception handling, a well-established principle in all modern ISAs, guarantees that instructions beyond a trapping instruction have no effect.
This is off by default, and is mainly meant as an option to study the impact of this idea and to allow for precise trap execution semantics on affected machines unless/until the spec is clarified.
On an Apple M2 Pro machine (aarch64), this was measured to have a 2% performance impact when running
spidermonkey.wasm
with a simple recursive Fibonacci program. It can be used via the-Ccranelift-ensure_precise_store_traps=true
flag to Wasmtime.[here]:
https://github.com/WebAssembly/design/issues/1490#issuecomment-1762043224<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
alexcrichton submitted PR review.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
I think atomics can be exempted here because they are already required to be aligned? At least at the wasm layer that's validated, I'm not sure about the Cranelift layer. This isn't needed for the wasm-level problem though.
alexcrichton created PR review comment:
Can this be skipped if
flags.trap_code().is_none()
or ifflags.aligned()
? Only possibly-trapping non-aligned loads (which is all wasm loads, but not anything to a vmctx for example)
tschneidereit commented on PR #8221:
I fully assume that I'm missing something and the answer is no, but just in case: would it not also work to first attempt to do the part of the store operation touching the highest page, and then do the rest of the write? ISTM that if any writes would fail that'd be guaranteed to include the highest page, and if that one succeeds, a write to all lower pages for the rest of the store operation is certain to succeed as well?
cfallin commented on PR #8221:
The problem I see with that approach is that we only know that dynamically: consider if we have a 64-bit store (allowed to be unaligned) to address
X
, we would have to decompose it into bytes and perform them in reverse order (storeX+7
, thenX+6
, then ...) always, because we know nothing aboutX
's alignment. A static analysis could sometimes knowX
's alignment, but usually not, e.g. if it is a value loaded from memory itself. So that is a store-amplification factor of 8 for 64-bit ops, 4 for 32-bit ops, etc, likely to have quite a big perf impact, I think.
tschneidereit commented on PR #8221:
Hmm, I see, I think. Could we then instead write a single byte (of whatever value) at the highest address, and do the actual store if that succeeded?
cfallin commented on PR #8221:
I think that would work, yeah. There are some other interesting performance tradeoffs here -- two that come to mind are "partial-store forwarding" (partially overlapping loads/stores can be problematic for uarch memory perf) and the fact that stores in general are more expensive than loads (store buffers typically fewer entries than load buffers so can stall the OoO window, stores take extra effort at retire) that, balanced against all the other considerations here (we want something we could turn on by default on tier-2 architecture(s) if needed, and aren't as concerned with interactions with IPIs in pooling allocator where we're more likely to be very careful to select the right configuration and avoid this mechanism), I think I'd still prefer the load+store over narrow-store+store.
tschneidereit commented on PR #8221:
That all makes sense to me. I was mainly concerned about the impact on the pooling allocator, and the missing piece was the ability to select a configuration that'd allow us to avoid any of this.
cfallin commented on PR #8221:
Ah yes, indeed. This is only a problem on aarch64 and riscv64, and AFAICT all aarch64 hardware one would want to run a high-performance server on (Apple implementations, AWS machines, Ampere CPUs, even RPi4) do not exhibit store-tearing, so on aarch64 it's mostly theoretical and/or on very small (low-power / embedded) cores I think.
afonso360 commented on PR #8221:
I can confirm that this fix also works in real hardware. I ran a few benchmarks at random, and it does have a bit of impact on smaller cores (as expected). Most benchmarks are 5-10% slower, but there is one outlier at 40%
The results below are on a Dual-Issue In Order Core (Sifive u74).
<details><summary>Results</summary>
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm Δ = 1459784.00 ± 60785.14 (confidence = 99%) noload.so is 1.07x to 1.07x faster than load.so! [22663159 22821360.40 22862178] load.so [21326450 21361576.40 21405304] noload.so compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm Δ = 2560349.00 ± 1538031.16 (confidence = 99%) noload.so is 1.01x to 1.04x faster than load.so! [110537308 111942491.90 113530980] load.so [108818829 109382142.90 111609052] noload.so instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [3465 3954.40 6169] load.so [3444 4122.00 6204] noload.so execution :: cycles :: benchmarks/bz2/benchmark.wasm Δ = 92094.60 ± 29972.88 (confidence = 99%) noload.so is 1.06x to 1.12x faster than load.so! [1100973 1120327.60 1169522] load.so [1019878 1028233.00 1083605] noload.so compilation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [2281318 2391173.30 2505976] load.so [2212191 2312207.60 2376594] noload.so instantiation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [1028 1098.00 1448] load.so [1050 1103.10 1170] noload.so execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 66417.80 ± 543.64 (confidence = 99%) noload.so is 1.39x to 1.40x faster than load.so! [233706 234547.90 235270] load.so [167648 168130.10 168742] noload.so compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [4541616 4965704.40 5992445] load.so [4423989 4751356.50 5875859] noload.so instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [1714 2113.50 2415] load.so [1798 2135.50 2299] noload.so compilation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm Δ = 44240.30 ± 40968.63 (confidence = 99%) noload.so is 1.00x to 1.03x faster than load.so! [2848804 2880309.90 2923861] load.so [2803879 2836069.60 2912357] noload.so instantiation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm No difference in performance. [1186 1391.10 1495] load.so [1185 1445.70 1604] noload.so execution :: cycles :: benchmarks/blake3-scalar/benchmark.wasm No difference in performance. [9675 9782.10 10005] load.so [9572 9637.80 9814] noload.so instantiation :: cycles :: benchmarks/regex/benchmark.wasm Δ = 340.10 ± 243.11 (confidence = 99%) noload.so is 1.03x to 1.19x faster than load.so! [2939 3406.10 3681] load.so [2992 3066.00 3281] noload.so execution :: cycles :: benchmarks/regex/benchmark.wasm Δ = 245010.10 ± 1279.07 (confidence = 99%) noload.so is 1.06x to 1.06x faster than load.so! [4076050 4077873.90 4079406] load.so [3831298 3832863.80 3834736] noload.so compilation :: cycles :: benchmarks/regex/benchmark.wasm Δ = 369482.20 ± 93448.03 (confidence = 99%) noload.so is 1.03x to 1.04x faster than load.so! [10706461 10872794.20 10932079] load.so [10360994 10503312.00 10590922] noload.so
</details>
candymate commented on PR #8221:
@afonso360, if you still have access to your RISC-V hardware and a bit of time to test, it would be useful to confirm that this changes the result of the "torn stores" test case from earlier! @candymate, it would be interesting to confirm that this resolves your fuzzbug as well, and this could be useful for further fuzzing for you.
I'm sorry that I saw this just now - I'll check this in a day or two from now and let you know.
candymate edited a comment on PR #8221:
@afonso360, if you still have access to your RISC-V hardware and a bit of time to test, it would be useful to confirm that this changes the result of the "torn stores" test case from earlier! @candymate, it would be interesting to confirm that this resolves your fuzzbug as well, and this could be useful for further fuzzing for you.
I'm sorry that I saw this just now - I'll check this in a day or two from now and let you know.
Tiny update: I'm sorry to say this, but I'm not available till 5/1. I'll see this again at that time.
I'm not sure about this, but I think I observed that this problem persists with QEMU 8.1.5 - need recheck...
Last updated: Jan 24 2025 at 00:11 UTC