fitzgen edited issue #7237:
Apparently on ARM when
- a store is unaligned
- and crosses a page boundary
- and one page faults but the other doesn't
then there is no guarantee that the part of the store to the page that didn't fault will not succeed, despite the other part raising a signal.
This means that partially out-of-bounds Wasm stores that trigger a trap can potentially mutate memory for the in-bounds portion of the write, which is not spec compliant.
Apparently it is implementation-defined behavior, so it may or may not be an issue on any given ARM machine.
Thus far, @cfallin tested on the ARM machines he has access to and none of the following have failed the attached test case:
- Apple M2
- RPi4
- the BA ARM server ("Neoverse N1 core, I think, in an Ampere Altra CPU")
Test Case
(module (memory 1) (func (export "i64.store") (param i32 i64) local.get 0 local.get 1 i64.store) (func (export "i32.load8_u") (param i32) (result i32) local.get 0 i32.load8_u)) (assert_trap (invoke "i64.store" (i32.const 65529) (i64.const 0xffffffffffffffff)) "out of bounds memory access") ;; Partial bytes were not written. (assert_return (invoke "i32.load8_u" (i32.const 65529)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65530)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65531)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65532)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65533)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65534)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65535)) (i32.const 0))
See Also
fitzgen edited issue #7237:
Apparently on ARM (edit: and riscv) when
- a store is unaligned
- and crosses a page boundary
- and one page faults but the other doesn't
then there is no guarantee that the part of the store to the page that didn't fault will not succeed, despite the other part raising a signal.
This means that partially out-of-bounds Wasm stores that trigger a trap can potentially mutate memory for the in-bounds portion of the write, which is not spec compliant.
Apparently it is implementation-defined behavior, so it may or may not be an issue on any given ARM machine.
Thus far, @cfallin tested on the ARM machines he has access to and none of the following have failed the attached test case:
- Apple M2
- RPi4
- the BA ARM server ("Neoverse N1 core, I think, in an Ampere Altra CPU")
The test case does fail on the following machines we have tested:
- JH7110 (SiFive u74 core) (riscv)
Test Case
(module (memory 1) (func (export "i64.store") (param i32 i64) local.get 0 local.get 1 i64.store) (func (export "i32.load8_u") (param i32) (result i32) local.get 0 i32.load8_u)) (assert_trap (invoke "i64.store" (i32.const 65529) (i64.const 0xffffffffffffffff)) "out of bounds memory access") ;; Partial bytes were not written. (assert_return (invoke "i32.load8_u" (i32.const 65529)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65530)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65531)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65532)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65533)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65534)) (i32.const 0)) (assert_return (invoke "i32.load8_u" (i32.const 65535)) (i32.const 0))
See Also
sparker-arm commented on issue #7237:
if the platform doesn't guarantee consistency in its behavior
The Arm architecture doesn't give you any guarantees, but if a specific OS only runs on specific microarchitectures, that do provide the semantics, then it will be conforming. But, of course, there's no guarantee that future microarchitectures on that platform will continue to provide the necessary semantics.
bjorn3 commented on issue #7237:
It also has the issue of migrating between cpu's. Whether for live vm migration, or for something like rr.
alexcrichton commented on issue #7237:
This was rediscovered recently with fuzzing
cfallin commented on issue #7237:
I'll copy here the idea I posted in the Wasm design thread, which AFAIK is the lowest-overhead potential solution today if the spec doesn't change: insert a load of the same size before every store. This is correct in the sense that if we're about to store to an address, a load to that same address should be legal, and if it traps the store would have trapped; and prevents side-effects because if one instruction occurs before another and traps, the side-effects of the latter must not occur at all, architecturally at least (here we're not thinking about speculation; also there was some discussion in the other thread about memory models and atomics, but this is fully within a single thread and relies only on instruction order in that thread). I think I will prototype this idea just to see what the overhead is, and to allow folks with write-tearing machines to test.
sparker-arm commented on issue #7237:
I will be very interested in hearing the results! I've currently got a prototype patch for V8 which enables trap handling for byte stores and for any store for which we know is aligned, and that gives a 2% uplift in some benchmarks.
cfallin commented on issue #7237:
I've implemented the idea in #8221 and I see a 2% perf degradation (on an Apple M2 Max machine when running spidermonkey.wasm in Wasmtime). That seems reasonable to me to enable under some conditions -- not universally -- if the spec remains the same. It's still a hard question exactly what those conditions are: require a non-default setting for precise post-trap memory state? Perform a torn-store test at startup and fail if option was not enabled during compilation (or turn it on, the same way we infer native flags otherwise -- "non-tearing" hardware is then like an ISA extension in that sense)?
pervognsen commented on issue #7237:
Alex Bradbury pointed me to this thread. I was working on the specification of a low-level sandbox ISA and realized that partially visible page-crossing stores would be a potential issue if I overspecified it relative to what guarantees existing ISAs could offer. In my case I have a page-based memory model with both read and read/write permissions per page, so load probing does not suffice.
However, if you specify maximally-complete bytewise store semantics (i.e. store every byte that can be stored given the page permissions) I believe you should be able to implement it with software emulation in the fault handler via idempotent stores: the architectural state at the faulting instruction still has the address and the source operand accessible (in registers in my case or on the stack in Wasm's case). This would have zero performance impact on the non-faulting code path. In Wasm's case I don't know if this would be inconsistent with the specification in other areas (e.g. concurrency) but I thought I'd mention it.
pervognsen edited a comment on issue #7237:
Alex Bradbury pointed me to this thread. I was working on the specification of a low-level sandbox ISA and realized that partially visible page-crossing stores would be a potential issue if I overspecified it relative to what guarantees existing ISAs could offer. In my case I have a page-based memory model with both read and read/write permissions per page, so load probing does not suffice Edit: on second thought, load probing _would_ suffice but you'd need to keep the load-probed bytes in a temp register so the fault handler could undo a partial store by restoring the original bytes.
However, if you specify maximally-complete bytewise store semantics (i.e. store every byte that can be stored given the page permissions) I believe you should be able to implement it with software emulation in the fault handler via idempotent stores: the architectural state at the faulting instruction still has the address and the source operand accessible (in registers in my case or on the stack in Wasm's case). This would have zero performance impact on the non-faulting code path. In Wasm's case I don't know if this would be inconsistent with the specification in other areas (e.g. concurrency) but I thought I'd mention it.
pervognsen edited a comment on issue #7237:
Alex Bradbury pointed me to this thread. I was working on the specification of a low-level sandbox ISA and realized that partially visible page-crossing stores would be a potential issue if I overspecified it relative to what guarantees existing ISAs could offer. In my case I have a page-based memory model with both read and read/write permissions per page, so load probing does not suffice Edit: on second thought, load probing _would_ suffice if you kept the load-probed bytes in a temp register so the fault handler could undo a partial store by restoring the original bytes.
However, if you specify maximally-complete bytewise store semantics (i.e. store every byte that can be stored given the page permissions) I believe you should be able to implement it with software emulation in the fault handler via idempotent stores: the architectural state at the faulting instruction still has the address and the source operand accessible (in registers in my case or on the stack in Wasm's case). This would have zero performance impact on the non-faulting code path. In Wasm's case I don't know if this would be inconsistent with the specification in other areas (e.g. concurrency) but I thought I'd mention it.
pervognsen edited a comment on issue #7237:
Alex Bradbury pointed me to this thread. I was working on the specification of a low-level sandbox ISA and realized that partially visible page-crossing stores would be a potential issue if I overspecified it relative to what guarantees existing ISAs could offer. In my case I have a page-based memory model with both read and read/write permissions per page, so load probing does not suffice Edit: on second thought, load probing _would_ suffice if you kept all loadable bytes in a temp register so the fault handler could undo a partial store by restoring the original bytes. But this requires you to split the unaligned load probe into two aligned load probes which also adds some alignment math.
However, if you specify maximal bytewise store semantics (i.e. store every byte that can be stored given the page permissions) I believe you should be able to implement it with software emulation in the fault handler via idempotent stores: the architectural state at the faulting instruction still has the address and the source operand accessible (in registers in my case or on the stack in Wasm's case). This would have zero performance impact on the non-faulting code path. In Wasm's case I don't know if this would be inconsistent with the specification in other areas (e.g. concurrency) but I thought I'd mention it.
Last updated: Jan 24 2025 at 00:11 UTC