gfx opened issue #13461:
Summary
A Wasm GC
(ref null $t)value that is live across a call safepoint can read back as
null after the call, making aref.as_non_null(orstruct.get/array.get) on a
provably non-null value trap withwasm trap: null reference. The same module runs
correctly on 44.0.0.
git bisect(overv44.0.1..v45.0.0) points at:24938c41630e2662ebcdb0a37137f02fd48c4d07 — "Handle alias values in
SafepointSpiller::rewrite_use" (#13228)That PR makes the safepoint spiller reload aliased needs-stack-map values from their
stack slot after safepoints (needed for the new moving collector). For the affected
value, the post-safepoint reload appears to read a slot that was never spilled, so the
use seesnull.
Reproduces with every collector (
drc/null/copying) and at **every
Cranelift opt level**, because the reload code is emitted unconditionally.The follow-up #13449 ("include every SSA value bound to a variable in stack maps")
does not fix this case.Still present on
main— testedfa50c6a150(46.0.0-dev).Reproduction
A self-contained regression test (
.wast) is on this branch:https://github.com/wado-lang/wasmtime/blob/gfx/repro-gc-safepoint-reload-null/tests/misc_testsuite/gc/safepoint-reload-aliased-ref-null.wast
It is a standard
tests/misc_testsuite/gc/test —(assert_return (invoke "run"))on a
GC module.runreturns without trapping on 44.0.0 and traps withnull referenceon
45.0.0 andmain.You can also run the module directly:
$ FEATURES="-Wgc=y -Wfunction-references=y -Wwide-arithmetic=y -Wsimd=y -Wthreads=y -Wreference-types=y" # 44.0.0 — OK (exit 0) $ wasmtime run $FEATURES --invoke run repro.wasm # 45.0.0 / main — traps $ wasmtime run $FEATURES --invoke run repro.wasm ... wasm trap: null referenceThe module was produced by a real frontend (the Wado
compiler): it JSON-deserializes a two-field struct in a loop. The trapping
ref.as_non_nullreads a struct-access object that is built withstruct.new(so it
cannot be null) and is held in a local that is live across the call fetching the next
field; on the second loop iteration the reloaded local reads back null. The module passes
wasm-tools validate --features all.On minimization
I could not reduce this with
wasm-tools shrink(0% — the module is already DCE'd and the
miscompile is register-allocation-sensitive, so any structural mutation hides it). Happy
to provide the source program or help with a CLIF-level reduction.Environment
- Cranelift/wasmtime 45.0.0 (
377cd917a) andmain(fa50c6a150, 46.0.0-dev)- host: aarch64-apple-darwin, Cranelift backend
- first bad commit:
24938c41630e2662ebcdb0a37137f02fd48c4d07(#13228)
gfx edited issue #13461:
Summary
A Wasm GC
(ref null $t)value that is live across a call safepoint can read back as
null after the call, making aref.as_non_null(orstruct.get/array.get) on a
provably non-null value trap withwasm trap: null reference. The same module runs
correctly on 44.0.0.
git bisect(overv44.0.1..v45.0.0) points at:24938c41630e2662ebcdb0a37137f02fd48c4d07 — "Handle alias values in
SafepointSpiller::rewrite_use" (#13228)That PR makes the safepoint spiller reload aliased needs-stack-map values from their
stack slot after safepoints (needed for the new moving collector). For the affected
value, the post-safepoint reload appears to read a slot that was never spilled, so the
use seesnull.
Reproduces with every collector (
drc/null/copying) and at **every
Cranelift opt level**, because the reload code is emitted unconditionally.The follow-up #13449 ("include every SSA value bound to a variable in stack maps")
does not fix this case.Still present on
main— testedfa50c6a150(46.0.0-dev).Reproduction
A self-contained regression test (
.wast) is on this branch:https://github.com/wado-lang/wasmtime/blob/gfx/repro-gc-safepoint-reload-null/tests/misc_testsuite/gc/safepoint-reload-aliased-ref-null.wast
It is a standard
tests/misc_testsuite/gc/test —(assert_return (invoke "run"))on a
GC module.runreturns without trapping on 44.0.0 and traps withnull referenceon
45.0.0 andmain.You can also run the module directly:
$ FEATURES="-Wgc=y -Wfunction-references=y -Wwide-arithmetic=y -Wsimd=y -Wthreads=y -Wreference-types=y" # 44.0.0 — OK (exit 0) $ wasmtime run $FEATURES --invoke run repro.wasm # 45.0.0 / main — traps $ wasmtime run $FEATURES --invoke run repro.wasm ... wasm trap: null referenceThe module was produced by a real frontend (the Wado
compiler): it JSON-deserializes a two-field struct in a loop. The trapping
ref.as_non_nullreads a struct-access object that is built withstruct.new(so it
cannot be null) and is held in a local that is live across the call fetching the next
field; on the second loop iteration the reloaded local reads back null. The module passes
wasm-tools validate --features all.On minimization
I could not reduce this with
wasm-tools shrink(0% — the module is already DCE'd and the
miscompile is register-allocation-sensitive, so any structural mutation might hide it). Happy
to provide the source program or help with a CLIF-level reduction.Environment
- Cranelift/wasmtime 45.0.0 (
377cd917a) andmain(fa50c6a150, 46.0.0-dev)- host: aarch64-apple-darwin, Cranelift backend
- first bad commit:
24938c41630e2662ebcdb0a37137f02fd48c4d07(#13228)
alexcrichton added the wasm-proposal:gc label to Issue #13461.
alexcrichton commented on issue #13461:
cc @fitzgen
vouillon commented on issue #13461:
Claude suggests to swap these two steps:
The spiller's stack-slot allocator reuses slots via a free-list. The reuse invariant is: a slot freed at a def can only be handed back out to values processed strictly later in the pass (= strictly earlier in program order), so liveness guarantees no overlap. Per-instruction the pass currently runs
rewrite_def→rewrite_safepoint→rewrite_use.That order breaks the invariant whenever a single instruction is both a safepoint and the def of a needs-stack-map value (e.g.
v_def = call $alloc).rewrite_deffreesv_def's slot into the pool, thenrewrite_safepointrequests slots for everything live across the call; any live-across value that doesn't yet have a slot popsv_def's just-freed slot. The IR ends up storingv_definto a slot that another live-across value is also reading from, and in a loop the next iteration's reload returnsv_definstead of the original value — manifesting aswasm trap: nullreference onref.as_non_null.The bug pre-existed #13228; that commit only made it observable by rewriting aliased uses into
stack_loadsinstead of leaving them as direct register reads.Fix: do
rewrite_safepointbeforerewrite_defso live-across slots are reserved beforev_def's slot is returned to the pool. The two steps commute (disjoint state: stack-map metadata vs. inserting astack_store), andrewrite_def's freed slot then only becomes available to values processed later in the pass, restoring the invariant.
cfallin commented on issue #13461:
Claude suggests to swap these two steps:
Hi @vouillon -- without commenting on the rest of your message, please do not do this. Our AI tool usage policy requires a human in the loop, and copying/pasting something straight from your agent is explicitly against the spirit of our policy. We are talking to you, and you take responsibility for everything you say and all code that you post here. If you do not have enough of a solid grasp on your contributions to speak yourself about them, but have to transparently delegate to your agent, then that is a good sign that you should spend more time understanding it yourself first. (You are of course free to use an agent internally; but we interact with you only.) Doing otherwise is an "extractive contribution" per our policy: it says that you don't want to bother parsing and working through details but just want to throw that work at us. Thanks!
vouillon commented on issue #13461:
Claude suggests to swap these two steps:
Hi @vouillon -- without commenting on the rest of your message, _please do not do this_. Our AI tool usage policy requires a human in the loop, and copying/pasting something straight from your agent is _explicitly_ against the spirit of our policy. We are talking to _you_, and _you_ take responsibility for everything you say and all code that you post here. If you do not have enough of a solid grasp on your contributions to speak yourself about them, but have to transparently delegate to your agent, then that is a good sign that you should spend more time understanding it yourself first. (You are of course free to use an agent internally; but we interact with _you_ only.) Doing otherwise is an "extractive contribution" per our policy: it says that you don't want to bother parsing and working through details but just want to throw that work at us. Thanks!
Sorry, I should have phrased this differently, and should have been clearer. This change fixes the issue, and does not seem to break anything, including a quite large test suite with which I found two other issues (#13449 and #13450). I have also read the AI-generated explanation, and it makes a lot of sense to me. So, even though I have not taken the time to make a PR, I thought it would save you some time to have this information. ("Our golden rule is that a contribution should be worth more to the project than the time it takes to review it.")
cfallin commented on issue #13461:
Per our AI guidelines, such a note would be quite welcome -- if written by a human rather than "Claude seems to say: \<dump\>". I hope you can appreciate that we get increasingly large mountains of very verbose AI output, some of it right, some of it wrong, and some of it not even wrong in extraordinarily strange ways, and so the only realistic approach with our limited time is to consider all such text as un-reviewed by a human until proven otherwise. That's the essence behind our policy. A "I've tested this fix and it works, what do you think?" note would have led to a much better reception!
cfallin edited a comment on issue #13461:
Per our AI guidelines, such a note would be quite welcome -- if written by a human rather than "Claude suggests: \<dump\>". I hope you can appreciate that we get increasingly large mountains of very verbose AI output, some of it right, some of it wrong, and some of it not even wrong in extraordinarily strange ways, and so the only realistic approach with our limited time is to consider all such text as un-reviewed by a human until proven otherwise. That's the essence behind our policy. A "I've tested this fix and it works, what do you think?" note would have led to a much better reception!
vouillon edited a comment on issue #13461:
Swapping these two steps seems to fix the issue:
The rational is that one should reserve slots for all the needs-stack-map values before releasing the slot for the value defined by the instruction. Otherwise, one can assign the same slot for both a value live across the instruction and the value just defined.
gfx commented on issue #13461:
We're hitting another miscompile from #13228 that #13469 does not fix — a second case on a non-moving (
drc) collector, which is what Wado uses.Bisected (clean, over
[#13217..v45.0.0]) to #13228 (24938c4163). The same wado-generated wasm runs on wasmtime 44 and traps on 45; stockv45.0.0traps, and so does our fork with #13469 already applied.Reproduce (failing source + details in this gist: https://gist.github.com/gfx/50948a70258ab94e1dd52011894a4848):
git clone https://github.com/wado-lang/wado cd wado git checkout gfx/wasmtime-45 git submodule update --init vendor/wasmtime # = wado-lang/wasmtime @ gfx/wasmtime-45 (v45.0.0 + #13469) cargo build --release --bin wado ./target/release/wado test -O3 package-gale/src/codegen_lr_label_test.wadoTraps in a GC-reference-heavy codegen function:
error while executing at wasm backtrace: 0: ...!./codegen.wado/emit_lr_alt_struct 3: ...!./codegen.wado/generate 4: ...!__test_0_labeled_left_recursive_alt_routes_through_gen_scan_lr_functionsThe same source on wasmtime 44 (wado
main) passes. It needs-O3(aggressive inlining) and reproduces with cranelift opt levelnone(the test harness usesConfig::cranelift_opt_level(OptLevel::None)).I tried to reduce this to a standalone
wasmtime runcase but couldn't get one to fire: invoking the same.wasmthrough its component export does not trap — it only reproduces through the test harness's instantiate-and-call path, so it's regalloc-fragile like the original #13461 repro. Happy to dig further if a smaller case would help.Since #13228's own commit message notes these reloads are unnecessary for non-moving collectors, and it has now produced two distinct miscompiles for them (this and #13461), I'd lean toward reverting #13228 rather than fixing it case by case.
gfx edited a comment on issue #13461:
We're hitting another miscompile from #13228 that #13469 does not fix — a second case on a non-moving (
drc) collector, which is what Wado uses.Bisected (clean, over
[#13217..v45.0.0]) to #13228 (24938c4163). The same wado-generated wasm runs on wasmtime 44 and traps on 45; stockv45.0.0traps, and so does our fork with #13469 already applied.Reproduce (failing source + details in this gist: https://gist.github.com/gfx/50948a70258ab94e1dd52011894a4848):
git clone https://github.com/wado-lang/wado cd wado git checkout gfx/wasmtime-45 git submodule update --init vendor/wasmtime # = wado-lang/wasmtime @ gfx/wasmtime-45 (v45.0.0 + #13469) cargo build --release --bin wado ./target/release/wado test -O3 package-gale/src/codegen_lr_label_test.wadoTraps in a GC-reference-heavy codegen function:
error while executing at wasm backtrace: 0: ...!./codegen.wado/emit_lr_alt_struct 3: ...!./codegen.wado/generate 4: ...!__test_0_labeled_left_recursive_alt_routes_through_gen_scan_lr_functionsThe same source on wasmtime 44 (wado
main) passes. It needswado -O3and reproduces with cranelift opt levelnone(the test harness usesConfig::cranelift_opt_level(OptLevel::None)).I tried to reduce this to a standalone
wasmtime runcase but couldn't get one to fire: invoking the same.wasmthrough its component export does not trap — it only reproduces through the test harness's instantiate-and-call path, so it's regalloc-fragile like the original #13461 repro. Happy to dig further if a smaller case would help.Since #13228's own commit message notes these reloads are unnecessary for non-moving collectors, and it has now produced two distinct miscompiles for them (this and #13461), I'd lean toward reverting #13228 rather than fixing it case by case.
vouillon commented on issue #13461:
We're hitting another miscompile from #13228 that #13469 does not fix — a second case on a non-moving (
drc) collector, which is what Wado uses.This other issue should be fixed by #13480.
vouillon commented on issue #13461:
Per our AI guidelines, such a note would be quite welcome -- if written by a human rather than "Claude suggests: <dump>". I hope you can appreciate that we get increasingly large mountains of very verbose AI output, some of it right, some of it wrong, and some of it _not even wrong_ in extraordinarily strange ways, and so the only realistic approach with our limited time is to consider all such text as un-reviewed by a human until proven otherwise. That's the essence behind our policy. A "I've tested this fix and it works, what do you think?" note would have led to a _much_ better reception!
I totally understand. I apologize for my clumsy message. I'll be more careful going forward.
fitzgen closed issue #13461:
Summary
A Wasm GC
(ref null $t)value that is live across a call safepoint can read back as
null after the call, making aref.as_non_null(orstruct.get/array.get) on a
provably non-null value trap withwasm trap: null reference. The same module runs
correctly on 44.0.0.
git bisect(overv44.0.1..v45.0.0) points at:24938c41630e2662ebcdb0a37137f02fd48c4d07 — "Handle alias values in
SafepointSpiller::rewrite_use" (#13228)That PR makes the safepoint spiller reload aliased needs-stack-map values from their
stack slot after safepoints (needed for the new moving collector). For the affected
value, the post-safepoint reload appears to read a slot that was never spilled, so the
use seesnull.
Reproduces with every collector (
drc/null/copying) and at **every
Cranelift opt level**, because the reload code is emitted unconditionally.The follow-up #13449 ("include every SSA value bound to a variable in stack maps")
does not fix this case.Still present on
main— testedfa50c6a150(46.0.0-dev).Reproduction
A self-contained regression test (
.wast) is on this branch:https://github.com/wado-lang/wasmtime/blob/gfx/repro-gc-safepoint-reload-null/tests/misc_testsuite/gc/safepoint-reload-aliased-ref-null.wast
It is a standard
tests/misc_testsuite/gc/test —(assert_return (invoke "run"))on a
GC module.runreturns without trapping on 44.0.0 and traps withnull referenceon
45.0.0 andmain.You can also run the module directly:
$ FEATURES="-Wgc=y -Wfunction-references=y -Wwide-arithmetic=y -Wsimd=y -Wthreads=y -Wreference-types=y" # 44.0.0 — OK (exit 0) $ wasmtime run $FEATURES --invoke run repro.wasm # 45.0.0 / main — traps $ wasmtime run $FEATURES --invoke run repro.wasm ... wasm trap: null referenceThe module was produced by a real frontend (the Wado
compiler): it JSON-deserializes a two-field struct in a loop. The trapping
ref.as_non_nullreads a struct-access object that is built withstruct.new(so it
cannot be null) and is held in a local that is live across the call fetching the next
field; on the second loop iteration the reloaded local reads back null. The module passes
wasm-tools validate --features all.On minimization
I could not reduce this with
wasm-tools shrink(0% — the module is already DCE'd and the
miscompile is register-allocation-sensitive, so any structural mutation might hide it). Happy
to provide the source program or help with a CLIF-level reduction.Environment
- Cranelift/wasmtime 45.0.0 (
377cd917a) andmain(fa50c6a150, 46.0.0-dev)- host: aarch64-apple-darwin, Cranelift backend
- first bad commit:
24938c41630e2662ebcdb0a37137f02fd48c4d07(#13228)
Last updated: Jun 01 2026 at 09:49 UTC