matthargett opened PR #13446 from rebeckerspecialties:claude/pulley-fusion-xband-brif-upstream to bytecodealliance:main:
TL;DR
Stacks on #13445 (per-table mutability tracking). Adds a fused dispatch op family at the
call_indirectlazy-init brif site (similar in shape to WAMR's preprocessed-bytecode register IR) plus acall_indirect{1,2,3,4}family mirroring direct-callcall{1,2,3,4}. Consistent 4–8 % wallclock wins on polymorphic vtable workloads across iPhone 12 A14 Icestorm E-core, Apple Watch SE2 S8, and M4 Sawtooth E-core. iPhone XS A12 Tempest is mixed — branch-prediction pressure is the cross-microarch variable.Closes ~10 % of the Pulley/WAMR wallclock gap on the iPhone 12 vtable suite (vtable_poly4 1.73× → 1.58×) and pushes WAMR within 1.16× on
xmrsplayeron Apple Watch SE2 — the closest cross-device result in our matrix.Dependency
Depends on #13445 landing first. Both PRs are stacked from the same fork branch; this PR's diff against
mainincludes #13445's 11 commits at the bottom + 13 fusion commits on top. The fusion is gated onis_eagerly_initialized_funcref_table(the predicate added in #13445), so it only fires when the table-mutability proof holds.Stack
13 commits on top of #13445:
- Phases 1–3: collapse
band + brif + 2 xloadsat the call_indirect lazy-init tail (5 Pulley dispatches → 2 per call_indirect site).- Phase 4:
call_indirect{1,2,3,4}opcodes mirror direct-callcall{1,2,3,4}.Inst::IndirectCallbundles first 4 integer ABI args into the call opcode instead of synthesisingxmovs via regallocreg_fixed_use.- Correctness: handlers trap on null (a slow-path-aliasing review concern:
sink_pure_instof the continuation-block loads broke the lazy-init slow path's rejoin; trapping fails closed under the predicate);#[cfg(debug_assertions)]predecessor-count assertion inpre_lowerto catch future structural regressions.Wallclock medians, N=10, phase-4 vs
table-mutability-trackingbaseline
BENCH_TARGET_MS=2000;.utilityQoS on iOS /taskpolicy -bon M4.
workload iPhone 12 (A14) iPhone XS (A12) Watch SE2 (S8) call_indirect −2.48 % +0.18 % −0.50 % vtable_bi −7.45 % +4.84 % −4.26 % vtable_poly4 −8.61 % −2.96 % −4.76 % vtable_poly6 −5.32 % +5.41 % −4.64 % xmrsplayer +0.26 % −6.04 % +0.25 % graphql (AS) −0.13 % −3.63 % −0.74 % graphql (Porffor) +2.13 % +0.21 % +0.58 % PMU buckets (single 12 s xctrace
CPU Countersper workload)iPhone 12 attach-mode + M4 launch-mode. Phase-4 bucket shares:
- A14 Icestorm vtable suite: 20–38 % Processing, 4–18 % Discarded.
- A14 Icestorm graphql + call_indirect: 9–13 % Processing, 17–41 % Discarded — interpreter-loop mispredict pressure.
- M4 Sawtooth, every workload: 33–47 % Processing — back-end load-use latency on the dispatch tail dominates Sawtooth's wider issue width.
Common bottleneck across A14 + M4 is Processing (back-end-bound). A12 Tempest doesn't expose
CounterMetricByThread; iPhone XS is wallclock-only.Verification
- 2237 / 2237 cranelift filetests
- 13 / 13
craneliftpulley_call_*integration tests- 21 min
cargo fuzz run differential --no-default-featureswithALLOWED_ENGINES=pulley,wasmtime— 0 crashes / 0 divergencesCross-device measurement harness + raw N=10 logs + per-phase writeups: rebeckerspecialties/wasm-benchmark#1.
matthargett requested cfallin for a review on PR #13446.
matthargett requested wasmtime-compiler-reviewers for a review on PR #13446.
matthargett requested wasmtime-core-reviewers for a review on PR #13446.
matthargett requested fitzgen for a review on PR #13446.
matthargett requested wasmtime-default-reviewers for a review on PR #13446.
matthargett edited PR #13446:
TL;DR
Stacks on #13445 (per-table mutability tracking). Adds a fused dispatch op family at the
call_indirectlazy-init brif site (similar in shape to WAMR's preprocessed-bytecode register IR) plus acall_indirect{1,2,3,4}family mirroring direct-callcall{1,2,3,4}. Consistent 4–8 % wallclock wins on polymorphic vtable workloads across iPhone 12 E-core, Apple Watch SE2 , and M4 E-core. iPhone XS e-core is mixed and the reason, is visible in hand-rolled aarch64 assembly microbenchmarks: branch-prediction pressure is the cross-microarch variable.Closes ~10 % of the Pulley/WAMR wallclock gap on the iPhone 12 vtable suite (vtable_poly4 1.73× → 1.58×) and pushes WAMR within 1.16× on
xmrsplayeron Apple Watch SE2 — the closest cross-device result in our matrix.Dependency
Depends on #13445 landing first. Both PRs are stacked from the same fork branch; this PR's diff against
mainincludes #13445's 11 commits at the bottom + 13 fusion commits on top. The fusion is gated onis_eagerly_initialized_funcref_table(the predicate added in #13445), so it only fires when the table-mutability proof holds.Stack
13 commits on top of #13445:
- Phases 1–3: collapse
band + brif + 2 xloadsat the call_indirect lazy-init tail (5 Pulley dispatches → 2 per call_indirect site).- Phase 4:
call_indirect{1,2,3,4}opcodes mirror direct-callcall{1,2,3,4}.Inst::IndirectCallbundles first 4 integer ABI args into the call opcode instead of synthesisingxmovs via regallocreg_fixed_use.- Correctness: handlers trap on null (a slow-path-aliasing review concern:
sink_pure_instof the continuation-block loads broke the lazy-init slow path's rejoin; trapping fails closed under the predicate);#[cfg(debug_assertions)]predecessor-count assertion inpre_lowerto catch future structural regressions, but lmk if this is overkill.Wallclock medians, N=10, phase-4 vs
table-mutability-trackingbaseline
BENCH_TARGET_MS=2000;.utilityQoS on iOS /taskpolicy -bon M4.
workload iPhone 12 (A14) iPhone XS (A12) Watch SE2 (S8) call_indirect −2.48 % +0.18 % −0.50 % vtable_bi −7.45 % +4.84 % −4.26 % vtable_poly4 −8.61 % −2.96 % −4.76 % vtable_poly6 −5.32 % +5.41 % −4.64 % xmrsplayer +0.26 % −6.04 % +0.25 % graphql (AS) −0.13 % −3.63 % −0.74 % graphql (Porffor) +2.13 % +0.21 % +0.58 % PMU buckets (single 12 s xctrace
CPU Countersper workload)
- A14 Icestorm vtable suite: 20–38 % Processing, 4–18 % Discarded.
- A14 Icestorm graphql + call_indirect: 9–13 % Processing, 17–41 % Discarded interpreter-loop mispredict pressure but its kinda squirrely and I couldn't pin it down..
- M4 Sawtooth, every workload: 33–47 % Processing — back-end load-use latency on the dispatch tail dominates Sawtooth's wider issue width. The wide spread here is probably due to running the measurement and dev stack on the device itself.
Common bottleneck across A14 + M4 is Processing (back-end-bound). I can't measure these advacned CPU counters on iPhone XS (A12) or Apple Watch, I can only get wall clock time and power draw.
Verification
- 2237 / 2237 cranelift filetests
- 13 / 13
craneliftpulley_call_*integration tests- 21 min
cargo fuzz run differential --no-default-featureswithALLOWED_ENGINES=pulley,wasmtime— 0 crashes / 0 divergencesExtra credit
Cross-device measurement harness I built for this bake-off across WASM runtimes: https://github.com/rebeckerspecialties/wasm-benchmark/pull/1
I submitted two PRs to WAMR to support exceptions and loose SIMD, so that it can run more of the benchmarks and generally have functional partity without losing its performance edge.
matthargett updated PR #13446.
matthargett updated PR #13446.
matthargett edited PR #13446:
TL;DR
Stacks on #13445 (per-table mutability tracking). Adds a fused dispatch op family at the
call_indirectlazy-init brif site (similar in shape to WAMR's preprocessed-bytecode register IR) plus acall_indirect{1,2,3,4}family mirroring direct-callcall{1,2,3,4}. Consistent 4–8 % wallclock wins on polymorphic vtable workloads across iPhone 12 E-core, Apple Watch SE2 , and M4 E-core. iPhone XS e-core is mixed and the reason, is visible in hand-rolled aarch64 assembly microbenchmarks: branch-prediction pressure is the cross-microarch variable.Closes ~10 % of the Pulley/WAMR wallclock gap on the iPhone 12 vtable suite (vtable_poly4 1.73× → 1.58×) and pushes WAMR within 1.16× on
xmrsplayeron Apple Watch SE2 — the closest cross-device result in our matrix.Dependency
Depends on #13445 landing first. Both PRs are stacked from the same fork branch; this PR's diff against
mainincludes #13445's 11 commits at the bottom + 13 fusion commits on top. The fusion is gated onis_eagerly_initialized_funcref_table(the predicate added in #13445), so it only fires when the table-mutability proof holds.Stack
13 commits on top of #13445:
- Phases 1–3: collapse
band + brif + 2 xloadsat the call_indirect lazy-init tail (5 Pulley dispatches → 2 per call_indirect site).- Phase 4:
call_indirect{1,2,3,4}opcodes mirror direct-callcall{1,2,3,4}.Inst::IndirectCallbundles first 4 integer ABI args into the call opcode instead of synthesisingxmovs via regallocreg_fixed_use.- Correctness: handlers trap on null (a slow-path-aliasing review concern:
sink_pure_instof the continuation-block loads broke the lazy-init slow path's rejoin; trapping fails closed under the predicate).Wallclock medians, N=10, phase-4 vs
table-mutability-trackingbaseline
BENCH_TARGET_MS=2000;.utilityQoS on iOS /taskpolicy -bon M4.
workload iPhone 12 (A14) iPhone XS (A12) Watch SE2 (S8) call_indirect −2.48 % +0.18 % −0.50 % vtable_bi −7.45 % +4.84 % −4.26 % vtable_poly4 −8.61 % −2.96 % −4.76 % vtable_poly6 −5.32 % +5.41 % −4.64 % xmrsplayer +0.26 % −6.04 % +0.25 % graphql (AS) −0.13 % −3.63 % −0.74 % graphql (Porffor) +2.13 % +0.21 % +0.58 % PMU buckets (single 12 s xctrace
CPU Countersper workload)
- A14 Icestorm vtable suite: 20–38 % Processing, 4–18 % Discarded.
- A14 Icestorm graphql + call_indirect: 9–13 % Processing, 17–41 % Discarded interpreter-loop mispredict pressure but its kinda squirrely and I couldn't pin it down..
- M4 Sawtooth, every workload: 33–47 % Processing — back-end load-use latency on the dispatch tail dominates Sawtooth's wider issue width. The wide spread here is probably due to running the measurement and dev stack on the device itself.
Common bottleneck across A14 + M4 is Processing (back-end-bound). I can't measure these advacned CPU counters on iPhone XS (A12) or Apple Watch, I can only get wall clock time and power draw.
Verification
- 2237 / 2237 cranelift filetests
- 13 / 13
craneliftpulley_call_*integration tests- 21 min
cargo fuzz run differential --no-default-featureswithALLOWED_ENGINES=pulley,wasmtime— 0 crashes / 0 divergencesExtra credit
Cross-device measurement harness I built for this bake-off across WASM runtimes: https://github.com/rebeckerspecialties/wasm-benchmark/pull/1
I submitted two PRs to WAMR to support exceptions and loose SIMD, so that it can run more of the benchmarks and generally have functional partity without losing its performance edge.
matthargett updated PR #13446.
matthargett updated PR #13446.
:cross_mark: matthargett closed without merge PR #13446.
matthargett commented on PR #13446:
Reopened as #13447 (renamed branch). Same commits, same code; just a branch-name cleanup. CI is running there now.
Last updated: Jun 01 2026 at 09:49 UTC