wasmtime / PR #13446 pulley/cranelift: opcode fusion at c... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #13446 pulley/cranelift: opcode fusion at c...

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett opened PR #13446 from rebeckerspecialties:claude/pulley-fusion-xband-brif-upstream to bytecodealliance:main:

TL;DR

Stacks on #13445 (per-table mutability tracking). Adds a fused dispatch op family at the call_indirect lazy-init brif site (similar in shape to WAMR's preprocessed-bytecode register IR) plus a call_indirect{1,2,3,4} family mirroring direct-call call{1,2,3,4}. Consistent 4–8 % wallclock wins on polymorphic vtable workloads across iPhone 12 A14 Icestorm E-core, Apple Watch SE2 S8, and M4 Sawtooth E-core. iPhone XS A12 Tempest is mixed — branch-prediction pressure is the cross-microarch variable.

Closes ~10 % of the Pulley/WAMR wallclock gap on the iPhone 12 vtable suite (vtable_poly4 1.73× → 1.58×) and pushes WAMR within 1.16× on xmrsplayer on Apple Watch SE2 — the closest cross-device result in our matrix.

Dependency

Depends on #13445 landing first. Both PRs are stacked from the same fork branch; this PR's diff against main includes #13445's 11 commits at the bottom + 13 fusion commits on top. The fusion is gated on is_eagerly_initialized_funcref_table (the predicate added in #13445), so it only fires when the table-mutability proof holds.

Stack

13 commits on top of #13445:

Phases 1–3: collapse band + brif + 2 xloads at the call_indirect lazy-init tail (5 Pulley dispatches → 2 per call_indirect site).

Phase 4: call_indirect{1,2,3,4} opcodes mirror direct-call call{1,2,3,4}. Inst::IndirectCall bundles first 4 integer ABI args into the call opcode instead of synthesising xmovs via regalloc reg_fixed_use.

Correctness: handlers trap on null (a slow-path-aliasing review concern: sink_pure_inst of the continuation-block loads broke the lazy-init slow path's rejoin; trapping fails closed under the predicate); #[cfg(debug_assertions)] predecessor-count assertion in pre_lower to catch future structural regressions.

Wallclock medians, N=10, phase-4 vs table-mutability-tracking baseline

BENCH_TARGET_MS=2000; .utility QoS on iOS / taskpolicy -b on M4.

workload iPhone 12 (A14) iPhone XS (A12) Watch SE2 (S8)

call_indirect −2.48 % +0.18 % −0.50 %

vtable_bi −7.45 % +4.84 % −4.26 %

vtable_poly4 −8.61 % −2.96 % −4.76 %

vtable_poly6 −5.32 % +5.41 % −4.64 %

xmrsplayer +0.26 % −6.04 % +0.25 %

graphql (AS) −0.13 % −3.63 % −0.74 %

graphql (Porffor) +2.13 % +0.21 % +0.58 %

PMU buckets (single 12 s xctrace CPU Counters per workload)

iPhone 12 attach-mode + M4 launch-mode. Phase-4 bucket shares:

A14 Icestorm vtable suite: 20–38 % Processing, 4–18 % Discarded.

A14 Icestorm graphql + call_indirect: 9–13 % Processing, 17–41 % Discarded — interpreter-loop mispredict pressure.

M4 Sawtooth, every workload: 33–47 % Processing — back-end load-use latency on the dispatch tail dominates Sawtooth's wider issue width.

Common bottleneck across A14 + M4 is Processing (back-end-bound). A12 Tempest doesn't expose CounterMetricByThread; iPhone XS is wallclock-only.

Verification

2237 / 2237 cranelift filetests

13 / 13 craneliftpulley_call_* integration tests

21 min cargo fuzz run differential --no-default-features with ALLOWED_ENGINES=pulley,wasmtime — 0 crashes / 0 divergences

Cross-device measurement harness + raw N=10 logs + per-phase writeups: rebeckerspecialties/wasm-benchmark#1.

workload	iPhone 12 (A14)	iPhone XS (A12)	Watch SE2 (S8)
call_indirect	−2.48 %	+0.18 %	−0.50 %
vtable_bi	−7.45 %	+4.84 %	−4.26 %
vtable_poly4	−8.61 %	−2.96 %	−4.76 %
vtable_poly6	−5.32 %	+5.41 %	−4.64 %
xmrsplayer	+0.26 %	−6.04 %	+0.25 %
graphql (AS)	−0.13 %	−3.63 %	−0.74 %
graphql (Porffor)	+2.13 %	+0.21 %	+0.58 %

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested cfallin for a review on PR #13446.

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested wasmtime-compiler-reviewers for a review on PR #13446.

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested wasmtime-core-reviewers for a review on PR #13446.

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested fitzgen for a review on PR #13446.

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested wasmtime-default-reviewers for a review on PR #13446.

Wasmtime GitHub notifications bot (May 22 2026 at 02:44):

matthargett edited PR #13446:

TL;DR

Stacks on #13445 (per-table mutability tracking). Adds a fused dispatch op family at the call_indirect lazy-init brif site (similar in shape to WAMR's preprocessed-bytecode register IR) plus a call_indirect{1,2,3,4} family mirroring direct-call call{1,2,3,4}. Consistent 4–8 % wallclock wins on polymorphic vtable workloads across iPhone 12 E-core, Apple Watch SE2 , and M4 E-core. iPhone XS e-core is mixed and the reason, is visible in hand-rolled aarch64 assembly microbenchmarks: branch-prediction pressure is the cross-microarch variable.

Closes ~10 % of the Pulley/WAMR wallclock gap on the iPhone 12 vtable suite (vtable_poly4 1.73× → 1.58×) and pushes WAMR within 1.16× on xmrsplayer on Apple Watch SE2 — the closest cross-device result in our matrix.

Dependency

Depends on #13445 landing first. Both PRs are stacked from the same fork branch; this PR's diff against main includes #13445's 11 commits at the bottom + 13 fusion commits on top. The fusion is gated on is_eagerly_initialized_funcref_table (the predicate added in #13445), so it only fires when the table-mutability proof holds.

Stack

13 commits on top of #13445:

Phases 1–3: collapse band + brif + 2 xloads at the call_indirect lazy-init tail (5 Pulley dispatches → 2 per call_indirect site).

Phase 4: call_indirect{1,2,3,4} opcodes mirror direct-call call{1,2,3,4}. Inst::IndirectCall bundles first 4 integer ABI args into the call opcode instead of synthesising xmovs via regalloc reg_fixed_use.

Correctness: handlers trap on null (a slow-path-aliasing review concern: sink_pure_inst of the continuation-block loads broke the lazy-init slow path's rejoin; trapping fails closed under the predicate); #[cfg(debug_assertions)] predecessor-count assertion in pre_lower to catch future structural regressions, but lmk if this is overkill.

Wallclock medians, N=10, phase-4 vs table-mutability-tracking baseline

BENCH_TARGET_MS=2000; .utility QoS on iOS / taskpolicy -b on M4.

workload iPhone 12 (A14) iPhone XS (A12) Watch SE2 (S8)

call_indirect −2.48 % +0.18 % −0.50 %

vtable_bi −7.45 % +4.84 % −4.26 %

vtable_poly4 −8.61 % −2.96 % −4.76 %

vtable_poly6 −5.32 % +5.41 % −4.64 %

xmrsplayer +0.26 % −6.04 % +0.25 %

graphql (AS) −0.13 % −3.63 % −0.74 %

graphql (Porffor) +2.13 % +0.21 % +0.58 %

PMU buckets (single 12 s xctrace CPU Counters per workload)

A14 Icestorm vtable suite: 20–38 % Processing, 4–18 % Discarded.

A14 Icestorm graphql + call_indirect: 9–13 % Processing, 17–41 % Discarded interpreter-loop mispredict pressure but its kinda squirrely and I couldn't pin it down..

M4 Sawtooth, every workload: 33–47 % Processing — back-end load-use latency on the dispatch tail dominates Sawtooth's wider issue width. The wide spread here is probably due to running the measurement and dev stack on the device itself.

Common bottleneck across A14 + M4 is Processing (back-end-bound). I can't measure these advacned CPU counters on iPhone XS (A12) or Apple Watch, I can only get wall clock time and power draw.

Verification

2237 / 2237 cranelift filetests

13 / 13 craneliftpulley_call_* integration tests

21 min cargo fuzz run differential --no-default-features with ALLOWED_ENGINES=pulley,wasmtime — 0 crashes / 0 divergences

Extra credit

Cross-device measurement harness I built for this bake-off across WASM runtimes: https://github.com/rebeckerspecialties/wasm-benchmark/pull/1

I submitted two PRs to WAMR to support exceptions and loose SIMD, so that it can run more of the benchmarks and generally have functional partity without losing its performance edge.

workload	iPhone 12 (A14)	iPhone XS (A12)	Watch SE2 (S8)
call_indirect	−2.48 %	+0.18 %	−0.50 %
vtable_bi	−7.45 %	+4.84 %	−4.26 %
vtable_poly4	−8.61 %	−2.96 %	−4.76 %
vtable_poly6	−5.32 %	+5.41 %	−4.64 %
xmrsplayer	+0.26 %	−6.04 %	+0.25 %
graphql (AS)	−0.13 %	−3.63 %	−0.74 %
graphql (Porffor)	+2.13 %	+0.21 %	+0.58 %

Wasmtime GitHub notifications bot (May 22 2026 at 03:26):

matthargett updated PR #13446.

Wasmtime GitHub notifications bot (May 22 2026 at 03:35):

matthargett updated PR #13446.

Wasmtime GitHub notifications bot (May 22 2026 at 03:36):

matthargett edited PR #13446:

TL;DR

Stacks on #13445 (per-table mutability tracking). Adds a fused dispatch op family at the call_indirect lazy-init brif site (similar in shape to WAMR's preprocessed-bytecode register IR) plus a call_indirect{1,2,3,4} family mirroring direct-call call{1,2,3,4}. Consistent 4–8 % wallclock wins on polymorphic vtable workloads across iPhone 12 E-core, Apple Watch SE2 , and M4 E-core. iPhone XS e-core is mixed and the reason, is visible in hand-rolled aarch64 assembly microbenchmarks: branch-prediction pressure is the cross-microarch variable.

Closes ~10 % of the Pulley/WAMR wallclock gap on the iPhone 12 vtable suite (vtable_poly4 1.73× → 1.58×) and pushes WAMR within 1.16× on xmrsplayer on Apple Watch SE2 — the closest cross-device result in our matrix.

Dependency

Depends on #13445 landing first. Both PRs are stacked from the same fork branch; this PR's diff against main includes #13445's 11 commits at the bottom + 13 fusion commits on top. The fusion is gated on is_eagerly_initialized_funcref_table (the predicate added in #13445), so it only fires when the table-mutability proof holds.

Stack

13 commits on top of #13445:

Phases 1–3: collapse band + brif + 2 xloads at the call_indirect lazy-init tail (5 Pulley dispatches → 2 per call_indirect site).

Phase 4: call_indirect{1,2,3,4} opcodes mirror direct-call call{1,2,3,4}. Inst::IndirectCall bundles first 4 integer ABI args into the call opcode instead of synthesising xmovs via regalloc reg_fixed_use.

Correctness: handlers trap on null (a slow-path-aliasing review concern: sink_pure_inst of the continuation-block loads broke the lazy-init slow path's rejoin; trapping fails closed under the predicate).

Wallclock medians, N=10, phase-4 vs table-mutability-tracking baseline

BENCH_TARGET_MS=2000; .utility QoS on iOS / taskpolicy -b on M4.

workload iPhone 12 (A14) iPhone XS (A12) Watch SE2 (S8)

call_indirect −2.48 % +0.18 % −0.50 %

vtable_bi −7.45 % +4.84 % −4.26 %

vtable_poly4 −8.61 % −2.96 % −4.76 %

vtable_poly6 −5.32 % +5.41 % −4.64 %

xmrsplayer +0.26 % −6.04 % +0.25 %

graphql (AS) −0.13 % −3.63 % −0.74 %

graphql (Porffor) +2.13 % +0.21 % +0.58 %

PMU buckets (single 12 s xctrace CPU Counters per workload)

A14 Icestorm vtable suite: 20–38 % Processing, 4–18 % Discarded.

A14 Icestorm graphql + call_indirect: 9–13 % Processing, 17–41 % Discarded interpreter-loop mispredict pressure but its kinda squirrely and I couldn't pin it down..

M4 Sawtooth, every workload: 33–47 % Processing — back-end load-use latency on the dispatch tail dominates Sawtooth's wider issue width. The wide spread here is probably due to running the measurement and dev stack on the device itself.

Common bottleneck across A14 + M4 is Processing (back-end-bound). I can't measure these advacned CPU counters on iPhone XS (A12) or Apple Watch, I can only get wall clock time and power draw.

Verification

2237 / 2237 cranelift filetests

13 / 13 craneliftpulley_call_* integration tests

21 min cargo fuzz run differential --no-default-features with ALLOWED_ENGINES=pulley,wasmtime — 0 crashes / 0 divergences

Extra credit

Cross-device measurement harness I built for this bake-off across WASM runtimes: https://github.com/rebeckerspecialties/wasm-benchmark/pull/1

I submitted two PRs to WAMR to support exceptions and loose SIMD, so that it can run more of the benchmarks and generally have functional partity without losing its performance edge.

workload	iPhone 12 (A14)	iPhone XS (A12)	Watch SE2 (S8)
call_indirect	−2.48 %	+0.18 %	−0.50 %
vtable_bi	−7.45 %	+4.84 %	−4.26 %
vtable_poly4	−8.61 %	−2.96 %	−4.76 %
vtable_poly6	−5.32 %	+5.41 %	−4.64 %
xmrsplayer	+0.26 %	−6.04 %	+0.25 %
graphql (AS)	−0.13 %	−3.63 %	−0.74 %
graphql (Porffor)	+2.13 %	+0.21 %	+0.58 %

Wasmtime GitHub notifications bot (May 22 2026 at 03:40):

matthargett updated PR #13446.

Wasmtime GitHub notifications bot (May 22 2026 at 04:08):

matthargett updated PR #13446.

Wasmtime GitHub notifications bot (May 22 2026 at 04:11):

:cross_mark: matthargett closed without merge PR #13446.

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

matthargett commented on PR #13446:

Reopened as #13447 (renamed branch). Same commits, same code; just a branch-name cleanup. CI is running there now.

Last updated: Jul 29 2026 at 05:03 UTC

Stream: git-wasmtime

Topic: wasmtime / PR #13446 pulley/cranelift: opcode fusion at c...

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

TL;DR

Dependency

Stack

Wallclock medians, N=10, phase-4 vs `table-mutability-tracking` baseline

PMU buckets (single 12 s xctrace `CPU Counters` per workload)

Verification

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

Wasmtime GitHub notifications bot (May 22 2026 at 02:44):

TL;DR

Dependency

Stack

Wallclock medians, N=10, phase-4 vs `table-mutability-tracking` baseline

PMU buckets (single 12 s xctrace `CPU Counters` per workload)

Verification

Extra credit

Wasmtime GitHub notifications bot (May 22 2026 at 03:26):

Wasmtime GitHub notifications bot (May 22 2026 at 03:35):

Wasmtime GitHub notifications bot (May 22 2026 at 03:36):

TL;DR

Dependency

Stack

Wallclock medians, N=10, phase-4 vs `table-mutability-tracking` baseline

PMU buckets (single 12 s xctrace `CPU Counters` per workload)

Verification

Extra credit

Wasmtime GitHub notifications bot (May 22 2026 at 03:40):

Wasmtime GitHub notifications bot (May 22 2026 at 04:08):

Wasmtime GitHub notifications bot (May 22 2026 at 04:11):

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

Stream: git-wasmtime

Topic: wasmtime / PR #13446 pulley/cranelift: opcode fusion at c...

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

TL;DR

Dependency

Stack

Wallclock medians, N=10, phase-4 vs table-mutability-tracking baseline

PMU buckets (single 12 s xctrace CPU Counters per workload)

Verification

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

Wasmtime GitHub notifications bot (May 22 2026 at 02:44):

TL;DR

Dependency

Stack

Wallclock medians, N=10, phase-4 vs table-mutability-tracking baseline

PMU buckets (single 12 s xctrace CPU Counters per workload)

Verification

Extra credit

Wasmtime GitHub notifications bot (May 22 2026 at 03:26):

Wasmtime GitHub notifications bot (May 22 2026 at 03:35):

Wasmtime GitHub notifications bot (May 22 2026 at 03:36):

TL;DR

Dependency

Stack

Wallclock medians, N=10, phase-4 vs table-mutability-tracking baseline

PMU buckets (single 12 s xctrace CPU Counters per workload)

Verification

Extra credit

Wasmtime GitHub notifications bot (May 22 2026 at 03:40):

Wasmtime GitHub notifications bot (May 22 2026 at 04:08):

Wasmtime GitHub notifications bot (May 22 2026 at 04:11):

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

Wallclock medians, N=10, phase-4 vs `table-mutability-tracking` baseline

PMU buckets (single 12 s xctrace `CPU Counters` per workload)

Wallclock medians, N=10, phase-4 vs `table-mutability-tracking` baseline

PMU buckets (single 12 s xctrace `CPU Counters` per workload)

Wallclock medians, N=10, phase-4 vs `table-mutability-tracking` baseline

PMU buckets (single 12 s xctrace `CPU Counters` per workload)