Stream: git-wasmtime

Topic: wasmtime / PR #13446 pulley/cranelift: opcode fusion at c...


view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett opened PR #13446 from rebeckerspecialties:claude/pulley-fusion-xband-brif-upstream to bytecodealliance:main:

TL;DR

Stacks on #13445 (per-table mutability tracking). Adds a fused dispatch op family at the call_indirect lazy-init brif site (similar in shape to WAMR's preprocessed-bytecode register IR) plus a call_indirect{1,2,3,4} family mirroring direct-call call{1,2,3,4}. Consistent 4–8 % wallclock wins on polymorphic vtable workloads across iPhone 12 A14 Icestorm E-core, Apple Watch SE2 S8, and M4 Sawtooth E-core. iPhone XS A12 Tempest is mixed — branch-prediction pressure is the cross-microarch variable.

Closes ~10 % of the Pulley/WAMR wallclock gap on the iPhone 12 vtable suite (vtable_poly4 1.73× → 1.58×) and pushes WAMR within 1.16× on xmrsplayer on Apple Watch SE2 — the closest cross-device result in our matrix.

Dependency

Depends on #13445 landing first. Both PRs are stacked from the same fork branch; this PR's diff against main includes #13445's 11 commits at the bottom + 13 fusion commits on top. The fusion is gated on is_eagerly_initialized_funcref_table (the predicate added in #13445), so it only fires when the table-mutability proof holds.

Stack

13 commits on top of #13445:

Wallclock medians, N=10, phase-4 vs table-mutability-tracking baseline

BENCH_TARGET_MS=2000; .utility QoS on iOS / taskpolicy -b on M4.

workload iPhone 12 (A14) iPhone XS (A12) Watch SE2 (S8)
call_indirect −2.48 % +0.18 % −0.50 %
vtable_bi −7.45 % +4.84 % −4.26 %
vtable_poly4 −8.61 % −2.96 % −4.76 %
vtable_poly6 −5.32 % +5.41 % −4.64 %
xmrsplayer +0.26 % −6.04 % +0.25 %
graphql (AS) −0.13 % −3.63 % −0.74 %
graphql (Porffor) +2.13 % +0.21 % +0.58 %

PMU buckets (single 12 s xctrace CPU Counters per workload)

iPhone 12 attach-mode + M4 launch-mode. Phase-4 bucket shares:

Common bottleneck across A14 + M4 is Processing (back-end-bound). A12 Tempest doesn't expose CounterMetricByThread; iPhone XS is wallclock-only.

Verification

Cross-device measurement harness + raw N=10 logs + per-phase writeups: rebeckerspecialties/wasm-benchmark#1.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested cfallin for a review on PR #13446.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested wasmtime-compiler-reviewers for a review on PR #13446.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested wasmtime-core-reviewers for a review on PR #13446.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested fitzgen for a review on PR #13446.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested wasmtime-default-reviewers for a review on PR #13446.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 02:44):

matthargett edited PR #13446:

TL;DR

Stacks on #13445 (per-table mutability tracking). Adds a fused dispatch op family at the call_indirect lazy-init brif site (similar in shape to WAMR's preprocessed-bytecode register IR) plus a call_indirect{1,2,3,4} family mirroring direct-call call{1,2,3,4}. Consistent 4–8 % wallclock wins on polymorphic vtable workloads across iPhone 12 E-core, Apple Watch SE2 , and M4 E-core. iPhone XS e-core is mixed and the reason, is visible in hand-rolled aarch64 assembly microbenchmarks: branch-prediction pressure is the cross-microarch variable.

Closes ~10 % of the Pulley/WAMR wallclock gap on the iPhone 12 vtable suite (vtable_poly4 1.73× → 1.58×) and pushes WAMR within 1.16× on xmrsplayer on Apple Watch SE2 — the closest cross-device result in our matrix.

Dependency

Depends on #13445 landing first. Both PRs are stacked from the same fork branch; this PR's diff against main includes #13445's 11 commits at the bottom + 13 fusion commits on top. The fusion is gated on is_eagerly_initialized_funcref_table (the predicate added in #13445), so it only fires when the table-mutability proof holds.

Stack

13 commits on top of #13445:

Wallclock medians, N=10, phase-4 vs table-mutability-tracking baseline

BENCH_TARGET_MS=2000; .utility QoS on iOS / taskpolicy -b on M4.

workload iPhone 12 (A14) iPhone XS (A12) Watch SE2 (S8)
call_indirect −2.48 % +0.18 % −0.50 %
vtable_bi −7.45 % +4.84 % −4.26 %
vtable_poly4 −8.61 % −2.96 % −4.76 %
vtable_poly6 −5.32 % +5.41 % −4.64 %
xmrsplayer +0.26 % −6.04 % +0.25 %
graphql (AS) −0.13 % −3.63 % −0.74 %
graphql (Porffor) +2.13 % +0.21 % +0.58 %

PMU buckets (single 12 s xctrace CPU Counters per workload)

Common bottleneck across A14 + M4 is Processing (back-end-bound). I can't measure these advacned CPU counters on iPhone XS (A12) or Apple Watch, I can only get wall clock time and power draw.

Verification

Extra credit

Cross-device measurement harness I built for this bake-off across WASM runtimes: https://github.com/rebeckerspecialties/wasm-benchmark/pull/1

I submitted two PRs to WAMR to support exceptions and loose SIMD, so that it can run more of the benchmarks and generally have functional partity without losing its performance edge.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 03:26):

matthargett updated PR #13446.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 03:35):

matthargett updated PR #13446.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 03:36):

matthargett edited PR #13446:

TL;DR

Stacks on #13445 (per-table mutability tracking). Adds a fused dispatch op family at the call_indirect lazy-init brif site (similar in shape to WAMR's preprocessed-bytecode register IR) plus a call_indirect{1,2,3,4} family mirroring direct-call call{1,2,3,4}. Consistent 4–8 % wallclock wins on polymorphic vtable workloads across iPhone 12 E-core, Apple Watch SE2 , and M4 E-core. iPhone XS e-core is mixed and the reason, is visible in hand-rolled aarch64 assembly microbenchmarks: branch-prediction pressure is the cross-microarch variable.

Closes ~10 % of the Pulley/WAMR wallclock gap on the iPhone 12 vtable suite (vtable_poly4 1.73× → 1.58×) and pushes WAMR within 1.16× on xmrsplayer on Apple Watch SE2 — the closest cross-device result in our matrix.

Dependency

Depends on #13445 landing first. Both PRs are stacked from the same fork branch; this PR's diff against main includes #13445's 11 commits at the bottom + 13 fusion commits on top. The fusion is gated on is_eagerly_initialized_funcref_table (the predicate added in #13445), so it only fires when the table-mutability proof holds.

Stack

13 commits on top of #13445:

Wallclock medians, N=10, phase-4 vs table-mutability-tracking baseline

BENCH_TARGET_MS=2000; .utility QoS on iOS / taskpolicy -b on M4.

workload iPhone 12 (A14) iPhone XS (A12) Watch SE2 (S8)
call_indirect −2.48 % +0.18 % −0.50 %
vtable_bi −7.45 % +4.84 % −4.26 %
vtable_poly4 −8.61 % −2.96 % −4.76 %
vtable_poly6 −5.32 % +5.41 % −4.64 %
xmrsplayer +0.26 % −6.04 % +0.25 %
graphql (AS) −0.13 % −3.63 % −0.74 %
graphql (Porffor) +2.13 % +0.21 % +0.58 %

PMU buckets (single 12 s xctrace CPU Counters per workload)

Common bottleneck across A14 + M4 is Processing (back-end-bound). I can't measure these advacned CPU counters on iPhone XS (A12) or Apple Watch, I can only get wall clock time and power draw.

Verification

Extra credit

Cross-device measurement harness I built for this bake-off across WASM runtimes: https://github.com/rebeckerspecialties/wasm-benchmark/pull/1

I submitted two PRs to WAMR to support exceptions and loose SIMD, so that it can run more of the benchmarks and generally have functional partity without losing its performance edge.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 03:40):

matthargett updated PR #13446.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 04:08):

matthargett updated PR #13446.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 04:11):

:cross_mark: matthargett closed without merge PR #13446.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

matthargett commented on PR #13446:

Reopened as #13447 (renamed branch). Same commits, same code; just a branch-name cleanup. CI is running there now.


Last updated: Jun 01 2026 at 09:49 UTC