wasmtime / PR #13447 pulley/cranelift: opcode fusion at c... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #13447 pulley/cranelift: opcode fusion at c...

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

matthargett opened PR #13447 from rebeckerspecialties:pulley-fusion-dispatch-tail-upstream to bytecodealliance:main:

TL;DR

Stacks on #13445 (per-table mutability tracking). Adds a fused dispatch op family at the call_indirect lazy-init brif site (similar in shape to WAMR's preprocessed-bytecode register IR) plus a call_indirect{1,2,3,4} family mirroring direct-call call{1,2,3,4}. Consistent 4–8 % wallclock wins on polymorphic vtable workloads across iPhone 12 E-core, Apple Watch SE2 , and M4 E-core. iPhone XS e-core is mixed and the reason, is visible in hand-rolled aarch64 assembly microbenchmarks: branch-prediction pressure is the cross-microarch variable.

Closes ~10 % of the Pulley/WAMR wallclock gap on the iPhone 12 vtable suite (vtable_poly4 1.73× → 1.58×) and pushes WAMR within 1.16× on xmrsplayer on Apple Watch SE2 — the closest cross-device result in our matrix.

Dependency

Depends on #13445 landing first. Both PRs are stacked from the same fork branch; this PR's diff against main includes #13445's 11 commits at the bottom + 13 fusion commits on top. The fusion is gated on is_eagerly_initialized_funcref_table (the predicate added in #13445), so it only fires when the table-mutability proof holds.

Stack

13 commits on top of #13445:

Phases 1–3: collapse band + brif + 2 xloads at the call_indirect lazy-init tail (5 Pulley dispatches → 2 per call_indirect site).

Phase 4: call_indirect{1,2,3,4} opcodes mirror direct-call call{1,2,3,4}. Inst::IndirectCall bundles first 4 integer ABI args into the call opcode instead of synthesising xmovs via regalloc reg_fixed_use.

Correctness: handlers trap on null (a slow-path-aliasing review concern: sink_pure_inst of the continuation-block loads broke the lazy-init slow path's rejoin; trapping fails closed under the predicate).

Wallclock medians, N=10, phase-4 vs table-mutability-tracking baseline

BENCH_TARGET_MS=2000; .utility QoS on iOS / taskpolicy -b on M4.

workload iPhone 12 (A14) iPhone XS (A12) Watch SE2 (S8)

call_indirect −2.48 % +0.18 % −0.50 %

vtable_bi −7.45 % +4.84 % −4.26 %

vtable_poly4 −8.61 % −2.96 % −4.76 %

vtable_poly6 −5.32 % +5.41 % −4.64 %

xmrsplayer +0.26 % −6.04 % +0.25 %

graphql (AS) −0.13 % −3.63 % −0.74 %

graphql (Porffor) +2.13 % +0.21 % +0.58 %

PMU buckets (single 12 s xctrace CPU Counters per workload)

A14 Icestorm vtable suite: 20–38 % Processing, 4–18 % Discarded.

A14 Icestorm graphql + call_indirect: 9–13 % Processing, 17–41 % Discarded interpreter-loop mispredict pressure but its kinda squirrely and I couldn't pin it down..

M4 Sawtooth, every workload: 33–47 % Processing — back-end load-use latency on the dispatch tail dominates Sawtooth's wider issue width. The wide spread here is probably due to running the measurement and dev stack on the device itself.

Common bottleneck across A14 + M4 is Processing (back-end-bound). I can't measure these advacned CPU counters on iPhone XS (A12) or Apple Watch, I can only get wall clock time and power draw.

Verification

2237 / 2237 cranelift filetests

13 / 13 craneliftpulley_call_* integration tests

21 min cargo fuzz run differential --no-default-features with ALLOWED_ENGINES=pulley,wasmtime — 0 crashes / 0 divergences

Extra credit

Cross-device measurement harness I built for this bake-off across WASM runtimes: https://github.com/rebeckerspecialties/wasm-benchmark/pull/1

I submitted two PRs to WAMR to support exceptions and loose SIMD, so that it can run more of the benchmarks and generally have functional partity without losing its performance edge.

workload	iPhone 12 (A14)	iPhone XS (A12)	Watch SE2 (S8)
call_indirect	−2.48 %	+0.18 %	−0.50 %
vtable_bi	−7.45 %	+4.84 %	−4.26 %
vtable_poly4	−8.61 %	−2.96 %	−4.76 %
vtable_poly6	−5.32 %	+5.41 %	−4.64 %
xmrsplayer	+0.26 %	−6.04 %	+0.25 %
graphql (AS)	−0.13 %	−3.63 %	−0.74 %
graphql (Porffor)	+2.13 %	+0.21 %	+0.58 %

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

matthargett requested cfallin for a review on PR #13447.

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

matthargett requested wasmtime-core-reviewers for a review on PR #13447.

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

matthargett requested wasmtime-compiler-reviewers for a review on PR #13447.

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

matthargett requested fitzgen for a review on PR #13447.

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

matthargett requested wasmtime-default-reviewers for a review on PR #13447.

Wasmtime GitHub notifications bot (May 22 2026 at 04:30):

matthargett commented on PR #13447:

CI status

13/14 jobs are green. Remaining failure is Nightly tests → wasmtime-fuzzing/oom, but it doesn't look related to this PR:

All individual oom::* tests print ok (including table_grow, the last to log).

Process then aborts with memory allocation of 8 bytes failed + SIGABRT during teardown.

Same pattern on every run since rebase onto current main.

Our changes that touch instance/allocation are guarded on is_eagerly_initialized_funcref_table — only fires for module instantiation with funcref tables under the predicate. The failing tests create Table::new directly with enable_compiler(false), no module instantiation; the new code path is never reached.

Same Nightly tests job passes on recent main runs (26267350231, 26204292060, 26140078343) and was skipped on the parent PR #13445 (no test-nightly trigger). The 8-byte OOM at teardown after all tests passed reads as runner-side memory pressure rather than a real regression. Happy to dig further if maintainers can confirm whether this is a known infra issue, or if you want me to bisect locally — my pinned nightly toolchain doesn't accept the arc_try_new cfg gate so I can't reproduce the exact CI build flags here.

Wasmtime GitHub notifications bot (May 22 2026 at 07:28):

matthargett updated PR #13447.

Wasmtime GitHub notifications bot (May 22 2026 at 07:29):

matthargett edited a comment on PR #13447:

CI status

13/14 jobs are green. Remaining failure is Nightly tests → wasmtime-fuzzing/oom, but it doesn't look related to this PR:

All individual oom::* tests print ok (including table_grow, the last to log).

Process then aborts with memory allocation of 8 bytes failed + SIGABRT during teardown.

Same pattern on every run since rebase onto current main.

Our changes that touch instance/allocation are guarded on is_eagerly_initialized_funcref_table, so it only fires for module instantiation with funcref tables under the predicate. The failing tests create Table::new directly with enable_compiler(false), no module instantiation; the new code path is never reached.

Same Nightly tests job passes on recent main runs (26267350231, 26204292060, 26140078343) and was skipped on the parent PR #13445 (no test-nightly trigger). The 8-byte OOM at teardown after all tests passed reads as runner-side memory pressure rather than a real regression. Happy to dig further if maintainers can confirm whether this is a known infra issue, or if you want me to bisect locally — my pinned nightly toolchain doesn't accept the arc_try_new cfg gate so I can't reproduce the exact CI build flags here.

Wasmtime GitHub notifications bot (May 22 2026 at 07:30):

matthargett edited a comment on PR #13447:

CI status

13/14 jobs are green. Remaining failure is Nightly tests → wasmtime-fuzzing/oom, but it doesn't look related to this PR:

All individual oom::* tests print ok (including table_grow, the last to log).

Process then aborts with memory allocation of 8 bytes failed + SIGABRT during teardown.

Same pattern on every run since rebase onto current main.

Our changes that touch instance/allocation are guarded on is_eagerly_initialized_funcref_table, so it only fires for module instantiation with funcref tables under the predicate. The failing tests create Table::new directly with enable_compiler(false), no module instantiation; the new code path is never reached.

Same Nightly tests job passes on recent main runs (26267350231, 26204292060, 26140078343) and was skipped on the parent PR #13445 (no test-nightly trigger). The 8-byte OOM at teardown after all tests passed reads as runner-side memory pressure rather than a real regression. Happy to dig further if folks can confirm whether this is a known infra issue, or if you want me to bisect locally — my pinned nightly toolchain doesn't accept the arc_try_new cfg gate so I can't reproduce the exact CI build flags here.

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] added the label cranelift on PR #13447.

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] added the label cranelift:area:machinst on PR #13447.

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] added the label cranelift:meta on PR #13447.

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] added the label isle on PR #13447.

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] added the label wasmtime:api on PR #13447.

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] added the label pulley on PR #13447.

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] commented on PR #13447:

Subscribe to Label Action

cc @cfallin, @fitzgen

<details>
This issue or pull request has been labeled: "cranelift", "cranelift:area:machinst", "cranelift:meta", "isle", "pulley", "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

cfallin: isle

fitzgen: isle, pulley

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

Wasmtime GitHub notifications bot (May 23 2026 at 15:22):

alexcrichton commented on PR #13447:

Thanks for the PR, and like the previous PR this is stacked on we'd appreciate a bit more care taken when communicating here. Extensively documenting all these numbers is fine, but for example I've no idea what any of these benchmarks are or where their source is. I also don't really know how to reason about how the speedups seem to be balanced by slowdowns, especially in the middle column for the iPhone XS. For "PMU", "Icestorm", "Processing", "Discarded", etc, could you explain what those words all mean? I'm not entirely sure myself...

Procedurally it's fine to stack PRs on one another, but given the quantity of commits here this'll probably not get a review until afterhttps://github.com/bytecodealliance/wasmtime/pull/13445 has landed and this is rebased. Alternatively, if you'd like, feel free to split out things from this PR (for example the pulley opcodes) and have them land separately.

Wasmtime GitHub notifications bot (May 28 2026 at 21:37):

matthargett commented on PR #13447:

Thanks for the PR, and like the previous PR this is stacked on we'd appreciate a bit more care taken when communicating here. Extensively documenting all these numbers is fine, but for example I've no idea what any of these benchmarks are or where their source is. I also don't really know how to reason about how the speedups seem to be balanced by slowdowns, especially in the middle column for the iPhone XS. For "PMU", "Icestorm", "Processing", "Discarded", etc, could you explain what those words all mean? I'm not entirely sure myself...

Procedurally it's fine to stack PRs on one another, but given the quantity of commits here this'll probably not get a review until afterhttps://github.com//pull/13445 has landed and this is rebased. Alternatively, if you'd like, feel free to split out things from this PR (for example the pulley opcodes) and have them land separately.

PMU is a reference to xcode/xctrace's instruments for profiling, and Processing vs Discarded is CPU pipeline frontend vs backend efficiency. (Discarded is when branch prediction is wrong, so speculative execution results are pure waste.) Icestorm is the name for the effiicency cores (E-cores) of the device, which I'm focusing on to try and get the best performance-per-watt out of the decisions/choices rather than high-wattage and/or performance core uplift to consider the work "done".

I fully admit I wasn't up to speed on all the jargon and code/marketing names for the separate silicon IP and profiler tooling in these devices until I started this work. I've tried to use the higher-level terms efficiency cores where possible, but also want to make it easy for people to do their own web/AI searches so they have an easier time than I did on the research and execution. I've been deep into x86, x64, Qualcomm XR2, and Zen 2 profiling and optimziation before, but not on Apple hardware.

the benchmarks in in the repo I've referenced before: https://github.com/rebeckerspecialties/wasm-benchmark/ . this shoudl allow anyone to reproduce the WASM runtime performance "bakeoff" on their own Apple devices (iPhone, iPad, Apple TV 4K, Apple Watch, etc). I'm trying to show my work as much as I can here so this can have the best user-visible uplift possible for the extra complexity. IME, performance profiling and optimization can often be a bit hyperbolic and I'm doing my best to show that I'm being data-driven on the real devices I have available to me.

Wasmtime GitHub notifications bot (May 28 2026 at 23:38):

alexcrichton edited a comment on PR #13447:

Thanks for the PR, and like the previous PR this is stacked on we'd appreciate a bit more care taken when communicating here. Extensively documenting all these numbers is fine, but for example I've no idea what any of these benchmarks are or where their source is. I also don't really know how to reason about how the speedups seem to be balanced by slowdowns, especially in the middle column for the iPhone XS. For "PMU", "Icestorm", "Processing", "Discarded", etc, could you explain what those words all mean? I'm not entirely sure myself...

Procedurally it's fine to stack PRs on one another, but given the quantity of commits here this'll probably not get a review until after https://github.com/bytecodealliance/wasmtime/pull/13445 has landed and this is rebased. Alternatively, if you'd like, feel free to split out things from this PR (for example the pulley opcodes) and have them land separately.

Wasmtime GitHub notifications bot (Jun 11 2026 at 02:18):

matthargett updated PR #13447.

Wasmtime GitHub notifications bot (Jul 07 2026 at 21:31):

matthargett updated PR #13447.

Wasmtime GitHub notifications bot (Jul 10 2026 at 08:57):

matthargett updated PR #13447.

Wasmtime GitHub notifications bot (Jul 10 2026 at 21:45):

matthargett updated PR #13447.

Last updated: Jul 29 2026 at 05:03 UTC

Stream: git-wasmtime

Topic: wasmtime / PR #13447 pulley/cranelift: opcode fusion at c...

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

TL;DR

Dependency

Stack

Wallclock medians, N=10, phase-4 vs `table-mutability-tracking` baseline

PMU buckets (single 12 s xctrace `CPU Counters` per workload)

Verification

Extra credit

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

Wasmtime GitHub notifications bot (May 22 2026 at 04:30):

CI status

Wasmtime GitHub notifications bot (May 22 2026 at 07:28):

Wasmtime GitHub notifications bot (May 22 2026 at 07:29):

CI status

Wasmtime GitHub notifications bot (May 22 2026 at 07:30):

CI status

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

Subscribe to Label Action

Wasmtime GitHub notifications bot (May 23 2026 at 15:22):

Wasmtime GitHub notifications bot (May 28 2026 at 21:37):

Wasmtime GitHub notifications bot (May 28 2026 at 23:38):

Wasmtime GitHub notifications bot (Jun 11 2026 at 02:18):

Wasmtime GitHub notifications bot (Jul 07 2026 at 21:31):

Wasmtime GitHub notifications bot (Jul 10 2026 at 08:57):

Wasmtime GitHub notifications bot (Jul 10 2026 at 21:45):

Stream: git-wasmtime

Topic: wasmtime / PR #13447 pulley/cranelift: opcode fusion at c...

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

TL;DR

Dependency

Stack

Wallclock medians, N=10, phase-4 vs table-mutability-tracking baseline

PMU buckets (single 12 s xctrace CPU Counters per workload)

Verification

Extra credit

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

Wasmtime GitHub notifications bot (May 22 2026 at 04:30):

CI status

Wasmtime GitHub notifications bot (May 22 2026 at 07:28):

Wasmtime GitHub notifications bot (May 22 2026 at 07:29):

CI status

Wasmtime GitHub notifications bot (May 22 2026 at 07:30):

CI status

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

Subscribe to Label Action

Wasmtime GitHub notifications bot (May 23 2026 at 15:22):

Wasmtime GitHub notifications bot (May 28 2026 at 21:37):

Wasmtime GitHub notifications bot (May 28 2026 at 23:38):

Wasmtime GitHub notifications bot (Jun 11 2026 at 02:18):

Wasmtime GitHub notifications bot (Jul 07 2026 at 21:31):

Wasmtime GitHub notifications bot (Jul 10 2026 at 08:57):

Wasmtime GitHub notifications bot (Jul 10 2026 at 21:45):

Wallclock medians, N=10, phase-4 vs `table-mutability-tracking` baseline

PMU buckets (single 12 s xctrace `CPU Counters` per workload)