Stream: git-wasmtime

Topic: wasmtime / PR #13447 pulley/cranelift: opcode fusion at c...


view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

matthargett opened PR #13447 from rebeckerspecialties:pulley-fusion-dispatch-tail-upstream to bytecodealliance:main:

TL;DR

Stacks on #13445 (per-table mutability tracking). Adds a fused dispatch op family at the call_indirect lazy-init brif site (similar in shape to WAMR's preprocessed-bytecode register IR) plus a call_indirect{1,2,3,4} family mirroring direct-call call{1,2,3,4}. Consistent 4–8 % wallclock wins on polymorphic vtable workloads across iPhone 12 E-core, Apple Watch SE2 , and M4 E-core. iPhone XS e-core is mixed and the reason, is visible in hand-rolled aarch64 assembly microbenchmarks: branch-prediction pressure is the cross-microarch variable.

Closes ~10 % of the Pulley/WAMR wallclock gap on the iPhone 12 vtable suite (vtable_poly4 1.73× → 1.58×) and pushes WAMR within 1.16× on xmrsplayer on Apple Watch SE2 — the closest cross-device result in our matrix.

Dependency

Depends on #13445 landing first. Both PRs are stacked from the same fork branch; this PR's diff against main includes #13445's 11 commits at the bottom + 13 fusion commits on top. The fusion is gated on is_eagerly_initialized_funcref_table (the predicate added in #13445), so it only fires when the table-mutability proof holds.

Stack

13 commits on top of #13445:

Wallclock medians, N=10, phase-4 vs table-mutability-tracking baseline

BENCH_TARGET_MS=2000; .utility QoS on iOS / taskpolicy -b on M4.

workload iPhone 12 (A14) iPhone XS (A12) Watch SE2 (S8)
call_indirect −2.48 % +0.18 % −0.50 %
vtable_bi −7.45 % +4.84 % −4.26 %
vtable_poly4 −8.61 % −2.96 % −4.76 %
vtable_poly6 −5.32 % +5.41 % −4.64 %
xmrsplayer +0.26 % −6.04 % +0.25 %
graphql (AS) −0.13 % −3.63 % −0.74 %
graphql (Porffor) +2.13 % +0.21 % +0.58 %

PMU buckets (single 12 s xctrace CPU Counters per workload)

Common bottleneck across A14 + M4 is Processing (back-end-bound). I can't measure these advacned CPU counters on iPhone XS (A12) or Apple Watch, I can only get wall clock time and power draw.

Verification

Extra credit

Cross-device measurement harness I built for this bake-off across WASM runtimes: https://github.com/rebeckerspecialties/wasm-benchmark/pull/1

I submitted two PRs to WAMR to support exceptions and loose SIMD, so that it can run more of the benchmarks and generally have functional partity without losing its performance edge.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

matthargett requested cfallin for a review on PR #13447.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

matthargett requested wasmtime-core-reviewers for a review on PR #13447.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

matthargett requested wasmtime-compiler-reviewers for a review on PR #13447.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

matthargett requested fitzgen for a review on PR #13447.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 04:13):

matthargett requested wasmtime-default-reviewers for a review on PR #13447.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 04:30):

matthargett commented on PR #13447:

CI status

13/14 jobs are green. Remaining failure is Nightly testswasmtime-fuzzing/oom, but it doesn't look related to this PR:

Our changes that touch instance/allocation are guarded on is_eagerly_initialized_funcref_table — only fires for module instantiation with funcref tables under the predicate. The failing tests create Table::new directly with enable_compiler(false), no module instantiation; the new code path is never reached.

Same Nightly tests job passes on recent main runs (26267350231, 26204292060, 26140078343) and was skipped on the parent PR #13445 (no test-nightly trigger). The 8-byte OOM at teardown after all tests passed reads as runner-side memory pressure rather than a real regression. Happy to dig further if maintainers can confirm whether this is a known infra issue, or if you want me to bisect locally — my pinned nightly toolchain doesn't accept the arc_try_new cfg gate so I can't reproduce the exact CI build flags here.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 07:28):

matthargett updated PR #13447.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 07:29):

matthargett edited a comment on PR #13447:

CI status

13/14 jobs are green. Remaining failure is Nightly testswasmtime-fuzzing/oom, but it doesn't look related to this PR:

Our changes that touch instance/allocation are guarded on is_eagerly_initialized_funcref_table, so it only fires for module instantiation with funcref tables under the predicate. The failing tests create Table::new directly with enable_compiler(false), no module instantiation; the new code path is never reached.

Same Nightly tests job passes on recent main runs (26267350231, 26204292060, 26140078343) and was skipped on the parent PR #13445 (no test-nightly trigger). The 8-byte OOM at teardown after all tests passed reads as runner-side memory pressure rather than a real regression. Happy to dig further if maintainers can confirm whether this is a known infra issue, or if you want me to bisect locally — my pinned nightly toolchain doesn't accept the arc_try_new cfg gate so I can't reproduce the exact CI build flags here.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 07:30):

matthargett edited a comment on PR #13447:

CI status

13/14 jobs are green. Remaining failure is Nightly testswasmtime-fuzzing/oom, but it doesn't look related to this PR:

Our changes that touch instance/allocation are guarded on is_eagerly_initialized_funcref_table, so it only fires for module instantiation with funcref tables under the predicate. The failing tests create Table::new directly with enable_compiler(false), no module instantiation; the new code path is never reached.

Same Nightly tests job passes on recent main runs (26267350231, 26204292060, 26140078343) and was skipped on the parent PR #13445 (no test-nightly trigger). The 8-byte OOM at teardown after all tests passed reads as runner-side memory pressure rather than a real regression. Happy to dig further if folks can confirm whether this is a known infra issue, or if you want me to bisect locally — my pinned nightly toolchain doesn't accept the arc_try_new cfg gate so I can't reproduce the exact CI build flags here.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] added the label cranelift on PR #13447.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] added the label cranelift:area:machinst on PR #13447.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] added the label cranelift:meta on PR #13447.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] added the label isle on PR #13447.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] added the label wasmtime:api on PR #13447.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] added the label pulley on PR #13447.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] commented on PR #13447:

Subscribe to Label Action

cc @cfallin, @fitzgen

<details>
This issue or pull request has been labeled: "cranelift", "cranelift:area:machinst", "cranelift:meta", "isle", "pulley", "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 15:22):

alexcrichton commented on PR #13447:

Thanks for the PR, and like the previous PR this is stacked on we'd appreciate a bit more care taken when communicating here. Extensively documenting all these numbers is fine, but for example I've no idea what any of these benchmarks are or where their source is. I also don't really know how to reason about how the speedups seem to be balanced by slowdowns, especially in the middle column for the iPhone XS. For "PMU", "Icestorm", "Processing", "Discarded", etc, could you explain what those words all mean? I'm not entirely sure myself...

Procedurally it's fine to stack PRs on one another, but given the quantity of commits here this'll probably not get a review until afterhttps://github.com/bytecodealliance/wasmtime/pull/13445 has landed and this is rebased. Alternatively, if you'd like, feel free to split out things from this PR (for example the pulley opcodes) and have them land separately.

view this post on Zulip Wasmtime GitHub notifications bot (May 28 2026 at 21:37):

matthargett commented on PR #13447:

Thanks for the PR, and like the previous PR this is stacked on we'd appreciate a bit more care taken when communicating here. Extensively documenting all these numbers is fine, but for example I've no idea what any of these benchmarks are or where their source is. I also don't really know how to reason about how the speedups seem to be balanced by slowdowns, especially in the middle column for the iPhone XS. For "PMU", "Icestorm", "Processing", "Discarded", etc, could you explain what those words all mean? I'm not entirely sure myself...

Procedurally it's fine to stack PRs on one another, but given the quantity of commits here this'll probably not get a review until afterhttps://github.com//pull/13445 has landed and this is rebased. Alternatively, if you'd like, feel free to split out things from this PR (for example the pulley opcodes) and have them land separately.

PMU is a reference to xcode/xctrace's instruments for profiling, and Processing vs Discarded is CPU pipeline frontend vs backend efficiency. (Discarded is when branch prediction is wrong, so speculative execution results are pure waste.) Icestorm is the name for the effiicency cores (E-cores) of the device, which I'm focusing on to try and get the best performance-per-watt out of the decisions/choices rather than high-wattage and/or performance core uplift to consider the work "done".

I fully admit I wasn't up to speed on all the jargon and code/marketing names for the separate silicon IP and profiler tooling in these devices until I started this work. I've tried to use the higher-level terms efficiency cores where possible, but also want to make it easy for people to do their own web/AI searches so they have an easier time than I did on the research and execution. I've been deep into x86, x64, Qualcomm XR2, and Zen 2 profiling and optimziation before, but not on Apple hardware.

the benchmarks in in the repo I've referenced before: https://github.com/rebeckerspecialties/wasm-benchmark/ . this shoudl allow anyone to reproduce the WASM runtime performance "bakeoff" on their own Apple devices (iPhone, iPad, Apple TV 4K, Apple Watch, etc). I'm trying to show my work as much as I can here so this can have the best user-visible uplift possible for the extra complexity. IME, performance profiling and optimization can often be a bit hyperbolic and I'm doing my best to show that I'm being data-driven on the real devices I have available to me.

view this post on Zulip Wasmtime GitHub notifications bot (May 28 2026 at 23:38):

alexcrichton edited a comment on PR #13447:

Thanks for the PR, and like the previous PR this is stacked on we'd appreciate a bit more care taken when communicating here. Extensively documenting all these numbers is fine, but for example I've no idea what any of these benchmarks are or where their source is. I also don't really know how to reason about how the speedups seem to be balanced by slowdowns, especially in the middle column for the iPhone XS. For "PMU", "Icestorm", "Processing", "Discarded", etc, could you explain what those words all mean? I'm not entirely sure myself...

Procedurally it's fine to stack PRs on one another, but given the quantity of commits here this'll probably not get a review until after https://github.com/bytecodealliance/wasmtime/pull/13445 has landed and this is rebased. Alternatively, if you'd like, feel free to split out things from this PR (for example the pulley opcodes) and have them land separately.


Last updated: Jun 01 2026 at 09:49 UTC