matthargett opened PR #13447 from rebeckerspecialties:pulley-fusion-dispatch-tail-upstream to bytecodealliance:main:
TL;DR
Stacks on #13445 (per-table mutability tracking). Adds a fused dispatch op family at the
call_indirectlazy-init brif site (similar in shape to WAMR's preprocessed-bytecode register IR) plus acall_indirect{1,2,3,4}family mirroring direct-callcall{1,2,3,4}. Consistent 4–8 % wallclock wins on polymorphic vtable workloads across iPhone 12 E-core, Apple Watch SE2 , and M4 E-core. iPhone XS e-core is mixed and the reason, is visible in hand-rolled aarch64 assembly microbenchmarks: branch-prediction pressure is the cross-microarch variable.Closes ~10 % of the Pulley/WAMR wallclock gap on the iPhone 12 vtable suite (vtable_poly4 1.73× → 1.58×) and pushes WAMR within 1.16× on
xmrsplayeron Apple Watch SE2 — the closest cross-device result in our matrix.Dependency
Depends on #13445 landing first. Both PRs are stacked from the same fork branch; this PR's diff against
mainincludes #13445's 11 commits at the bottom + 13 fusion commits on top. The fusion is gated onis_eagerly_initialized_funcref_table(the predicate added in #13445), so it only fires when the table-mutability proof holds.Stack
13 commits on top of #13445:
- Phases 1–3: collapse
band + brif + 2 xloadsat the call_indirect lazy-init tail (5 Pulley dispatches → 2 per call_indirect site).- Phase 4:
call_indirect{1,2,3,4}opcodes mirror direct-callcall{1,2,3,4}.Inst::IndirectCallbundles first 4 integer ABI args into the call opcode instead of synthesisingxmovs via regallocreg_fixed_use.- Correctness: handlers trap on null (a slow-path-aliasing review concern:
sink_pure_instof the continuation-block loads broke the lazy-init slow path's rejoin; trapping fails closed under the predicate).Wallclock medians, N=10, phase-4 vs
table-mutability-trackingbaseline
BENCH_TARGET_MS=2000;.utilityQoS on iOS /taskpolicy -bon M4.
workload iPhone 12 (A14) iPhone XS (A12) Watch SE2 (S8) call_indirect −2.48 % +0.18 % −0.50 % vtable_bi −7.45 % +4.84 % −4.26 % vtable_poly4 −8.61 % −2.96 % −4.76 % vtable_poly6 −5.32 % +5.41 % −4.64 % xmrsplayer +0.26 % −6.04 % +0.25 % graphql (AS) −0.13 % −3.63 % −0.74 % graphql (Porffor) +2.13 % +0.21 % +0.58 % PMU buckets (single 12 s xctrace
CPU Countersper workload)
- A14 Icestorm vtable suite: 20–38 % Processing, 4–18 % Discarded.
- A14 Icestorm graphql + call_indirect: 9–13 % Processing, 17–41 % Discarded interpreter-loop mispredict pressure but its kinda squirrely and I couldn't pin it down..
- M4 Sawtooth, every workload: 33–47 % Processing — back-end load-use latency on the dispatch tail dominates Sawtooth's wider issue width. The wide spread here is probably due to running the measurement and dev stack on the device itself.
Common bottleneck across A14 + M4 is Processing (back-end-bound). I can't measure these advacned CPU counters on iPhone XS (A12) or Apple Watch, I can only get wall clock time and power draw.
Verification
- 2237 / 2237 cranelift filetests
- 13 / 13
craneliftpulley_call_*integration tests- 21 min
cargo fuzz run differential --no-default-featureswithALLOWED_ENGINES=pulley,wasmtime— 0 crashes / 0 divergencesExtra credit
Cross-device measurement harness I built for this bake-off across WASM runtimes: https://github.com/rebeckerspecialties/wasm-benchmark/pull/1
I submitted two PRs to WAMR to support exceptions and loose SIMD, so that it can run more of the benchmarks and generally have functional partity without losing its performance edge.
matthargett requested cfallin for a review on PR #13447.
matthargett requested wasmtime-core-reviewers for a review on PR #13447.
matthargett requested wasmtime-compiler-reviewers for a review on PR #13447.
matthargett requested fitzgen for a review on PR #13447.
matthargett requested wasmtime-default-reviewers for a review on PR #13447.
matthargett commented on PR #13447:
CI status
13/14 jobs are green. Remaining failure is
Nightly tests→wasmtime-fuzzing/oom, but it doesn't look related to this PR:
- All individual
oom::*tests printok(includingtable_grow, the last to log).- Process then aborts with
memory allocation of 8 bytes failed+ SIGABRT during teardown.- Same pattern on every run since rebase onto current
main.Our changes that touch instance/allocation are guarded on
is_eagerly_initialized_funcref_table— only fires for module instantiation with funcref tables under the predicate. The failing tests createTable::newdirectly withenable_compiler(false), no module instantiation; the new code path is never reached.Same
Nightly testsjob passes on recentmainruns (26267350231,26204292060,26140078343) and was skipped on the parent PR #13445 (no test-nightly trigger). The 8-byte OOM at teardown after all tests passed reads as runner-side memory pressure rather than a real regression. Happy to dig further if maintainers can confirm whether this is a known infra issue, or if you want me to bisect locally — my pinned nightly toolchain doesn't accept thearc_try_newcfg gate so I can't reproduce the exact CI build flags here.
matthargett updated PR #13447.
matthargett edited a comment on PR #13447:
CI status
13/14 jobs are green. Remaining failure is
Nightly tests→wasmtime-fuzzing/oom, but it doesn't look related to this PR:
- All individual
oom::*tests printok(includingtable_grow, the last to log).- Process then aborts with
memory allocation of 8 bytes failed+ SIGABRT during teardown.- Same pattern on every run since rebase onto current
main.Our changes that touch instance/allocation are guarded on
is_eagerly_initialized_funcref_table, so it only fires for module instantiation with funcref tables under the predicate. The failing tests createTable::newdirectly withenable_compiler(false), no module instantiation; the new code path is never reached.Same
Nightly testsjob passes on recentmainruns (26267350231,26204292060,26140078343) and was skipped on the parent PR #13445 (no test-nightly trigger). The 8-byte OOM at teardown after all tests passed reads as runner-side memory pressure rather than a real regression. Happy to dig further if maintainers can confirm whether this is a known infra issue, or if you want me to bisect locally — my pinned nightly toolchain doesn't accept thearc_try_newcfg gate so I can't reproduce the exact CI build flags here.
matthargett edited a comment on PR #13447:
CI status
13/14 jobs are green. Remaining failure is
Nightly tests→wasmtime-fuzzing/oom, but it doesn't look related to this PR:
- All individual
oom::*tests printok(includingtable_grow, the last to log).- Process then aborts with
memory allocation of 8 bytes failed+ SIGABRT during teardown.- Same pattern on every run since rebase onto current
main.Our changes that touch instance/allocation are guarded on
is_eagerly_initialized_funcref_table, so it only fires for module instantiation with funcref tables under the predicate. The failing tests createTable::newdirectly withenable_compiler(false), no module instantiation; the new code path is never reached.Same
Nightly testsjob passes on recentmainruns (26267350231,26204292060,26140078343) and was skipped on the parent PR #13445 (no test-nightly trigger). The 8-byte OOM at teardown after all tests passed reads as runner-side memory pressure rather than a real regression. Happy to dig further if folks can confirm whether this is a known infra issue, or if you want me to bisect locally — my pinned nightly toolchain doesn't accept thearc_try_newcfg gate so I can't reproduce the exact CI build flags here.
github-actions[bot] added the label cranelift on PR #13447.
github-actions[bot] added the label cranelift:area:machinst on PR #13447.
github-actions[bot] added the label cranelift:meta on PR #13447.
github-actions[bot] added the label isle on PR #13447.
github-actions[bot] added the label wasmtime:api on PR #13447.
github-actions[bot] added the label pulley on PR #13447.
github-actions[bot] commented on PR #13447:
Subscribe to Label Action
cc @cfallin, @fitzgen
<details>
This issue or pull request has been labeled: "cranelift", "cranelift:area:machinst", "cranelift:meta", "isle", "pulley", "wasmtime:api"Thus the following users have been cc'd because of the following labels:
- cfallin: isle
- fitzgen: isle, pulley
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
alexcrichton commented on PR #13447:
Thanks for the PR, and like the previous PR this is stacked on we'd appreciate a bit more care taken when communicating here. Extensively documenting all these numbers is fine, but for example I've no idea what any of these benchmarks are or where their source is. I also don't really know how to reason about how the speedups seem to be balanced by slowdowns, especially in the middle column for the iPhone XS. For "PMU", "Icestorm", "Processing", "Discarded", etc, could you explain what those words all mean? I'm not entirely sure myself...
Procedurally it's fine to stack PRs on one another, but given the quantity of commits here this'll probably not get a review until afterhttps://github.com/bytecodealliance/wasmtime/pull/13445 has landed and this is rebased. Alternatively, if you'd like, feel free to split out things from this PR (for example the pulley opcodes) and have them land separately.
matthargett commented on PR #13447:
Thanks for the PR, and like the previous PR this is stacked on we'd appreciate a bit more care taken when communicating here. Extensively documenting all these numbers is fine, but for example I've no idea what any of these benchmarks are or where their source is. I also don't really know how to reason about how the speedups seem to be balanced by slowdowns, especially in the middle column for the iPhone XS. For "PMU", "Icestorm", "Processing", "Discarded", etc, could you explain what those words all mean? I'm not entirely sure myself...
Procedurally it's fine to stack PRs on one another, but given the quantity of commits here this'll probably not get a review until afterhttps://github.com//pull/13445 has landed and this is rebased. Alternatively, if you'd like, feel free to split out things from this PR (for example the pulley opcodes) and have them land separately.
PMU is a reference to xcode/xctrace's instruments for profiling, and Processing vs Discarded is CPU pipeline frontend vs backend efficiency. (Discarded is when branch prediction is wrong, so speculative execution results are pure waste.) Icestorm is the name for the effiicency cores (E-cores) of the device, which I'm focusing on to try and get the best performance-per-watt out of the decisions/choices rather than high-wattage and/or performance core uplift to consider the work "done".
I fully admit I wasn't up to speed on all the jargon and code/marketing names for the separate silicon IP and profiler tooling in these devices until I started this work. I've tried to use the higher-level terms efficiency cores where possible, but also want to make it easy for people to do their own web/AI searches so they have an easier time than I did on the research and execution. I've been deep into x86, x64, Qualcomm XR2, and Zen 2 profiling and optimziation before, but not on Apple hardware.
the benchmarks in in the repo I've referenced before: https://github.com/rebeckerspecialties/wasm-benchmark/ . this shoudl allow anyone to reproduce the WASM runtime performance "bakeoff" on their own Apple devices (iPhone, iPad, Apple TV 4K, Apple Watch, etc). I'm trying to show my work as much as I can here so this can have the best user-visible uplift possible for the extra complexity. IME, performance profiling and optimization can often be a bit hyperbolic and I'm doing my best to show that I'm being data-driven on the real devices I have available to me.
alexcrichton edited a comment on PR #13447:
Thanks for the PR, and like the previous PR this is stacked on we'd appreciate a bit more care taken when communicating here. Extensively documenting all these numbers is fine, but for example I've no idea what any of these benchmarks are or where their source is. I also don't really know how to reason about how the speedups seem to be balanced by slowdowns, especially in the middle column for the iPhone XS. For "PMU", "Icestorm", "Processing", "Discarded", etc, could you explain what those words all mean? I'm not entirely sure myself...
Procedurally it's fine to stack PRs on one another, but given the quantity of commits here this'll probably not get a review until after https://github.com/bytecodealliance/wasmtime/pull/13445 has landed and this is rebased. Alternatively, if you'd like, feel free to split out things from this PR (for example the pulley opcodes) and have them land separately.
Last updated: Jun 01 2026 at 09:49 UTC