Hi! I've been benchmarking wasmtime's interpreter against WAMR's interpreter on Apple M4, iPhone XS, and Apple Watch 6/SE2 with some homegrown benchmarks based on my real application that drove me away from WasmEdge interpreter. I see that there's been a few unresolved issues and pull requests discussing this area, but I'm not sure if I should dive into those existing threads of make a new one.
While I was able to get WasmEdge purely on e-cores (saving battery life) with a bunch of local optimziations, the CPU usage is still relatively high -- especially when playing music sequencer type workloads: SNES SPC player, MOD/S3M player, etc. Even when optimizing with simd128 (sometimes doing the opcodes by hand), to reduce the opcode execution and stack operation pressure, WasmEdge was still using ~25% CPU and that's not acceptable for battery life.
iPhone XS (A12) — Pulley wins: matmul SIMD (+90%), matmul FMA (only Pulley supports relaxed-simd), tail-call (+52%), convolution (+9%). WAMR wins: bulk_memory (2.2×), call_indirect (2×), audio_dsp (+28%), sieve (+26%).
Apple Watch SE2 (S8) — Pulley wins: matmul SIMD (2.4×), tail-call (+57%), sieve (+23%), fib (+25%). WAMR wins: bulk_memory (2×), call_indirect (+58%), audio_dsp (+39%).
Key shifts from M4 host:
so, while wasmtime interpreter generally wins, it seems like call_indirect is what would have the most uplift in the "audio_dsp" workload that represents the SNES SPC player. I'm up for contributing the optimization that would help with the weak(er) branch prediction on the target device's CPUs, but would love some guidance on the history and approach you'd prefer I take.
I'm up for contributing the optimization
which optimizations do you have in mind?
FWIW the implementation of Wasm call_indirect isn't super-fast because it has to do CFI (control-flow integrity) things -- table bounds-check, target signature check, null check. That carries over from Wasm semantics, it's not Pulley-specific. Indeed branch predictor quality will matter too.
There aren't really systematic optimizations we could do easily that would address BP accuracy, though, that I'm aware of -- call_indirect fundamentally has a runtime choice, so unless we can fully devirtualize with heavyweight AOT analysis (turn into a direct call with one provable target), there always has to be some branch... happy to hear what you have in mind of course.
Does arm have a way to hint that a branch is almost always taken/not taken? If so that would be possible to use for checks.
I'm not aware of branch-hint instructions on aarch64 (and quick searching doesn't find any; someone correct me if wrong). That said the "default" on most modern BPUs is that the branch is invisible until taken once -- the branch won't be inserted into the relevant CPU frontend tables until a taken-mispredict forces a correction and insertion. So with our branch-to-cold-path-on-failure we're already best-case for our conditionals. I suspect the one making the difference between uarches is the indirect...
One thing we haven't done is a sort of "macro call_indirect" pulley instruction which might help here quite a bit, basically slurp up a bunch of pulley opcodes into a single opcode to do the call_indirect logic, effectively taking fewer turns of the interpreter loop which can often be a nice win
there's a bunch of stuff we could theorize, but its hard to consider anything without concrete benchmarks in hand that we can run ourselves
thanks for the quick responses! I'm coming at this from the perspective of pulley/interpreter on a constrained platform, so while I'm considerate of the ripple effects to JIT, I'm focused on how to get reasonably fast Time To First Instruction without causing a ton of heap fragmentation.
Based on my profiling (and filtering that through professional experience), I think that doing something akin to what WAMR and Luau pre-resolves indirect-call targets at preprocessing time; Pulley dispatches through the function table per call. I don't think this amounts to requiring full devirtualization (an optimization I helped contribute to GCC back in ~2010), but making passes generalized so they can layer with future passes is obviously Very Nice To Have.
it sounds like I should prototype a thesis, benchmark the daylights out of it, and give something more tangible for you all to react to?
@Matt Hargett yes, absolutely, a prototype showing speedups would be really valuable! (And ideally any speedups you see on the little cores would transfer at least partially to the big cores in our development machines so we could evaluate too, but we'll take your word on the impact on real hardware)
I'd be interested to read more about how WAMR pre-resolves targets -- I haven't looked into this before. If the source is still a function pointer (ie an index into the funcref table) I think we still need all checks, unless one can see that e.g. all table entries are non-null and all signatures match (latter is unlikely for all address-taken funcs in a real program). But curious to see what you come up with
would you say that your cfallin/experiment-fast-calls branch is a good leaping off point? or is there something you learned in that branch you want to avoid, or do better/differently?
ah, yes, I was playing with caching the last target; this was in the context of JS compiled to Wasm where almost every bytecode became an inline cache (indirect funcref call). I removed it because I didn't actually see any speedups
If you want to rebuild a 1-entry resolved-function-pointer cache and if it shows speedups, I think we'd be happy to put it back in though. (There's a version I actually landed in tree, then deleted a few months later, if you want to find that)
yea, I'm testing on Apple Watch SE2, iPhone XS, iPhone 12, iPhone 16, and M4 to try and make sure things remain a win in a reasonably diversified way. I have x64 Windows machines I can test on, but frankly I need to keep scope in mind since I have other paid projects going on :)
it looked like your branch didn't do polymorphic ICs, cross tiers, and seems to leave invalidation to a generation-counter scheme. is that right? also, what hardware did you benchmark/profile on?
table.set/... existed that could mutate it, and table not exported), so the cache remained valid for rest of executionBenchmarking was on both a generic desktop AMD x86 and Apple hardware (M1 and M2) at the time
I'm guessing on any x64 (or Apple M-series) CPU from the last ~7 years , the great branch predictors (and huge BTB) means the bounds check, signature check, and indirect call gets pipelined and predicted within 10 cycles thanks once the fast path is warmed up. if you looked at hardcore CPU counters (branch misses, L1 cache misses, etc) you'd probably see movement but not e2e movement until you paired that optimization with something else. on even newer iPhone and M2 efficiency cores, you'd probably get a wall clock difference (depending on the benchmark)
anyway, I'll come up with a prototype and some numbers in a PR in my fork for you all to react to. I'll try to solve for the multi-dimensional constraints as best I can :D
lmk if you have any guidance/concerns/preferences in the meantime!
sounds good -- no further thoughts other than that if the 1-entry cache is neutral for big cores but helps on small cores, that'd be a great outcome. looking forward to knowing either way!
Matt Hargett said:
I'm focused on how to get reasonably fast Time To First Instruction
not sure if you've seen these docs, but you'll want to make sure you are pre-compiling Wasm to pulley bytecode before sending it to your device: https://docs.wasmtime.dev/examples-pre-compiling-wasm.html
I need to the p2p module transfer to be wasm standard Because Reasons(tm), so I'm okay with absorbing some first-launch cold-start time where the device will cache the IR/pcode for later launches. For reasonably fast, I'm thinking 10kb (module size)/sec and staying under 100MB memory usage. versus hardcore AOT (like in .NET Native) which is 2+GB and sometimes tens of minutes
That's fine (and definitely more network-efficient; .wasms are much smaller than the corresponding AOT-compiled artifacts), just be aware that we don't necessarily guarantee fast compilation, and Pulley still uses Cranelift (to target the Pulley virtual CPU) so you're still exposed to potential compilation-resource-exhaustion DoS vectors if you don't trust the .wasm. Of course we still try our best to avoid those. And if your compilation speed target is 10kB of Wasm bytecode per sec then you should be fine by one or two orders of magnitude for non-pathological inputs...
@Chris Fallin what was the JS compiled to WASM example you used for your benchmark? how was it compiled?
Ah, that is now all upstream in StarlingMonkey; build any interesting JS benchmark with --enable-aot and you'll get a .wasm full of indirect calls (I don't have one ready-made for you at the moment, sorry!)
(https://github.com/bytecodealliance/StarlingMonkey/ for reference)
for testing you could write something as simple as a five-line iterative Fibonacci computation or something; the important part is having an inner loop with JS operators that have ICs (which is ~all of them, e.g. +)
besides my Rust SPC and S3M player workload, I"ll use a JS workload that I used to drive interpreter optimizations in Luau (when I was at Roblox), JSC and hermes (when I was at PlayStation): graphql . I'll cross-check with microbenchmarks as well, but I'm more interested in integration-case outcomes devs/users will "feel" :D
Sounds good! Note that if you do it in stock StarlingMonkey you'll get a wasi-http component out of the end of the compilation pipeline, so you'll need to put the benchmark inside a GET handler for / or whatever, then run with wasmtime serve and hit it with curl; a bit elaborate but that's what it was built for. There's a way to build the bare JS shell in a way that can directly be compiled with weval that I used for development, but that's way unnecessary for your purposes probably
(you can also run StarlingMonkey as a CLI tool with dynamically loaded JS)
oh, except that doesn't work with AOT compilation, of course
I opted to not use StarlingMonkey, but to try direct JS to WASM compilers: AssemblyScript and Porffor. I frankly never understood the scripting VM inside a VM think like Uno (mono on WASM), and that overhead would kill my WASM interpreter-on-watch target. I extracted a subset of graphql-js's validation workload to test both AssemblyScript and Porffor, and it's pretty neat that their unique approaches pushed me to generalize things further.
I have a draft commit stack PR up at: https://github.com/rebeckerspecialties/wasmtime/pull/2
this was a bit tedious, but very interesting. with each commit starting at commit 2, we optimize at each CPU pipeline stage and move the "pileup" further down until we get stuck at the branch predictor. The next logical step is a fusion operation, but there's enough hard-won info here to get feedback/guidance/flames on what to do different/next.
Just to note, if you weren't aware, that binaryen has a 'Directize' pass that tries to convert indirect to direct calls.
I wasn't aware, thanks for the tip! :D
I can see why @Chris Fallin hit a wall in his branch and reverted the previous optimization PR for this area. it's pretty tricky to balance everything so that you get a user-visible, end-to-end uplift across many real-world sample programs. I think it's solvable with enough time and a dedicated benchmarking setup to cross-check assumptions.
I frankly never understood the scripting VM inside a VM think like Uno (mono on WASM), and that overhead would kill my WASM interpreter-on-watch target.
The motivation is very straight-forward: compatibility. Neither AssemblyScript nor Porffor are anywhere close to being able to run most existing applications, and lots of us have that as a strong requirement. If someone gave me an AOT compiler from JS to Wasm that solves that issue, I'd drop work on StarlingMonkey immediately. In the meantime, we have weval giving us 3x-5x speedups for most code running in StarlingMonkey. Still not comparable to optimizing JITs, but very much sufficient for a lot of real-world use cases
I opted to not use StarlingMonkey, but to try direct JS to WASM compilers: AssemblyScript and Porffor.
FWIW, that's missing the original motivation for the benchmark then: wevaled-StarlingMonkey code has a whole bunch of indirect call sites, but to my knowledge neither AS nor Porffor use ICs/emit many indirect calls. You might as well drop the benchmark then (but if the others provide the coverage you want / still represent your real use-case then that's fine).
Interesting results, thanks. Extremely verbose (I would expect nothing less from our new clanker friends? -- no judgment, they're great for exploration) but interesting.
I'll note it seems that your branch doesn't do the one main optimization I had put in / experimented with before: a 1-entry target cache. Basically if we know the table is immutable, we can statically cache in the vmtctx that (say) the last call at this call_indirect site took index 23, and that resolved to funcptr 0x1234. Unless you cache callee vmctx you'll want to gate on caller vmctx == callee vmtx (intra-instance call) as well. At least at one point in time during weval/SpiderMonkey's co-evolution, that gave a few percent speedup, but as noted above the speedup eventually disappeared with other optimizations.
Also, overall, I agree with Alex above that merging Pulley ops into macro-ops is going to be by far the biggest win here, IMHO -- interpreter dispatch overhead dominates everything else.
they actually all have indirect call sites, I pulled in sqlite benchmark out of curiosity and to make sure I don't apply the optimizations when the table is mutable (which it is in sqlite due to the export)
| workload | source | call_indirect sites | table | dispatch shape |
|---|---|---|---|---|
call_indirect.wasm |
hand-written Rust | 22 | (table 17 17), internal |
tight monomorphic loop, 200 K dispatches/iter |
graphql-validation-as.wasm |
AssemblyScript port | 13 | (table 107 107), slot 0 null |
optimizer-friendly baseline; predicate doesn't fire |
graphql-validation-porf.wasm |
Porffor port (JS→wasm) | 131 | (table 46 46), internal |
megamorphic JS dispatch |
xmrsplayer.wasm |
xmrsplayer 0.11.0 + unreal.s3m music file |
27 | (table 10 10), internal |
real-world tracker-player dispatch through 12 dyn-trait sites in the per-tick effect pipeline. Each call renders one CoreAudio-shaped buffer (1024 stereo frames ≈ 23 ms of audio); player state is persistent across calls and the song loops indefinitely (max_loop_count = 0), so pick_iters lands on a real iter distribution rather than collapsing to a single 15 s shot |
sqlite3.wasm |
sightglass | 595 | (table 620), exported as __indirect_function_table |
host-mutable; predicate correctly OFF |
happy to add StarlingMonkey into the mix here if it's something people are deploying to production today in the server context to make sure it's all balancing out.
the need for fused-ops/macro-ops/superinstructions is the logical outcome of my measurements as well and it's at the end of the fork PR's description. I'd love some feedback on the approach and ordering: xband_brif_eq_zero first, funcref_load_dispatch second, and then the peephole pass to leverage them both to greatest effect. being able to fuse them correctly/efficiently depends on the mutability knowledge/guarantee, so it would be an add-on. let me know in what order (and size) you'd want the contributions broken up into for easy review and incremental merging.
I did look at your cache in both your branch and the PR that was merged and reverted. I thought it would be better to tee that (runtime optimziations) up better with static guarantees at bytecode loading/lowering/transforming time, but also so megamorphic chains (IC A -> IC B -> IC D) that appear common could be tackled as an incremental step with the same machinery. So, not really a different of opinion on the approach -- they aren't mutually exclusive by any means :D
again, sorry for the wall of text. in the PR, I wanted to present all the facts and information as I ran into them so there was a clear trail of breadcrumbs in case I made a mistake or assumption at some step along way. Verbosity is a general problem I have, but also when it comes to performance/optimization I try to be very rigorous and data-driven when proving/disproving my theory/approach.
ok, I've done some more integration and testing of the indirect cache. there's a couple of reasons why it's break-even or slightly regressing and bizarrely it's worse on M4 than on A12 due to quirks in the differences between the two microarchitectures when there is either an indirect cache Hit or a Miss. I just did a little experiment and I can get a consistent win if we make the indirect cache 2-way associative, and store it alongside the opcode, so that they're both in the L1 cache line. This would mean that the Pulley bytecode pages would need to be writable so that the indirect cache can be updated during execution.
question: can I make pulley bytecode memory pages writeable as a whole? or should I try and poke "holes" of writeable pages? or is this not a good idea because <Very Good Reason(tm)>?
note that this colocation of the IC is something done by the VM team at Roblox in response to some of my app teams' benchmarking and performane improvement requets. that runtime also had to deal with UGC and be resilient to fuzzing as well, but I don't know how much more sensitive to potetial attack surface you all are :)
Hmm, we definitely don't want to make bytecode memory writable. The most immediate reason is that it's shared among multiple instances, so that's a data race (and hence undefined behavior) if you have more than one instance going.
I'm somewhat skeptical that you need to put the data alongside the opcode for good cache behavior though -- most or all of the opcodes around the call-indirect opcode will also actually be used (so that cache line is not "wasted" in terms of cache occupancy vs byte-level working set size) and you can separately pack all the cache slots together in another array (classical struct-of-arrays). As long as a given workload fits in L1, the L1D doesn't care how you permute the data. Also if you put adjacent call-indirects' caches next to each other and you're streaming through code, it'll be a fairly prefetchable streaming sequence.
ok, I definitely hit a wall across multiple platforms trying the indirect_call cache angle. TL;DR the losses in the branch predictors on lower-power cores (including Pentium Gold x64) across multiple compiler/transpiler outputs (C++ vtable from StarlingMonkey, GraphQL schema validation conversion from AssemblyScript, etc). This was super educational, thanks for giving me the feedback.
that said, the table mutability analysis and the opcode elision I think are defensibly useful. I'll submit that as a self-contained PR and start trying the fused op / superinstruction approach that gets WAMR its clearer wins. I'll keep it incremental, but if you want to combine the two halves together and land them at the same time, that's fine with me. I'll report back when I have undeniable gains with wall clock and CPU trace bottleneck evidence
I finally found a pretty great end-to-end outcome in the benchmarks, including the new ones I added derived from Starlight. I'd be curious for people to reproduce the benchmarks where that were flat before.
https://github.com/rebeckerspecialties/wasmtime/pull/4
and I pushed up the sample app I've been using to drive the on-device measurements. I' happy to donate/transfer this repo if that's useful.
https://github.com/rebeckerspecialties/wasm-benchmark
the win is more obvious, even if it doesn't generalize all the way back to iPhone XS's efficiency cores. I don't think the "phase 5" makes sense since it starts to break the design/structure of wasmtime to date, but I'm interested in other people's opinions.
Let me know if there's other workloads you'd like me to benchmark on (or devices, assuming I have them). if there's no major objections/requests, I'll do another fuzzing pass and then submit the PRs to the upstream repo (table mutation analysis and the fused opcode).
@Chris Fallin lmk if there's any other rigor or constraints that I missed
@Matt Hargett please note our policy on AI tool use. At the very least, we ask that AI tools aren't noted as co-authors, and that instead the human contributor takes full responsibly for the entirety of the changes.
Note that that doesn't necessarily mean that anything needs to change about the code itself—I don't have the domain expertise to assess that, nor do our policies forbid substantial use of LLMs. In addition to requesting removal of the by-line I mainly wanted to give a heads-up to please be prepared to fully engage in conversations about these changes yourself, and feel fully confident that they're right, instead of letting reviewers be the first line of assessment of the LLM's output. For all I know you're already in that position, so please don't take this as criticism!
oh, and since I hadn't seen it before: we also request that contributors not use automated AI review tools. The idea is that when a PR is opened, the contributor is confident that review by humans is the right next step, with any automated reviewing already done
@Till Schneidereit sure thing, I can clean those things up before I submit the PR into the bytecodealliance repo. right now, I'm staging in my fork to make sure CI passes there. I have codex reviewing all PRs in my org to cross-check anything I didn't think of, same as the remote CI pass. I review all diffs and run style checksers and unit tests before commits are pushed, and in this case I also do multi-device benchmarks.
Like not submitting a PR in your org for the indirect-call cache because I couldn't get the clear wins across CPU architectures (despite the approach working well in Luau when I was a Principal at Roblox). That said, I may ask for guidance when it's not clear what will be accepted, and I try to provde as much data as a I can so people can follow the chain of logic (and point out exactly where I went wrong). This is how I've often worked in OSS (like my first merged Linux TCP/IP patch I emailed to Linus in 1994), even before coding agents :)
btw, if you're asking me to disable codex reviews on forks in my own org, I suppose I can do that, but the policy doesn't make it clear that automated bot responses are banned from contributors' orgs and repos outside of the bytecodealliance org.
I believe our policy applies only to interactions within the BA org. I suppose the line could be a bit fuzzy wrt what is "in BA" (e.g. linking large AI-generated walls of text here is a little bit against the spirit of "no extractive labor", but at the same time it's your fork). I would say, in any interaction where we are directly engaging e.g. a PR review, I would expect careful human review before the PR is posted, and I'd expect a coherent, human-written or at least human-edited not-wall-of-text PR description
re: your linked PR above, I will say it's an enormous wall of text, my time is fairly limited, so for best engagement I'd usually appreciate something like a paragraph summary ("I tried A, B, C; A and B worked well; I found the call-indirect cache fell down because most callsites are actually polymorphic / the loads miss the cache / ..."). What you have there looks more like the detailed play-by-play of a long experiment session
thanks for your time and patience, I'll try to not lay out _all_ the evidence like I usually do from now on.
TL;DR: with the fused op similar to what WAMR does, there's consistent wins of 4-8% across the benchmarks for iPhone 12 e-cores, M4 e-cores, and Apple Watch 6/SE2. the wins are there on iPhone XS, but not as consistent. branch prediction pressure is the thing that makes optimizing in this area hard to translate across microarchitectures. this demonstrably helps close the benchmark gap with WAMR on Apple Watch, which was the original motivator.
if that doesn't sound too incredible, you can skip the gory details in the PR description. the diff looks to me like it fits within the structure and style (and why I didn't implement/propose phase 5), so that's what I was looking to get feedback on. if you'd prefer that I hold off on asking for feedback until I've deployed the change to users, that is a fine barrier to set. (that used to be my default for OSS contribution when I was working within large companies to lend more credibility to the contribution.)
4-8% wins is pretty good, and if it doesn't regress anything else significantly, then that definitely seems like something we would be interested in (complexity and maintenance being the other thing that we would need to consider, but that can be done in a PR so we can see the diff, IMO)
+1 -- happy to take a look at a PR for the opcode fusing
@Matt Hargett my apologies, I clearly didn't look closely enough and missed the fact that the PR isn't to upstream :confused: Certainly our policies don't apply to what you're doing in your own repos!
As for laying out evidence, it's always a fine line to walk: we certainly do want evidence on whether changes are worthwhile, in particular if they're complex and have subtle implications, as is the case here. At the same time, it's very easy for maintainers to get overwhelmed by too much information, and sometimes hard to know which parts to really dig into and which to skim.
One way to try to thread that needle is to structure what you're sharing in a way that puts the top-level takeaways at, well, the top, mention in which order which details will follow, and then have sections for each aspect of details, so reviewers can dig into the aspects they find most critical to engage with more deeply
Also the <details> tag is a godsend
ok, I split it into two PRs for easier review, one can be merged before the other.
https://github.com/bytecodealliance/wasmtime/pull/13445
https://github.com/bytecodealliance/wasmtime/pull/13446
Last updated: Jun 01 2026 at 09:49 UTC