Stream: git-wasmtime

Topic: wasmtime / issue #13254 Compile-time performance of perfo...


view this post on Zulip Wasmtime GitHub notifications bot (May 01 2026 at 23:58):

alexcrichton opened issue #13254:

In the interest of turning on inlining by default one of the things we identified as desriable to do ahead of time would be to investigate the performance of enabling inlining. This analysis is done with https://github.com/bytecodealliance/wasmtime/pull/13250 as a base, notably where the intent is for -Cinlining=intrinsics to become the default.

The tests here were done on app.wasm.gz, a hello-world componentize-py application.

Initially, -Cinlining=no is 23% faster than -Cinlining=intrinsics.

<details>

Benchmark 1: ./baseline compile app.wasm -o/dev/null
  Time (mean ± σ):     498.1 ms ±  33.2 ms    [User: 7217.8 ms, System: 307.1 ms]
  Range (min  max):   464.9 ms  552.3 ms    10 runs

Benchmark 2: ./baseline compile app.wasm -Cinlining=intrinsics -o/dev/null
  Time (mean ± σ):     612.3 ms ±  12.4 ms    [User: 7309.4 ms, System: 999.1 ms]
  Range (min  max):   587.1 ms  627.2 ms    10 runs

Summary
  ./baseline compile app.wasm -o/dev/null ran
    1.23 ± 0.09 times faster than ./baseline compile app.wasm -Cinlining=intrinsics -o/dev/null

</details>

I dug more into this and did a few data structure optimizations which didn't really move the needle all that much. The main optimization opportunity I found was to prune the edges of the call graph that we create to only include "can this callee ever be inlined". This is an easy deduction for -Cinlining=intrinsics because we just look for FuncKey::UnsafeIntrinsic. This change reduced the number of strata layers from 39 to 1, where that's the precise number of layers to be expected if we only want to inline intrinsics.

With this change, plus the minor data structure optimizations, -Cinlining=no is still 16% faster than -Cinlining=intrinsics.

<details>

Benchmark 1: ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm -o/dev/null
  Time (mean ± σ):     515.4 ms ±  32.5 ms    [User: 7209.5 ms, System: 306.6 ms]
  Range (min  max):   468.7 ms  558.4 ms    10 runs

Benchmark 2: ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm -Cinlining=intrinsics -o/dev/null
  Time (mean ± σ):     597.7 ms ±  20.4 ms    [User: 7266.1 ms, System: 1006.8 ms]
  Range (min  max):   570.9 ms  629.0 ms    10 runs

Summary
  ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm -o/dev/null ran
    1.16 ± 0.08 times faster than ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm -Cinlining=intrinsics -o/dev/null

</details>

With further investigation I think this is a fundamental tradeoff that we have no realistic way of avoiding with the current architecture. To showcase this I modified the "compile without inlining" branch of Wasmtime to do that in two steps. The first step runs f(compiler) in parallel, and the next step performs finish_compiling in parallel.

With this change, I found that the single-parallel-loop approach is 17% faster than the two-parallel-loop approach.

<details>

(the hack here is that the "two parallel loops" is conditional on the A env var being present, I didn't make this a formal CLI option or something like that)

Benchmark 1: A=1 ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm -o/dev/null
  Time (mean ± σ):     581.0 ms ±  19.6 ms    [User: 7073.9 ms, System: 1033.0 ms]
  Range (min  max):   556.8 ms  614.8 ms    10 runs

Benchmark 2: ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm  -o/dev/null
  Time (mean ± σ):     494.6 ms ±  18.7 ms    [User: 7238.7 ms, System: 298.8 ms]
  Range (min  max):   467.7 ms  531.1 ms    10 runs

Summary
  ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm  -o/dev/null ran
    1.17 ± 0.06 times faster than A=1 ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm -o/dev/null

</details>

So, in essence, what I'm finding is that after specializing the call graph to -Cinlining=intrinsics the slowdown is the exact same as if inlining didn't happen at all, assuming that there's a "join point" in compilation (which there isn't today because there's no need).


This is where I have reached the conclusion that this is a fundamental tradeoff right now. Compilation without inlining intrinsically has no necessary synchronization between functions and all functions can be compiled 100% in parallel. This lack of synchronization means we get to keep CPUs nice and busy the entire time. With inlining, however, we fundamentally have a synchronization point where for A to consider inlining B it means that B has to finish at least being translated. The architecture currently is to translate everything in parallel, then perform inlining, then optimize/codegen.

My hypothesis is that this "join point" drastically cuts the amount of parallelization that's possible. Most "big modules" end up having basically one function that takes forever to compile. The nice part is that while that big function is compiling all the little functions can get done in parallel. With this join point, however, we cut the amount of parallelism that happens by causing everything to wait.

For example, in the above wasm, I see wasm[31]::function[254]::__wasm_apply_data_relocs take 33ms to translate and 274ms to optimize/codegen. The next largest function is wasm[31]::function[3488]::_PyUnicode_InitStaticStrings clocking in at 15ms/93ms. The functions then rapidly decrease from there. If the long function gets stuck towards the end of compilation then it means that CPUs are going to be much more idle than they were previously since we're holding up work.


Overall I'm not sure what the best conclusion here is. Other possible schemes for synchronizing the inlining decision feel way more complicated than the current "just loop twice", making me hesitant. Given that, I wanted to open up an issue here. Do others have thoughts on this? For example is this a compile-time cost we're willing to eat? Is this so unacceptable we can never turn inlining on by default? Other ideas?

view this post on Zulip Wasmtime GitHub notifications bot (May 04 2026 at 19:44):

fitzgen commented on issue #13254:

I dug more into this and did a few data structure optimizations which didn't really move the needle all that much. The main optimization opportunity I found was to prune the edges of the call graph that we create to only include "can this callee ever be inlined". This is an easy deduction for -Cinlining=intrinsics because we just look for FuncKey::UnsafeIntrinsic. This change reduced the number of strata layers from 39 to 1, where that's the precise number of layers to be expected if we only want to inline intrinsics.

We should land this, even if it doesn't get us 100% to parity because it should be a speed up even in other scenarios (eg in inter-module-only inlining this would filter out all intra-module call graph edges).

My hypothesis is that this "join point" drastically cuts the amount of parallelization that's possible. Most "big modules" end up having basically one function that takes forever to compile. The nice part is that while that big function is compiling all the little functions can get done in parallel. With this join point, however, we cut the amount of parallelism that happens by causing everything to wait.

Right now everything waits for the join point, but we could perhaps have only functions that appear as either a callee or caller of an inlinable edge in the call graph wait for that synchronization point, and have other functions that will never participate in inlining eagerly finish compilation instead of waiting? For the intrinsics-only mode, this should allow most functions to eagerly finish compilation.

view this post on Zulip Wasmtime GitHub notifications bot (May 04 2026 at 20:22):

alexcrichton commented on issue #13254:

Agreed yeah I was mostly just waiting til after https://github.com/bytecodealliance/wasmtime/pull/13250 to sort out the changes.

For architectural changes I'm not sure what would be feasible really. What we want is to start all translation in parallel, halt if there are inlinable callees until the callees have their IR, and then continue. Rayon doesn't provide such synchronization (or at least not that I'm aware of), and even a coarse "wait for all inlinable callees" I don't think is easily possible with Rayon. AFAIK Rayon is basically intended for "big parallel loop" style workflows, and something in the middle, like we have with a synchronization point, I think would involve refactoring entirely away from Rayon for compilation. At that point I'm not entirely sure how worth it it would be.

One, orthogonal, thing I've realized. Here's the speedup of "no inlining" vs -Cinlining=intrinsics w.r.t. threads:

threads speedup
1 4%
2 5%
4 9%
6 9%
8 12%
12 18%
32 18%

(I realize "speedup of the default" is a bit confusing, but that's how hyperfine is printing results and my brain hurts trying to convert "N% speedup" into "M% slowdown" since IIRC it's not 1:1)

In that sense, as expected, the slowdown gets worse with more cores. Not that that changes anything fundamental here, but I figured I'd note it.

view this post on Zulip Wasmtime GitHub notifications bot (May 06 2026 at 18:43):

alexcrichton commented on issue #13254:

We talked about this topic in today's Cranelift meeting and the conclusion was:

The final part here is deferred for future work. In the meantime https://github.com/bytecodealliance/wasmtime/pull/13300 is intended to be the solution for wasip3

view this post on Zulip Wasmtime GitHub notifications bot (May 06 2026 at 18:43):

alexcrichton added the cranelift:goal:compile-time label to Issue #13254.

view this post on Zulip Wasmtime GitHub notifications bot (May 06 2026 at 20:01):

fitzgen commented on issue #13254:

Longer-term the ideal would be to be able to enable inlining without such a large performance hit. This will require something akin to an async executor where each function's compilation will have synchronization points to wait for inlinable callees to have CLIF and then additionall wait for a funtion to be inlined into callers. This has open questions about how best to orchestrate it, how much the synchronization overhead will be, and how to handle the issues like loops in the call graph to avoid deadlock.

An additional wrinkle with any non-bottom-up/SCC approach that Chris and I identified after the meeting is that we can get $O(n^2)$ code bloat in call graph chains. Consider this call graph:

a -> b -> c -> d -> e -> f -> g

Without doing bottom-up/SCC-based inlining, and instead making inlining decisions independently and without syncronization for each function, then post-inlining we could end up with:


Last updated: Jun 01 2026 at 09:49 UTC