alexcrichton opened issue #13254:
In the interest of turning on inlining by default one of the things we identified as desriable to do ahead of time would be to investigate the performance of enabling inlining. This analysis is done with https://github.com/bytecodealliance/wasmtime/pull/13250 as a base, notably where the intent is for
-Cinlining=intrinsicsto become the default.The tests here were done on app.wasm.gz, a hello-world componentize-py application.
Initially,
-Cinlining=nois 23% faster than-Cinlining=intrinsics.<details>
Benchmark 1: ./baseline compile app.wasm -o/dev/null Time (mean ± σ): 498.1 ms ± 33.2 ms [User: 7217.8 ms, System: 307.1 ms] Range (min … max): 464.9 ms … 552.3 ms 10 runs Benchmark 2: ./baseline compile app.wasm -Cinlining=intrinsics -o/dev/null Time (mean ± σ): 612.3 ms ± 12.4 ms [User: 7309.4 ms, System: 999.1 ms] Range (min … max): 587.1 ms … 627.2 ms 10 runs Summary ./baseline compile app.wasm -o/dev/null ran 1.23 ± 0.09 times faster than ./baseline compile app.wasm -Cinlining=intrinsics -o/dev/null</details>
I dug more into this and did a few data structure optimizations which didn't really move the needle all that much. The main optimization opportunity I found was to prune the edges of the call graph that we create to only include "can this callee ever be inlined". This is an easy deduction for
-Cinlining=intrinsicsbecause we just look forFuncKey::UnsafeIntrinsic. This change reduced the number of strata layers from 39 to 1, where that's the precise number of layers to be expected if we only want to inline intrinsics.With this change, plus the minor data structure optimizations,
-Cinlining=nois still 16% faster than-Cinlining=intrinsics.<details>
Benchmark 1: ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm -o/dev/null Time (mean ± σ): 515.4 ms ± 32.5 ms [User: 7209.5 ms, System: 306.6 ms] Range (min … max): 468.7 ms … 558.4 ms 10 runs Benchmark 2: ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm -Cinlining=intrinsics -o/dev/null Time (mean ± σ): 597.7 ms ± 20.4 ms [User: 7266.1 ms, System: 1006.8 ms] Range (min … max): 570.9 ms … 629.0 ms 10 runs Summary ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm -o/dev/null ran 1.16 ± 0.08 times faster than ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm -Cinlining=intrinsics -o/dev/null</details>
With further investigation I think this is a fundamental tradeoff that we have no realistic way of avoiding with the current architecture. To showcase this I modified the "compile without inlining" branch of Wasmtime to do that in two steps. The first step runs
f(compiler)in parallel, and the next step performsfinish_compilingin parallel.With this change, I found that the single-parallel-loop approach is 17% faster than the two-parallel-loop approach.
<details>
(the hack here is that the "two parallel loops" is conditional on the
Aenv var being present, I didn't make this a formal CLI option or something like that)Benchmark 1: A=1 ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm -o/dev/null Time (mean ± σ): 581.0 ms ± 19.6 ms [User: 7073.9 ms, System: 1033.0 ms] Range (min … max): 556.8 ms … 614.8 ms 10 runs Benchmark 2: ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm -o/dev/null Time (mean ± σ): 494.6 ms ± 18.7 ms [User: 7238.7 ms, System: 298.8 ms] Range (min … max): 467.7 ms … 531.1 ms 10 runs Summary ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm -o/dev/null ran 1.17 ± 0.06 times faster than A=1 ./target/x86_64-unknown-linux-gnu/release/wasmtime compile app.wasm -o/dev/null</details>
So, in essence, what I'm finding is that after specializing the call graph to
-Cinlining=intrinsicsthe slowdown is the exact same as if inlining didn't happen at all, assuming that there's a "join point" in compilation (which there isn't today because there's no need).
This is where I have reached the conclusion that this is a fundamental tradeoff right now. Compilation without inlining intrinsically has no necessary synchronization between functions and all functions can be compiled 100% in parallel. This lack of synchronization means we get to keep CPUs nice and busy the entire time. With inlining, however, we fundamentally have a synchronization point where for A to consider inlining B it means that B has to finish at least being translated. The architecture currently is to translate everything in parallel, then perform inlining, then optimize/codegen.
My hypothesis is that this "join point" drastically cuts the amount of parallelization that's possible. Most "big modules" end up having basically one function that takes forever to compile. The nice part is that while that big function is compiling all the little functions can get done in parallel. With this join point, however, we cut the amount of parallelism that happens by causing everything to wait.
For example, in the above wasm, I see
wasm[31]::function[254]::__wasm_apply_data_relocstake 33ms to translate and 274ms to optimize/codegen. The next largest function iswasm[31]::function[3488]::_PyUnicode_InitStaticStringsclocking in at 15ms/93ms. The functions then rapidly decrease from there. If the long function gets stuck towards the end of compilation then it means that CPUs are going to be much more idle than they were previously since we're holding up work.
Overall I'm not sure what the best conclusion here is. Other possible schemes for synchronizing the inlining decision feel way more complicated than the current "just loop twice", making me hesitant. Given that, I wanted to open up an issue here. Do others have thoughts on this? For example is this a compile-time cost we're willing to eat? Is this so unacceptable we can never turn inlining on by default? Other ideas?
Last updated: May 03 2026 at 22:13 UTC