wasmtime / PR #11630 Lookup compiled functions by `FuncKe... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #11630 Lookup compiled functions by `FuncKe...

Wasmtime GitHub notifications bot (Sep 05 2025 at 20:50):

fitzgen opened PR #11630 from fitzgen:organize-compiled-func-locs-by-func-key to bytecodealliance:main:

This treats compiled functions homogeneously, removing the need to add new metadata tables to places like CompiledModuleInfo whenever we add a new kind of function, and simplifying the process of constructing the metadata for a final, linked compilation artifact. This also paves the way to doing gc-sections in our linking (getting smaller code sizes, removing functions that have been inlined into every caller, and etc...) as we no longer assume that certain types of function index spaces are dense.

This does, however, replace a couple operations that were previously O(1) table lookups with O(log n) binary searches. And, notably, some of these are on the VMFuncRef-creation path, and therefore on the
force-initialization-of-a-lazy-funcref-table-slot path, when we look up a Wasm function and its trampolines. Our call-indirect micro-benchmarks show that indirect calling every funcref once in a table of 64Ki slots went from taking ~2.6ms to ~3.8ms (a +46% slowdown). Note that this edge case is both synthetic and the worst-case scenario for this commit's change: we are measuring, as much as we can, only the force-initialization-of-a-lazy-funcref-table-slot path. All other call-indirect benchmarks are within the noise, which is what we would expect.

Also, the size of .cwasms is slightly larger: spidermonkey.wasm's .cwasm size went from 19_750_632 bytes to 19785872 bytes, which is a 1.78% increase.

Ultimately, I believe that the simplification and possibility of doing gc-sections in the future is worth the downsides. That said, if others feel differently, there are some things we could try to improve the situation, although most things I can think of off the top of my head (e.g. LEB128s and delta encoding, making certain FuncKey kinds' index spaces dense) will improve one of code size or lookup times while pessimizing the other. I'm sure we could come up with something given enough effort though.

<details>

<summary>call-indirect micro-benchmarks results</summary>
call-indirect/same-callee/table-init-lazy/65536-calls
                        time:   [144.14 µs 145.26 µs 146.56 µs]
                        thrpt:  [447.15 Melem/s 451.15 Melem/s 454.68 Melem/s]
                 change:
                        time:   [−5.5066% −3.6611% −1.9130%] (p = 0.00 < 0.05)
                        thrpt:  [+1.9503% +3.8002% +5.8275%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
call-indirect/different-callees/table-init-lazy/65536-calls
                        time:   [3.8128 ms 3.8433 ms 3.8763 ms]
                        thrpt:  [16.907 Melem/s 17.052 Melem/s 17.188 Melem/s]
                 change:
                        time:   [+43.064% +46.066% +49.080%] (p = 0.00 < 0.05)
                        thrpt:  [−32.922% −31.538% −30.101%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild
call-indirect/same-callee/table-init-strict/65536-calls
                        time:   [130.27 µs 131.66 µs 133.40 µs]
                        thrpt:  [491.26 Melem/s 497.75 Melem/s 503.09 Melem/s]
                 change:
                        time:   [−6.4798% −4.1871% −1.8965%] (p = 0.00 < 0.05)
                        thrpt:  [+1.9332% +4.3701% +6.9288%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe
call-indirect/different-callees/table-init-strict/65536-calls
                        time:   [176.22 µs 178.49 µs 180.99 µs]
                        thrpt:  [362.10 Melem/s 367.18 Melem/s 371.90 Melem/s]
                 change:
                        time:   [−18.431% −15.397% −12.330%] (p = 0.00 < 0.05)
                        thrpt:  [+14.064% +18.200% +22.595%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
</details>

Wasmtime GitHub notifications bot (Sep 05 2025 at 20:50):

fitzgen requested cfallin for a review on PR #11630.

Wasmtime GitHub notifications bot (Sep 05 2025 at 20:50):

fitzgen requested wasmtime-compiler-reviewers for a review on PR #11630.

Wasmtime GitHub notifications bot (Sep 05 2025 at 20:50):

fitzgen requested alexcrichton for a review on PR #11630.

Wasmtime GitHub notifications bot (Sep 05 2025 at 20:50):

fitzgen requested wasmtime-core-reviewers for a review on PR #11630.

Wasmtime GitHub notifications bot (Sep 05 2025 at 20:56):

fitzgen edited PR #11630:

This treats compiled functions homogeneously, removing the need to add new metadata tables to places like CompiledModuleInfo whenever we add a new kind of function, and simplifying the process of constructing the metadata for a final, linked compilation artifact. This also paves the way to doing gc-sections in our linking (getting smaller code sizes, removing functions that have been inlined into every caller, and etc...) as we no longer assume that certain types of function index spaces are dense.

This does, however, replace a couple operations that were previously O(1) table lookups with O(log n) binary searches. And, notably, some of these are on the VMFuncRef-creation path, and therefore on the
force-initialization-of-a-lazy-funcref-table-slot path, when we look up a Wasm function and its trampolines. Our call-indirect micro-benchmarks show that indirect calling every funcref once in a table of 64Ki slots went from taking ~2.6ms to ~3.8ms (a +46% slowdown). Note that this edge case is both synthetic and the worst-case scenario for this commit's change: we are measuring, as much as we can, only the force-initialization-of-a-lazy-funcref-table-slot path. All other call-indirect benchmarks are within the noise, which is what we would expect.

Also, the size of .cwasms is slightly larger: spidermonkey.wasm's .cwasm size went from 19_750_632 bytes to 19_785_872 bytes, which is a 1.78% increase.

Ultimately, I believe that the simplification and possibility of doing gc-sections in the future is worth the downsides. That said, if others feel differently, there are some things we could try to improve the situation, although most things I can think of off the top of my head (e.g. LEB128s and delta encoding, making certain FuncKey kinds' index spaces dense) will improve one of code size or lookup times while pessimizing the other. I'm sure we could come up with something given enough effort though.

<details>

<summary>call-indirect micro-benchmarks results</summary>
call-indirect/same-callee/table-init-lazy/65536-calls
                        time:   [144.14 µs 145.26 µs 146.56 µs]
                        thrpt:  [447.15 Melem/s 451.15 Melem/s 454.68 Melem/s]
                 change:
                        time:   [−5.5066% −3.6611% −1.9130%] (p = 0.00 < 0.05)
                        thrpt:  [+1.9503% +3.8002% +5.8275%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
call-indirect/different-callees/table-init-lazy/65536-calls
                        time:   [3.8128 ms 3.8433 ms 3.8763 ms]
                        thrpt:  [16.907 Melem/s 17.052 Melem/s 17.188 Melem/s]
                 change:
                        time:   [+43.064% +46.066% +49.080%] (p = 0.00 < 0.05)
                        thrpt:  [−32.922% −31.538% −30.101%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild
call-indirect/same-callee/table-init-strict/65536-calls
                        time:   [130.27 µs 131.66 µs 133.40 µs]
                        thrpt:  [491.26 Melem/s 497.75 Melem/s 503.09 Melem/s]
                 change:
                        time:   [−6.4798% −4.1871% −1.8965%] (p = 0.00 < 0.05)
                        thrpt:  [+1.9332% +4.3701% +6.9288%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe
call-indirect/different-callees/table-init-strict/65536-calls
                        time:   [176.22 µs 178.49 µs 180.99 µs]
                        thrpt:  [362.10 Melem/s 367.18 Melem/s 371.90 Melem/s]
                 change:
                        time:   [−18.431% −15.397% −12.330%] (p = 0.00 < 0.05)
                        thrpt:  [+14.064% +18.200% +22.595%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
</details>

Wasmtime GitHub notifications bot (Sep 05 2025 at 21:00):

fitzgen updated PR #11630.

Wasmtime GitHub notifications bot (Sep 05 2025 at 21:10):

fitzgen updated PR #11630.

Wasmtime GitHub notifications bot (Sep 05 2025 at 22:44):

github-actions[bot] commented on PR #11630:

Subscribe to Label Action

cc @saulecabrera

<details>
This issue or pull request has been labeled: "wasmtime:api", "winch"

Thus the following users have been cc'd because of the following labels:

saulecabrera: winch

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

Wasmtime GitHub notifications bot (Sep 08 2025 at 20:07):

alexcrichton created PR review comment:

Could this expand on what a None case means?

(also if it's purely to be able to use Module::default I think it'd be ok to remove that construction method)

Wasmtime GitHub notifications bot (Sep 08 2025 at 20:07):

alexcrichton submitted PR review:

Really nice how this turned out, thanks for pushing on this!

One more benchmark though before merging I now realize -- component instantiation. IIRC we confirmed core wasm instantiation was largely unaffected by this but reading over this I'm remembering that component instantiation, when it hits various initializers for trampolines/builtins/etc, will do the FuncKey lookup now. Can you make a simple-ish spidermonkey.wasm component and compare before/after the instantiation numbers?

Wasmtime GitHub notifications bot (Sep 12 2025 at 20:46):

fitzgen submitted PR review.

Wasmtime GitHub notifications bot (Sep 12 2025 at 20:46):

fitzgen created PR review comment:

Addressed in https://github.com/bytecodealliance/wasmtime/pull/11694

Wasmtime GitHub notifications bot (Sep 15 2025 at 19:38):

fitzgen updated PR #11630.

Wasmtime GitHub notifications bot (Sep 15 2025 at 19:39):

fitzgen commented on PR #11630:

@alexcrichton mind taking another look at this? Redid a bunch of stuff so that there is actually a (very small) code size improvement now, the various index spaces have nice newtypes, and lookups into dense index spaces are O(1) again.

Wasmtime GitHub notifications bot (Sep 15 2025 at 19:40):

fitzgen edited PR #11630:

This commit refactors our metadata, treating compiled functions homogeneously
and removing the need to add new tables to places like CompiledModuleInfo
whenever we add a new kind of function. This also simplifies the process of
constructing the metadata for a final, linked compilation artifact. Finally, it
paves the way to doing gc-sections during our linking process (which would give
us smaller code sizes by removing functions that have been inlined into every
caller, for example) as we now allow holes in certain types of function index
spaces that were previously always densely populated.

We have two kinds of index spaces:

Mostly-dense index spaces, which take O(max_index) space and provide O(1)
lookups.

Sparse index spaces, which take O(num_members) space and provide
O(log n) lookups.

Most of our function index spaces are currently dense, but we can tweak that in
the future if necessary.

Furthermore, code size of .cwasm binaries has shrunk very slightly with this
refactoring. Consider spidermonkey.wasm's compiled .cwasm:

Size before: 218756 .wasmtime.info section bytes, 20052632 total bytes

Size after: 213761 .wasmtime.info section bytes, 20047640 total bytes

That is a 2.28% reduction on the size of the .wasmtime.info section, or a
0.025% reduction total.

However, we previously did a single metadata lookup to get the location of both
a Wasm function itself and its array-to-Wasm trampoline at the same time, and in
the new version of the code two lookups are performed. This is slightly slower,
as shown in our call-indirect micro-benchmark that combines lazy table
initialization (which delays looking up the function element's location until
runtime) with indirect-calling each table element exactly once (which defeats
the amortization of that lookup). So this micro-benchmark is both synthetic and
the worst-case scenario for this commit's change: we are measuring, as much as
we can, only the force-initialization-of-a-lazy-funcref-table-slot path.

Ultimately, I believe that the simplification is worth the regression in this
micro-benchmark.

<details>

<summary>call-indirect micro-benchmarks results</summary>
call-indirect/same-callee/table-init-lazy/65536-calls
                        time:   [152.77 µs 154.92 µs 157.39 µs]
                        thrpt:  [416.40 Melem/s 423.04 Melem/s 428.99 Melem/s]
                 change:
                        time:   [−13.749% −10.205% −6.2864%] (p = 0.00 < 0.05)
                        thrpt:  [+6.7081% +11.365% +15.941%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) high mild
  5 (5.00%) high severe
call-indirect/different-callees/table-init-lazy/65536-calls
                        time:   [4.3564 ms 4.4641 ms 4.5843 ms]
                        thrpt:  [14.296 Melem/s 14.681 Melem/s 15.044 Melem/s]
                 change:
                        time:   [+38.134% +44.404% +50.927%] (p = 0.00 < 0.05)
                        thrpt:  [−33.743% −30.750% −27.606%]
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
call-indirect/same-callee/table-init-strict/65536-calls
                        time:   [144.91 µs 148.41 µs 152.02 µs]
                        thrpt:  [431.10 Melem/s 441.58 Melem/s 452.24 Melem/s]
                 change:
                        time:   [−13.665% −10.470% −7.2626%] (p = 0.00 < 0.05)
                        thrpt:  [+7.8313% +11.694% +15.828%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
call-indirect/different-callees/table-init-strict/65536-calls
                        time:   [195.18 µs 200.67 µs 206.49 µs]
                        thrpt:  [317.38 Melem/s 326.59 Melem/s 335.77 Melem/s]
                 change:
                        time:   [−15.936% −11.568% −7.0835%] (p = 0.00 < 0.05)
                        thrpt:  [+7.6235% +13.081% +18.957%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

Wasmtime GitHub notifications bot (Sep 15 2025 at 22:00):

alexcrichton submitted PR review.

Wasmtime GitHub notifications bot (Sep 15 2025 at 22:59):

fitzgen merged PR #11630.

Last updated: Jun 01 2026 at 09:49 UTC