fitzgen opened PR #11630 from fitzgen:organize-compiled-func-locs-by-func-key to bytecodealliance:main:
This treats compiled functions homogeneously, removing the need to add new metadata tables to places like
CompiledModuleInfowhenever we add a new kind of function, and simplifying the process of constructing the metadata for a final, linked compilation artifact. This also paves the way to doing gc-sections in our linking (getting smaller code sizes, removing functions that have been inlined into every caller, and etc...) as we no longer assume that certain types of function index spaces are dense.This does, however, replace a couple operations that were previously O(1) table lookups with O(log n) binary searches. And, notably, some of these are on the
VMFuncRef-creation path, and therefore on the
force-initialization-of-a-lazy-funcref-table-slot path, when we look up a Wasm function and its trampolines. Our call-indirect micro-benchmarks show that indirect calling every funcref once in a table of 64Ki slots went from taking ~2.6ms to ~3.8ms (a +46% slowdown). Note that this edge case is both synthetic and the worst-case scenario for this commit's change: we are measuring, as much as we can, only the force-initialization-of-a-lazy-funcref-table-slot path. All other call-indirect benchmarks are within the noise, which is what we would expect.Also, the size of
.cwasms is slightly larger:spidermonkey.wasm's.cwasmsize went from19_750_632bytes to19785872bytes, which is a 1.78% increase.Ultimately, I believe that the simplification and possibility of doing gc-sections in the future is worth the downsides. That said, if others feel differently, there are some things we could try to improve the situation, although most things I can think of off the top of my head (e.g. LEB128s and delta encoding, making certain
FuncKeykinds' index spaces dense) will improve one of code size or lookup times while pessimizing the other. I'm sure we could come up with something given enough effort though.<details>
<summary>call-indirect micro-benchmarks results</summary>
call-indirect/same-callee/table-init-lazy/65536-calls time: [144.14 µs 145.26 µs 146.56 µs] thrpt: [447.15 Melem/s 451.15 Melem/s 454.68 Melem/s] change: time: [−5.5066% −3.6611% −1.9130%] (p = 0.00 < 0.05) thrpt: [+1.9503% +3.8002% +5.8275%] Performance has improved. Found 9 outliers among 100 measurements (9.00%) 7 (7.00%) high mild 2 (2.00%) high severe call-indirect/different-callees/table-init-lazy/65536-calls time: [3.8128 ms 3.8433 ms 3.8763 ms] thrpt: [16.907 Melem/s 17.052 Melem/s 17.188 Melem/s] change: time: [+43.064% +46.066% +49.080%] (p = 0.00 < 0.05) thrpt: [−32.922% −31.538% −30.101%] Performance has regressed. Found 7 outliers among 100 measurements (7.00%) 7 (7.00%) high mild call-indirect/same-callee/table-init-strict/65536-calls time: [130.27 µs 131.66 µs 133.40 µs] thrpt: [491.26 Melem/s 497.75 Melem/s 503.09 Melem/s] change: time: [−6.4798% −4.1871% −1.8965%] (p = 0.00 < 0.05) thrpt: [+1.9332% +4.3701% +6.9288%] Performance has improved. Found 12 outliers among 100 measurements (12.00%) 4 (4.00%) high mild 8 (8.00%) high severe call-indirect/different-callees/table-init-strict/65536-calls time: [176.22 µs 178.49 µs 180.99 µs] thrpt: [362.10 Melem/s 367.18 Melem/s 371.90 Melem/s] change: time: [−18.431% −15.397% −12.330%] (p = 0.00 < 0.05) thrpt: [+14.064% +18.200% +22.595%] Performance has improved. Found 7 outliers among 100 measurements (7.00%) 4 (4.00%) high mild 3 (3.00%) high severe</details>
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
fitzgen requested cfallin for a review on PR #11630.
fitzgen requested wasmtime-compiler-reviewers for a review on PR #11630.
fitzgen requested alexcrichton for a review on PR #11630.
fitzgen requested wasmtime-core-reviewers for a review on PR #11630.
fitzgen edited PR #11630:
This treats compiled functions homogeneously, removing the need to add new metadata tables to places like
CompiledModuleInfowhenever we add a new kind of function, and simplifying the process of constructing the metadata for a final, linked compilation artifact. This also paves the way to doing gc-sections in our linking (getting smaller code sizes, removing functions that have been inlined into every caller, and etc...) as we no longer assume that certain types of function index spaces are dense.This does, however, replace a couple operations that were previously O(1) table lookups with O(log n) binary searches. And, notably, some of these are on the
VMFuncRef-creation path, and therefore on the
force-initialization-of-a-lazy-funcref-table-slot path, when we look up a Wasm function and its trampolines. Our call-indirect micro-benchmarks show that indirect calling every funcref once in a table of 64Ki slots went from taking ~2.6ms to ~3.8ms (a +46% slowdown). Note that this edge case is both synthetic and the worst-case scenario for this commit's change: we are measuring, as much as we can, only the force-initialization-of-a-lazy-funcref-table-slot path. All other call-indirect benchmarks are within the noise, which is what we would expect.Also, the size of
.cwasms is slightly larger:spidermonkey.wasm's.cwasmsize went from19_750_632bytes to19_785_872bytes, which is a 1.78% increase.Ultimately, I believe that the simplification and possibility of doing gc-sections in the future is worth the downsides. That said, if others feel differently, there are some things we could try to improve the situation, although most things I can think of off the top of my head (e.g. LEB128s and delta encoding, making certain
FuncKeykinds' index spaces dense) will improve one of code size or lookup times while pessimizing the other. I'm sure we could come up with something given enough effort though.<details>
<summary>call-indirect micro-benchmarks results</summary>
call-indirect/same-callee/table-init-lazy/65536-calls time: [144.14 µs 145.26 µs 146.56 µs] thrpt: [447.15 Melem/s 451.15 Melem/s 454.68 Melem/s] change: time: [−5.5066% −3.6611% −1.9130%] (p = 0.00 < 0.05) thrpt: [+1.9503% +3.8002% +5.8275%] Performance has improved. Found 9 outliers among 100 measurements (9.00%) 7 (7.00%) high mild 2 (2.00%) high severe call-indirect/different-callees/table-init-lazy/65536-calls time: [3.8128 ms 3.8433 ms 3.8763 ms] thrpt: [16.907 Melem/s 17.052 Melem/s 17.188 Melem/s] change: time: [+43.064% +46.066% +49.080%] (p = 0.00 < 0.05) thrpt: [−32.922% −31.538% −30.101%] Performance has regressed. Found 7 outliers among 100 measurements (7.00%) 7 (7.00%) high mild call-indirect/same-callee/table-init-strict/65536-calls time: [130.27 µs 131.66 µs 133.40 µs] thrpt: [491.26 Melem/s 497.75 Melem/s 503.09 Melem/s] change: time: [−6.4798% −4.1871% −1.8965%] (p = 0.00 < 0.05) thrpt: [+1.9332% +4.3701% +6.9288%] Performance has improved. Found 12 outliers among 100 measurements (12.00%) 4 (4.00%) high mild 8 (8.00%) high severe call-indirect/different-callees/table-init-strict/65536-calls time: [176.22 µs 178.49 µs 180.99 µs] thrpt: [362.10 Melem/s 367.18 Melem/s 371.90 Melem/s] change: time: [−18.431% −15.397% −12.330%] (p = 0.00 < 0.05) thrpt: [+14.064% +18.200% +22.595%] Performance has improved. Found 7 outliers among 100 measurements (7.00%) 4 (4.00%) high mild 3 (3.00%) high severe</details>
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
fitzgen updated PR #11630.
fitzgen updated PR #11630.
github-actions[bot] commented on PR #11630:
Subscribe to Label Action
cc @saulecabrera
<details>
This issue or pull request has been labeled: "wasmtime:api", "winch"Thus the following users have been cc'd because of the following labels:
- saulecabrera: winch
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
alexcrichton created PR review comment:
Could this expand on what a
Nonecase means?(also if it's purely to be able to use
Module::defaultI think it'd be ok to remove that construction method)
alexcrichton submitted PR review:
Really nice how this turned out, thanks for pushing on this!
One more benchmark though before merging I now realize -- component instantiation. IIRC we confirmed core wasm instantiation was largely unaffected by this but reading over this I'm remembering that component instantiation, when it hits various initializers for trampolines/builtins/etc, will do the
FuncKeylookup now. Can you make a simple-ishspidermonkey.wasmcomponent and compare before/after the instantiation numbers?
fitzgen submitted PR review.
fitzgen created PR review comment:
Addressed in https://github.com/bytecodealliance/wasmtime/pull/11694
fitzgen updated PR #11630.
fitzgen commented on PR #11630:
@alexcrichton mind taking another look at this? Redid a bunch of stuff so that there is actually a (very small) code size improvement now, the various index spaces have nice newtypes, and lookups into dense index spaces are O(1) again.
fitzgen edited PR #11630:
This commit refactors our metadata, treating compiled functions homogeneously
and removing the need to add new tables to places likeCompiledModuleInfo
whenever we add a new kind of function. This also simplifies the process of
constructing the metadata for a final, linked compilation artifact. Finally, it
paves the way to doing gc-sections during our linking process (which would give
us smaller code sizes by removing functions that have been inlined into every
caller, for example) as we now allow holes in certain types of function index
spaces that were previously always densely populated.We have two kinds of index spaces:
Mostly-dense index spaces, which take O(max_index) space and provide O(1)
lookups.Sparse index spaces, which take O(num_members) space and provide
O(log n) lookups.Most of our function index spaces are currently dense, but we can tweak that in
the future if necessary.Furthermore, code size of
.cwasmbinaries has shrunk very slightly with this
refactoring. Considerspidermonkey.wasm's compiled.cwasm:
- Size before: 218756
.wasmtime.infosection bytes, 20052632 total bytes- Size after: 213761
.wasmtime.infosection bytes, 20047640 total bytesThat is a 2.28% reduction on the size of the
.wasmtime.infosection, or a
0.025% reduction total.However, we previously did a single metadata lookup to get the location of both
a Wasm function itself and its array-to-Wasm trampoline at the same time, and in
the new version of the code two lookups are performed. This is slightly slower,
as shown in our call-indirect micro-benchmark that combines lazy table
initialization (which delays looking up the function element's location until
runtime) with indirect-calling each table element exactly once (which defeats
the amortization of that lookup). So this micro-benchmark is both synthetic and
the worst-case scenario for this commit's change: we are measuring, as much as
we can, only the force-initialization-of-a-lazy-funcref-table-slot path.Ultimately, I believe that the simplification is worth the regression in this
micro-benchmark.<details>
<summary>call-indirect micro-benchmarks results</summary>
call-indirect/same-callee/table-init-lazy/65536-calls time: [152.77 µs 154.92 µs 157.39 µs] thrpt: [416.40 Melem/s 423.04 Melem/s 428.99 Melem/s] change: time: [−13.749% −10.205% −6.2864%] (p = 0.00 < 0.05) thrpt: [+6.7081% +11.365% +15.941%] Performance has improved. Found 13 outliers among 100 measurements (13.00%) 8 (8.00%) high mild 5 (5.00%) high severe call-indirect/different-callees/table-init-lazy/65536-calls time: [4.3564 ms 4.4641 ms 4.5843 ms] thrpt: [14.296 Melem/s 14.681 Melem/s 15.044 Melem/s] change: time: [+38.134% +44.404% +50.927%] (p = 0.00 < 0.05) thrpt: [−33.743% −30.750% −27.606%] Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe call-indirect/same-callee/table-init-strict/65536-calls time: [144.91 µs 148.41 µs 152.02 µs] thrpt: [431.10 Melem/s 441.58 Melem/s 452.24 Melem/s] change: time: [−13.665% −10.470% −7.2626%] (p = 0.00 < 0.05) thrpt: [+7.8313% +11.694% +15.828%] Performance has improved. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) high mild 3 (3.00%) high severe call-indirect/different-callees/table-init-strict/65536-calls time: [195.18 µs 200.67 µs 206.49 µs] thrpt: [317.38 Melem/s 326.59 Melem/s 335.77 Melem/s] change: time: [−15.936% −11.568% −7.0835%] (p = 0.00 < 0.05) thrpt: [+7.6235% +13.081% +18.957%] Performance has improved. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high mild
alexcrichton submitted PR review.
fitzgen merged PR #11630.
Last updated: Dec 06 2025 at 07:03 UTC