cfallin edited PR #12051.
cfallin submitted PR review.
cfallin created PR review comment:
(Updated after refactor)
cfallin submitted PR review.
cfallin created PR review comment:
(Updated after refactor)
cfallin submitted PR review.
cfallin created PR review comment:
This was borrowing-related but I avoided it in the refactor by putting the
register_*methods on theStoredirectly so it can pass in the&Enginefrom its&Arcseparately borrowed -- updated in refactor.
cfallin submitted PR review.
cfallin created PR review comment:
(Updated inrefactor)
cfallin submitted PR review.
cfallin created PR review comment:
(Updated in refactor)
cfallin submitted PR review.
cfallin created PR review comment:
(Updated in refactor)
cfallin updated PR #12051.
cfallin submitted PR review.
cfallin created PR review comment:
The lack of alignment passed in to
deserialize_rawis generally problematic but I think I'd prefer to address this in a separate PR -- (i) this is a change to public invariants/APIs, (ii) this is a separate topic and existing deficit that we'll want to think a little carefully about I think!
cfallin submitted PR review.
cfallin created PR review comment:
Yep, I think we'll want to think carefully about this in its own right -- if you don't mind, could we defer alignment-related things (since it appears to be a pre-existing issue as well)?
cfallin updated PR #12051.
cfallin commented on PR #12051:
OK, I've now refactored in the direction we discussed above -- I feel a lot better about the types ensuring we're not leaking the wrong kind of function pointers (i.e. not executing the private copy) now!
I've updated the initial comment based on the new comment message above -- please see for some of the caveats and reasoning. Let me know what you think!
alexcrichton submitted PR review:
To me this all looks great, thanks for pushing on the refactor!
this is very deeply invasive, and I worry that (for example) indirect function calls are going to be penalized
I didn't run across this myself, so I wanted to clarify -- is this still present or did this end up getting resolved during refactoring?
alexcrichton created PR review comment:
Similar to below, mind lifting the
letout of theunsafe?
alexcrichton created PR review comment:
Could this be
#[cfg(feature = "debug")]to help optimize this enum representation on debug-less builds? (or is that too much #[cfg] noise for too little benefit?)
alexcrichton created PR review comment:
This should probably have been done prior, but mind lifting out the
letfrom theunsafeto make it clear that theunsafeis just for theget_exported_func?
alexcrichton created PR review comment:
As a follow-up and/or pending issue, it might be nice to only clone the text section here as opposed to anything else since everything else in the module is irrelevant for the cloning purpose. Modules can often be 2x larger than the text section for the
.wasmtime.addrmapsection alone so we might have some nice wins cutting down on the size. My hope is that this'd be pretty easy refactor to tack on in the future though where the only caveat would be to redirect some accesses of metadata to theEngineCodelayer, if any, rather than using it throughStoreCode(and this may already very well be solved entirely in the refactorings of this PR)
alexcrichton created PR review comment:
Could this check for
module.engine().config().guest_debugand return an error if it's enabled? And/or somehow gracefully fail later on or something like that?
alexcrichton created PR review comment:
As a note here, this is pretty cool that the
get_foo_and_barpattern worked here (and everywhere else in this PR) without needing anyunsafeanywhere. It's defnitely a drag that we need it at all, but hopefully future Rust features will come to save us one day...
alexcrichton created PR review comment:
Could this be replaced with the equivalent that
Moduleis using rather than deleting it?Some options for handling this though are:
- Documenting that this may return misleading results with
guest_debug.- Changing to return an
Optionwhich isNoneifguest_debugis enabled.As an introspection utility for embedders though I think this is something we'll want to keep
alexcrichton created PR review comment:
Since this is on the hot path of instantiation can you rerun the intantiation benchmarks we have in-repo (or some interesting subset of them) to see the impact on the extra maps here?
alexcrichton created PR review comment:
To help resolve this could this return an error if
guest_debugis enabled? (and then have a fixme-pointing-to-issue to resolve it in the future if necessary)
alexcrichton created PR review comment:
As a small change, could this return an error for
ExternallyOwnedand then use the host page size for alignment forMmapVec::Mmap?
fitzgen submitted PR review.
fitzgen created PR review comment:
Yeah for sure, was not suggesting anything that needs to be dealt with here and now.
cfallin updated PR #12051.
cfallin submitted PR review.
cfallin created PR review comment:
Done!
cfallin submitted PR review.
cfallin created PR review comment:
Done!
cfallin submitted PR review.
cfallin created PR review comment:
Sure, done!
cfallin submitted PR review.
cfallin created PR review comment:
Done!
cfallin updated PR #12051.
cfallin submitted PR review.
cfallin created PR review comment:
Yeah, definitely! It got a little silly with
optional_gc_store_and_registry_and_instance_mutbut, well, what happens in no-partial-borrows-land stays in no-partial-borrows-land...
cfallin updated PR #12051.
cfallin created PR review comment:
Done!
cfallin submitted PR review.
cfallin updated PR #12051.
cfallin submitted PR review.
cfallin created PR review comment:
Yep, great point -- filed #12104 and added a comment here.
cfallin updated PR #12051.
cfallin created PR review comment:
Ah, yep, I think I removed this without remembering what I had done for the
Modulecase; over there I added some comments describing that it may not represent all code actually executed.
cfallin submitted PR review.
cfallin updated PR #12051.
cfallin updated PR #12051.
cfallin submitted PR review.
cfallin created PR review comment:
Done! Filed #12105.
cfallin submitted PR review.
cfallin created PR review comment:
Sure -- looks like about a 3% hit on a small module, sequential instantiation:
% cargo bench --bench instantiation -- sequential/pooling/wasi.wasm [ baseline run ] sequential/pooling/wasi.wasm time: [1.4748 µs 1.4770 µs 1.4795 µs] [ run with this PR's changes ] sequential/pooling/wasi.wasm time: [1.5173 µs 1.5184 µs 1.5196 µs] change: [+2.9095% +3.0865% +3.2436%] (p = 0.00 < 0.05) Performance has regressed.Since this is constant-overhead-per-module, the operative number is the extra ~40 ns; that seems OK to me?
cfallin commented on PR #12051:
[indirect function calls]
I didn't run across this myself, so I wanted to clarify -- is this still present or did this end up getting resolved during refactoring?
I was worried about the additional overhead of the
EngineCode->StoreCodelookup inInstance::get_func_refbut that's oneBTreeMaplookup, in aBTreeMaplikely to have only one or a few entries, and only once per table slot invoked so perhaps not too bad? Actually, let's benchmark -- Sightglass says:$ RUST_LOG=info taskset 1 target/release/sightglass-cli benchmark -e ../wasmtime/old.so -e ../wasmtime/new.so --iterations-per-process 5 --processes 2 benchmarks/default.suite [ ... ] execution :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [71002288 72007774.20 72896266] new.so [70196726 71271162.70 73367632] old.so execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [449806273 459920466.40 477878694] new.so [455205740 461003924.50 470907233] old.so execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [4736966 4816051.60 5011779] new.so [4713058 4819706.60 5088233] old.so
cfallin updated PR #12051.
cfallin updated PR #12051.
cfallin submitted PR review.
cfallin created PR review comment:
Hmm, actually, this seems to cause a test failure: the registration always happens, so we can't simply throw an error. I think maybe the best we can do (short of a refactor that carries the
ProfilingAgentthrough to the point of instantiation, not just module loading, and registers each instantiation) is have the caveat thatperf recordwill not have registered profiling info for debugging-private copies of code. This may be fine (given issue to track)?
cfallin updated PR #12051.
cfallin updated PR #12051.
cfallin commented on PR #12051:
OK, I think this is ready for another look -- thanks @alexcrichton!
fitzgen commented on PR #12051:
Actually, let's benchmark -- Sightglass says:
2 processes x 5 iterations per process = 10 total samples per group. You can't generally do significance tests with less than 30 samples per group (and you probably want more than that for the
cyclesmeasure, since it is so noisy).If you're going to benchmark, then you should do it properly :-p
Also, fwiw, we have some call-indirect specific micro benchmarks here which might be better for measuring this specific thing.
fitzgen edited a comment on PR #12051:
Actually, let's benchmark -- Sightglass says:
2 processes x 5 iterations per process = 10 total samples per group. You can't generally do significance tests with less than 30 samples per group (and you probably want more than that for the
cyclesmeasure, since it is so noisy).If you're going to benchmark, then you should do it properly :-p
If the amount of time taken to run the benchmarks was a concern, then you could pass
--benchmark-phase executionto avoid re-compiling on each iteration.Also, fwiw, we have some call-indirect specific micro benchmarks here which might be better for measuring this specific thing.
cfallin commented on PR #12051:
OK, cool, I'll rerun this in a bit. Indeed I only had a small sliver of time last night and didn't want to wait hours for results. Good to know that one can re-use compilations, thanks.
My intent with using the real-program benchmarks rather than microbenchmarks was to measure the effect of call-indirects in proper context, but I can do the microbenchmarks too.
cfallin commented on PR #12051:
Here's the data with 100 data points:
execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 50870.29 ± 39152.71 (confidence = 99%) old.so is 1.00x to 1.02x faster than new.so! [4490103 4689894.33 5160645] new.so [4502186 4639024.04 4995740] old.so execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm Δ = 3902443.50 ± 1435000.11 (confidence = 99%) old.so is 1.01x to 1.01x faster than new.so! [452564035 458920590.84 468422822] new.so [447394704 455018147.34 464078661] old.so execution :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [69507823 70879740.15 73834440] new.so [69568496 71123992.19 74071155] old.soSo a slight slowdown in SpiderMonkey (1%), for example, which makes heavier use of virtual calls.
I might have some ideas for refactoring the
ModuleRegistry::store_codelookup to return a thin wrapper overEngineCodewhen we aren't configured to do private cloning, which would eliminate the extra lookup -- let me see if that helps...
cfallin commented on PR #12051:
A little more experimentation: I made a small tweak so that we don't do the StoreCode lookup from EngineCode when debugging is not enabled, but rather directly use the EngineCode's image. (Bikeshed the safety type wrappers maybe but the point is the perf experiment)
It turns out that this doesn't help:
old.sois baseline,new.sois this PR,new2.sois the above commit:execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm Δ = 3293570.64 ± 801397.15 (confidence = 99%) old.so is 1.01x to 1.01x faster than new2.so! [444994186 457634102.82 472147181] new2.so [444580053 454340532.18 464774917] old.so execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm Δ = 1711875.58 ± 601308.43 (confidence = 99%) new.so is 1.00x to 1.01x faster than new2.so! [443090662 452795699.65 464339929] new.so [445620051 454507575.23 465311385] new2.so(aside: Sightglass has a bug where with
new.soandnew2.so, it strips the common prefix and reports results for.soand2.so; manually edited the above. No energy to go fix that at the moment)So perhaps the ~1% overhead is in the additional plumbing in general on the table funcref init path; that's unavoidable unless we monomorphize the whole funcref-related runtime (libcalls on down) on debug and no-debug versions I think. Maybe we eat the 1%. What do you think?
cfallin edited a comment on PR #12051:
A little more experimentation: I made a small tweak so that we don't do the StoreCode lookup from EngineCode when debugging is not enabled, but rather directly use the EngineCode's image. (Bikeshed the safety type wrappers maybe but the point is the perf experiment)
It turns out that this doesn't help:
old.sois baseline,new.sois this PR,new2.sois the above commit (two separate pairwise runs; 10 processes and 50 data points per process):execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm Δ = 3293570.64 ± 801397.15 (confidence = 99%) old.so is 1.01x to 1.01x faster than new2.so! [444994186 457634102.82 472147181] new2.so [444580053 454340532.18 464774917] old.so execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm Δ = 1711875.58 ± 601308.43 (confidence = 99%) new.so is 1.00x to 1.01x faster than new2.so! [443090662 452795699.65 464339929] new.so [445620051 454507575.23 465311385] new2.so(aside: Sightglass has a bug where with
new.soandnew2.so, it strips the common prefix and reports results for.soand2.so; manually edited the above. No energy to go fix that at the moment)So perhaps the ~1% overhead is in the additional plumbing in general on the table funcref init path; that's unavoidable unless we monomorphize the whole funcref-related runtime (libcalls on down) on debug and no-debug versions I think. Maybe we eat the 1%. What do you think?
fitzgen commented on PR #12051:
didn't want to wait hours for results
If I do a full run of
default.suiteit takes 27 mins on my old thinkpad, probably faster for you on a newer MBP, fwiw. Also faster if doing--benchmark-phase executionor if measuring instructions retired and you can drop to a small handful of iterations. But yeah 27 mins is not great.So perhaps the ~1% overhead is in the additional plumbing in general on the table funcref init path; that's unavoidable unless we monomorphize the whole funcref-related runtime (libcalls on down) on debug and no-debug versions I think. Maybe we eat the 1%. What do you think?
This is only the case for lazy table init, right? If it isn't lazy then we'd still be hitting these branches, but at instantiation time instead, and we would be doing them all at once, so presumably the branch predictor would be more useful in that case, no?
If this is only for lazy table init, then yeah seems fine by me. Although maybe Alex has more ideas here. I originally was thinking that the module registry could possibly contain a hash map instead of a btree map for the new member, but it sounds like the costs are not related to the data structure.
Slightly more concerning if this is a 1% slowdown regardless of lazy table init...
cfallin commented on PR #12051:
This is only the case for lazy table init, right?
It seems so, yeah -- flipping the default to eager table init in tunables, I get (100 data points, 10 proc * 10/proc):
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [453510422 463605372.90 509169192] new-eager-init.so [453961879 467730247.66 531244575] old-eager-init.so
alexcrichton submitted PR review:
Code all looks good to me, thanks for the updates!
Questions on perf though, so it sounds like:
- Instantiation-wise this is measured as a 3% regression.
- call_indirect-wise it's measured as a "no meaningful change"
Is that right?
I would view this as adding relatively low-hanging fruit to the "perf tree" to pick in the future, and to me it's mostly a question of whether we pick that fruit here or not. The extra data structures can likely be optimized with fewer lookups/insertions or something like that when debugging enabled is my thinking.
We're also not really measuring heap impact (IIRC B-Trees are relatively bad for that?) which might be another consideration here. Basically I would be surprised if these exact data structures lived as-is for a long period of time past perf testing and such. I'd be more comfortable if the addition of debugging-related features didn't impact the data structures when debugging wasn't turned on, but it's a pretty minor impact here.
Basically I'm ok with this personally, but I think if we land this as-is it's worth opening an issue describing the possible optimization opportunity. More-or-less it feels like we shouldn't need 4 btree insertions when a single module is instantiated in a store in the fast path. That feels like it's pretty overkill relative to "lightweight instantiation". I realize we probably already have 3 insertions or something like that today, but this is basically a ripe area for optimization in the future if someone's interested.
cfallin commented on PR #12051:
Yep, that's right, and I'm happy to file an issue for followup. The refactor I did above to remove the
StoreCodelookup for theget_func_refpath could probably be carried further to actually not materialize a trueStoreCodeat instantiation (instead building a view of text/range inModuleWithCodewhen queried) in the non-debugging case; that should let us get back to status quo. It may also be worth optimizing the one-module case (?) -- BTree lookups in a tree of one entry are fast but there's still the allocation and initialization cost.
cfallin commented on PR #12051:
Followup filed at #12111.
cfallin merged PR #12051.
Last updated: Dec 06 2025 at 06:05 UTC