wasmtime / PR #12051 Debug: create private code memories... · git-wasmtime

This was borrowing-related but I avoided it in the refactor by putting the register_* methods on the Store directly so it can pass in the &Engine from its &Arc separately borrowed -- updated in refactor.

Wasmtime GitHub notifications bot (Nov 30 2025 at 00:00):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Nov 30 2025 at 00:00):

cfallin created PR review comment:

(Updated inrefactor)

Wasmtime GitHub notifications bot (Nov 30 2025 at 00:00):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Nov 30 2025 at 00:00):

cfallin created PR review comment:

(Updated in refactor)

Wasmtime GitHub notifications bot (Nov 30 2025 at 00:01):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Nov 30 2025 at 00:01):

cfallin created PR review comment:

(Updated in refactor)

Wasmtime GitHub notifications bot (Nov 30 2025 at 00:01):

cfallin updated PR #12051.

Wasmtime GitHub notifications bot (Nov 30 2025 at 00:02):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Nov 30 2025 at 00:02):

cfallin created PR review comment:

The lack of alignment passed in to deserialize_raw is generally problematic but I think I'd prefer to address this in a separate PR -- (i) this is a change to public invariants/APIs, (ii) this is a separate topic and existing deficit that we'll want to think a little carefully about I think!

Wasmtime GitHub notifications bot (Nov 30 2025 at 00:03):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Nov 30 2025 at 00:03):

cfallin created PR review comment:

Yep, I think we'll want to think carefully about this in its own right -- if you don't mind, could we defer alignment-related things (since it appears to be a pre-existing issue as well)?

Wasmtime GitHub notifications bot (Nov 30 2025 at 00:05):

cfallin updated PR #12051.

Wasmtime GitHub notifications bot (Nov 30 2025 at 00:07):

cfallin commented on PR #12051:

OK, I've now refactored in the direction we discussed above -- I feel a lot better about the types ensuring we're not leaking the wrong kind of function pointers (i.e. not executing the private copy) now!

I've updated the initial comment based on the new comment message above -- please see for some of the caveats and reasoning. Let me know what you think!

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:09):

alexcrichton submitted PR review:

To me this all looks great, thanks for pushing on the refactor!

this is very deeply invasive, and I worry that (for example) indirect function calls are going to be penalized

I didn't run across this myself, so I wanted to clarify -- is this still present or did this end up getting resolved during refactoring?

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:09):

alexcrichton created PR review comment:

Similar to below, mind lifting the let out of the unsafe?

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:09):

alexcrichton created PR review comment:

Could this be #[cfg(feature = "debug")] to help optimize this enum representation on debug-less builds? (or is that too much #[cfg] noise for too little benefit?)

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:09):

alexcrichton created PR review comment:

This should probably have been done prior, but mind lifting out the let from the unsafe to make it clear that the unsafe is just for the get_exported_func?

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:09):

alexcrichton created PR review comment:

As a follow-up and/or pending issue, it might be nice to only clone the text section here as opposed to anything else since everything else in the module is irrelevant for the cloning purpose. Modules can often be 2x larger than the text section for the .wasmtime.addrmap section alone so we might have some nice wins cutting down on the size. My hope is that this'd be pretty easy refactor to tack on in the future though where the only caveat would be to redirect some accesses of metadata to the EngineCode layer, if any, rather than using it through StoreCode (and this may already very well be solved entirely in the refactorings of this PR)

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:09):

alexcrichton created PR review comment:

Could this check for module.engine().config().guest_debug and return an error if it's enabled? And/or somehow gracefully fail later on or something like that?

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:09):

alexcrichton created PR review comment:

As a note here, this is pretty cool that the get_foo_and_bar pattern worked here (and everywhere else in this PR) without needing any unsafe anywhere. It's defnitely a drag that we need it at all, but hopefully future Rust features will come to save us one day...

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:09):

alexcrichton created PR review comment:

Could this be replaced with the equivalent that Module is using rather than deleting it?

Some options for handling this though are:

Documenting that this may return misleading results with guest_debug.

Changing to return an Option which is None if guest_debug is enabled.

As an introspection utility for embedders though I think this is something we'll want to keep

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:09):

alexcrichton created PR review comment:

Since this is on the hot path of instantiation can you rerun the intantiation benchmarks we have in-repo (or some interesting subset of them) to see the impact on the extra maps here?

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:09):

alexcrichton created PR review comment:

To help resolve this could this return an error if guest_debug is enabled? (and then have a fixme-pointing-to-issue to resolve it in the future if necessary)

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:09):

alexcrichton created PR review comment:

As a small change, could this return an error for ExternallyOwned and then use the host page size for alignment for MmapVec::Mmap?

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:44):

fitzgen submitted PR review.

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:44):

fitzgen created PR review comment:

Yeah for sure, was not suggesting anything that needs to be dealt with here and now.

Wasmtime GitHub notifications bot (Dec 02 2025 at 00:59):

cfallin updated PR #12051.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:00):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:00):

cfallin created PR review comment:

Done!

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:00):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:00):

cfallin created PR review comment:

Done!

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:00):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:00):

cfallin created PR review comment:

Sure, done!

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:00):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:00):

cfallin created PR review comment:

Done!

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:16):

cfallin updated PR #12051.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:18):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:18):

cfallin created PR review comment:

Yeah, definitely! It got a little silly with optional_gc_store_and_registry_and_instance_mut but, well, what happens in no-partial-borrows-land stays in no-partial-borrows-land...

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:21):

cfallin updated PR #12051.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:21):

cfallin created PR review comment:

Done!

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:21):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:25):

cfallin updated PR #12051.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:25):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:25):

cfallin created PR review comment:

Yep, great point -- filed #12104 and added a comment here.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:26):

cfallin updated PR #12051.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:29):

cfallin created PR review comment:

Ah, yep, I think I removed this without remembering what I had done for the Module case; over there I added some comments describing that it may not represent all code actually executed.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:29):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:30):

cfallin updated PR #12051.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:35):

cfallin updated PR #12051.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:35):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:35):

cfallin created PR review comment:

Done! Filed #12105.

Wasmtime GitHub notifications bot (Dec 02 2025 at 02:58):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Dec 02 2025 at 02:58):

cfallin created PR review comment:

Sure -- looks like about a 3% hit on a small module, sequential instantiation:

% cargo bench --bench instantiation -- sequential/pooling/wasi.wasm

[ baseline run ]

sequential/pooling/wasi.wasm
                        time:   [1.4748 µs 1.4770 µs 1.4795 µs]

[ run with this PR's changes ]

sequential/pooling/wasi.wasm
                        time:   [1.5173 µs 1.5184 µs 1.5196 µs]
                        change: [+2.9095% +3.0865% +3.2436%] (p = 0.00 < 0.05)
                        Performance has regressed.

Since this is constant-overhead-per-module, the operative number is the extra ~40 ns; that seems OK to me?

Wasmtime GitHub notifications bot (Dec 02 2025 at 03:32):

cfallin commented on PR #12051:

[indirect function calls]

I didn't run across this myself, so I wanted to clarify -- is this still present or did this end up getting resolved during refactoring?

I was worried about the additional overhead of the EngineCode -> StoreCode lookup in Instance::get_func_ref but that's one BTreeMap lookup, in a BTreeMap likely to have only one or a few entries, and only once per table slot invoked so perhaps not too bad? Actually, let's benchmark -- Sightglass says:
$ RUST_LOG=info taskset 1 target/release/sightglass-cli benchmark -e ../wasmtime/old.so -e ../wasmtime/new.so --iterations-per-process 5 --processes 2 benchmarks/default.suite
[ ... ]
execution :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [71002288 72007774.20 72896266] new.so
  [70196726 71271162.70 73367632] old.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [449806273 459920466.40 477878694] new.so
  [455205740 461003924.50 470907233] old.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [4736966 4816051.60 5011779] new.so
  [4713058 4819706.60 5088233] old.so

Wasmtime GitHub notifications bot (Dec 02 2025 at 03:33):

cfallin updated PR #12051.

Wasmtime GitHub notifications bot (Dec 02 2025 at 03:35):

cfallin updated PR #12051.

Wasmtime GitHub notifications bot (Dec 02 2025 at 03:37):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Dec 02 2025 at 03:37):

cfallin created PR review comment:

Hmm, actually, this seems to cause a test failure: the registration always happens, so we can't simply throw an error. I think maybe the best we can do (short of a refactor that carries the ProfilingAgent through to the point of instantiation, not just module loading, and registers each instantiation) is have the caveat that perf record will not have registered profiling info for debugging-private copies of code. This may be fine (given issue to track)?

Wasmtime GitHub notifications bot (Dec 02 2025 at 03:38):

cfallin updated PR #12051.

Wasmtime GitHub notifications bot (Dec 02 2025 at 03:38):

cfallin updated PR #12051.

Wasmtime GitHub notifications bot (Dec 02 2025 at 03:39):

cfallin commented on PR #12051:

OK, I think this is ready for another look -- thanks @alexcrichton!

Wasmtime GitHub notifications bot (Dec 02 2025 at 19:41):

fitzgen commented on PR #12051:

Actually, let's benchmark -- Sightglass says:

2 processes x 5 iterations per process = 10 total samples per group. You can't generally do significance tests with less than 30 samples per group (and you probably want more than that for the cycles measure, since it is so noisy).

If you're going to benchmark, then you should do it properly :-p

Also, fwiw, we have some call-indirect specific micro benchmarks here which might be better for measuring this specific thing.

Wasmtime GitHub notifications bot (Dec 02 2025 at 19:43):

fitzgen edited a comment on PR #12051:

Actually, let's benchmark -- Sightglass says:

2 processes x 5 iterations per process = 10 total samples per group. You can't generally do significance tests with less than 30 samples per group (and you probably want more than that for the cycles measure, since it is so noisy).

If you're going to benchmark, then you should do it properly :-p

If the amount of time taken to run the benchmarks was a concern, then you could pass --benchmark-phase execution to avoid re-compiling on each iteration.

Also, fwiw, we have some call-indirect specific micro benchmarks here which might be better for measuring this specific thing.

Wasmtime GitHub notifications bot (Dec 02 2025 at 19:59):

cfallin commented on PR #12051:

OK, cool, I'll rerun this in a bit. Indeed I only had a small sliver of time last night and didn't want to wait hours for results. Good to know that one can re-use compilations, thanks.

My intent with using the real-program benchmarks rather than microbenchmarks was to measure the effect of call-indirects in proper context, but I can do the microbenchmarks too.

Wasmtime GitHub notifications bot (Dec 02 2025 at 20:23):

cfallin commented on PR #12051:

Here's the data with 100 data points:

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 50870.29 ± 39152.71 (confidence = 99%)

  old.so is 1.00x to 1.02x faster than new.so!

  [4490103 4689894.33 5160645] new.so
  [4502186 4639024.04 4995740] old.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3902443.50 ± 1435000.11 (confidence = 99%)

  old.so is 1.01x to 1.01x faster than new.so!

  [452564035 458920590.84 468422822] new.so
  [447394704 455018147.34 464078661] old.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [69507823 70879740.15 73834440] new.so
  [69568496 71123992.19 74071155] old.so

So a slight slowdown in SpiderMonkey (1%), for example, which makes heavier use of virtual calls.

I might have some ideas for refactoring the ModuleRegistry::store_code lookup to return a thin wrapper over EngineCode when we aren't configured to do private cloning, which would eliminate the extra lookup -- let me see if that helps...

Wasmtime GitHub notifications bot (Dec 02 2025 at 20:55):

cfallin commented on PR #12051:

A little more experimentation: I made a small tweak so that we don't do the StoreCode lookup from EngineCode when debugging is not enabled, but rather directly use the EngineCode's image. (Bikeshed the safety type wrappers maybe but the point is the perf experiment)

It turns out that this doesn't help: old.so is baseline, new.so is this PR, new2.so is the above commit:
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3293570.64 ± 801397.15 (confidence = 99%)

  old.so is 1.01x to 1.01x faster than new2.so!

  [444994186 457634102.82 472147181] new2.so
  [444580053 454340532.18 464774917] old.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1711875.58 ± 601308.43 (confidence = 99%)

  new.so is 1.00x to 1.01x faster than new2.so!

  [443090662 452795699.65 464339929] new.so
  [445620051 454507575.23 465311385] new2.so
(aside: Sightglass has a bug where with new.so and new2.so, it strips the common prefix and reports results for .so and 2.so; manually edited the above. No energy to go fix that at the moment)

So perhaps the ~1% overhead is in the additional plumbing in general on the table funcref init path; that's unavoidable unless we monomorphize the whole funcref-related runtime (libcalls on down) on debug and no-debug versions I think. Maybe we eat the 1%. What do you think?

Wasmtime GitHub notifications bot (Dec 02 2025 at 20:56):

cfallin edited a comment on PR #12051:

A little more experimentation: I made a small tweak so that we don't do the StoreCode lookup from EngineCode when debugging is not enabled, but rather directly use the EngineCode's image. (Bikeshed the safety type wrappers maybe but the point is the perf experiment)

It turns out that this doesn't help: old.so is baseline, new.so is this PR, new2.so is the above commit (two separate pairwise runs; 10 processes and 50 data points per process):
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3293570.64 ± 801397.15 (confidence = 99%)

  old.so is 1.01x to 1.01x faster than new2.so!

  [444994186 457634102.82 472147181] new2.so
  [444580053 454340532.18 464774917] old.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1711875.58 ± 601308.43 (confidence = 99%)

  new.so is 1.00x to 1.01x faster than new2.so!

  [443090662 452795699.65 464339929] new.so
  [445620051 454507575.23 465311385] new2.so
(aside: Sightglass has a bug where with new.so and new2.so, it strips the common prefix and reports results for .so and 2.so; manually edited the above. No energy to go fix that at the moment)

So perhaps the ~1% overhead is in the additional plumbing in general on the table funcref init path; that's unavoidable unless we monomorphize the whole funcref-related runtime (libcalls on down) on debug and no-debug versions I think. Maybe we eat the 1%. What do you think?

Wasmtime GitHub notifications bot (Dec 02 2025 at 22:55):

fitzgen commented on PR #12051:

didn't want to wait hours for results

If I do a full run of default.suite it takes 27 mins on my old thinkpad, probably faster for you on a newer MBP, fwiw. Also faster if doing --benchmark-phase execution or if measuring instructions retired and you can drop to a small handful of iterations. But yeah 27 mins is not great.

So perhaps the ~1% overhead is in the additional plumbing in general on the table funcref init path; that's unavoidable unless we monomorphize the whole funcref-related runtime (libcalls on down) on debug and no-debug versions I think. Maybe we eat the 1%. What do you think?

This is only the case for lazy table init, right? If it isn't lazy then we'd still be hitting these branches, but at instantiation time instead, and we would be doing them all at once, so presumably the branch predictor would be more useful in that case, no?

If this is only for lazy table init, then yeah seems fine by me. Although maybe Alex has more ideas here. I originally was thinking that the module registry could possibly contain a hash map instead of a btree map for the new member, but it sounds like the costs are not related to the data structure.

Slightly more concerning if this is a 1% slowdown regardless of lazy table init...

Wasmtime GitHub notifications bot (Dec 02 2025 at 23:04):

cfallin commented on PR #12051:

This is only the case for lazy table init, right?

It seems so, yeah -- flipping the default to eager table init in tunables, I get (100 data points, 10 proc * 10/proc):
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [453510422 463605372.90 509169192] new-eager-init.so
  [453961879 467730247.66 531244575] old-eager-init.so

Wasmtime GitHub notifications bot (Dec 03 2025 at 00:02):

alexcrichton submitted PR review:

Code all looks good to me, thanks for the updates!

Questions on perf though, so it sounds like:

Instantiation-wise this is measured as a 3% regression.

call_indirect-wise it's measured as a "no meaningful change"

Is that right?

I would view this as adding relatively low-hanging fruit to the "perf tree" to pick in the future, and to me it's mostly a question of whether we pick that fruit here or not. The extra data structures can likely be optimized with fewer lookups/insertions or something like that when debugging enabled is my thinking.

We're also not really measuring heap impact (IIRC B-Trees are relatively bad for that?) which might be another consideration here. Basically I would be surprised if these exact data structures lived as-is for a long period of time past perf testing and such. I'd be more comfortable if the addition of debugging-related features didn't impact the data structures when debugging wasn't turned on, but it's a pretty minor impact here.

Basically I'm ok with this personally, but I think if we land this as-is it's worth opening an issue describing the possible optimization opportunity. More-or-less it feels like we shouldn't need 4 btree insertions when a single module is instantiated in a store in the fast path. That feels like it's pretty overkill relative to "lightweight instantiation". I realize we probably already have 3 insertions or something like that today, but this is basically a ripe area for optimization in the future if someone's interested.

Wasmtime GitHub notifications bot (Dec 03 2025 at 00:50):

cfallin commented on PR #12051:

Yep, that's right, and I'm happy to file an issue for followup. The refactor I did above to remove the StoreCode lookup for the get_func_ref path could probably be carried further to actually not materialize a true StoreCode at instantiation (instead building a view of text/range in ModuleWithCode when queried) in the non-debugging case; that should let us get back to status quo. It may also be worth optimizing the one-module case (?) -- BTree lookups in a tree of one entry are fast but there's still the allocation and initialization cost.

Wasmtime GitHub notifications bot (Dec 03 2025 at 00:53):

cfallin commented on PR #12051:

Followup filed at #12111.

Wasmtime GitHub notifications bot (Dec 03 2025 at 01:41):

cfallin merged PR #12051.

Last updated: Feb 24 2026 at 04:36 UTC