wasmtime / PR #13445 cranelift: per-table mutability trac... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #13445 cranelift: per-table mutability trac...

Wasmtime GitHub notifications bot (May 22 2026 at 02:25):

matthargett opened PR #13445 from rebeckerspecialties:table-mutability-tracking-upstream to bytecodealliance:main:

TL;DR

Adds a ModuleTranslation::tables_mutated bit (set during translation by table.set / table.fill / table.copy-dest / table.grow / table.init opcodes, or any passive elem segment that could land at runtime, or any leftover-segment shape that crosses a runtime resize) and uses it to elide six redundant runtime checks on call_indirect against provably-immutable funcref tables:

constant-index dispatches lower to direct call F

sig check is elided when every elem in the table shares the call_indirect's sig

null check is elided when no precomputed slot is null

bounds check + per-dispatch bound load are elided on non-growable tables

the lazy-init brif tests the masked funcref instead of the raw slot value (eager-init slots store the resolved VMFuncRef * directly at instantiation)

Each elision is gated on a stronger predicate than the previous; passive-segment + leftover-segment soundness corners are covered by integration tests.

Why

This is the predicate-factory PR. Each elision saves a small per-call_indirect cost, but the combined predicate (is_eagerly_initialized_funcref_table) is what lets the downstream Pulley opcode-fusion stack (a follow-up PR) collapse the dispatch tail. Real-world graphql-js validation pipelines compile to ~98 call_indirect sites all dispatching through a single immutable funcref table — every site qualifies.

Soundness

Three corners had to be tightened during development:

Passive elem segments with dest tables are conservatively counted as mutations (else elem.init against a slot the predicate said was immutable would slip through).

The constant-index direct-call rewrite loads the callee's vmctx from the precomputed VMFuncRef, not the caller's.

Null-check elision skips the tagged-null pattern (slot value 1, produced by table.fill(null) on a tagged table; excluded by the immutability half of the predicate).

tests/all/leftover_elem_segment_soundness.rs + 4 disas filetests cover the soundness shapes. crates/environ/tests/table_mutability.rs has 12 cases for the predicate itself.

c1-7 vs c1-8 elision floor

An attempt at fully eliding the lazy-init brif (c1-8: egraph-folds it to trapz) showed ~14 % Discarded-bucket increase on iPhone 12 Icestorm E-core PMU at N=3 without a wallclock improvement. Kept the c1-7 form (brif retained, mask + tagged-pointer test); commit disable the c1-8 brif elision based on PMU evidence documents this.

Tests

2237 / 2237 cranelift filetests

16 / 16 crates/environ/tests/table_mutability.rs integration

new tests/all/leftover_elem_segment_soundness.rs

Stacks under a follow-up PR for Pulley opcode fusion at the call_indirect lazy-init site.

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested cfallin for a review on PR #13445.

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested wasmtime-compiler-reviewers for a review on PR #13445.

Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested wasmtime-core-reviewers for a review on PR #13445.

Wasmtime GitHub notifications bot (May 22 2026 at 03:26):

matthargett updated PR #13445.

Wasmtime GitHub notifications bot (May 22 2026 at 04:36):

cfallin commented on PR #13445:

@matthargett could I request that (per our AI policy) you rewrite the PR description here? In particular, there are a bunch of phrases here that seem to be a locally-evolved jargon and are not very comprehensible:

"This is the predicate-factory PR." What is a predicate factory in this context? What does that mean? Are you just saying that this PR computes certain properties (predicates)? But then you say "Each elision saves a small per-call_indirect cost" -- so it sounds like this is not just computing predicates but using them?

"c1-7 vs c1-8 elision floor" -- what does this mean? What is c1-7? What is c1-8? (If I had to guess: these are Claude-shaped plan phase names?) And then what is an "elision floor" (other than, perhaps, a fancy new kind of home-construction product)? Please try to make sure the description doesn't have incomprehensible jargon -- define terms before you use them, unless they are standard (in our subfield) terms.

"...showed ~14 % Discarded-bucket increase" -- this reads like some sort of retro-encabulator description. What is the discarded bucket? Is it a bucket we have chosen to discard? Is it a bucket for things that are discarded? Why are they discarded? Who discarded them? Why is it bad that the discarded-bucket (... or the number of things within it) increases? There must be a whole experiment+measurement story here that's elided; unfortunately without that story it's hard to get any meaning out of this.

These are just examples; in general I'd like to see a description of the work that is aimed to actually communicate to a human, not dump a bunch of out-of-context details, especially before diving in to a 2k-line PR.

(And, to double-check per our AI policy: have you reviewed this whole PR by hand before posting it?)

Wasmtime GitHub notifications bot (May 22 2026 at 06:47):

matthargett commented on PR #13445:

@matthargett could I request that (per our AI policy) you rewrite the PR description here? In particular, there are a bunch of phrases here that seem to be a locally-evolved jargon and are not very comprehensible:

I did edit the PR description, by my own typing hands, and it folds in the feedback given about both the verbosity and the AI policy in the chat.
* "This is the predicate-factory PR." What is a predicate factory in this context? What does that mean? Are you just saying that this PR computes certain properties (predicates)? But then you say "Each elision saves a small per-`call_indirect` cost" -- so it sounds like this is not just computing predicates but using them?
sorry, this is a term we used in the code for my product called BugScan back in 2003. in that and this context, analyzing code to derive predicates that can be chained together for solving/elision purposes. here it's bits being set and not structs/objects, so "factory" wasn't a good metaphor choice.

This PR does the full chain: it does the analysis, sets the bits, and then the cranelift modifications uses those bits to do some elision of opcodes based on the proof of analysis that the predicates.

The reason there's a split between the PRs is that while there's some low-level CPU counters and profiler stats that improve just with this PR, but wallclock/e2e results on my devices didn't show an above-noise uplift.
* "c1-7 vs c1-8 elision floor" -- what does this mean? What is c1-7? What is c1-8? (If I had to guess: these are Claude-shaped plan phase names?) And then what is an "elision floor" (other than, perhaps, a fancy new kind of home-construction product)? Please try to make sure the description doesn't have incomprehensible jargon -- define terms before you use them, unless they are standard (in our subfield) terms.
c here means commit, again sorry for the sourceforge-era shorthand. I tried to keep the commit stack clean so that each change and why it was necessary was discrete and easy reason about one after the other (and not duplicate this information in the PR description). if it was one big commit, then some of the complexity around making the changes resilient to the fuzzer might seem unrelated or unnecessary.

the elision floor is when I wasn't getting any movement, even on CPU counter improvements, from trying to elide even more instructions. I'm not trying to use confusing metaphors on purpose, and I'm sorry my phrasing wasn't clearer.
* "...showed ~14 % Discarded-bucket increase" -- this reads like some sort of retro-encabulator description. What is the discarded bucket? Is it a bucket we have chosen to discard? Is it a bucket for things that are discarded? Why are they discarded? Who discarded them? Why is it bad that the discarded-bucket (... or the number of things within it) increases? There must be a whole experiment+measurement story here that's elided; unfortunately without that story it's hard to get any meaning out of this.
the "discarded" metric comes from the XCode profiling tools, specificallt xctrace's template for CPU bottlenecks. it's how many branch prediction mis-predicts happen, and its relevant in a bunch of interpreter performance work I've done over the years (starting with Tcl on the DEC Alpha and PowerPC).

These are just examples; in general I'd like to see a description of the work that is aimed to actually communicate to a human, not dump a bunch of out-of-context details, _especially_ before diving in to a 2k-line PR.

sorry, I really did try to make it more straightforward and not repeat things the diff already communicates.

(And, to double-check per our AI policy: have you reviewed this whole PR by hand before posting it?)

I re-reviewed diffs inbetween each on-device benchmark pass, which took 30-40 minutes across the cross-section of physical devices I have at my house.

I know I'm not the world's best programmer or communicator, even after a few decades of practice. If you feel like a video/audio chat would help, I'm up for it. If it's just too much overhead for you all, that's okay: I can keep a clean patch stack in my fork and we can revisit (or not) at your leisure.

Wasmtime GitHub notifications bot (May 22 2026 at 06:56):

matthargett edited PR #13445:

TL;DR

Adds a ModuleTranslation::tables_mutated bit (set during translation by table.set / table.fill / table.copy-dest / table.grow / table.init opcodes, or any passive elem segment that could land at runtime, or any leftover-segment shape that crosses a runtime resize) and uses it to elide six redundant runtime checks on call_indirect against provably-immutable funcref tables:

constant-index dispatches lower to direct call F

sig check is elided when every elem in the table shares the call_indirect's sig

null check is elided when no precomputed slot is null

bounds check + per-dispatch bound load are elided on non-growable tables

the lazy-init brif tests the masked funcref instead of the raw slot value (eager-init slots store the resolved VMFuncRef * directly at instantiation)

Each elision is gated on a stronger predicate than the previous; passive-segment + leftover-segment soundness corners are covered by integration tests.

Why

Each elision saves a small per-call_indirect cost, but the combined predicate (is_eagerly_initialized_funcref_table) is what lets the downstream Pulley opcode-fusion stack (a follow-up PR) collapse the dispatch tail. Real-world graphql-js validation pipelines compile to ~98 call_indirect sites all dispatching through a single immutable funcref table — every site qualifies.

Soundness

Three corners had to be tightened during development:

Passive elem segments with dest tables are conservatively counted as mutations (else elem.init against a slot the predicate said was immutable would slip through).

The constant-index direct-call rewrite loads the callee's vmctx from the precomputed VMFuncRef, not the caller's.

Null-check elision skips the tagged-null pattern (slot value 1, produced by table.fill(null) on a tagged table; excluded by the immutability half of the predicate).

tests/all/leftover_elem_segment_soundness.rs + 4 disas filetests cover the soundness shapes. crates/environ/tests/table_mutability.rs has 12 cases for the predicate itself.

Learnings / Caveats

An attempt at fully eliding the lazy-init brif (c1-8: egraph-folds it to trapz) showed ~14 % branch mis-prediction increase on iPhone 12 E-core profiler across 3+ runs without a wallclock improvement. I kept the form from commits 1-7 (brif retained, mask + tagged-pointer test); commit disable the c1-8 brif elision based on PMU evidence documents this.

Tests

2237 / 2237 cranelift filetests

16 / 16 crates/environ/tests/table_mutability.rs integration

new tests/all/leftover_elem_segment_soundness.rs

Stacks under a follow-up PR for Pulley opcode fusion at the call_indirect lazy-init site.

Wasmtime GitHub notifications bot (May 22 2026 at 07:02):

matthargett commented on PR #13445:

I did just notice that when cleaning up my fork's branch and splitting the PR that I lost some cleanup commits from my local clone. I'm fixing that now.

Wasmtime GitHub notifications bot (May 22 2026 at 07:11):

matthargett updated PR #13445.

Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] added the label wasmtime:api on PR #13445.

Wasmtime GitHub notifications bot (May 23 2026 at 14:21):

alexcrichton commented on PR #13445:

Would you be amenable to splitting up this PR into separate PRs for each optimization applied here? The optimizations here look sound to me and reasonable to implement, but I find it a bit difficult to consider them all at once vs one-at-a-time. Some high-level comments I have on this are applicable to this area, for example:

It looks like there's a bunch of duplication between the various predicates added here. I think it'd be reasonable to have some sort of helper on Module for example to encapsulate most of this rather than duplicating all of it.

Testing here is something I'm a bit worried about. We don't typically test the internals of translation all that much because that can be quite brittle and difficult to update over time. The disas tests are good here, but I'd like to see more comprehensive runtime tests. For example these tests are quit suitable for the *.wast format I believe where tests could be done to ensure, for tables of a particular shape, that all runtime behavior is as expected.

The part about "some lazy init tables aren't actually lazily initialized" is something that I think I'd like to at least personally consider in more depth. That seems quite subtle and is something worth poring over for a bit, which I feel would be best as a separate, isolated, PR.

From a compilation perspective this is going to incur what I suspect is a nontrivial slowdown to parse the entire code section of a module just looking for table mutation instructions. What's implemented here does seem like the simplest solution, but I'd like to separately consider the cost of doing this. The code section of a wasm module is typically the longest to parse and this is using wasmparser's more inefficient API plus a lack of parallelism that otherwise happens today. I'd like to ideally consider this in isolation and consider sightglass benchmarks, for example, before committing to this.

At a high-level, as well, can you describe the shapes of modules that would benefit from these sorts of optimizations? For example I would expect constant-index dispatches to be optimized by the frontend, the entire table sharing one signature to be relatively rare, and the null-check not mattering much in practice since it's something where we just catch the fault. For the bounds-check I believe we already optimized fixed-size tables (statically fixed-size at least via the type) and the lazy-init changes I'm not totally sold on yet myself (e.g. would want to benchmark/analyze more). Overall it feels like a pretty slim shape of module that would fit within these constraints, but that doesn't meant that they're not important. The optimizations here are pretty easy to read and reason about, so that's why I'm curious to understand more about this use case.

Wasmtime GitHub notifications bot (Jun 10 2026 at 23:38):

matthargett updated PR #13445.

Wasmtime GitHub notifications bot (Jun 10 2026 at 23:44):

matthargett updated PR #13445.

Wasmtime GitHub notifications bot (Jul 07 2026 at 21:21):

matthargett updated PR #13445.

Wasmtime GitHub notifications bot (Jul 07 2026 at 21:27):

matthargett updated PR #13445.

Wasmtime GitHub notifications bot (Jul 10 2026 at 08:44):

matthargett updated PR #13445.

Wasmtime GitHub notifications bot (Jul 17 2026 at 05:35):

matthargett commented on PR #13445:

Would you be amenable to splitting up this PR into separate PRs for each optimization applied here? The optimizations here look sound to me and reasonable to implement, but I find it a bit difficult to consider them all at once vs one-at-a-time.

if you're comfortable with merging individual pieces that may not have e2e benchmark uplift individually, that's fine by me :)

based on the changes and my fresh reading since i originally proposed this, it seems like dropping the constant-index to direct-call rewrite entirely is a good idea. you're right that frontends get there first. I checked my whole benchmark corpus (18 modules: Rust/LLVM, AssemblyScript, Porffor, Emscripten-built sqlite3) — of 912 call_indirect sites, zero have a constant index. they're all fed by loads (vtable slots) or locals. Binaryen's Directize pass already does this transform toolchain-side where it's already provable.
* It looks like there's a bunch of duplication between the various predicates added here. I think it'd be reasonable to have some sort of helper on `Module` for example to encapsulate most of this rather than duplicating all of it.
ok, I see it, and it's slightly more than a refactor by my eye. the mutability bit currently lives on ModuleTranslation (compile-only), so consolidating means promoting it to a serialized field on Module next to table_initialization, with helpers like table_is_immutable(TableIndex) / static_funcref_image(TableIndex) replacing the three hand-written predicate chains in func_environ.rs. is that what you were thinking?
* Testing here is something I'm a bit worried about. We don't typically test the internals of translation all that much because that can be quite brittle and difficult to update over time. The disas tests are good here, but I'd like to see more comprehensive runtime tests. For example these tests are quit suitable for the `*.wast` format I believe where tests could be done to ensure, for tables of a particular shape, that all runtime behavior is as expected.
no problem, I just found that in the devirtualization optimization that I helped shepherd into GCC in ~2010, by the time an e2e test failed, the individual pieces of plumbing that slowly drifted in multiple areas represented a nested problem where each fix risked other regressions. (GCC ended up disabling our test cases one by one, refusing to revert changes that verifiably caused regressions, and LLVM eventually caught up around ~2016 and blew past durably.)

looking closer at the coverage turned up real gaps: nothing in the tree (pre-existing or added here) runtime-tests table.grow-then-call_indirect into the grown region, or table.set of a null/wrong-signature entry followed by a dispatch that must trap. I've written a *.wast covering those must-not-fire shapes (including an exported table clobbered by a second module through the import, which is the wast-expressible analog of host mutation) — it passes on all four engine configs against this branch with the last commit I pushed. lmk if you have other test scenarios in mind.

something else I noticed: wasmparser already defines the shared-everything-threads table.atomic.set / table.atomic.rmw.* operators. They can't reach translation today (wasm_unsupported!), but the analysis will match them anyway so it can't silently go stale if that proposal lands. I was thinking/reaching ahead a bit, but a nice effect of how things are organized.
* The part about "some lazy init tables aren't actually lazily initialized" is something that I think I'd like to at least personally consider in more depth. That seems quite subtle and is something worth poring over for a bit, which I feel would be best as a separate, isolated, PR.
ok.
* From a compilation perspective this is going to incur what I suspect is a nontrivial slowdown to parse the entire code section of a module just looking for table mutation instructions. What's implemented here does seem like the simplest solution, but I'd like to separately consider the cost of doing this. The code section of a wasm module is typically the longest to parse and this is using wasmparser's more inefficient API plus a lack of parallelism that otherwise happens today. I'd like to ideally consider this in isolation and consider sightglass benchmarks, for example, before committing to this.
your suspicion is right and I have som first numbers. As written (serial OperatorsReader::read over every body), the scan costs ~83% of a full serial validate_all on Emscripten-built sqlite3.wasm (it's 833 KB and 4.8 ms scan vs 5.8 ms validate e2e). I agree that's too much overhead. If I try using the VisitOperator API and running bodies in parallel (same shape as validation) brings it to 0.57 ms (e2e). and there are free bail-outs: skip the walk when the module has no defined funcref tables (10 of my 18 benchmark modules), when every table is already conservatively marked (imported/exported), and early-exit once all tables are marked. lmk if this is agreeable and I can make the change (here or in a separate PR slice).

At a high-level, as well, can you describe the shapes of modules that would benefit from these sorts of optimizations? For example I would expect constant-index dispatches to be optimized by the frontend, the entire table sharing one signature to be relatively rare, and the null-check not mattering much in practice since it's something where we just catch the fault. For the bounds-check I believe we already optimized fixed-size tables (statically fixed-size at least via the type) and the lazy-init changes I'm not totally sold on yet myself (e.g. would want to benchmark/analyze more). Overall it feels like a pretty slim shape of module that would fit within these constraints, but that doesn't meant that they're not important. The optimizations here are pretty easy to read and reason about, so that's why I'm curious to understand more about this use case.

the target is Pulley on iOS/watchOS/tvOS/visionOS, where the App Store rules out JIT so everything is interpreted, and each retained check in the call_indirect sequence is one-plus extra interpreter dispatch per call rather than a folded native instruction. That also reframes two of your points: the null check is nearly free on native because the fault is the check, but Pulley (and any signals_based_traps(false) config) emits it explicitly; and uniform-signature tables are rare in big C/C++ modules (sqlite3 has ~25 signatures across ~620 elems) but common in the single-language modules this targets. 5 of the 8 table-bearing modules in my test corpus (both AssemblyScript builds, both Porffor builds, plus a Rust library called xmrsplayer) have exactly one signature across the whole table. Porffor funnels every indirect call through one giant calling-convention signature by design. The common shape across all 8: exactly one defined funcref table, one active element segment, and zero table mutation anywhere in the code section. btw, I looked at Porffor nd AssemblyScript based on the questions/recommendations in the chat.

Wasmtime GitHub notifications bot (Jul 20 2026 at 20:02):

alexcrichton commented on PR #13445:

Separate PRs are fine, yeah, and is something we try to do where possible. For testing I'm basically asking you to write *.wast tests. These are about as deterministic as they can get and all have a well-defined meaning, so no risk of us randomly disabling them. And yeah thanks for the clarification of your operating target, that makes sense. If you're ok with it I think it'd be best to split this up into separately reviewable pieces and we can go from there.

If you're curious, the main driver for me to request separate PRs is that (a) it makes it much easier to review and (b) any mistake here is likely a CVE-in-waiting for Wasmtime. Correctness here is paramount and is much easier to review in smaller chunks to ensure that all corner cases are well-tested, everything's given proper thought, etc.

Wasmtime GitHub notifications bot (Jul 20 2026 at 20:07):

matthargett updated PR #13445.

Wasmtime GitHub notifications bot (Jul 20 2026 at 21:31):

matthargett commented on PR #13445:

Separate PRs are fine, yeah, and is something we try to do where possible. For testing I'm basically asking you to write *.wast tests. These are about as deterministic as they can get and all have a well-defined meaning, so no risk of us randomly disabling them. And yeah thanks for the clarification of your operating target, that makes sense. If you're ok with it I think it'd be best to split this up into separately reviewable pieces and we can go from there.

no worries! I'll reframe this PR as the first slice and stack the next steps as drafts.

If you're curious, the main driver for me to request separate PRs is that (a) it makes it much easier to review and (b) any mistake here is likely a CVE-in-waiting for Wasmtime. Correctness here is paramount and is much easier to review in smaller chunks to ensure that all corner cases are well-tested, everything's given proper thought, etc.

we're in total alignment, protecting against malicious/pathological programs matters in my interpreter-only deployment scenario as well. I did another adversarial review with the CVE framing, and it found a real soundness hole that my fuzzing before didn't find. finalize_table_init folds element segments into the precomputed image only up to the first non-foldable segment (dynamic offset or expressions-form); the rest are applied at instantiation. The elision predicates treat the image as the table's complete contents, so a deferred segment installing a wrong-signature function dodgess an elided sig check (type confusion), and a deferred null write dodges an elided null check. (I'd say "derp", but it's actually a pretty complex scenario of interactions so I don't feel too embarassed.) The rebuilt PR stack fixes it (once I push) — static_funcref_image refuses tables with deferred segments — with regression coverage at three levels (wast modules that would miscall/crash on the current #13445 branch, environ unit tests, and the guard living inside the helper so no caller can forget it).

Wasmtime GitHub notifications bot (Jul 20 2026 at 21:33):

matthargett requested alexcrichton for a review on PR #13445.

Wasmtime GitHub notifications bot (Jul 20 2026 at 21:33):

matthargett requested wasmtime-default-reviewers for a review on PR #13445.

Wasmtime GitHub notifications bot (Jul 20 2026 at 21:33):

matthargett updated PR #13445.

Wasmtime GitHub notifications bot (Jul 20 2026 at 21:42):

matthargett updated PR #13445.

Wasmtime GitHub notifications bot (Jul 21 2026 at 01:36):

matthargett edited PR #13445:

TL;DR

Adds a ModuleTranslation::tables_mutated bit (set during translation by table.set / table.fill / table.copy-dest / table.grow / table.init opcodes, or any passive elem segment that could land at runtime, or any leftover-segment shape that crosses a runtime resize) and uses it to elide six redundant runtime checks on call_indirect against provably-immutable funcref tables:

constant-index dispatches lower to direct call F

sig check is elided when every elem in the table shares the call_indirect's sig

null check is elided when no precomputed slot is null

bounds check + per-dispatch bound load are elided on non-growable tables

the lazy-init brif tests the masked funcref instead of the raw slot value (eager-init slots store the resolved VMFuncRef * directly at instantiation)

Each elision is gated on a stronger predicate than the previous; passive-segment + leftover-segment soundness corners are covered by integration tests.

Why

Each elision saves a small per-call_indirect cost, but the combined predicate (is_eagerly_initialized_funcref_table) is what lets the downstream Pulley opcode-fusion stack (a follow-up PR) collapse the dispatch tail. Real-world graphql-js validation pipelines compile to ~98 call_indirect sites all dispatching through a single immutable funcref table — every site qualifies. The original impetus for this optimization of the Rust xmrsplayer compiled to wasm, needs later PRs in this chain to see uplift.

Soundness

Three corners had to be tightened during development:

Passive elem segments with dest tables are conservatively counted as mutations (else elem.init against a slot the predicate said was immutable would slip through).

The constant-index direct-call rewrite loads the callee's vmctx from the precomputed VMFuncRef, not the caller's.

Null-check elision skips the tagged-null pattern (slot value 1, produced by table.fill(null) on a tagged table; excluded by the immutability half of the predicate).

tests/all/leftover_elem_segment_soundness.rs + 4 disas filetests cover the soundness shapes. crates/environ/tests/table_mutability.rs has 12 cases for the predicate itself.

Learnings / Caveats

An attempt at fully eliding the lazy-init brif (commits 1-8: egraph-folds it to trapz) showed ~14 % branch mis-prediction increase on iPhone 12 E-core profiler across 3+ runs without a wallclock improvement. I kept the form from commits 1-7 (brif retained, mask + tagged-pointer test); commit disable the commits 1-8 brif elision based on PMU evidence documents this. I'm looking beneath pure wallclock time into lower-level CPU counters to make sure we aren't silently regressing things in a way that surprises us later.

Tests

2237 / 2237 cranelift filetests

16 / 16 crates/environ/tests/table_mutability.rs integration

new tests/all/leftover_elem_segment_soundness.rs

Stacks under a follow-up PR for Pulley opcode fusion at the call_indirect lazy-init site.

Wasmtime GitHub notifications bot (Jul 21 2026 at 02:14):

matthargett edited PR #13445:

TL;DR

Adds a ModuleTranslation::tables_mutated bit (set during translation by table.set / table.fill / table.copy-dest / table.grow / table.init opcodes, or any passive elem segment that could land at runtime, or any leftover-segment shape that crosses a runtime resize) and uses it to elide six redundant runtime checks on call_indirect against provably-immutable funcref tables:

constant-index dispatches lower to direct call F

sig check is elided when every elem in the table shares the call_indirect's sig

null check is elided when no precomputed slot is null

bounds check + per-dispatch bound load are elided on non-growable tables

the lazy-init brif tests the masked funcref instead of the raw slot value (eager-init slots store the resolved VMFuncRef * directly at instantiation)

Each elision is gated on a stronger predicate than the previous; passive-segment + leftover-segment soundness corners are covered by integration tests.

Why

Each elision saves a small per-call_indirect cost, but the combined predicate (is_eagerly_initialized_funcref_table) is what lets the downstream Pulley opcode-fusion stack (a follow-up PR) collapse the dispatch tail. Real-world graphql-js validation pipelines compile to ~98 call_indirect sites all dispatching through a single immutable funcref table — every site qualifies. The original impetus for this optimization of the Rust xmrsplayer compiled to wasm, needs later PRs in this chain to see uplift. The Porffor and AssemblyScript concerns I was asked to also solve for don't get uplift from this PR, their uplift comes at the end of the PR stack.

Soundness

Three corners had to be tightened during development:

Passive elem segments with dest tables are conservatively counted as mutations (else elem.init against a slot the predicate said was immutable would slip through).

The constant-index direct-call rewrite loads the callee's vmctx from the precomputed VMFuncRef, not the caller's.

Null-check elision skips the tagged-null pattern (slot value 1, produced by table.fill(null) on a tagged table; excluded by the immutability half of the predicate).

tests/all/leftover_elem_segment_soundness.rs + 4 disas filetests cover the soundness shapes. crates/environ/tests/table_mutability.rs has 12 cases for the predicate itself.

Learnings / Caveats

An attempt at fully eliding the lazy-init brif (commits 1-8: egraph-folds it to trapz) showed ~14 % branch mis-prediction increase on iPhone 12 E-core profiler across 3+ runs without a wallclock improvement. I kept the form from commits 1-7 (brif retained, mask + tagged-pointer test); commit disable the commits 1-8 brif elision based on PMU evidence documents this. I'm looking beneath pure wallclock time into lower-level CPU counters to make sure we aren't silently layering things in a way that surprises us later.

Tests

2237 / 2237 cranelift filetests

16 / 16 crates/environ/tests/table_mutability.rs integration

new tests/all/leftover_elem_segment_soundness.rs

Stacks under a follow-up PR for Pulley opcode fusion at the call_indirect lazy-init site.

Wasmtime GitHub notifications bot (Jul 22 2026 at 19:31):

alexcrichton commented on PR #13445:

A question for you: is the immutable table analysis here used in subsequent PRs for other optimizations? (I forget, it's been awhile since I originally looked at the whole stack of commits here)

If no, the main consequence here looks to be deducing that a table can't grow and adjusting the bounds check appropriately. This sort of optimization in theory could be in a guest toolchain (e.g. a wasm-to-wasm transform like binaryen, or fancier optimizations in LLVM), which is why I ask. Where possible optimizing in the toolchain I think is preferable (reducing the TCB of the engine). If, however, this analysis is used for later optimizations as well then I could see how it's not possible to put it entirely in the guest toolchain.

Wasmtime GitHub notifications bot (Jul 22 2026 at 23:44):

matthargett commented on PR #13445:

A question for you: is the immutable table analysis here used in subsequent PRs for other optimizations? (I forget, it's been awhile since I originally looked at the whole stack of commits here)

Yes. We need the immutability guarantee to do the optimizations later in the stack. If it's not clear in their descriptions, lmk. I agree it was easier to see when it was all in one place, that's why I submitted it that way initially :)

If no, the main consequence here looks to be deducing that a table can't grow and adjusting the bounds check appropriately. This sort of optimization in theory could be in a guest toolchain (e.g. a wasm-to-wasm transform like binaryen, or fancier optimizations in LLVM), which is why I ask. Where possible optimizing in the toolchain I think is preferable (reducing the TCB of the engine). If, however, this analysis is used for later optimizations as well then I could see how it's not possible to put it entirely in the guest toolchain.

At the very least, Rust 1.8x and 1.9x doesn't do the devirtualization even with LTO and whole-program enabled when emitting as wasm32 platform target. Again, my driver was a Rust library compiled to wasm, and then dong the profiling work to eliminate skips in the music and keep mobile CPU in as low of a power mode as possible

Wasmtime GitHub notifications bot (Jul 23 2026 at 02:35):

matthargett edited PR #13445:

TL;DR

Adds a per-table compile-time fact — "this table can never be mutated after instantiation" — and uses it for one optimization (in this PR): tables that can never grow get a constant bound, extending the existing min == max static-bound rule to tables whose type alone doesn't pin their size.

The fact lives as a serialized set on Module behind Module::table_is_immutable, which documents the contract; follow-up PRs #13909 and #13910 use it to elide the call_indirect signature check on uniform-signature immutable tables and the null check on fully-covered ones. call_indirect was the long tooth in my CPU profiling of xmrsplayer compiled to wasm by Rust 1.83 both when running on Apple A12 (iPhone XS) and M4 MacBook (efficiency CPU cores in both cases). The Porffor and AssemblyScript concerns I was asked to also solve for don't get uplift from this PR, their uplift comes at the end of the PR stack.

Compile-time cost

The analysis is one extra decode pass over the code section, addressed per the review discussion:

skipped when the module has no tables (10 of the 18 modules in the benchmark corpus that motivated this) or when every table is already conservatively marked — e.g. Emscripten output, which exports its function table;
bodies decode through VisitOperator and run on the rayon pool under parallel-compilation; the serial fallback stops as soon as every table is marked.

Measured on an 833 KiB Emscripten-built sqlite3.wasm: ~4 ms serial, ~0.6 ms parallel — and 0 in practice, since its exported table takes the skip path. Sightglass numbers to follow in this PR.

Testing

crates/environ/tests/table_mutability.rs: 14 unit cases for the analysis, including exported/imported pre-marking, table.copy marking only its destinatin, elem.drop marking nothing, and out-of-range indices in not-yet-validated bodies marking nothing.

Runtime *.wast on both sides of the fence: immutable shapes (in-bounds results; OOB, null-slot, and sig-mismatch traps) and mutated shapes where nothing may be optimized: table.grow then dispatch into the grown region, table.set of null then dispatch, and an exported table written by a second module through its import. You can see my background in security/fuzzing shining through :P

A disas test pins the constant-bound shape for a min < max table nothing grows; regenerated goldens show the same folding in table-init startup functions.

Follow-ups (separate PRs)

Signature-check elision on uniform-signature immutable tables, then null-check elision on fully-covered ones (it needs the sig PR's static-image helper), then the eager-init work, isolated as requested. The constant-index-to-direct-call rewrite from the original version of this PR is dropped — zero of 912 call_indirect sites in the corpus have a constant index, and toolchains (Binaryen Directize) already do it where provable.
Each elision is gated on a stronger predicate than the previous; passive-segment + leftover-segment soundness corners are covered by integration tests.

Learnings / Caveats

An attempt at fully eliding the lazy-init brif (commits 1-8: egraph-folds it to trapz) showed ~14 % branch mis-prediction increase on iPhone 12 E-core profiler across 3+ runs without a wallclock improvement. I'm looking beneath pure wallclock time into lower-level CPU counters to make sure we aren't silently layering things in a way that surprises us with e2e visibility later.

Tests

2237 / 2237 cranelift filetests

16 / 16 crates/environ/tests/table_mutability.rs integration

new tests/all/leftover_elem_segment_soundness.rs

Stacks under a follow-up PR for Pulley opcode fusion at the call_indirect lazy-init site.

Wasmtime GitHub notifications bot (Jul 23 2026 at 02:35):

matthargett edited PR #13445:

TL;DR

Adds a per-table compile-time fact — "this table can never be mutated after instantiation" — and uses it for one optimization (in this PR): tables that can never grow get a constant bound, extending the existing min == max static-bound rule to tables whose type alone doesn't pin their size.

The fact lives as a serialized set on Module behind Module::table_is_immutable, which documents the contract; follow-up PRs #13909 and #13910 use it to elide the call_indirect signature check on uniform-signature immutable tables and the null check on fully-covered ones. call_indirect was the long tooth in my CPU profiling of xmrsplayer compiled to wasm by Rust 1.83 both when running on Apple A12 (iPhone XS) and M4 MacBook (efficiency CPU cores in both cases). The Porffor and AssemblyScript concerns I was asked to also solve for don't get uplift from this PR, their uplift comes at the end of the PR stack.

Compile-time cost

The analysis is one extra decode pass over the code section, addressed per the review discussion:

skipped when the module has no tables (10 of the 18 modules in the benchmark corpus that motivated this) or when every table is already conservatively marked — e.g. Emscripten output, which exports its function table;
bodies decode through VisitOperator and run on the rayon pool under parallel-compilation; the serial fallback stops as soon as every table is marked.

Measured on an 833 KiB Emscripten-built sqlite3.wasm: ~4 ms serial, ~0.6 ms parallel — and 0 in practice, since its exported table takes the skip path. Sightglass numbers to follow in this PR.

Testing

crates/environ/tests/table_mutability.rs: 14 unit cases for the analysis, including exported/imported pre-marking, table.copy marking only its destinatin, elem.drop marking nothing, and out-of-range indices in not-yet-validated bodies marking nothing.

Runtime *.wast on both sides of the fence: immutable shapes (in-bounds results; OOB, null-slot, and sig-mismatch traps) and mutated shapes where nothing may be optimized: table.grow then dispatch into the grown region, table.set of null then dispatch, and an exported table written by a second module through its import. You can see my background in security/fuzzing shining through :P

A disas test pins the constant-bound shape for a min < max table nothing grows; regenerated goldens show the same folding in table-init startup functions.

Follow-ups (separate PRs)

Signature-check elision on uniform-signature immutable tables, then null-check elision on fully-covered ones (it needs the sig PR's static-image helper), then the eager-init work, isolated as requested. The constant-index-to-direct-call rewrite from the original version of this PR is dropped — zero of 912 call_indirect sites in the corpus have a constant index, and toolchains (Binaryen Directize) already do it where provable.
Each elision is gated on a stronger predicate than the previous; passive-segment + leftover-segment soundness corners are covered by integration tests.

Learnings / Caveats

An attempt at fully eliding the lazy-init brif (across the whole patch stack) showed ~14 % branch mis-prediction increase on iPhone 12 E-core profiler across 3+ runs without a wallclock improvement. I'm looking beneath pure wallclock time into lower-level CPU counters to make sure we aren't silently layering things in a way that surprises us with e2e visibility later.

Tests

2237 / 2237 cranelift filetests

16 / 16 crates/environ/tests/table_mutability.rs integration

new tests/all/leftover_elem_segment_soundness.rs

Stacks under a follow-up PR for Pulley opcode fusion at the call_indirect lazy-init site.

Wasmtime GitHub notifications bot (Jul 23 2026 at 02:48):

matthargett edited PR #13445:

TL;DR

Adds a per-table compile-time fact — "this table can never be mutated after instantiation" — and uses it for one optimization (in this PR): tables that can never grow get a constant bound, extending the existing min == max static-bound rule to tables whose type alone doesn't pin their size.

The fact lives as a serialized set on Module behind Module::table_is_immutable, which documents the contract; follow-up PRs #13909 and #13910 use it to elide the call_indirect signature check on uniform-signature immutable tables and the null check on fully-covered ones. call_indirect was the long tooth in my CPU profiling of xmrsplayer compiled to wasm by Rust 1.83 both when running on Apple A12 (iPhone XS) and M4 MacBook (efficiency CPU cores in both cases). The Porffor and AssemblyScript concerns I was asked to also solve for don't get uplift from this PR, their uplift comes at the end of the PR stack.

Compile-time cost

The analysis is one extra decode pass over the code section, addressed per the review discussion:

skipped when the module has no tables (10 of the 18 modules in the benchmark corpus that motivated this) or when every table is already conservatively marked — e.g. Emscripten output, which exports its function table;
bodies decode through VisitOperator and run on the rayon pool under parallel-compilation; the serial fallback stops as soon as every table is marked.

Measured on an 833 KiB Emscripten-built sqlite3.wasm: ~4 ms serial, ~0.6 ms parallel — and 0 in practice, since its exported table takes the skip path. Sightglass numbers to follow in this PR.

Testing

crates/environ/tests/table_mutability.rs: 14 unit cases for the analysis, including exported/imported pre-marking, table.copy marking only its destinatin, elem.drop marking nothing, and out-of-range indices in not-yet-validated bodies marking nothing.

Runtime *.wast on both sides of the fence: immutable shapes (in-bounds results; OOB, null-slot, and sig-mismatch traps) and mutated shapes where nothing may be optimized: table.grow then dispatch into the grown region, table.set of null then dispatch, and an exported table written by a second module through its import. You can see my background in security/fuzzing shining through :P

A disas test pins the constant-bound shape for a min < max table nothing grows; regenerated goldens show the same folding in table-init startup functions.

Follow-ups (separate PRs)

Signature-check elision on uniform-signature immutable tables, then null-check elision on fully-covered ones (it needs the sig PR's static-image helper), then the eager-init work, isolated as requested. The constant-index-to-direct-call rewrite from the original version of this PR is dropped — zero of 912 call_indirect sites in the corpus have a constant index, and toolchains (Binaryen Directize) already do it where provable.
Each elision is gated on a stronger predicate than the previous; passive-segment + leftover-segment soundness corners are covered by integration tests. see #13909 and #13910 for the next steps in the patch stack.

Learnings / Caveats

An attempt at fully eliding the lazy-init brif (across the whole patch stack) showed ~14 % branch mis-prediction increase on iPhone 12 E-core profiler across 3+ runs without a wallclock improvement. I'm looking beneath pure wallclock time into lower-level CPU counters to make sure we aren't silently layering things in a way that surprises us with e2e visibility later.

Tests

2237 / 2237 cranelift filetests

16 / 16 crates/environ/tests/table_mutability.rs integration

new tests/all/leftover_elem_segment_soundness.rs

Stacks under a follow-up PR for Pulley opcode fusion at the call_indirect lazy-init site.

Wasmtime GitHub notifications bot (Jul 23 2026 at 17:12):

alexcrichton commented on PR #13445:

Ok I went and reviewed the other PRs as well again now too. From my understanding all of the optimizations that you want to apply are modelable in WebAssembly today, in theory. For example this PR is possible if a tables maximum size is the same as the minimum size. For https://github.com/bytecodealliance/wasmtime/pull/13909 it's possible to have a table of (ref null $func_ty) which would elide signature checks. For https://github.com/bytecodealliance/wasmtime/pull/13910 it's possible to have (ref func) or (ref $func_ty) to elide null checks. If this is the extent of the optimizations you'd like to perform, this is where I'd ideally like to lean on wasm toolchains rather than the runtime here. For example all the optimizations are redundant here with what Wasmtime is already doing in other contexts (for statically known types).

It's certainly possible for Wasmtime to pick up optimizations that guest toolchains aren't doing, but this is something where I subjectively would prefer to push these sorts of optimizations into tooling. My primary reasoning is that WebAssembly is already expressive enough to model all the optimizations you want to do, and Wasmtime is already optimizing modules as you want if the modules have these shapes.

Now I don't mean to give you the impression that these optimizations should be happening in Rust/LLVM already. As you've seen Rust/LLVM itself does not do these optimizations, and the only tool in theory that might would be wasm-opt, and I'm not sure if it does optimizations like this. My points more broadly is that this should be eminently doable in tooling (e.g. with a wasm-to-wasm transform baked into wasm-opt/binaryen or maybe something built with wasm-tools crates). This then avoids the need for Wasmtime to parse the entire code section twice for all modules, for example.

Does what I'm saying make sense though? Would you be able to test out and confirm that modeling these optimizations with preexisting WebAssembly constructs would achieve the performance you're looking for?

Wasmtime GitHub notifications bot (Jul 24 2026 at 03:45):

:cross_mark: matthargett closed without merge PR #13445.

Wasmtime GitHub notifications bot (Jul 24 2026 at 03:45):

matthargett commented on PR #13445:

Does what I'm saying make sense though? Would you be able to test out and confirm that modeling these optimizations with preexisting WebAssembly constructs would achieve the performance you're looking for?

I am a little confused. I thought that with all of the middle-end transformations that are already happening, which are very similar to GCC and LLVM IR, and the memory/overhead that they take up, that taking advantage of the existing infra would be preferred. Adding another tool into the suite users have to know about, akin to webpack and myriad plugins in the JS ecosystem, which replicate the DFA and other plumbing already in the runtime, doesn't seem like great ergonomics to me for the user, but also seems dissonant with how much code and engineering has gone into the IR, predicates, and solves up to this point.

Overall, it sounds like this work stream and it's mobile CPU leaning isn't a good fit for the project. I did learn a lot though, so I appreciate your time to walk me through all the dimensions of the constraints. It's definitely given me some concrete experience in the broader WASM community that clarifies some of the tech strategy decisions for my own project. Cheers! :D

Last updated: Jul 29 2026 at 05:03 UTC