matthargett opened PR #13445 from rebeckerspecialties:table-mutability-tracking-upstream to bytecodealliance:main:
TL;DR
Adds a
ModuleTranslation::tables_mutatedbit (set during translation bytable.set/table.fill/table.copy-dest /table.grow/table.initopcodes, or any passive elem segment that could land at runtime, or any leftover-segment shape that crosses a runtime resize) and uses it to elide six redundant runtime checks oncall_indirectagainst provably-immutable funcref tables:
- constant-index dispatches lower to direct
call F- sig check is elided when every elem in the table shares the call_indirect's sig
- null check is elided when no precomputed slot is null
- bounds check + per-dispatch bound load are elided on non-growable tables
- the lazy-init brif tests the masked funcref instead of the raw slot value (eager-init slots store the resolved
VMFuncRef *directly at instantiation)Each elision is gated on a stronger predicate than the previous; passive-segment + leftover-segment soundness corners are covered by integration tests.
Why
This is the predicate-factory PR. Each elision saves a small per-
call_indirectcost, but the combined predicate (is_eagerly_initialized_funcref_table) is what lets the downstream Pulley opcode-fusion stack (a follow-up PR) collapse the dispatch tail. Real-world graphql-js validation pipelines compile to ~98 call_indirect sites all dispatching through a single immutable funcref table — every site qualifies.Soundness
Three corners had to be tightened during development:
- Passive elem segments with dest tables are conservatively counted as mutations (else
elem.initagainst a slot the predicate said was immutable would slip through).- The constant-index direct-call rewrite loads the callee's
vmctxfrom the precomputedVMFuncRef, not the caller's.- Null-check elision skips the tagged-null pattern (slot value
1, produced bytable.fill(null)on a tagged table; excluded by the immutability half of the predicate).
tests/all/leftover_elem_segment_soundness.rs+ 4 disas filetests cover the soundness shapes.crates/environ/tests/table_mutability.rshas 12 cases for the predicate itself.c1-7 vs c1-8 elision floor
An attempt at fully eliding the lazy-init brif (c1-8: egraph-folds it to
trapz) showed ~14 % Discarded-bucket increase on iPhone 12 Icestorm E-core PMU at N=3 without a wallclock improvement. Kept the c1-7 form (brif retained, mask + tagged-pointer test); commitdisable the c1-8 brif elision based on PMU evidencedocuments this.Tests
- 2237 / 2237 cranelift filetests
- 16 / 16
crates/environ/tests/table_mutability.rsintegration- new
tests/all/leftover_elem_segment_soundness.rsStacks under a follow-up PR for Pulley opcode fusion at the call_indirect lazy-init site.
matthargett requested cfallin for a review on PR #13445.
matthargett requested wasmtime-compiler-reviewers for a review on PR #13445.
matthargett requested wasmtime-core-reviewers for a review on PR #13445.
matthargett updated PR #13445.
cfallin commented on PR #13445:
@matthargett could I request that (per our AI policy) you rewrite the PR description here? In particular, there are a bunch of phrases here that seem to be a locally-evolved jargon and are not very comprehensible:
- "This is the predicate-factory PR." What is a predicate factory in this context? What does that mean? Are you just saying that this PR computes certain properties (predicates)? But then you say "Each elision saves a small per-
call_indirectcost" -- so it sounds like this is not just computing predicates but using them?- "c1-7 vs c1-8 elision floor" -- what does this mean? What is c1-7? What is c1-8? (If I had to guess: these are Claude-shaped plan phase names?) And then what is an "elision floor" (other than, perhaps, a fancy new kind of home-construction product)? Please try to make sure the description doesn't have incomprehensible jargon -- define terms before you use them, unless they are standard (in our subfield) terms.
- "...showed ~14 % Discarded-bucket increase" -- this reads like some sort of retro-encabulator description. What is the discarded bucket? Is it a bucket we have chosen to discard? Is it a bucket for things that are discarded? Why are they discarded? Who discarded them? Why is it bad that the discarded-bucket (... or the number of things within it) increases? There must be a whole experiment+measurement story here that's elided; unfortunately without that story it's hard to get any meaning out of this.
These are just examples; in general I'd like to see a description of the work that is aimed to actually communicate to a human, not dump a bunch of out-of-context details, especially before diving in to a 2k-line PR.
(And, to double-check per our AI policy: have you reviewed this whole PR by hand before posting it?)
matthargett commented on PR #13445:
@matthargett could I request that (per our AI policy) you rewrite the PR description here? In particular, there are a bunch of phrases here that seem to be a locally-evolved jargon and are not very comprehensible:
I did edit the PR description, by my own typing hands, and it folds in the feedback given about both the verbosity and the AI policy in the chat.
* "This is the predicate-factory PR." What is a predicate factory in this context? What does that mean? Are you just saying that this PR computes certain properties (predicates)? But then you say "Each elision saves a small per-`call_indirect` cost" -- so it sounds like this is not just computing predicates but using them?sorry, this is a term we used in the code for my product called BugScan back in 2003. in that and this context, analyzing code to derive predicates that can be chained together for solving/elision purposes. here it's bits being set and not structs/objects, so "factory" wasn't a good metaphor choice.
This PR does the full chain: it does the analysis, sets the bits, and then the cranelift modifications uses those bits to do some elision of opcodes based on the proof of analysis that the predicates.
The reason there's a split between the PRs is that while there's some low-level CPU counters and profiler stats that improve just with this PR, but wallclock/e2e results on my devices didn't show an above-noise uplift.
* "c1-7 vs c1-8 elision floor" -- what does this mean? What is c1-7? What is c1-8? (If I had to guess: these are Claude-shaped plan phase names?) And then what is an "elision floor" (other than, perhaps, a fancy new kind of home-construction product)? Please try to make sure the description doesn't have incomprehensible jargon -- define terms before you use them, unless they are standard (in our subfield) terms.
chere means commit, again sorry for the sourceforge-era shorthand. I tried to keep the commit stack clean so that each change and why it was necessary was discrete and easy reason about one after the other (and not duplicate this information in the PR description). if it was one big commit, then some of the complexity around making the changes resilient to the fuzzer might seem unrelated or unnecessary.the elision floor is when I wasn't getting any movement, even on CPU counter improvements, from trying to elide even more instructions. I'm not trying to use confusing metaphors on purpose, and I'm sorry my phrasing wasn't clearer.
* "...showed ~14 % Discarded-bucket increase" -- this reads like some sort of retro-encabulator description. What is the discarded bucket? Is it a bucket we have chosen to discard? Is it a bucket for things that are discarded? Why are they discarded? Who discarded them? Why is it bad that the discarded-bucket (... or the number of things within it) increases? There must be a whole experiment+measurement story here that's elided; unfortunately without that story it's hard to get any meaning out of this.the "discarded" metric comes from the XCode profiling tools, specificallt xctrace's template for CPU bottlenecks. it's how many branch prediction mis-predicts happen, and its relevant in a bunch of interpreter performance work I've done over the years (starting with Tcl on the DEC Alpha and PowerPC).
These are just examples; in general I'd like to see a description of the work that is aimed to actually communicate to a human, not dump a bunch of out-of-context details, _especially_ before diving in to a 2k-line PR.
sorry, I really did try to make it more straightforward and not repeat things the diff already communicates.
(And, to double-check per our AI policy: have you reviewed this whole PR by hand before posting it?)
I re-reviewed diffs inbetween each on-device benchmark pass, which took 30-40 minutes across the cross-section of physical devices I have at my house.
I know I'm not the world's best programmer or communicator, even after a few decades of practice. If you feel like a video/audio chat would help, I'm up for it. If it's just too much overhead for you all, that's okay: I can keep a clean patch stack in my fork and we can revisit (or not) at your leisure.
matthargett edited PR #13445:
TL;DR
Adds a
ModuleTranslation::tables_mutatedbit (set during translation bytable.set/table.fill/table.copy-dest /table.grow/table.initopcodes, or any passive elem segment that could land at runtime, or any leftover-segment shape that crosses a runtime resize) and uses it to elide six redundant runtime checks oncall_indirectagainst provably-immutable funcref tables:
- constant-index dispatches lower to direct
call F- sig check is elided when every elem in the table shares the call_indirect's sig
- null check is elided when no precomputed slot is null
- bounds check + per-dispatch bound load are elided on non-growable tables
- the lazy-init brif tests the masked funcref instead of the raw slot value (eager-init slots store the resolved
VMFuncRef *directly at instantiation)Each elision is gated on a stronger predicate than the previous; passive-segment + leftover-segment soundness corners are covered by integration tests.
Why
Each elision saves a small per-
call_indirectcost, but the combined predicate (is_eagerly_initialized_funcref_table) is what lets the downstream Pulley opcode-fusion stack (a follow-up PR) collapse the dispatch tail. Real-world graphql-js validation pipelines compile to ~98 call_indirect sites all dispatching through a single immutable funcref table — every site qualifies.Soundness
Three corners had to be tightened during development:
- Passive elem segments with dest tables are conservatively counted as mutations (else
elem.initagainst a slot the predicate said was immutable would slip through).- The constant-index direct-call rewrite loads the callee's
vmctxfrom the precomputedVMFuncRef, not the caller's.- Null-check elision skips the tagged-null pattern (slot value
1, produced bytable.fill(null)on a tagged table; excluded by the immutability half of the predicate).
tests/all/leftover_elem_segment_soundness.rs+ 4 disas filetests cover the soundness shapes.crates/environ/tests/table_mutability.rshas 12 cases for the predicate itself.Learnings / Caveats
An attempt at fully eliding the lazy-init brif (c1-8: egraph-folds it to
trapz) showed ~14 % branch mis-prediction increase on iPhone 12 E-core profiler across 3+ runs without a wallclock improvement. I kept the form from commits 1-7 (brif retained, mask + tagged-pointer test); commitdisable the c1-8 brif elision based on PMU evidencedocuments this.Tests
- 2237 / 2237 cranelift filetests
- 16 / 16
crates/environ/tests/table_mutability.rsintegration- new
tests/all/leftover_elem_segment_soundness.rsStacks under a follow-up PR for Pulley opcode fusion at the call_indirect lazy-init site.
matthargett commented on PR #13445:
I did just notice that when cleaning up my fork's branch and splitting the PR that I lost some cleanup commits from my local clone. I'm fixing that now.
matthargett updated PR #13445.
github-actions[bot] added the label wasmtime:api on PR #13445.
alexcrichton commented on PR #13445:
Would you be amenable to splitting up this PR into separate PRs for each optimization applied here? The optimizations here look sound to me and reasonable to implement, but I find it a bit difficult to consider them all at once vs one-at-a-time. Some high-level comments I have on this are applicable to this area, for example:
- It looks like there's a bunch of duplication between the various predicates added here. I think it'd be reasonable to have some sort of helper on
Modulefor example to encapsulate most of this rather than duplicating all of it.- Testing here is something I'm a bit worried about. We don't typically test the internals of translation all that much because that can be quite brittle and difficult to update over time. The disas tests are good here, but I'd like to see more comprehensive runtime tests. For example these tests are quit suitable for the
*.wastformat I believe where tests could be done to ensure, for tables of a particular shape, that all runtime behavior is as expected.- The part about "some lazy init tables aren't actually lazily initialized" is something that I think I'd like to at least personally consider in more depth. That seems quite subtle and is something worth poring over for a bit, which I feel would be best as a separate, isolated, PR.
- From a compilation perspective this is going to incur what I suspect is a nontrivial slowdown to parse the entire code section of a module just looking for table mutation instructions. What's implemented here does seem like the simplest solution, but I'd like to separately consider the cost of doing this. The code section of a wasm module is typically the longest to parse and this is using wasmparser's more inefficient API plus a lack of parallelism that otherwise happens today. I'd like to ideally consider this in isolation and consider sightglass benchmarks, for example, before committing to this.
At a high-level, as well, can you describe the shapes of modules that would benefit from these sorts of optimizations? For example I would expect constant-index dispatches to be optimized by the frontend, the entire table sharing one signature to be relatively rare, and the null-check not mattering much in practice since it's something where we just catch the fault. For the bounds-check I believe we already optimized fixed-size tables (statically fixed-size at least via the type) and the lazy-init changes I'm not totally sold on yet myself (e.g. would want to benchmark/analyze more). Overall it feels like a pretty slim shape of module that would fit within these constraints, but that doesn't meant that they're not important. The optimizations here are pretty easy to read and reason about, so that's why I'm curious to understand more about this use case.
Last updated: Jun 01 2026 at 09:49 UTC