Stream: git-wasmtime

Topic: wasmtime / PR #13445 cranelift: per-table mutability trac...


view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 02:25):

matthargett opened PR #13445 from rebeckerspecialties:table-mutability-tracking-upstream to bytecodealliance:main:

TL;DR

Adds a ModuleTranslation::tables_mutated bit (set during translation by table.set / table.fill / table.copy-dest / table.grow / table.init opcodes, or any passive elem segment that could land at runtime, or any leftover-segment shape that crosses a runtime resize) and uses it to elide six redundant runtime checks on call_indirect against provably-immutable funcref tables:

Each elision is gated on a stronger predicate than the previous; passive-segment + leftover-segment soundness corners are covered by integration tests.

Why

This is the predicate-factory PR. Each elision saves a small per-call_indirect cost, but the combined predicate (is_eagerly_initialized_funcref_table) is what lets the downstream Pulley opcode-fusion stack (a follow-up PR) collapse the dispatch tail. Real-world graphql-js validation pipelines compile to ~98 call_indirect sites all dispatching through a single immutable funcref table — every site qualifies.

Soundness

Three corners had to be tightened during development:

tests/all/leftover_elem_segment_soundness.rs + 4 disas filetests cover the soundness shapes. crates/environ/tests/table_mutability.rs has 12 cases for the predicate itself.

c1-7 vs c1-8 elision floor

An attempt at fully eliding the lazy-init brif (c1-8: egraph-folds it to trapz) showed ~14 % Discarded-bucket increase on iPhone 12 Icestorm E-core PMU at N=3 without a wallclock improvement. Kept the c1-7 form (brif retained, mask + tagged-pointer test); commit disable the c1-8 brif elision based on PMU evidence documents this.

Tests

Stacks under a follow-up PR for Pulley opcode fusion at the call_indirect lazy-init site.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested cfallin for a review on PR #13445.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested wasmtime-compiler-reviewers for a review on PR #13445.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 02:26):

matthargett requested wasmtime-core-reviewers for a review on PR #13445.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 03:26):

matthargett updated PR #13445.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 04:36):

cfallin commented on PR #13445:

@matthargett could I request that (per our AI policy) you rewrite the PR description here? In particular, there are a bunch of phrases here that seem to be a locally-evolved jargon and are not very comprehensible:

These are just examples; in general I'd like to see a description of the work that is aimed to actually communicate to a human, not dump a bunch of out-of-context details, especially before diving in to a 2k-line PR.

(And, to double-check per our AI policy: have you reviewed this whole PR by hand before posting it?)

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 06:47):

matthargett commented on PR #13445:

@matthargett could I request that (per our AI policy) you rewrite the PR description here? In particular, there are a bunch of phrases here that seem to be a locally-evolved jargon and are not very comprehensible:

I did edit the PR description, by my own typing hands, and it folds in the feedback given about both the verbosity and the AI policy in the chat.

* "This is the predicate-factory PR." What is a predicate factory in this context? What does that mean? Are you just saying that this PR computes certain properties (predicates)? But then you say "Each elision saves a small per-`call_indirect` cost" -- so it sounds like this is not just computing predicates but using them?

sorry, this is a term we used in the code for my product called BugScan back in 2003. in that and this context, analyzing code to derive predicates that can be chained together for solving/elision purposes. here it's bits being set and not structs/objects, so "factory" wasn't a good metaphor choice.

This PR does the full chain: it does the analysis, sets the bits, and then the cranelift modifications uses those bits to do some elision of opcodes based on the proof of analysis that the predicates.

The reason there's a split between the PRs is that while there's some low-level CPU counters and profiler stats that improve just with this PR, but wallclock/e2e results on my devices didn't show an above-noise uplift.

* "c1-7 vs c1-8 elision floor" -- what does this mean? What is c1-7? What is c1-8? (If I had to guess: these are Claude-shaped plan phase names?) And then what is an "elision floor" (other than, perhaps, a fancy new kind of home-construction product)? Please try to make sure the description doesn't have incomprehensible jargon -- define terms before you use them, unless they are standard (in our subfield) terms.

c here means commit, again sorry for the sourceforge-era shorthand. I tried to keep the commit stack clean so that each change and why it was necessary was discrete and easy reason about one after the other (and not duplicate this information in the PR description). if it was one big commit, then some of the complexity around making the changes resilient to the fuzzer might seem unrelated or unnecessary.

the elision floor is when I wasn't getting any movement, even on CPU counter improvements, from trying to elide even more instructions. I'm not trying to use confusing metaphors on purpose, and I'm sorry my phrasing wasn't clearer.

* "...showed ~14 % Discarded-bucket increase" -- this reads like some sort of retro-encabulator description. What is the discarded bucket? Is it a bucket we have chosen to discard? Is it a bucket for things that are discarded? Why are they discarded? Who discarded them? Why is it bad that the discarded-bucket (... or the number of things within it) increases? There must be a whole experiment+measurement story here that's elided; unfortunately without that story it's hard to get any meaning out of this.

the "discarded" metric comes from the XCode profiling tools, specificallt xctrace's template for CPU bottlenecks. it's how many branch prediction mis-predicts happen, and its relevant in a bunch of interpreter performance work I've done over the years (starting with Tcl on the DEC Alpha and PowerPC).

These are just examples; in general I'd like to see a description of the work that is aimed to actually communicate to a human, not dump a bunch of out-of-context details, _especially_ before diving in to a 2k-line PR.

sorry, I really did try to make it more straightforward and not repeat things the diff already communicates.

(And, to double-check per our AI policy: have you reviewed this whole PR by hand before posting it?)

I re-reviewed diffs inbetween each on-device benchmark pass, which took 30-40 minutes across the cross-section of physical devices I have at my house.

I know I'm not the world's best programmer or communicator, even after a few decades of practice. If you feel like a video/audio chat would help, I'm up for it. If it's just too much overhead for you all, that's okay: I can keep a clean patch stack in my fork and we can revisit (or not) at your leisure.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 06:56):

matthargett edited PR #13445:

TL;DR

Adds a ModuleTranslation::tables_mutated bit (set during translation by table.set / table.fill / table.copy-dest / table.grow / table.init opcodes, or any passive elem segment that could land at runtime, or any leftover-segment shape that crosses a runtime resize) and uses it to elide six redundant runtime checks on call_indirect against provably-immutable funcref tables:

Each elision is gated on a stronger predicate than the previous; passive-segment + leftover-segment soundness corners are covered by integration tests.

Why

Each elision saves a small per-call_indirect cost, but the combined predicate (is_eagerly_initialized_funcref_table) is what lets the downstream Pulley opcode-fusion stack (a follow-up PR) collapse the dispatch tail. Real-world graphql-js validation pipelines compile to ~98 call_indirect sites all dispatching through a single immutable funcref table — every site qualifies.

Soundness

Three corners had to be tightened during development:

tests/all/leftover_elem_segment_soundness.rs + 4 disas filetests cover the soundness shapes. crates/environ/tests/table_mutability.rs has 12 cases for the predicate itself.

Learnings / Caveats

An attempt at fully eliding the lazy-init brif (c1-8: egraph-folds it to trapz) showed ~14 % branch mis-prediction increase on iPhone 12 E-core profiler across 3+ runs without a wallclock improvement. I kept the form from commits 1-7 (brif retained, mask + tagged-pointer test); commit disable the c1-8 brif elision based on PMU evidence documents this.

Tests

Stacks under a follow-up PR for Pulley opcode fusion at the call_indirect lazy-init site.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 07:02):

matthargett commented on PR #13445:

I did just notice that when cleaning up my fork's branch and splitting the PR that I lost some cleanup commits from my local clone. I'm fixing that now.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 07:11):

matthargett updated PR #13445.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2026 at 09:52):

github-actions[bot] added the label wasmtime:api on PR #13445.

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 14:21):

alexcrichton commented on PR #13445:

Would you be amenable to splitting up this PR into separate PRs for each optimization applied here? The optimizations here look sound to me and reasonable to implement, but I find it a bit difficult to consider them all at once vs one-at-a-time. Some high-level comments I have on this are applicable to this area, for example:

At a high-level, as well, can you describe the shapes of modules that would benefit from these sorts of optimizations? For example I would expect constant-index dispatches to be optimized by the frontend, the entire table sharing one signature to be relatively rare, and the null-check not mattering much in practice since it's something where we just catch the fault. For the bounds-check I believe we already optimized fixed-size tables (statically fixed-size at least via the type) and the lazy-init changes I'm not totally sold on yet myself (e.g. would want to benchmark/analyze more). Overall it feels like a pretty slim shape of module that would fit within these constraints, but that doesn't meant that they're not important. The optimizations here are pretty easy to read and reason about, so that's why I'm curious to understand more about this use case.


Last updated: Jun 01 2026 at 09:49 UTC