Stream: git-wasmtime

Topic: wasmtime / issue #9349 Escape analysis for Wasm GC


view this post on Zulip Wasmtime GitHub notifications bot (Oct 01 2024 at 16:50):

fitzgen added the performance label to Issue #9349.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 01 2024 at 16:50):

fitzgen added the cranelift:goal:optimize-speed label to Issue #9349.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 01 2024 at 16:50):

fitzgen added the wasmtime:ref-types label to Issue #9349.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 01 2024 at 16:50):

fitzgen opened issue #9349:

We could implement an escape analysis for Wasm GC types to figure out which allocations do not escape their stack frame. Then we can do unboxing/SROA, replacing a struct's fields or array elements with a set of local SSA values.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 25 2025 at 17:37):

alexcrichton added the wasm-proposal:gc label to Issue #9349.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 23 2026 at 21:59):

fitzgen commented on issue #9349:

Thought about this a bit more recently.

When optimizing Wasm GC I think we will want the following pass ordering:

From there the pipeline continues as it does today:

The reason we want gc_{alloc,load,store} intrinsics is so that we can reason about them, and elide them when possible, during escape analysis. That is much harder to do if we've already exploded them into their inlined CLIF code (eg bump allocation requires branching on when we run out of capacity, GC loads and stores might require barriers which involve branching, etc).

Note that we want to do escape analysis before safepoint spilling. We don't want to make everything "escape" when we spill it to the stack, and making the analysis understand when spilling to the stack does vs doesn't count as escaping is a bit hairy. It's also unnecessary so long as escape analysis happens before safepoint spilling.

Note that we want to do escape analysis after inlining. An object might escape before inlining, because it is passed as an argument to a function call, but after we inline that function call we see that it no longer escapes.

Those two previous notes already imply a different phase ordering from what we do today. Today, we do safepoint spilling at the end of CLIF construction and before inlining. In this hypothetical future, we would do CLIF translation, then inlining, then safepoint spilling.

What would the GC intrinsics be and where would they be defined? I see two main alternatives:

  1. If we keep the GC safepoint spilling in cranelift_frontend, and outside of cranelift_codegen, then I guess these would be well-known ir::ExtFunc sentinels. This is a bit hacky, and also means we wouldn't be able to use the regular filetest infrastructure, at least without a bunch of additional work. I've already been feeling this pain a bit with the lack of filetests for GC safepoint spilling, and including whole CLIF function dumps in regular rust unit tests. It is also not clear to me exactly how we would do inlining (which is part of cranelift_codegen) before GC safepoints (still part of cranelift_frontend). Also we wouldn't be able to attach alias regions to the GC intrinsics.

  2. We could move GC safepoint spilling back into cranelift_codegen, and define the escape analysis in cranelift_codegen as well, which would allow the GC intrinsics to be CLIF instructions. This would allow us to attach alias regions to them. It would allow us to use filetests. Inlining before GC safepoint spilling is not a crate dependency issue anymore. It does mean that we need to reject the GC instructions during lowering and report a "you should have legalized these away by now" error. I think that is a fine trade off personally.

I know that (2) may raise questions like "wasn't the whole point of user stack maps to move this stuff out of cranelift_codegen?" Sort of. The key bit was making it so that the mid-end's GVN and alias analyis and all that saw the GC safepoint spilling and wouldn't do optimizations that were invalid in the face of e.g. a moving GC. The proposed phase ordering above still preserves that property (if we did the optional mid-end optimizations before inlining, we would need to disallow GVNing GC intrinsics, as that would open the original can of worms we were trying to avoid by introducint user stack maps back up again).

So yeah, I think (2) is the way to go here. And it is unfortunate that this is a bunch of new passes over the whole IR that we don't do today, but I think that this is somewhat unavoidable. That said, they are only necessary when we are optimizing GC code. Anything that doesn't use GC can completely skip all that.

I also have thoughts on the details of the static analysis that we do for escape analysis, but I'll leave them for a follow up comment.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 23 2026 at 22:11):

fitzgen commented on issue #9349:

Oh also one other small wrinkle from (2) above: how to do GC legalization and expand the GC instructions into regular CLIF? In (2) we are doing this in cranelift_codegen, so we can't give a cranelift_frontend::FunctionBuilder to the user and tell them to just insert what they need directly. Idea: reuse our inlining infrastructure and have the user supply a whole ir::Function that we inline to replace the gc_{alloc,load,store}.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 24 2026 at 18:38):

fitzgen commented on issue #9349:

Okay, here's some thoughts on the escape analysis itself, rather than the phase ordering.

(For simplicity, I'm imagining we won't bother doing SROA for arrays or any GC object that is dynamically indexed into, at least for quite a while.)

I don't think we need a sophisticated points-to analysis. I think we can get away with a simple backwards analysis where

This is a backwards analysis because it propagates backwards from a value's uses.

Our transfer functions would look something like this:

// A value escapes when it is an argument to a function.
escapes(S, (call f(x))) = S union {x}

// A value escapes when it is stored to non-GC memory.
escapes(S, (store x, p)) = S union {x}

// A value escapes when it is returned from the function.
escapes(S, (return x)) = S union {x}

// A value escapes if it is a block arg to a block param that escapes.
escapes(S, (jump block(x))) =
    if block.params[0] in S {
        S union {x}
    } else {
        S
    }

// A value escapes if it is stored into a GC object that escapes.
escapes(S, (gc_store r[field], x)) =
    if r in S {
        S union {x}
    } else {
        S
    }

// Otherwise, an object escapes if it escapes in any successor.
escapes(S, x) = union( escapes(S, y) for y in successors(x) )

We would initialize S with our function parameters, since anything that came from outside this function has already escaped before we even began (and we can't elide allocations from this function's callers while optimizing this function anyways).

Note that we wouldn't actually compute this for every single instruction, but instead have escapes_in and escapes_out summary sets for each basic block, similar to what we do for liveness during safepoint spilling. Note that:

Given that information, we can do SROA of GC objects by walking the CFG in reverse post-order (so that visit defs before uses) and doing the following:

The current_fields and current_aliases maps will need to be scoped hash maps.

We will need to to do something about control flow join points to handle scenarios like this, where we need to insert new block parameters:

               block0(v0, v1):
                 ;; this allocation will get elided
                 v2 = gc_alloc(v0)
                 br_if v1, block1, block2
                /                    \
               /                      \
              /                        \
    block1:                             block2:
      v3 = call f()                       v4 = call g()
      ;; gc_store gets elided,            ;; gc_store gets elided,
      ;; current_fields[v2][0] = v3       ;; current_fields[v2][0] = v4
      gc_store v2[0], v3                  gc_store v2[0], v4
      jump block3                         jump block3
              \                        /
               \                      /
                \                    /
                 block3:
                   ;; gc_load gets elided, need
                   ;; to add a block param to
                   ;; block3 to get the current
                   ;; field
                   v5 = gc_load v2[0]
                   ;; v5 gets replaced with the
                   ;; block param
                   return v5

Not totally sure how we want to do this yet, given that our existing code to do this stuff is in cranelift_frontend rather than cranelift_codegen (the Variables and SSA construction stuff). Might be okay just to move that stuff into cranelift_codegen and publicly export it to make it usable in both places?

view this post on Zulip Wasmtime GitHub notifications bot (Apr 24 2026 at 21:25):

cfallin commented on issue #9349:

Thanks for writing all of this up!

Regarding the pass order question and to write down here what we discussed already: yes, I think all of this is a good reason to pull safepoint spilling (in some form) back into the core compiler. The main simplification from my PoV that the earlier user stackmaps work brought (in addition to what you describe) is that it removed the load-bearingness of regalloc's and lowering's location tracking for GC correctness, and that would still be true here, so all is well.

I have some detailed comments on the escape analysis rules but on the whole, it sounds like a reasonable direction. To name another aspect explicitly, it's fairly interesting that it is a flow-sensitive analysis: typically I would imagine something more based on points-to analysis with heap abstractions, which would decide the escape status for each abstraction independent of program point. But I like the fact that we can tell where something escapes and box it only then (possibly e.g. on a rare exit path). Also: it's worth clarifying the aliasing invariants. Specifically, if we have e.g. p2 := p1 somewhere in the program (metaphorically at least -- say that p2 is offset into p1), and later we see that p1 escapes, we need to know that we escape p2 too, right?

My one uncertainty around your transfer-function definition: it's not clear to me how the "deep" escape works when a field is stored-to. Say that I allocate p1 = S1 { .x = ... }, p2 = S2 { .y = p1 }, p3 = S3 { .z = p2 }, ... and then eventually return or store p3 somewhere. We need to propagate all reachable objects to the escape-set, not just p3, right? This is where I guess heap abstractions would have come in handy. Interested to understand how this works in your analysis!

view this post on Zulip Wasmtime GitHub notifications bot (Apr 25 2026 at 09:25):

tschneidereit commented on issue #9349:

That said, they are only necessary when we are optimizing GC code. Anything that doesn't use GC can completely skip all that.

Does that include exception handling? And if so, how fine-grained can we be about it? Would Rust or C++ code that uses exception handling all have to be compiled with these new passes running as well, or only some or even none of it?

view this post on Zulip Wasmtime GitHub notifications bot (Apr 27 2026 at 18:12):

fitzgen commented on issue #9349:

Does that include exception handling? And if so, how fine-grained can we be about it? Would Rust or C++ code that uses exception handling all have to be compiled with these new passes running as well, or only some or even none of it?

This is sort of hypothetical because it depends on what exactly we build.

The closest analog we have today is our detection of whether a module needs a GC heap, which happens via whether it dynamically requests a GcCompiler during Wasm-to-CLIF translation, which roughly corresponds to "was there any GC- or exn-related Wasm instruction". So if we reused that signal for enabling this GC-optimization pipeline, then yes it would include Rust w/ panic unwinds and C++ with exceptions (for just the functions that throw, effectively). It's possible that the number of functions this impacts is small enough that it doesn't actually slow things down in the grand scheme of things. But if it does, then we could do something like turn that bool into a count of how many times we needed a GcCompiler and then only trigger this pipeline if the ratio of that count over the function's byte size is high enough or something.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 27 2026 at 18:17):

fitzgen commented on issue #9349:

My one uncertainty around your transfer-function definition: it's not clear to me how the "deep" escape works when a field is stored-to. Say that I allocate p1 = S1 { .x = ... }, p2 = S2 { .y = p1 }, p3 = S3 { .z = p2 }, ... and then eventually return or store p3 somewhere. We need to propagate _all_ reachable objects to the escape-set, not just p3, right? This is where I guess heap abstractions would have come in handy. Interested to understand how this works in your analysis!

That should be handled by the rule

// A value escapes if it is stored into a GC object that escapes.
escapes(S, (gc_store r[field], x)) =
    if r in S {
        S union {x}
    } else {
        S
    }

which cascades escaping from one GC object to another: if we are storing x into a field of r, and r escapes, then x must also escape.

(and a corresponding rule for the gc_alloc instruction, which I omitted but maybe we would need depending on the design of the gc_alloc instruction and whether we allocate uninitialized and then fill in via subsequent gc_stores or provide initial field values to gc_alloc to eagerly initialize)

view this post on Zulip Wasmtime GitHub notifications bot (Apr 27 2026 at 18:33):

cfallin commented on issue #9349:

which cascades escaping from one GC object to another

True, but in a flow-sensitive way, i.e. only if the object is known to escape at the current program point, right? So if we (i) store x into r, and then later escape r (add it to S), the rule that adds r to S doesn't also recurse into r and add all pointees reachable from r.

(If S were flow-insensitive and maintained as a montonically-ratcheting set globally, then a second pass would solve this issue, but it's not, as far as I can tell?)

view this post on Zulip Wasmtime GitHub notifications bot (Apr 27 2026 at 18:34):

cfallin edited a comment on issue #9349:

which cascades escaping from one GC object to another

True, but in a flow-sensitive way, i.e. only if the object is known to escape at the current program point, right? So if we (i) store x into r, and then later escape r (add it to S), the rule that adds r to S doesn't also recurse into r and add all pointees (e.g. x) reachable from r.

(If S were flow-insensitive and maintained as a montonically-ratcheting set globally, then a second pass would solve this issue, but it's not, as far as I can tell?)

view this post on Zulip Wasmtime GitHub notifications bot (Apr 28 2026 at 16:03):

fitzgen commented on issue #9349:

True, but in a flow-sensitive way, i.e. only if the object is known to escape _at the current program point_, right? So if we (i) store x into r, and _then_ later escape r (add it to S), the rule that adds r to S doesn't also recurse into r and add all pointees (e.g. x) reachable from r.

Okay, I see what you mean. For example given this:

block5:
  call fn0(v0)
  jump block6

block6:
  gc_store v1, v0+4
  return

This will

We could of course just also compute a global escapes set in addition to the escapes-{in,out} sets but that feels unsatisfying.

It feels like perhaps the fundamental problem is caught up in trying to both compute the escape set and the first-escape-block at the same time and perhaps it would be better to separate the two. Need to think about this some more.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 29 2026 at 17:54):

fitzgen commented on issue #9349:

Chris and I just talked a bit about this and the take away was that the analysis really does need to be flow insensitive (or we must disallow SROA of nested objects, more on this later). We then considered two tweaks to make the analysis flow-insensitive:

  1. Make the lattice be Map<Value, Set<EscapeSite>>, where EscapeSite is a program point where the value escapes. Optionally each Set<EscapeSite> could contain only program points that do not dominate each other.

    This is answering the question "where are all the places this value escapes?" (Or, if the sets only contain program points that do not dominate each other, "where are the places where this value escapes for the first time?", which is exactly what we want for the SROA rewriting pass.)

    The SROA rewriting pass would take this information, eagerly SROA GC allocations, and then insert lazy GC allocations at each escape site that is not dominated by another escape site (this latter bit is easier here if we did more work in the initial static analysis, but its a bit six-of-one-and-half-a-dozen-of-another).

  2. Make the lattice be simple Set<Value> where inclusion in the set means the value escapes at some program point.

    This is answering the question "does this value ever escape?"

    The SROA rewriting pass would work similarly as in (1) but would need to additionally maintain a scoped set of values-that-have-escaped-thus-far as it walks the IR via the dom tree, (re)identifies instructions that cause values to escape, and inserts the lazy GC allocation on first escape.

The two (and a half) options are pretty similar, and the differences are largely whether we shift the work of determining first escape sites from the static analysis to the rewrite pass that consumes the static analysis results or vice versa. I don't think either choice has more inherent complexity than the other.

In terms of data structures required and the associated performance implications, I think (2) comes out on top, since it is just a flat set (which can be an EntitySet which is a compact bit set) and a scoped set (a couple good implementation options for this); the map of nested sets in (1) is going to involve more allocations (and many small allocations) and more pointer chasing.

Regarding SROA of nested objects: if we disallow this, and consider gc_store r[field], x to cause x to escape regardless whether r escapes, then I believe we avoid the fault that Chris pointed out. But that would also limit the effectiveness of the analysis compared to the above options and we also end up with two Set<Value>s per basic block, so it seems worse than (2) perf-wise as well. Just worse all around, so probably not worth pursuing that variation any further.


Last updated: May 03 2026 at 22:13 UTC