fitzgen added the performance label to Issue #9349.
fitzgen added the cranelift:goal:optimize-speed label to Issue #9349.
fitzgen added the wasmtime:ref-types label to Issue #9349.
fitzgen opened issue #9349:
We could implement an escape analysis for Wasm GC types to figure out which allocations do not escape their stack frame. Then we can do unboxing/SROA, replacing a struct's fields or array elements with a set of local SSA values.
alexcrichton added the wasm-proposal:gc label to Issue #9349.
fitzgen commented on issue #9349:
Thought about this a bit more recently.
When optimizing Wasm GC I think we will want the following pass ordering:
- Wasm --> translator --> CLIF w/
gc_{alloc,load,store}intrinsics- (optional) CLIF w/ GC intrinsics --> mid-end --> CLIF w/ GC intrinsics
- (optional) CLIF w/ GC intrinsics --> inliner --> CLIF w/ GC intrinsics
- CLIF w/ GC intrinsics --> escape analysis & SROA --> CLIF w/ GC intrinsics
- finds GC objects that don't escape
- does SROA to remove the allocation and replace their fields with SSA values
- CLIF w/ GC intrinsics --> GC legalization & safepoint spilling --> CLIF
- replaces GC intrinsics with their expanded regular CLIF instructions
- does the same GC safepoint spilling we do today
From there the pipeline continues as it does today:
- CLIF --> mid-end --> CLIF
- CLIF --> backend --> VCode
- VCode --> regalloc2 --> VCode & register allocations
- VCode & register allocations --> emit --> machine code
The reason we want
gc_{alloc,load,store}intrinsics is so that we can reason about them, and elide them when possible, during escape analysis. That is much harder to do if we've already exploded them into their inlined CLIF code (eg bump allocation requires branching on when we run out of capacity, GC loads and stores might require barriers which involve branching, etc).Note that we want to do escape analysis before safepoint spilling. We don't want to make everything "escape" when we spill it to the stack, and making the analysis understand when spilling to the stack does vs doesn't count as escaping is a bit hairy. It's also unnecessary so long as escape analysis happens before safepoint spilling.
Note that we want to do escape analysis after inlining. An object might escape before inlining, because it is passed as an argument to a function call, but after we inline that function call we see that it no longer escapes.
Those two previous notes already imply a different phase ordering from what we do today. Today, we do safepoint spilling at the end of CLIF construction and before inlining. In this hypothetical future, we would do CLIF translation, then inlining, then safepoint spilling.
What would the GC intrinsics be and where would they be defined? I see two main alternatives:
If we keep the GC safepoint spilling in
cranelift_frontend, and outside ofcranelift_codegen, then I guess these would be well-knownir::ExtFuncsentinels. This is a bit hacky, and also means we wouldn't be able to use the regularfiletestinfrastructure, at least without a bunch of additional work. I've already been feeling this pain a bit with the lack of filetests for GC safepoint spilling, and including whole CLIF function dumps in regular rust unit tests. It is also not clear to me exactly how we would do inlining (which is part ofcranelift_codegen) before GC safepoints (still part ofcranelift_frontend). Also we wouldn't be able to attach alias regions to the GC intrinsics.We could move GC safepoint spilling back into
cranelift_codegen, and define the escape analysis incranelift_codegenas well, which would allow the GC intrinsics to be CLIF instructions. This would allow us to attach alias regions to them. It would allow us to use filetests. Inlining before GC safepoint spilling is not a crate dependency issue anymore. It does mean that we need to reject the GC instructions during lowering and report a "you should have legalized these away by now" error. I think that is a fine trade off personally.I know that (2) may raise questions like "wasn't the whole point of user stack maps to move this stuff out of
cranelift_codegen?" Sort of. The key bit was making it so that the mid-end's GVN and alias analyis and all that saw the GC safepoint spilling and wouldn't do optimizations that were invalid in the face of e.g. a moving GC. The proposed phase ordering above still preserves that property (if we did the optional mid-end optimizations before inlining, we would need to disallow GVNing GC intrinsics, as that would open the original can of worms we were trying to avoid by introducint user stack maps back up again).So yeah, I think (2) is the way to go here. And it is unfortunate that this is a bunch of new passes over the whole IR that we don't do today, but I think that this is somewhat unavoidable. That said, they are only necessary when we are optimizing GC code. Anything that doesn't use GC can completely skip all that.
I also have thoughts on the details of the static analysis that we do for escape analysis, but I'll leave them for a follow up comment.
fitzgen commented on issue #9349:
Oh also one other small wrinkle from (2) above: how to do GC legalization and expand the GC instructions into regular CLIF? In (2) we are doing this in
cranelift_codegen, so we can't give acranelift_frontend::FunctionBuilderto the user and tell them to just insert what they need directly. Idea: reuse our inlining infrastructure and have the user supply a wholeir::Functionthat we inline to replace thegc_{alloc,load,store}.
fitzgen commented on issue #9349:
Okay, here's some thoughts on the escape analysis itself, rather than the phase ordering.
(For simplicity, I'm imagining we won't bother doing SROA for arrays or any GC object that is dynamically indexed into, at least for quite a while.)
I don't think we need a sophisticated points-to analysis. I think we can get away with a simple backwards analysis where
the lattice for each
ir::Valueis simply
top = escapes | bottom = does not escapethe lattice at a program point is then the power set of escaping
ir::Values, with set-union as the join function- and therefore the lattice for the whole program is the map from program point to the set of escaping
ir::Values at that program point.This is a backwards analysis because it propagates backwards from a value's uses.
Our transfer functions would look something like this:
// A value escapes when it is an argument to a function. escapes(S, (call f(x))) = S union {x} // A value escapes when it is stored to non-GC memory. escapes(S, (store x, p)) = S union {x} // A value escapes when it is returned from the function. escapes(S, (return x)) = S union {x} // A value escapes if it is a block arg to a block param that escapes. escapes(S, (jump block(x))) = if block.params[0] in S { S union {x} } else { S } // A value escapes if it is stored into a GC object that escapes. escapes(S, (gc_store r[field], x)) = if r in S { S union {x} } else { S } // Otherwise, an object escapes if it escapes in any successor. escapes(S, x) = union( escapes(S, y) for y in successors(x) )We would initialize
Swith our function parameters, since anything that came from outside this function has already escaped before we even began (and we can't elide allocations from this function's callers while optimizing this function anyways).Note that we wouldn't actually compute this for every single instruction, but instead have
escapes_inandescapes_outsummary sets for each basic block, similar to what we do for liveness during safepoint spilling. Note that:
- An object escapes for the first time along that control-flow path when it appears in the
escapes_outset but does not appear in theescapes_inset.- Conversely, an object has already escaped when it is in both
escapes_outandescapes_in.- An object never escapes if on this control-flow path if it is not in
escapes_out(and it will therefore also not be inescapes_in).Given that information, we can do SROA of GC objects by walking the CFG in reverse post-order (so that visit defs before uses) and doing the following:
- When we define a GC object
v3 = gc_alloc(v0, v1, v2)we optimistically remove that allocation (even ifv3escapes! we want to delay boxing until it actually escapes) and we setcurrent_fields[v3] = [v0, v1, v2].- When we have
v7 = gc_load v5[field]beforev5escapes (if it ever does) then we remove the instruction and setcurrent_aliases[v7] = current_fields[v5][field].- When we find uses of a value in
current_aliases, we replace it with the associated alias value.- When we have
gc_store v8[field], v6beforev8escapes (if it ever does) then we updatecurrent_fields[v8][field] = v6.- When an object escapes for the first time, we insert the delayed
gc_allocjust before that escaping operation.The
current_fieldsandcurrent_aliasesmaps will need to be scoped hash maps.We will need to to do something about control flow join points to handle scenarios like this, where we need to insert new block parameters:
block0(v0, v1): ;; this allocation will get elided v2 = gc_alloc(v0) br_if v1, block1, block2 / \ / \ / \ block1: block2: v3 = call f() v4 = call g() ;; gc_store gets elided, ;; gc_store gets elided, ;; current_fields[v2][0] = v3 ;; current_fields[v2][0] = v4 gc_store v2[0], v3 gc_store v2[0], v4 jump block3 jump block3 \ / \ / \ / block3: ;; gc_load gets elided, need ;; to add a block param to ;; block3 to get the current ;; field v5 = gc_load v2[0] ;; v5 gets replaced with the ;; block param return v5Not totally sure how we want to do this yet, given that our existing code to do this stuff is in
cranelift_frontendrather thancranelift_codegen(theVariables and SSA construction stuff). Might be okay just to move that stuff intocranelift_codegenand publicly export it to make it usable in both places?
cfallin commented on issue #9349:
Thanks for writing all of this up!
Regarding the pass order question and to write down here what we discussed already: yes, I think all of this is a good reason to pull safepoint spilling (in some form) back into the core compiler. The main simplification from my PoV that the earlier user stackmaps work brought (in addition to what you describe) is that it removed the load-bearingness of regalloc's and lowering's location tracking for GC correctness, and that would still be true here, so all is well.
I have some detailed comments on the escape analysis rules but on the whole, it sounds like a reasonable direction. To name another aspect explicitly, it's fairly interesting that it is a flow-sensitive analysis: typically I would imagine something more based on points-to analysis with heap abstractions, which would decide the escape status for each abstraction independent of program point. But I like the fact that we can tell where something escapes and box it only then (possibly e.g. on a rare exit path). Also: it's worth clarifying the aliasing invariants. Specifically, if we have e.g.
p2 := p1somewhere in the program (metaphorically at least -- say that p2 is offset into p1), and later we see thatp1escapes, we need to know that we escapep2too, right?My one uncertainty around your transfer-function definition: it's not clear to me how the "deep" escape works when a field is stored-to. Say that I allocate
p1 = S1 { .x = ... },p2 = S2 { .y = p1 },p3 = S3 { .z = p2 }, ... and then eventually return or storep3somewhere. We need to propagate all reachable objects to the escape-set, not justp3, right? This is where I guess heap abstractions would have come in handy. Interested to understand how this works in your analysis!
tschneidereit commented on issue #9349:
That said, they are only necessary when we are optimizing GC code. Anything that doesn't use GC can completely skip all that.
Does that include exception handling? And if so, how fine-grained can we be about it? Would Rust or C++ code that uses exception handling all have to be compiled with these new passes running as well, or only some or even none of it?
fitzgen commented on issue #9349:
Does that include exception handling? And if so, how fine-grained can we be about it? Would Rust or C++ code that uses exception handling all have to be compiled with these new passes running as well, or only some or even none of it?
This is sort of hypothetical because it depends on what exactly we build.
The closest analog we have today is our detection of whether a module needs a GC heap, which happens via whether it dynamically requests a
GcCompilerduring Wasm-to-CLIF translation, which roughly corresponds to "was there any GC- or exn-related Wasm instruction". So if we reused that signal for enabling this GC-optimization pipeline, then yes it would include Rust w/ panic unwinds and C++ with exceptions (for just the functions that throw, effectively). It's possible that the number of functions this impacts is small enough that it doesn't actually slow things down in the grand scheme of things. But if it does, then we could do something like turn that bool into a count of how many times we needed aGcCompilerand then only trigger this pipeline if the ratio of that count over the function's byte size is high enough or something.
fitzgen commented on issue #9349:
My one uncertainty around your transfer-function definition: it's not clear to me how the "deep" escape works when a field is stored-to. Say that I allocate
p1 = S1 { .x = ... },p2 = S2 { .y = p1 },p3 = S3 { .z = p2 }, ... and then eventually return or storep3somewhere. We need to propagate _all_ reachable objects to the escape-set, not justp3, right? This is where I guess heap abstractions would have come in handy. Interested to understand how this works in your analysis!That should be handled by the rule
// A value escapes if it is stored into a GC object that escapes. escapes(S, (gc_store r[field], x)) = if r in S { S union {x} } else { S }which cascades escaping from one GC object to another: if we are storing
xinto a field ofr, andrescapes, thenxmust also escape.(and a corresponding rule for the
gc_allocinstruction, which I omitted but maybe we would need depending on the design of thegc_allocinstruction and whether we allocate uninitialized and then fill in via subsequentgc_stores or provide initial field values togc_allocto eagerly initialize)
cfallin commented on issue #9349:
which cascades escaping from one GC object to another
True, but in a flow-sensitive way, i.e. only if the object is known to escape at the current program point, right? So if we (i) store
xintor, and then later escaper(add it toS), the rule that addsrtoSdoesn't also recurse intorand add all pointees reachable fromr.(If
Swere flow-insensitive and maintained as a montonically-ratcheting set globally, then a second pass would solve this issue, but it's not, as far as I can tell?)
cfallin edited a comment on issue #9349:
which cascades escaping from one GC object to another
True, but in a flow-sensitive way, i.e. only if the object is known to escape at the current program point, right? So if we (i) store
xintor, and then later escaper(add it toS), the rule that addsrtoSdoesn't also recurse intorand add all pointees (e.g.x) reachable fromr.(If
Swere flow-insensitive and maintained as a montonically-ratcheting set globally, then a second pass would solve this issue, but it's not, as far as I can tell?)
fitzgen commented on issue #9349:
True, but in a flow-sensitive way, i.e. only if the object is known to escape _at the current program point_, right? So if we (i) store
xintor, and _then_ later escaper(add it toS), the rule that addsrtoSdoesn't also recurse intorand add all pointees (e.g.x) reachable fromr.Okay, I see what you mean. For example given this:
block5: call fn0(v0) jump block6 block6: gc_store v1, v0+4 returnThis will
- Start with
block5.escapes_in = block5.escapes_out = block6.escapes_in = block6.escapes_out = {}- process
block6:
S = block6.escapes_out = {}- process
gc_store v1, v0+4and becausev0is not inS, will not addv1toSblock6.escapes_in = S = {}- process
block5:
S = block5.escapes_out = union(b.escapes_in for b in block5.successors) = {}- process
call fn0(v0), which addsv0toSblock5.escapes_in = S = {v0}- and then we do that again and none of the
escapes_{in,out}sets change so we've hit our fixed point and stop, but we never madev1escape but should haveWe could of course just also compute a global escapes set in addition to the escapes-{in,out} sets but that feels unsatisfying.
It feels like perhaps the fundamental problem is caught up in trying to both compute the escape set and the first-escape-block at the same time and perhaps it would be better to separate the two. Need to think about this some more.
fitzgen commented on issue #9349:
Chris and I just talked a bit about this and the take away was that the analysis really does need to be flow insensitive (or we must disallow SROA of nested objects, more on this later). We then considered two tweaks to make the analysis flow-insensitive:
Make the lattice be
Map<Value, Set<EscapeSite>>, whereEscapeSiteis a program point where the value escapes. Optionally eachSet<EscapeSite>could contain only program points that do not dominate each other.This is answering the question "where are all the places this value escapes?" (Or, if the sets only contain program points that do not dominate each other, "where are the places where this value escapes for the first time?", which is exactly what we want for the SROA rewriting pass.)
The SROA rewriting pass would take this information, eagerly SROA GC allocations, and then insert lazy GC allocations at each escape site that is not dominated by another escape site (this latter bit is easier here if we did more work in the initial static analysis, but its a bit six-of-one-and-half-a-dozen-of-another).
Make the lattice be simple
Set<Value>where inclusion in the set means the value escapes at some program point.This is answering the question "does this value ever escape?"
The SROA rewriting pass would work similarly as in (1) but would need to additionally maintain a scoped set of values-that-have-escaped-thus-far as it walks the IR via the dom tree, (re)identifies instructions that cause values to escape, and inserts the lazy GC allocation on first escape.
The two (and a half) options are pretty similar, and the differences are largely whether we shift the work of determining first escape sites from the static analysis to the rewrite pass that consumes the static analysis results or vice versa. I don't think either choice has more inherent complexity than the other.
In terms of data structures required and the associated performance implications, I think (2) comes out on top, since it is just a flat set (which can be an
EntitySetwhich is a compact bit set) and a scoped set (a couple good implementation options for this); the map of nested sets in (1) is going to involve more allocations (and many small allocations) and more pointer chasing.Regarding SROA of nested objects: if we disallow this, and consider
gc_store r[field], xto causexto escape regardless whetherrescapes, then I believe we avoid the fault that Chris pointed out. But that would also limit the effectiveness of the analysis compared to the above options and we also end up with twoSet<Value>s per basic block, so it seems worse than (2) perf-wise as well. Just worse all around, so probably not worth pursuing that variation any further.
Last updated: May 03 2026 at 22:13 UTC