fitzgen opened PR #8978 from fitzgen:user-stack-maps-and-moving-gc
to bytecodealliance:main
:
This refactors the way that user stack maps work such that they are compatible with moving GCs. We now keep every GC value in a stack slot, spill to the stack slot on the value's definition, and reload from the stack slot on every use. This means that we can generate a bunch of redundant loads, but we let the mid-end's alias analysis clean that up. This has the added benefit of reducing the length of live ranges for GC values and also means we don't need to proactively spill every single live GC values at each safepoint, because we know they are already spilled. This latter bit actually helps us avoid an accidentally quadratic issue with many, long, overlapping live ranges where we would do
O(n^2)
spills for a series of safepoints that keep creating more GC refs.We also implement two further optimizations:
We lazily allocate slots for live GC values, which means that if a GC value is not ever live across a safepoint, then we do never allocate a stack slot for it and never spill it to the stack slot. If we didn't do this, then frames would be larger and we'd have a dead store to the stack slot that would otherwise require the mid-end to grow a dead-store-elimination pass.
We keep a free list of available slots that will never be used again, and we reuse slots from here when possible. This means that a chain of non-overlapping live ranges, each of which still needs to appear in some safepoint's stack map, will all reuse the same stack slot, keeping frames from being bloated.
Finally, this commit also introduces some logs for the liveness analysis to ease future debugging.
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
fitzgen requested cfallin for a review on PR #8978.
fitzgen requested wasmtime-default-reviewers for a review on PR #8978.
fitzgen requested wasmtime-compiler-reviewers for a review on PR #8978.
cfallin submitted PR review:
I think this is correct, just some nits below -- I'm glad our idea to fall back to stackslots is working out!
I think we can land as-is if you prefer, but I do have some thoughts spawned from the handling of blockparams. My original thought when I suggested this was that (perhaps naively, but...) it would operate closer to the level of the frontend's "variables" that we use to represent Wasm locals, so that somehow we wouldn't need to handle blockparams -- since by construction, we insert blockparams only to carry the same variable from source block to dest block. I worry that inserting reloads and respills (if I've understood the above correctly) could lead to unexpected worst-case behavior still. (In the
vars_block_params_and_needs_stack_map
test example,x
would have one stackslot, we would overwrite it in bothblock1
andblock2
, and we wouldn't need loads at those blocks' tails and a re-store atblock3
.)I think that if we have a closer correspondence between frontend variables (i.e. Wasm locals) and stackslots, we have a more understandable cost-model from the Wasm producer's point of view as well: at least in baseline compilers, locals generally become actual stack slots, so producers try to minimize local count and reuse them where possible. If we can carry over that consolidation to our IR, we have an understandable model of "one store per write to a Wasm local, one load per use of a Wasm local" which is great.
As things are now I think we can still see higher-than-linear resulting code size from linear number of reads/writes to refs: imagine a
br_table
that fans out to N blocks, the first (innermost) of which overwrites a reftyped Wasm local with an existing def. Then when we processend
opcodes and generate N merge-points in sequence, each merge point will have to reload and re-store the slot. Ideally we'd instead directly translate the update to the variable to astack_store
and the uses tostack_load
s.I think that might point the way to an implementation solution that also avoids the need for the extra pass: branch in the variable-to-SSA-value translation to an alternate "value just lives on stack" path.
Thoughts?
cfallin submitted PR review:
I think this is correct, just some nits below -- I'm glad our idea to fall back to stackslots is working out!
I think we can land as-is if you prefer, but I do have some thoughts spawned from the handling of blockparams. My original thought when I suggested this was that (perhaps naively, but...) it would operate closer to the level of the frontend's "variables" that we use to represent Wasm locals, so that somehow we wouldn't need to handle blockparams -- since by construction, we insert blockparams only to carry the same variable from source block to dest block. I worry that inserting reloads and respills (if I've understood the above correctly) could lead to unexpected worst-case behavior still. (In the
vars_block_params_and_needs_stack_map
test example,x
would have one stackslot, we would overwrite it in bothblock1
andblock2
, and we wouldn't need loads at those blocks' tails and a re-store atblock3
.)I think that if we have a closer correspondence between frontend variables (i.e. Wasm locals) and stackslots, we have a more understandable cost-model from the Wasm producer's point of view as well: at least in baseline compilers, locals generally become actual stack slots, so producers try to minimize local count and reuse them where possible. If we can carry over that consolidation to our IR, we have an understandable model of "one store per write to a Wasm local, one load per use of a Wasm local" which is great.
As things are now I think we can still see higher-than-linear resulting code size from linear number of reads/writes to refs: imagine a
br_table
that fans out to N blocks, the first (innermost) of which overwrites a reftyped Wasm local with an existing def. Then when we processend
opcodes and generate N merge-points in sequence, each merge point will have to reload and re-store the slot. Ideally we'd instead directly translate the update to the variable to astack_store
and the uses tostack_load
s.I think that might point the way to an implementation solution that also avoids the need for the extra pass: branch in the variable-to-SSA-value translation to an alternate "value just lives on stack" path.
Thoughts?
cfallin created PR review comment:
doesn't make a semantic difference here but style nit: semicolon after call?
cfallin created PR review comment:
s/can clean these all up/should eventually be able to clean most of these up, when not needed/ (aspirational description is aspirational, hopefully we'll get there! but don't want to confuse in the meantime)
cfallin created PR review comment:
Could we have a comment here describing why we need to back up? Is it that we need to pass over the stack-stores we just inserted for blockparams above?
(And given that we do need this, is there any harm in making it unconditional, since it should be a no-op if we didn't insert anything?)
cfallin created PR review comment:
perhaps abstract out to an analogous
process_use
lambda here to make things a bit clearer, as above for defs?
cfallin created PR review comment:
If it's not too bad to do, could we have an enum for stack-slot size rather than a bare
u32
? Something likeenum StackSlotSize { Slot4, Slot8 }
and perhaps.bytes()
and.align()
methods as needed; and aTryFrom<ir::Type>
impl maybe too...(To be clear, I'm suggesting this as a refactor wherever slot-sizes appear, the hashmap key here just prompted the thought)
fitzgen submitted PR review.
fitzgen created PR review comment:
We would actually need quite a few slot sizes, since we have types like
f64x8
in CLIF. I'm not fully convinced that we would actually benefit from such anenum
, and even if we do want to go down that route, I'd want to do it in a separate PR.I did waffle back and forth between a full hash map here vs an array indexed by
log2(size)
like the old approach did all over the place. I opted to just go for the simplest, most-obvious implementation approach and figured we can come back and optimize if it is ever a perf bottleneck, but I suspect it really isn't.So given all that: are you okay with keeping this as-is for now at least?
cfallin submitted PR review.
cfallin created PR review comment:
Sure, it doesn't need to happen as part of this PR. (The reason for
TryFrom<ir::Type>
rather thanFrom
was exactly to address this -- I don't think we need to define a slot size for every type, only for types that thecranelift-frontend
user puts into stackmaps -- and in general I find bare integers pretty error-prone; but happy to discuss further!)
fitzgen updated PR #8978.
fitzgen commented on PR #8978:
operate closer to the level of the frontend's "variables"
I also thought about creating a new entity for these things but I thought through how it would be used in
cranelift-wasm
andwasmtime-cranelift
and my conclusion was that it would be a bit of a mess. First, we aren't just dealing with wasm locals, which areVariable
s; we are also dealing with the results of calls and things like that, which produceValue
s. So we have to have a solution forValue
s either way. Second,cranelift-wasm
doesn't make the decision whether certain types are managed by the GC or not, it is agnostic of runtime structures and implementation,wasmtime-cranelift
makes those decisions based on its Wasmtime-specific knowledge. That means thatcranelift-wasm
would have to have some kind ofenum
of either a GC-managedVariable
(or other new entity) or a plainValue
and use that for its abstract operand stack, at which point the whole crate, essentially, is infected with this change and things are getting really hairy.For this reason, we decided it was better, at least for now, to play nice with the existing system and its existing entities, rather than overhaul everything.
I think that might point the way to an implementation solution that also avoids the need for the extra pass: branch in the variable-to-SSA-value translation to an alternate "value just lives on stack" path.
Unfortunately this won't work because the stack slots and liveness and all that are not created until we are finishing the function builder, long after calls to
builder.use_var
.Trevor and I talked about another solution for this specific problem though: if we maintain a backwards map from
Value
to theVariable
that it is associated with (if any) then we can force allValue
s for the sameVariable
to use the same stack slot. ForValue
s that are not associated with aVariable
, they would use the same approach we have in this PR. But this is work for the future.
fitzgen commented on PR #8978:
I think we can land as-is if you prefer
I'm going to go ahead and enqueue this for merging, but I am still interested in your responses above and in the inline thread!
fitzgen merged PR #8978.
fitzgen submitted PR review.
fitzgen created PR review comment:
Did this in https://github.com/bytecodealliance/wasmtime/pull/8991
Last updated: Dec 23 2024 at 12:05 UTC