Stream: git-wasmtime

Topic: wasmtime / PR #8978 `cranelift-frontend`: Support moving ...


view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 18:21):

fitzgen opened PR #8978 from fitzgen:user-stack-maps-and-moving-gc to bytecodealliance:main:

This refactors the way that user stack maps work such that they are compatible with moving GCs. We now keep every GC value in a stack slot, spill to the stack slot on the value's definition, and reload from the stack slot on every use. This means that we can generate a bunch of redundant loads, but we let the mid-end's alias analysis clean that up. This has the added benefit of reducing the length of live ranges for GC values and also means we don't need to proactively spill every single live GC values at each safepoint, because we know they are already spilled. This latter bit actually helps us avoid an accidentally quadratic issue with many, long, overlapping live ranges where we would do O(n^2) spills for a series of safepoints that keep creating more GC refs.

We also implement two further optimizations:

  1. We lazily allocate slots for live GC values, which means that if a GC value is not ever live across a safepoint, then we do never allocate a stack slot for it and never spill it to the stack slot. If we didn't do this, then frames would be larger and we'd have a dead store to the stack slot that would otherwise require the mid-end to grow a dead-store-elimination pass.

  2. We keep a free list of available slots that will never be used again, and we reuse slots from here when possible. This means that a chain of non-overlapping live ranges, each of which still needs to appear in some safepoint's stack map, will all reuse the same stack slot, keeping frames from being bloated.

Finally, this commit also introduces some logs for the liveness analysis to ease future debugging.

<!--
Please make sure you include the following information:

Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.html

Please ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 18:21):

fitzgen requested cfallin for a review on PR #8978.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 18:21):

fitzgen requested wasmtime-default-reviewers for a review on PR #8978.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 18:21):

fitzgen requested wasmtime-compiler-reviewers for a review on PR #8978.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 18:52):

cfallin submitted PR review:

I think this is correct, just some nits below -- I'm glad our idea to fall back to stackslots is working out!

I think we can land as-is if you prefer, but I do have some thoughts spawned from the handling of blockparams. My original thought when I suggested this was that (perhaps naively, but...) it would operate closer to the level of the frontend's "variables" that we use to represent Wasm locals, so that somehow we wouldn't need to handle blockparams -- since by construction, we insert blockparams only to carry the same variable from source block to dest block. I worry that inserting reloads and respills (if I've understood the above correctly) could lead to unexpected worst-case behavior still. (In the vars_block_params_and_needs_stack_map test example, x would have one stackslot, we would overwrite it in both block1 and block2, and we wouldn't need loads at those blocks' tails and a re-store at block3.)

I think that if we have a closer correspondence between frontend variables (i.e. Wasm locals) and stackslots, we have a more understandable cost-model from the Wasm producer's point of view as well: at least in baseline compilers, locals generally become actual stack slots, so producers try to minimize local count and reuse them where possible. If we can carry over that consolidation to our IR, we have an understandable model of "one store per write to a Wasm local, one load per use of a Wasm local" which is great.

As things are now I think we can still see higher-than-linear resulting code size from linear number of reads/writes to refs: imagine a br_table that fans out to N blocks, the first (innermost) of which overwrites a reftyped Wasm local with an existing def. Then when we process end opcodes and generate N merge-points in sequence, each merge point will have to reload and re-store the slot. Ideally we'd instead directly translate the update to the variable to a stack_store and the uses to stack_loads.

I think that might point the way to an implementation solution that also avoids the need for the extra pass: branch in the variable-to-SSA-value translation to an alternate "value just lives on stack" path.

Thoughts?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 18:52):

cfallin submitted PR review:

I think this is correct, just some nits below -- I'm glad our idea to fall back to stackslots is working out!

I think we can land as-is if you prefer, but I do have some thoughts spawned from the handling of blockparams. My original thought when I suggested this was that (perhaps naively, but...) it would operate closer to the level of the frontend's "variables" that we use to represent Wasm locals, so that somehow we wouldn't need to handle blockparams -- since by construction, we insert blockparams only to carry the same variable from source block to dest block. I worry that inserting reloads and respills (if I've understood the above correctly) could lead to unexpected worst-case behavior still. (In the vars_block_params_and_needs_stack_map test example, x would have one stackslot, we would overwrite it in both block1 and block2, and we wouldn't need loads at those blocks' tails and a re-store at block3.)

I think that if we have a closer correspondence between frontend variables (i.e. Wasm locals) and stackslots, we have a more understandable cost-model from the Wasm producer's point of view as well: at least in baseline compilers, locals generally become actual stack slots, so producers try to minimize local count and reuse them where possible. If we can carry over that consolidation to our IR, we have an understandable model of "one store per write to a Wasm local, one load per use of a Wasm local" which is great.

As things are now I think we can still see higher-than-linear resulting code size from linear number of reads/writes to refs: imagine a br_table that fans out to N blocks, the first (innermost) of which overwrites a reftyped Wasm local with an existing def. Then when we process end opcodes and generate N merge-points in sequence, each merge point will have to reload and re-store the slot. Ideally we'd instead directly translate the update to the variable to a stack_store and the uses to stack_loads.

I think that might point the way to an implementation solution that also avoids the need for the extra pass: branch in the variable-to-SSA-value translation to an alternate "value just lives on stack" path.

Thoughts?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 18:52):

cfallin created PR review comment:

doesn't make a semantic difference here but style nit: semicolon after call?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 18:52):

cfallin created PR review comment:

s/can clean these all up/should eventually be able to clean most of these up, when not needed/ (aspirational description is aspirational, hopefully we'll get there! but don't want to confuse in the meantime)

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 18:52):

cfallin created PR review comment:

Could we have a comment here describing why we need to back up? Is it that we need to pass over the stack-stores we just inserted for blockparams above?

(And given that we do need this, is there any harm in making it unconditional, since it should be a no-op if we didn't insert anything?)

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 18:52):

cfallin created PR review comment:

perhaps abstract out to an analogous process_use lambda here to make things a bit clearer, as above for defs?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 18:52):

cfallin created PR review comment:

If it's not too bad to do, could we have an enum for stack-slot size rather than a bare u32? Something like enum StackSlotSize { Slot4, Slot8 } and perhaps .bytes() and .align() methods as needed; and a TryFrom<ir::Type> impl maybe too...

(To be clear, I'm suggesting this as a refactor wherever slot-sizes appear, the hashmap key here just prompted the thought)

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 19:37):

fitzgen submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 19:37):

fitzgen created PR review comment:

We would actually need quite a few slot sizes, since we have types like f64x8 in CLIF. I'm not fully convinced that we would actually benefit from such an enum, and even if we do want to go down that route, I'd want to do it in a separate PR.

I did waffle back and forth between a full hash map here vs an array indexed by log2(size) like the old approach did all over the place. I opted to just go for the simplest, most-obvious implementation approach and figured we can come back and optimize if it is ever a perf bottleneck, but I suspect it really isn't.

So given all that: are you okay with keeping this as-is for now at least?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 19:41):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 19:41):

cfallin created PR review comment:

Sure, it doesn't need to happen as part of this PR. (The reason for TryFrom<ir::Type> rather than From was exactly to address this -- I don't think we need to define a slot size for every type, only for types that the cranelift-frontend user puts into stackmaps -- and in general I find bare integers pretty error-prone; but happy to discuss further!)

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 20:15):

fitzgen updated PR #8978.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 20:27):

fitzgen commented on PR #8978:

operate closer to the level of the frontend's "variables"

I also thought about creating a new entity for these things but I thought through how it would be used in cranelift-wasm and wasmtime-cranelift and my conclusion was that it would be a bit of a mess. First, we aren't just dealing with wasm locals, which are Variables; we are also dealing with the results of calls and things like that, which produce Values. So we have to have a solution for Values either way. Second, cranelift-wasm doesn't make the decision whether certain types are managed by the GC or not, it is agnostic of runtime structures and implementation, wasmtime-cranelift makes those decisions based on its Wasmtime-specific knowledge. That means that cranelift-wasm would have to have some kind of enum of either a GC-managed Variable (or other new entity) or a plain Value and use that for its abstract operand stack, at which point the whole crate, essentially, is infected with this change and things are getting really hairy.

For this reason, we decided it was better, at least for now, to play nice with the existing system and its existing entities, rather than overhaul everything.

I think that might point the way to an implementation solution that also avoids the need for the extra pass: branch in the variable-to-SSA-value translation to an alternate "value just lives on stack" path.

Unfortunately this won't work because the stack slots and liveness and all that are not created until we are finishing the function builder, long after calls to builder.use_var.

Trevor and I talked about another solution for this specific problem though: if we maintain a backwards map from Value to the Variable that it is associated with (if any) then we can force all Values for the same Variable to use the same stack slot. For Values that are not associated with a Variable, they would use the same approach we have in this PR. But this is work for the future.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2024 at 20:49):

fitzgen commented on PR #8978:

I think we can land as-is if you prefer

I'm going to go ahead and enqueue this for merging, but I am still interested in your responses above and in the inline thread!

view this post on Zulip Wasmtime GitHub notifications bot (Jul 22 2024 at 17:46):

fitzgen merged PR #8978.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 22 2024 at 19:46):

fitzgen submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 22 2024 at 19:46):

fitzgen created PR review comment:

Did this in https://github.com/bytecodealliance/wasmtime/pull/8991


Last updated: Oct 23 2024 at 20:03 UTC