Stream: git-wasmtime

Topic: wasmtime / issue #10427 Cranelift: towards representing c...


view this post on Zulip Wasmtime GitHub notifications bot (Mar 19 2025 at 23:56):

fitzgen added the cranelift:area:clif label to Issue #10427.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 19 2025 at 23:56):

fitzgen added the cranelift label to Issue #10427.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 19 2025 at 23:56):

fitzgen added the cranelift:discussion label to Issue #10427.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 19 2025 at 23:56):

fitzgen added the cranelift:goal:optimize-speed label to Issue #10427.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 19 2025 at 23:56):

fitzgen added the cranelift:mid-end label to Issue #10427.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 19 2025 at 23:56):

fitzgen opened issue #10427:

Note: pie-in-the-sky ideas incoming. I'm not proposing we stop everything and do this immediately, just that we consider it for Cranelift's long-term evolution. Apologies if anything here isn't super clear, I didn't sleep well last night and I'm really tired.

In https://github.com/bytecodealliance/wasmtime/pull/10340, we fixed a bug where we were doing code motion of a non-trapping and read-only load operation such that its explicit dependencies were satisfied. However, CLIF only represents data dependencies between instructions explicitly, not control or effect dependencies. It turns out code motion was not safe because the load's safety and non-trapping-ness implicitly depended upon control flow and its block's location in the CFG: specifically the load was in a block that was dominated by a br_if that performed a bounds check. Our optimizer could not see, and therefore did not abide by, that implicit constraint, and therefore performed "illegal" code motion leading to a miscompile. This bug could have been a very serious CVE, and basically was not due to a combination of luck and that it was lurking in less-exercised, off-by-default, not-yet-production-ready features that our fuzzers hadn't hammered on.

We addressed this issue with an initial stop-gap solution by adding a new can_move flag to ir::MemFlags, effectively conservatively disabling code motion for all memory operations by default, and requiring that a memory operation set this flag to opt into dependency-based code motion in the optimizer. I remain concerned that we have similar misoptimization bugs still lurking, that we annotated a load with can_move when it should not have been, or that we will make an innocuous change in the future that, from a distance, invalidates the soundness of a downstream instruction's can_move annotation. A more-principled solution would be to eliminate implicit dependencies and explicitly represent all dependencies in the IR itself. Taken to its limit, where you no longer have a control-flow graph of basic blocks and instead only have control dependency edges between instructions, this is the "sea of nodes" approach to IR design. I believe that, after some offline discussions and brainstorming with Chris and Alex, I may have found somewhat of a sweet spot for Cranelift that doesn't require fully boiling the ocean to completely redefine CLIF, transforming it beyond recognition (we can keep block arguments, for example, which feel very CLIF-y) into that full-blown sea-of-nodes limit, but nonetheless makes all dependencies explicit.

Benefits of Explicit Dependencies

First, let me enumerate the benefits we can reap by making all of an IR instruction's dependencies explicit:

Sketch of the CLIF Changes

With that out of the way, here is a sketch of the changes I am imagining:

More Implications

And here are some more random things that fall out from this design:

view this post on Zulip Wasmtime GitHub notifications bot (Mar 19 2025 at 23:56):

fitzgen added the cranelift:moonshot label to Issue #10427.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 20 2025 at 15:03):

fitzgen edited issue #10427:

Note: pie-in-the-sky ideas incoming. I'm not proposing we stop everything and do this immediately, just that we consider it for Cranelift's long-term evolution. Apologies if anything here isn't super clear, I didn't sleep well last night and I'm really tired.

In https://github.com/bytecodealliance/wasmtime/pull/10340, we fixed a bug where we were doing code motion of a non-trapping and read-only load operation such that its explicit dependencies were satisfied. However, CLIF only represents data dependencies between instructions explicitly, not control or effect dependencies. It turns out code motion was not safe because the load's safety and non-trapping-ness implicitly depended upon control flow and its block's location in the CFG: specifically the load was in a block that was dominated by a br_if that performed a bounds check. Our optimizer could not see, and therefore did not abide by, that implicit constraint, and therefore performed "illegal" code motion leading to a miscompile. This bug could have been a very serious CVE, and basically was not due to a combination of luck and that it was lurking in less-exercised, off-by-default, not-yet-production-ready features that our fuzzers hadn't hammered on.

We addressed this issue with an initial stop-gap solution by adding a new can_move flag to ir::MemFlags, effectively conservatively disabling code motion for all memory operations by default, and requiring that a memory operation set this flag to opt into dependency-based code motion in the optimizer. I remain concerned that we have similar misoptimization bugs still lurking, that we annotated a load with can_move when it should not have been, or that we will make an innocuous change in the future that, from a distance, invalidates the soundness of a downstream instruction's can_move annotation. A more-principled solution would be to eliminate implicit dependencies and explicitly represent all dependencies in the IR itself. Taken to its limit, where you no longer have a control-flow graph of basic blocks and instead only have control dependency edges between instructions, this is the "sea of nodes" approach to IR design. I believe that, after some offline discussions and brainstorming with Chris and Alex, I may have found somewhat of a sweet spot for Cranelift that doesn't require fully boiling the ocean to completely redefine CLIF, transforming it beyond recognition (we can keep block arguments, for example, which feel very CLIF-y) into that full-blown sea-of-nodes limit, but nonetheless makes all dependencies explicit.

Benefits of Explicit Dependencies

First, let me enumerate the benefits we can reap by making all of an IR instruction's dependencies explicit:

Sketch of the CLIF Changes

With that out of the way, here is a sketch of the changes I am imagining:

More Implications

And here are some more random things that fall out from this design:

view this post on Zulip Wasmtime GitHub notifications bot (Mar 25 2025 at 18:13):

bjorn3 commented on issue #10427:

V8 is moving the exact opposite direction as sea-of-nodes was found to have several disadvantages in practice for their use case at least. Not all of them would apply to Cranelift, but it may still be informative. https://v8.dev/blog/leaving-the-sea-of-nodes

view this post on Zulip Wasmtime GitHub notifications bot (Mar 25 2025 at 21:50):

fitzgen commented on issue #10427:

Thanks for the link. I'm glad they put their rationale into writing, because it was all tribal knowledge and rumors before this.

I'm going to to through their list of downsides and respond to each of them in the context of CLIF and what I wrote up above. All this should be taken with some amount of salt though because they've obviously had _much_ more hands-on experience with effect dependencies in their IR than I have!

Manually/visually inspecting and understanding a Sea of Nodes graph is hard

I think that what I sketched above can largely sidestep this issue because we would still organize things into blocks, and effectively always have an instruction schedule, rather than a big soup of nodes. We would still have a familiar text format, etc...

Basically, this is saying: "Let's add the control/effect dependencies, but not remove basic blocks. Then our goal is to do as much aggressive code motion in our existing CFG and basic blocks as we can while satisfying dependencies, rather than to reconstruct a CFG and schedule instructions in blocks from scratch." I think they are ultimately equivalent in theory, but I suspect that the former will be easier than the latter in practice (we basically already do the former during egraph elaboration).

Too many nodes are on the effect chain and/or have a control input

Quite possible this would still be an issue.

The source program (Wasm or rustc MIR) is going to constrain things somewhat. Some programs just don't have as much wiggle room as others. This is the nature of compiler optimization in general though. Many programs might not benefit from a particular analysis, but the ones that do really benefit. The trick is finding the best bang for your buck and complexity budget.

It is worth mentioning that the source-to-CLIF frontend could also make things worse by unnecessarily introducing dependencies that aren't strictly needed (for example, by linearizing all effects during CLIF construction, even when they could have been a partial ordering) which would constrain the optimizer and give it less freedom for code motion.

Memory operations do not float easily

Looking at their example, it seems like the difficulty is somewhat incidental and not fundamental.

They have effect dependencies between their two loads, but neither load actually produces an effect AFAICT, and so they should both instead depend on start, not on each other. With that change, they should each be able to float to either side of the if, since all their dependencies will be satisfied at that point. It seems like this might be an instance of the issue I was just mentioning where the frontend introduces unnecessary dependencies. They talk about doing analyses to determine when they can relax dependencies and how this is no harder on a CFG than SoN. I don't see any reason that is false, but would prefer to avoid these analyses completely and investing in better dependency construction in the frontend.

SoN is just CFG where pure nodes are floating

This is a really interesting/revealing statement because Cranelift already has "CFG where pure nodes are floating" and that is our egraph and side-effectful skeleton! And what I wrote up above is very much not just that because we would not impose a total ordering on anything that produces or consumes side effects, these operations would be partially ordered, and this is exactly the property that allows for many different schedulings / more aggressive code motion and ultimately better optimization. I think a key part of avoiding a single effect chain with a total ordering would be the bits I wrote up about generalizing our alias regions to effect dependencies. If you linearize all memory operations into a total order, that is equivalent to using a single alias region, and we know that we get tons of benefit from using multiple alias regions already today.

Managing the effect and control chains manually is hard

I believe it. Emitting partially-ordered effects puts even more or this kind of management responsibility on the frontend.

This doesn't, by itself, mean the whole endeavor isn't worthwhile.

The scheduler is too complex

I believe it.

But, as mentioned before, I think we can avoid that complexity by doing code motion on an existing schedule (something we already do!) rather than creating a new schedule from scratch. They use partial redundancy as an example: our egraph elaboration already avoids introducing partial redundancy and I don't see that getting any harder in this new world.

Finding a good order to visit the graph is difficult

Again, we can avoid this issue since we would retain a CFG and a schedule of instructions.

Cache unfriendliness

I'm skeptical of how generalizable these numbers are. A lot of this seems tied up in the exact data structures used for the IR.

That said, CFGs and basic blocks have implicit ordering and dependencies between instructions, while SoN represents them with explicit edges. Those explicit edges need to have some kind of runtime representation, which will lead to larger in-memory structures and therefore more cache misses. So I believe it, but also think there is a lot of performance engineering that could be done to mitigate the effects. It seems like the potential code improvement could be worth the compile time overhead, especially for final-tier compilers.

Sea of Nodes destroys any prior scheduling, by construction

We already destroy any existing schedule with our egraph mid-end, so introducing effect dependencies doesn't change that for us. :shrug:

view this post on Zulip Wasmtime GitHub notifications bot (Mar 25 2025 at 21:53):

fitzgen edited a comment on issue #10427:

Thanks for the link. I'm glad they put their rationale into writing, because it was all tribal knowledge and rumors before this.

I'm going to to through their list of downsides and respond to each of them in the context of CLIF and what I wrote up above. All this should be taken with some amount of salt though because they've obviously had _much_ more hands-on experience with effect dependencies in their IR than I have!

Manually/visually inspecting and understanding a Sea of Nodes graph is hard

I think that what I sketched above can largely sidestep this issue because we would still organize things into blocks, and effectively always have an instruction schedule, rather than a big soup of nodes. We would still have a familiar text format, etc...

Basically, this issue is proposing: "Let's add the control/effect dependencies, but not remove basic blocks. Then our goal is to do as much aggressive code motion in our existing CFG and basic blocks as we can while satisfying dependencies, rather than to reconstruct a CFG and schedule instructions in blocks from scratch." I think they are ultimately equivalent in theory, but I suspect that the former will be easier than the latter in practice (we basically already do the former during egraph elaboration).

Too many nodes are on the effect chain and/or have a control input

Quite possible this would still be an issue.

The source program (Wasm or rustc MIR) is going to constrain things somewhat. Some programs just don't have as much wiggle room as others. This is the nature of compiler optimization in general though. Many programs might not benefit from a particular analysis, but the ones that do really benefit. The trick is finding the best bang for your buck and complexity budget.

It is worth mentioning that the source-to-CLIF frontend could also make things worse by unnecessarily introducing dependencies that aren't strictly needed (for example, by linearizing all effects during CLIF construction, even when they could have been a partial ordering) which would constrain the optimizer and give it less freedom for code motion.

Memory operations do not float easily

Looking at their example, it seems like the difficulty is somewhat incidental and not fundamental.

They have effect dependencies between their two loads, but neither load actually produces an effect AFAICT, and so they should both instead depend on start, not on each other. With that change, they should each be able to float to either side of the if, since all their dependencies will be satisfied at that point. It seems like this might be an instance of the issue I was just mentioning where the frontend introduces unnecessary dependencies. They talk about doing analyses to determine when they can relax dependencies and how this is no harder on a CFG than SoN. I don't see any reason that is false, but would prefer to avoid these analyses completely and investing in better dependency construction in the frontend.

SoN is just CFG where pure nodes are floating

This is a really interesting/revealing statement because Cranelift already has "CFG where pure nodes are floating" and that is our egraph and side-effectful skeleton! And what I wrote up above is very much not just that because we would not impose a total ordering on anything that produces or consumes side effects, these operations would be partially ordered, and this is exactly the property that allows for many different schedulings / more aggressive code motion and ultimately better optimization. I think a key part of avoiding a single effect chain with a total ordering would be the bits I wrote up about generalizing our alias regions to effect dependencies. If you linearize all memory operations into a total order, that is equivalent to using a single alias region, and we know that we get tons of benefit from using multiple alias regions already today.

Managing the effect and control chains manually is hard

I believe it. Emitting partially-ordered effects puts even more or this kind of management responsibility on the frontend.

This doesn't, by itself, mean the whole endeavor isn't worthwhile.

The scheduler is too complex

I believe it.

But, as mentioned before, I think we can avoid that complexity by doing code motion on an existing schedule (something we already do!) rather than creating a new schedule from scratch. They use partial redundancy as an example: our egraph elaboration already avoids introducing partial redundancy and I don't see that getting any harder in this new world.

Finding a good order to visit the graph is difficult

Again, we can avoid this issue since we would retain a CFG and a schedule of instructions.

Cache unfriendliness

I'm skeptical of how generalizable these numbers are. A lot of this seems tied up in the exact data structures used for the IR.

That said, CFGs and basic blocks have implicit ordering and dependencies between instructions, while SoN represents them with explicit edges. Those explicit edges need to have some kind of runtime representation, which will lead to larger in-memory structures and therefore more cache misses. So I believe it, but also think there is a lot of performance engineering that could be done to mitigate the effects. It seems like the potential code improvement could be worth the compile time overhead, especially for final-tier compilers.

Sea of Nodes destroys any prior scheduling, by construction

We already destroy any existing schedule with our egraph mid-end, so introducing effect dependencies doesn't change that for us. :shrug:

view this post on Zulip Wasmtime GitHub notifications bot (Mar 26 2025 at 18:56):

cfallin commented on issue #10427:

I should add some thoughts here too -- I found the V8 post interesting, and obviously did a lot of thinking about this when initially researching and implementing our aegraphs mid-end (which has many parallels re: "free-floating nodes", as Nick mentions), so I certainly have some Opinions.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 29 2025 at 08:13):

hanna-kruppe commented on issue #10427:

There are ways to have some of the "explicit, principled effect representation" cake without wrangling explicit SoN-style control and effect dependency edges through the entire pipeline. As described by Filip Pizlo (and implemented e.g. in WebKit's B3 JIT), one can pick a uniform representation for the effects of any instruction, including reads/writes of heap alias regions and control flow (branches "write control", control-dependent instructions "read control") and anything else that matters for code motion. This gives a more principled answer to "am I allowed to move this instruction past that one?" than a collection of ad-hoc flags like can_move, readonly, notrap, and so on.

It's probably also easier for frontends to get such effect read/write sets right because they are mostly about what each instruction does (and some flat context like whether a load from wasm linear memory is gated under an explicit bounds check). Constructing the fully explicit def-use chains for each kind of effect seems harder to get right, similar to using mutable variables during CLIF construction vs. directly constructing everything in SSA form. Even if CLIF wanted explicit "effect values", it may be useful for cranelift-frontend to construct these from per-instruction read/write sets in a generic fashion, instead of leaving producers entirely to their own devices.

In terms of the IR that optimization passes work on, only knowing read/write of each instruction is of course less convenient for doing aggressive code motion. You can easily tell which instructions have no effects at all, but moving an effectful instruction from program point A to program point B would be absurdly tricky with raw pairwise "can this instruction be swapped with a predecessor/successor" queries. I don't know if and how B3 solves it, but in my mind this is similar to def-use chains for data flow. You can build your entire IR around an explicit, sparse representation of which "writes" are relevant for which "reads" (i.e., SSA for data flow, SoN-style for effects/control), but you don't have to. It can also be handled by a supplementary analysis that's computed only in the part of the pipeline where it's most useful. For data flow dependencies, everyone seems to agree that SSA is a very good trade-off. For effects and control dependencies, no such consensus has emerged yet. But having a clear ground truth for "what effects/dependencies does this instruction have" seems good regardless of if/how you materialize the implied relations between different instructions.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 18:41):

fitzgen commented on issue #10427:

You _can_ build your entire IR around an explicit, sparse representation of which "writes" are relevant for which "reads" (i.e., SSA for data flow, SoN-style for effects/control), but you _don't have to_. It can also be handled by a supplementary analysis that's computed only in the part of the pipeline where it's most useful. For data flow dependencies, everyone seems to agree that SSA is a very good trade-off. For effects and control dependencies, no such consensus has emerged yet. But having a _clear ground truth_ for "what effects/dependencies does this instruction have" seems good regardless of if/how you materialize the implied relations between different instructions.

Agreed: there are multiple viable approaches to representing effects and control in an IR. We just have to try and figure out what is the best fit for Cranelift, and what is realistic given our engineering resources.

There are ways to have some of the "explicit, principled effect representation" cake without wrangling explicit SoN-style control and effect dependency edges through the entire pipeline. As described by Filip Pizlo (and implemented e.g. in WebKit's B3 JIT), one can pick a _uniform_ representation for the effects of any instruction, including reads/writes of heap alias regions and control flow (branches "write control", control-dependent instructions "read control") and anything else that matters for code motion. This gives a more principled answer to "am I allowed to move this instruction past that one?" than a collection of ad-hoc flags like can_move, readonly, notrap, and so on.

I see this approach as, roughly, extending our existing alias regions infrastructure from annotating only memory operations to any instruction. I do believe this would be an improvement over what we have today. That said, fully removing our ad-hoc flags would require roughly the same kind and amount of work as adding full effect edges to CLIF and I don't think it would solve the class of bug we hit that motivated my thinking on this topic, where a memory operation was moved past its safety checks, as effectively as adding effect dependencies would.

On the one hand, we could make conditional branches clobber all alias regions, but this is overly conservative and would unnecessarily restrict code motion. We would end up generating worse code than we are able to today. We wouldn't be able to lift a Wasm linear memory's base pointer out of a loop anymore, for example.

Alternatively, we could keep track of exactly which operations depend on that conditional branch, and make the branch clobber only those alias regions. This is equivalent to the work necessary to track and manage effect dependencies, however it has a few downsides:

  1. It is a little weird since you are not adding a dependency to the instruction that relies on that branch having happened, but instead going back to that branch that was already emitted and mutating its alias region clobber set after the fact. If some optimization pass later removes the instruction that added the branch's clobber for a particular alias region, the pass must either do some kind of analysis to see if any instruction still depends on that branch before it can safely remove the clobber, or it must leave the clobber in place, which adds unnecessary constraints and potentially pessimizes our generated code.

  2. It results in each alias region having a linear side effect chain, rather than an arbitrary partial ordering. This, again, imposes artificial constraints on the optimizer.

  3. While it does unify side-effects and control at the IR level, it does not unify with pure data dependencies, which means that we don't ultimately unify and simplify dependency-management in our compiler/optimizer: we have to consider both an instruction's pure data-flow dependencies and we must also infer its side-effectful dependencies from which alias regions it uses and which predecessors clobbered that region.

In particular, I think (3) is the biggest downside of annotating any instruction with alias regions -- as well as our current state of things -- and conversely is the biggest potential benefit of adding effect dependencies. We aren't a large team and we don't have the maintainer-power to write and maintain more and more optimization passes and static analyses, brute-forcing our way to good codegen the traditional way. Therefore, we must pursue techniques that we believe can give us outsized performance-bang for their complexity-buck. With effect dependencies, I see the potential to address a scary class of bugs in a principled manner, simplify our internals, and allow for even more aggressive optimization than we can do today, all at once.

It's probably also easier for frontends to get such effect read/write sets right because they are mostly about what each instruction does (and some flat context like whether a load from wasm linear memory is gated under an explicit bounds check). Constructing the fully explicit def-use chains for each kind of effect seems harder to get right, similar to using mutable variables during CLIF construction vs. directly constructing everything in SSA form. Even if CLIF wanted explicit "effect values", it may be useful for cranelift-frontend to construct these from per-instruction read/write sets in a generic fashion, instead of leaving producers entirely to their own devices.

Yeah, there is definitely work we should be able to do in cranelift-frontend to help improve things on this front.

Off the top of my head, a good approach for frontends to start with would be to

This approach would be equivalent to, and no more difficult than, annotating every instruction with the alias regions it reads and clobbers in the frontend. We could wrap this pattern up in helper methods in cranelift-frontend.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 21:16):

hanna-kruppe commented on issue #10427:

Reusing the Variable infrastructure for "this instruction reads/writes these heap alias regions" is a very neat unifying idea, and a clear advantage of treating them as just Values. I've started thinking about the implications for control dependencies and other effects that aren't just heap alias regions (e.g., potential trapping). But now I've convinced myself there's something missing even from the description of heap alias regions! The OP said:

Instructions that depend on external state but do not themselves have side-effects -- like read-only, non-trapping memory loads -- will only take an effect value as an operand, and will not return a new effect value:

But in this case, we might have a snippet like this: a store to wasm memory followed by a load from the same memory, with a constant address that's guaranteed in-bounds for the wasm memory type.

block1(v0: effect, v1: i8):
  v2 = const.i32 123 # in-bounds for memory accesses
  v3: i8 = load.i8 v0, v2+0
  v4: effect = store.i8 v0, v1, v2+0
  ...

What prevents incorrectly moving the store before the load? Certainly not the SSA "defs dominate uses" requirement. This example could be fixed by making everything that reads some effect/alias region also produce a new effect value. But this flies in the face of the intuition I've tried to apply to these effect values, and needs some new tweak to make sure successive read-only operations aren't forced into a total order again by the effect chain.

I suppose this gestures at the question in the back of my mind all this time: why don't the various SoN-style and (R)VSDG-style IRs do what's proposed in this issue? Why do they have separate kinds of edges for effects and for data flow?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 21:18):

hanna-kruppe edited a comment on issue #10427:

Reusing the Variable infrastructure for "this instruction reads/writes these heap alias regions" is a very neat unifying idea, and a clear advantage of treating them as just Values. I've started thinking about the implications for control dependencies and other effects that aren't just heap alias regions (e.g., potential trapping). But now I've convinced myself there's something missing even from the description of heap alias regions! The OP said:

Instructions that depend on external state but do not themselves have side-effects -- like read-only, non-trapping memory loads -- will only take an effect value as an operand, and will not return a new effect value:

But in this case, we might have a snippet like this: a store to wasm memory followed by a load from the same memory, with a constant address that's guaranteed in-bounds for the wasm memory type (v2 is the base address of that memory).

block1(v0: effect, v1: i8, v2: i64):
  v3: i8 = load.i8 v0, v2+123
  v4: effect = store.i8 v0, v1, v2+123
  ...

What prevents incorrectly moving the store before the load? Certainly not the SSA "defs dominate uses" requirement. This example could be fixed by making everything that reads some effect/alias region also produce a new effect value. But this flies in the face of the intuition I've tried to apply to these effect values, and needs some new tweak to make sure successive read-only operations aren't forced into a total order again by the effect chain.

I suppose this gestures at the question in the back of my mind all this time: why don't the various SoN-style and (R)VSDG-style IRs do what's proposed in this issue? Why do they have separate kinds of edges for effects and for data flow?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 21:26):

cfallin commented on issue #10427:

What prevents incorrectly moving the store before the load?

Ah, this is really interesting! In one sense, the memory-state values are linear (or affine): the store consumes the old memory/effect state (or clobbers it), so it is no longer available for the load to use.

I see several ways to encode this into the IR:

I personally favor the last one, for simplicity reasons, though it is the most verbose encoding (hence the least space-efficient) if we have n readers for every writer, for example.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 21:53):

fitzgen commented on issue #10427:

What prevents incorrectly moving the store before the load?

Thanks for whipping up this example, it is indeed an interesting one.


I had originally been imagining effect values as affine, but at some point Chris convinced me we shouldn't do that, and additionally it gets messy when we want to fence on a bunch of effects, so we merge them together, do an operation depending on that merged effect, and then unpack them again from the resulting effect value afterward the operation (e.g. fanning in all effects to a call and then fanning out to recover the new version of each logical alias region). Agreed with Chris that it might have weird implications on various transforms and that we would need a bunch of annoying/specialized CLIF validation rules.

So to allow the encoding of dependencies we want purely in SSA values, where the store depends on the load, I think we need all instructions that read from the outside world, even if they don't mutate it, to return new effect values. Then the store can depend on the load's effect, and the desired ordering is encoded in the regular value dependencies again.

block1(v0: effect, v1: i8, v2: i64):
  v3: i8, v4: effect = load.i8 v0, v2+123
  v5: effect = store.i8 v3, v1, v2+123
  ...

I believe this is the third approach Chris enumerated, but making explicit our need for, e.g., loads to return a new effect value.

The anti-def idea is an interesting one, but again it is pretty bespoke and separate from the rest of our SSA infrastructure, which is in opposition to some of our goals here and makes me disinclined to pursue it further.

I would lean towards making any operation that reads from the outside world also return a new effect value.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 21:54):

fitzgen edited a comment on issue #10427:

What prevents incorrectly moving the store before the load?

Thanks for whipping up this example, it is indeed an interesting one.


I had originally been imagining effect values as affine, but at some point Chris convinced me we shouldn't do that, and additionally it gets messy when we want to fence on a bunch of effects, so we merge them together, do an operation depending on that merged effect, and then unpack them again from the resulting effect value after the operation (e.g. fanning in all effects to a call and then fanning out to recover the new version of each logical alias region). Agreed with Chris that it might have weird implications on various transforms and that we would need a bunch of annoying/specialized CLIF validation rules.

So to allow the encoding of dependencies we want purely in SSA values, where the store depends on the load, I think we need all instructions that read from the outside world, even if they don't mutate it, to return new effect values. Then the store can depend on the load's effect, and the desired ordering is encoded in the regular value dependencies again.

block1(v0: effect, v1: i8, v2: i64):
  v3: i8, v4: effect = load.i8 v0, v2+123
  v5: effect = store.i8 v3, v1, v2+123
  ...

I believe this is the third approach Chris enumerated, but making explicit our need for, e.g., loads to return a new effect value.

The anti-def idea is an interesting one, but again it is pretty bespoke and separate from the rest of our SSA infrastructure, which is in opposition to some of our goals here and makes me disinclined to pursue it further.

I would lean towards making any operation that reads from the outside world also return a new effect value.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 21:54):

fitzgen edited a comment on issue #10427:

What prevents incorrectly moving the store before the load?

Thanks for whipping up this example, it is indeed an interesting one.


I had originally been imagining effect values as affine, but at some point Chris convinced me we shouldn't do that, and additionally it gets messy when we want to fence on a bunch of effects, so we merge them together, do an operation depending on that merged effect, and then unpack them again from the resulting effect value after the operation (e.g. fanning in all effects to a call and then fanning out to recover the new version of each logical alias region). Agreed with Chris that it might have weird implications on various transforms and that we would need a bunch of annoying/specialized CLIF validation rules.

So to allow the encoding of dependencies we want purely in SSA values, where the store depends on the load, I think we need all instructions that read from the outside world, even if they don't mutate it, to return new effect values. Then the store can depend on the load's effect, and the desired ordering is encoded in the regular value dependencies again.

block1(v0: effect, v1: i8, v2: i64):
  v3: i8, v4: effect = load.i8 v0, v2+123
  v5: effect = store.i8 v4, v1, v2+123
  ...

I believe this is the third approach Chris enumerated, but making explicit our need for, e.g., loads to return a new effect value.

The anti-def idea is an interesting one, but again it is pretty bespoke and separate from the rest of our SSA infrastructure, which is in opposition to some of our goals here and makes me disinclined to pursue it further.

I would lean towards making any operation that reads from the outside world also return a new effect value.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 22:11):

hanna-kruppe commented on issue #10427:

Regarding affine/linear values, note that there would have to be a notion of a non-consuming use as well for operations which are read-only w.r.t. that effect. Otherwise everything related to the same alias region is forced into a single linear chain, which makes the whole thing much less appealing.

But then this raises interesting questions for merging effects - if a load depends on both control and the last write to a wasm memory, and these are merged with a combine_effects operation, is that a consuming use of the input effects? If yes, you have to artificially introduce order between operations that depend on intersecting but non-identical sets of effects. If not, this is a way to duplicate effect values and must be accounted for when constructing the consuming use(s) of later writes.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 22:14):

hanna-kruppe edited a comment on issue #10427:

Regarding affine/linear values, note that there would have to be a notion of a non-consuming use as well for operations which are read-only w.r.t. that effect. Otherwise everything related to the same alias region is forced into a single linear chain, which makes the whole thing much less appealing.

But then this raises interesting questions for merging effects - if a load depends on both control and the last write to a wasm memory, and these are merged with a combine_effects operation, is that a consuming use of the input effects? If yes, you have to artificially introduce order between operations that depend on intersecting but non-identical sets of effects. If not, this is a way to duplicate effect values and must be accounted for when constructing the consuming use(s) of later writes.

(There were already very good reasons to not go this route, but it seems there’s even more complexity hiding there.)

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 22:14):

hanna-kruppe edited a comment on issue #10427:

Regarding affine/linear values, note that there would have to be a notion of a non-consuming use as well for operations which are read-only w.r.t. that effect. Otherwise everything related to the same alias region is forced into a single linear chain, which makes the whole thing much less appealing.

But then this raises interesting questions for merging effects - if a load depends on both control and the last write to a wasm memory, and these are merged with a combine_effects operation, is that a consuming use of the input effects? If yes, you have to artificially introduce order between operations that depend on intersecting but non-identical sets of effects. If not, this is a way to duplicate effect values and must be accounted for when constructing the consuming use(s) of later writes.

(There were already very good reasons given to not go this route, but it seems there’s even more complexity hiding there.)

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 22:30):

hanna-kruppe commented on issue #10427:

Hmm, while affine effect values lead to two different kinds of uses, an encoding into regular SSA def-use relationships may also need a similar distinction for defs: if you want to DCE a non-trapping load, you need to know that removing/replacing the uses of its effect result value is valid. In other words, you need to know which effect values represent actual changes and which ones are “pass through” for the sake of wiring up a read to relevant later writes. This seems to drag in a significant chunk of “explicit effect chain manipulation” in SoN terms, or in CLIF terms, a significant amount of special knowledge about the semantics of specific instructions that doesn’t fall out from the SSA data flow alone.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 22:31):

hanna-kruppe edited a comment on issue #10427:

Hmm, while affine effect values lead to two different kinds of uses, an encoding into regular SSA def-use relationships may also need a similar distinction for defs: if you want to DCE a non-trapping load, you need to know that removing its effect result value (or replacing it with the effect inputs to the load?) is valid. In other words, you need to know which effect values represent actual changes and which ones are “pass through” for the sake of wiring up a read to relevant later writes. This seems to drag in a significant chunk of “explicit effect chain manipulation” in SoN terms, or in CLIF terms, a significant amount of special knowledge about the semantics of specific instructions that doesn’t fall out from the SSA data flow alone.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 22:43):

fitzgen commented on issue #10427:

In other words, you need to know which effect values represent actual changes and which ones are “pass through” for the sake of wiring up a read to relevant later writes. This seems to drag in a significant chunk of “explicit effect chain manipulation” in SoN terms, or in CLIF terms, a significant amount of special knowledge about the semantics of specific instructions that doesn’t fall out from the SSA data flow alone.

I don't think this is actually too bad: we can annotate these instructions in the cranelift-meta crate's instruction definitions/builders to automatically generate predicate methods for use in cranelift-codegen, similar to our side_effects_idempotent flag, for example.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 22:52):

hanna-kruppe commented on issue #10427:

How do you distinguish loads that may trap and those that don’t, if both of them take and produce effect values?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 22:52):

hanna-kruppe edited a comment on issue #10427:

How do you distinguish loads that may trap from those that don’t, if both kinds of load take and produce effect values?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 31 2025 at 23:00):

fitzgen commented on issue #10427:

We will still need to annotate loads with the ir::TrapCode of their trap, so we can use that.


Last updated: Apr 18 2025 at 00:13 UTC