fitzgen added the cranelift:area:clif label to Issue #10427.
fitzgen added the cranelift label to Issue #10427.
fitzgen added the cranelift:discussion label to Issue #10427.
fitzgen added the cranelift:goal:optimize-speed label to Issue #10427.
fitzgen added the cranelift:mid-end label to Issue #10427.
fitzgen opened issue #10427:
Note: pie-in-the-sky ideas incoming. I'm not proposing we stop everything and do this immediately, just that we consider it for Cranelift's long-term evolution. Apologies if anything here isn't super clear, I didn't sleep well last night and I'm really tired.
In https://github.com/bytecodealliance/wasmtime/pull/10340, we fixed a bug where we were doing code motion of a non-trapping and read-only
load
operation such that its explicit dependencies were satisfied. However, CLIF only represents data dependencies between instructions explicitly, not control or effect dependencies. It turns out code motion was not safe because theload
's safety and non-trapping-ness implicitly depended upon control flow and its block's location in the CFG: specifically theload
was in a block that was dominated by abr_if
that performed a bounds check. Our optimizer could not see, and therefore did not abide by, that implicit constraint, and therefore performed "illegal" code motion leading to a miscompile. This bug could have been a very serious CVE, and basically was not due to a combination of luck and that it was lurking in less-exercised, off-by-default, not-yet-production-ready features that our fuzzers hadn't hammered on.We addressed this issue with an initial stop-gap solution by adding a new
can_move
flag toir::MemFlags
, effectively conservatively disabling code motion for all memory operations by default, and requiring that a memory operation set this flag to opt into dependency-based code motion in the optimizer. I remain concerned that we have similar misoptimization bugs still lurking, that we annotated a load withcan_move
when it should not have been, or that we will make an innocuous change in the future that, from a distance, invalidates the soundness of a downstream instruction'scan_move
annotation. A more-principled solution would be to eliminate implicit dependencies and explicitly represent all dependencies in the IR itself. Taken to its limit, where you no longer have a control-flow graph of basic blocks and instead only have control dependency edges between instructions, this is the "sea of nodes" approach to IR design. I believe that, after some offline discussions and brainstorming with Chris and Alex, I may have found somewhat of a sweet spot for Cranelift that doesn't require fully boiling the ocean to completely redefine CLIF, transforming it beyond recognition (we can keep block arguments, for example, which feel very CLIF-y) into that full-blown sea-of-nodes limit, but nonetheless makes all dependencies explicit.Benefits of Explicit Dependencies
First, let me enumerate the benefits we can reap by making all of an IR instruction's dependencies explicit:
We eliminate the whole class of misoptimization bugs that occur because the optimizer fails to satisfy constraints that it isn't privy to. Implicit constraints that the optimizer is ignorant of no longer exist.
We do not need to conservatively deny optimizations by default and require opting into, e.g., code motion. Our defaults can be to optimize exactly as much as the dependencies allow and no more than that. Optimization is more uniform. CLIF frontends no longer must consider whether a given memory operation is hot enough that it is worth opting into more aggressive optimizations for it and, if so, taking the time to verify whether doing so would even be sound.
Implementing the optimizer becomes easier. All IR passes and transformations must already take care to schedule instructions such that their dependencies are satisfied (i.e. that an instruction is dominated by the defs of any values it uses). However, right now we additionally encode an alias region in memory operations, and we must do an alias analysis in order to soundly perform redundant load elimination and store-to-load forwarding optimizations; we also must perform a state-coloring analysis during CLIF-to-MachInst lowering to determine whether fusing a particular memory operation into an instruction's operand (on architectures like
x86_64
where many instructions support both register and memory operands) is sound. By making all dependencies explicit, we can remove alias regions in memory operations and vastly simplify the alias and state-coloring analyses (to the point that, I believe, we would barely even consider them full-blown analyses anymore).Explicit side-effect dependencies generalize our small number of fixed alias regions into any number of logical alias regions: just depend on this logical alias region's most-recent side effect.
We can more easily implement new, more-aggressive optimizations, such as branch-folding and optimizing side-effectful instructions.
Our current architecture makes it difficult to optimize side-effectful and control-flow instructions in the egraph because the egraph works by rewriting values and organizing values into eclasses; side-effects and control-flow are not represented by values but are instead implicit in their associated instruction. The optimizer lifts everything it can into the egraph and leaves behind the "skeleton" of side-effectful and control-flow instructions in the CFG and its basic blocks. This skeleton's instruction's operands act to root demanded computations inside the egraph from the outside.
If we represent control-flow and side-effect dependencies as compile-time-only values, then we can rewrite and optimize these values in the egraph. We don't demand computations from the egraph via a side-effectful skeleton, but instead with a set of control-flow and side-effect value dependencies.
We can additionally do things like if we
call
a function that is known not to modify thevmctx
, then we can reuse an already-loaded value from thevmctx
across thatcall
.Sketch of the CLIF Changes
With that out of the way, here is a sketch of the changes I am imagining:
We introduce a new
ir::types::Effect
. Values of this type are used to represent side-effect and control (control-flow being a type of side-effect, if you squint hard enough) dependencies of an instruction at compile time. They are not materialized at run time.The function's entry block's first argument will be an effect value. This represents the control flow and state of the world (e.g. memory) upon entering the function.
Instructions that mutate external state, like stores, will both take and produce an effect value:
asm effect1 = store.i32 effect0, value, pointer+offset
This means that the side-effect that
effect0
represents must have already happened before we perform thisstore
, and any instruction (such as aload
) that might depend on this store having been performed can takeeffect1
(or another effect value derived fromeffect1
) as an operand.Instructions that depend on external state but do not themselves have side-effects -- like read-only, non-trapping memory loads -- will only take an effect value as an operand, and will not return a new effect value:
asm val = load.i32 effect, pointer+offset
This means that the side-effect that
effect
represents must have already happened before we perform thisload
.Similarly, instructions which may trap take and produce effect values:
asm effect1 = trapnz effect0, condition, trap_code
Unconditional jumps take an effect value operand. Because effect values are just
ir::Value
s, they may also, but are not required to, pass effect values as arguments to the target block.
asm jump effect, block(v0, v1, ...)
Conditional branches take an effect value operand and additionally define a unique effect value result on each of their outgoing control edges. This value-definition-on-control-edge is similar to how
try_call
instructions define values on their outgoing control edges and we can use the same technique proposed there to pass these values to their target blocks: by syntactically passing thectrl
placeholder as an argument to the block call, rather than an already-defined value.```asm
block0(v0: effect, v1: i8, v2: i32, v3: i32):
...
br_if v0, v1, block1(ctrl, v2), block2(v3, ctrl)block1(v4: effect, v5: i32):
...block2(v6: i32, v7: effect):
...
```We will probably want an instruction to combine multiple effect values into a single new effect value:
asm effect2 = combine_effects effect0, effect1, ...
This can be used to make an instruction depend on both a previous
store
and abr_if
bounds check.More Implications
And here are some more random things that fall out from this design:
It still looks like CLIF! It remains familiar. Most instructions are pure, in fact, and therefore unchanged. We still have block args/params instead of phis. We still have a text format that is easy to parse and dump. It isn't just one big spaghetti mess of instructions. Instructions are still organized in blocks, we just have more freedom to do code motion.
That said, the order of instructions in a block, and the whole of
ir::Layout
, represents only a particular scheduling of instructions and does not encode any implicit constraints about where an instruction may or may not be placed. Placement constraints (i.e. how far and where we can code motion something) are solely defined by the instruction's explicit operands.Our existing optimizations that operate on values, uses, and defs, like GVN and unnecessary phi removal, will Just Work with effect values and will therefore automatically be extended from pure instructions to side effectful instructions (no more
side_effects_idempotent
flag needed for instructions!) and this will improve code generation. Removing unnecessary effect value phis, for example, lets us do
[message truncated]
fitzgen added the cranelift:moonshot label to Issue #10427.
fitzgen edited issue #10427:
Note: pie-in-the-sky ideas incoming. I'm not proposing we stop everything and do this immediately, just that we consider it for Cranelift's long-term evolution. Apologies if anything here isn't super clear, I didn't sleep well last night and I'm really tired.
In https://github.com/bytecodealliance/wasmtime/pull/10340, we fixed a bug where we were doing code motion of a non-trapping and read-only
load
operation such that its explicit dependencies were satisfied. However, CLIF only represents data dependencies between instructions explicitly, not control or effect dependencies. It turns out code motion was not safe because theload
's safety and non-trapping-ness implicitly depended upon control flow and its block's location in the CFG: specifically theload
was in a block that was dominated by abr_if
that performed a bounds check. Our optimizer could not see, and therefore did not abide by, that implicit constraint, and therefore performed "illegal" code motion leading to a miscompile. This bug could have been a very serious CVE, and basically was not due to a combination of luck and that it was lurking in less-exercised, off-by-default, not-yet-production-ready features that our fuzzers hadn't hammered on.We addressed this issue with an initial stop-gap solution by adding a new
can_move
flag toir::MemFlags
, effectively conservatively disabling code motion for all memory operations by default, and requiring that a memory operation set this flag to opt into dependency-based code motion in the optimizer. I remain concerned that we have similar misoptimization bugs still lurking, that we annotated a load withcan_move
when it should not have been, or that we will make an innocuous change in the future that, from a distance, invalidates the soundness of a downstream instruction'scan_move
annotation. A more-principled solution would be to eliminate implicit dependencies and explicitly represent all dependencies in the IR itself. Taken to its limit, where you no longer have a control-flow graph of basic blocks and instead only have control dependency edges between instructions, this is the "sea of nodes" approach to IR design. I believe that, after some offline discussions and brainstorming with Chris and Alex, I may have found somewhat of a sweet spot for Cranelift that doesn't require fully boiling the ocean to completely redefine CLIF, transforming it beyond recognition (we can keep block arguments, for example, which feel very CLIF-y) into that full-blown sea-of-nodes limit, but nonetheless makes all dependencies explicit.Benefits of Explicit Dependencies
First, let me enumerate the benefits we can reap by making all of an IR instruction's dependencies explicit:
We eliminate the whole class of misoptimization bugs that occur because the optimizer fails to satisfy constraints that it isn't privy to. Implicit constraints that the optimizer is ignorant of no longer exist.
We do not need to conservatively deny optimizations by default and require opting into, e.g., code motion. Our defaults can be to optimize exactly as much as the dependencies allow and no more than that. Optimization is more uniform. CLIF frontends no longer must consider whether a given memory operation is hot enough that it is worth opting into more aggressive optimizations for it and, if so, taking the time to verify whether doing so would even be sound.
Implementing the optimizer becomes easier. All IR passes and transformations must already take care to schedule instructions such that their dependencies are satisfied (i.e. that an instruction is dominated by the defs of any values it uses). However, right now we additionally encode an alias region in memory operations, and we must do an alias analysis in order to soundly perform redundant load elimination and store-to-load forwarding optimizations; we also must perform a state-coloring analysis during CLIF-to-MachInst lowering to determine whether fusing a particular memory operation into an instruction's operand (on architectures like
x86_64
where many instructions support both register and memory operands) is sound. By making all dependencies explicit, we can remove alias regions in memory operations and vastly simplify the alias and state-coloring analyses (to the point that, I believe, we would barely even consider them full-blown analyses anymore).Explicit side-effect dependencies generalize our small number of fixed alias regions into any number of logical alias regions: just depend on this logical alias region's most-recent side effect.
We can more easily implement new, more-aggressive optimizations, such as branch-folding and optimizing side-effectful instructions.
Our current architecture makes it difficult to optimize side-effectful and control-flow instructions in the egraph because the egraph works by rewriting values and organizing values into eclasses; side-effects and control-flow are not represented by values but are instead implicit in their associated instruction. The optimizer lifts everything it can into the egraph and leaves behind the "skeleton" of side-effectful and control-flow instructions in the CFG and its basic blocks. This skeleton's instruction's operands act to root demanded computations inside the egraph from the outside.
If we represent control-flow and side-effect dependencies as compile-time-only values, then we can rewrite and optimize these values in the egraph. We don't demand computations from the egraph via a side-effectful skeleton, but instead with a set of control-flow and side-effect value dependencies.
We can additionally do things like if we
call
a function that is known not to modify thevmctx
, then we can reuse an already-loaded value from thevmctx
across thatcall
.Sketch of the CLIF Changes
With that out of the way, here is a sketch of the changes I am imagining:
We introduce a new
ir::types::Effect
. Values of this type are used to represent side-effect and control (control-flow being a type of side-effect, if you squint hard enough) dependencies of an instruction at compile time. They are not materialized at run time.The function's entry block's first argument will be an effect value. This represents the control flow and state of the world (e.g. memory) upon entering the function.
Instructions that mutate external state, like stores, will both take and produce an effect value:
asm effect1 = store.i32 effect0, value, pointer+offset
This means that the side-effect that
effect0
represents must have already happened before we perform thisstore
, and any instruction (such as aload
) that might depend on this store having been performed can takeeffect1
(or another effect value derived fromeffect1
) as an operand.Instructions that depend on external state but do not themselves have side-effects -- like read-only, non-trapping memory loads -- will only take an effect value as an operand, and will not return a new effect value:
asm val = load.i32 effect, pointer+offset
This means that the side-effect that
effect
represents must have already happened before we perform thisload
.Similarly, instructions which may trap take and produce effect values:
asm effect1 = trapnz effect0, condition, trap_code
Unconditional jumps take an effect value operand. Because effect values are just
ir::Value
s, they may also, but are not required to, pass effect values as arguments to the target block.
asm jump effect, block(v0, v1, ...)
Conditional branches take an effect value operand and additionally define a unique effect value result on each of their outgoing control edges. This value-definition-on-control-edge is similar to how
try_call
instructions define values on their outgoing control edges and we can use the same technique proposed there to pass these values to their target blocks: by syntactically passing thectrl
placeholder as an argument to the block call, rather than an already-defined value.```asm
block0(v0: effect, v1: i8, v2: i32, v3: i32):
...
br_if v0, v1, block1(ctrl, v2), block2(v3, ctrl)block1(v4: effect, v5: i32):
...block2(v6: i32, v7: effect):
...
```We will probably want an instruction to combine multiple effect values into a single new effect value:
asm effect2 = combine_effects effect0, effect1, ...
This can be used to make an instruction depend on both a previous
store
and abr_if
bounds check.More Implications
And here are some more random things that fall out from this design:
It still looks like CLIF! It remains familiar. Most instructions are pure, in fact, and therefore unchanged. We still have block args/params instead of phis. We still have a text format that is easy to parse and dump. It isn't just one big spaghetti mess of instructions. Instructions are still organized in blocks, we just have more freedom to do code motion.
That said, the order of instructions in a block, and the whole of
ir::Layout
, represents only a particular scheduling of instructions and does not encode any implicit constraints about where an instruction may or may not be placed. Placement constraints (i.e. how far and where we can code motion something) are solely defined by the instruction's explicit operands.Our existing optimizations that operate on values, uses, and defs, like GVN and unnecessary phi removal, will Just Work with effect values and will therefore automatically be extended from pure instructions to side effectful instructions (no more
side_effects_idempotent
flag needed for instructions!) and this will improve code generation. Removing unnecessary effect value phis, for example, lets us do
[message truncated]
bjorn3 commented on issue #10427:
V8 is moving the exact opposite direction as sea-of-nodes was found to have several disadvantages in practice for their use case at least. Not all of them would apply to Cranelift, but it may still be informative. https://v8.dev/blog/leaving-the-sea-of-nodes
fitzgen commented on issue #10427:
Thanks for the link. I'm glad they put their rationale into writing, because it was all tribal knowledge and rumors before this.
I'm going to to through their list of downsides and respond to each of them in the context of CLIF and what I wrote up above. All this should be taken with some amount of salt though because they've obviously had _much_ more hands-on experience with effect dependencies in their IR than I have!
Manually/visually inspecting and understanding a Sea of Nodes graph is hard
I think that what I sketched above can largely sidestep this issue because we would still organize things into blocks, and effectively always have an instruction schedule, rather than a big soup of nodes. We would still have a familiar text format, etc...
Basically, this is saying: "Let's add the control/effect dependencies, but not remove basic blocks. Then our goal is to do as much aggressive code motion in our existing CFG and basic blocks as we can while satisfying dependencies, rather than to reconstruct a CFG and schedule instructions in blocks from scratch." I think they are ultimately equivalent in theory, but I suspect that the former will be easier than the latter in practice (we basically already do the former during egraph elaboration).
Too many nodes are on the effect chain and/or have a control input
Quite possible this would still be an issue.
The source program (Wasm or rustc MIR) is going to constrain things somewhat. Some programs just don't have as much wiggle room as others. This is the nature of compiler optimization in general though. Many programs might not benefit from a particular analysis, but the ones that do really benefit. The trick is finding the best bang for your buck and complexity budget.
It is worth mentioning that the source-to-CLIF frontend could also make things worse by unnecessarily introducing dependencies that aren't strictly needed (for example, by linearizing all effects during CLIF construction, even when they could have been a partial ordering) which would constrain the optimizer and give it less freedom for code motion.
Memory operations do not float easily
Looking at their example, it seems like the difficulty is somewhat incidental and not fundamental.
They have effect dependencies between their two loads, but neither load actually produces an effect AFAICT, and so they should both instead depend on
start
, not on each other. With that change, they should each be able to float to either side of theif
, since all their dependencies will be satisfied at that point. It seems like this might be an instance of the issue I was just mentioning where the frontend introduces unnecessary dependencies. They talk about doing analyses to determine when they can relax dependencies and how this is no harder on a CFG than SoN. I don't see any reason that is false, but would prefer to avoid these analyses completely and investing in better dependency construction in the frontend.SoN is just CFG where pure nodes are floating
This is a really interesting/revealing statement because Cranelift already has "CFG where pure nodes are floating" and that is our egraph and side-effectful skeleton! And what I wrote up above is very much not just that because we would not impose a total ordering on anything that produces or consumes side effects, these operations would be partially ordered, and this is exactly the property that allows for many different schedulings / more aggressive code motion and ultimately better optimization. I think a key part of avoiding a single effect chain with a total ordering would be the bits I wrote up about generalizing our alias regions to effect dependencies. If you linearize all memory operations into a total order, that is equivalent to using a single alias region, and we know that we get tons of benefit from using multiple alias regions already today.
Managing the effect and control chains manually is hard
I believe it. Emitting partially-ordered effects puts even more or this kind of management responsibility on the frontend.
This doesn't, by itself, mean the whole endeavor isn't worthwhile.
The scheduler is too complex
I believe it.
But, as mentioned before, I think we can avoid that complexity by doing code motion on an existing schedule (something we already do!) rather than creating a new schedule from scratch. They use partial redundancy as an example: our egraph elaboration already avoids introducing partial redundancy and I don't see that getting any harder in this new world.
Finding a good order to visit the graph is difficult
Again, we can avoid this issue since we would retain a CFG and a schedule of instructions.
Cache unfriendliness
I'm skeptical of how generalizable these numbers are. A lot of this seems tied up in the exact data structures used for the IR.
That said, CFGs and basic blocks have implicit ordering and dependencies between instructions, while SoN represents them with explicit edges. Those explicit edges need to have some kind of runtime representation, which will lead to larger in-memory structures and therefore more cache misses. So I believe it, but also think there is a lot of performance engineering that could be done to mitigate the effects. It seems like the potential code improvement could be worth the compile time overhead, especially for final-tier compilers.
Sea of Nodes destroys any prior scheduling, by construction
We already destroy any existing schedule with our egraph mid-end, so introducing effect dependencies doesn't change that for us. :shrug:
fitzgen edited a comment on issue #10427:
Thanks for the link. I'm glad they put their rationale into writing, because it was all tribal knowledge and rumors before this.
I'm going to to through their list of downsides and respond to each of them in the context of CLIF and what I wrote up above. All this should be taken with some amount of salt though because they've obviously had _much_ more hands-on experience with effect dependencies in their IR than I have!
Manually/visually inspecting and understanding a Sea of Nodes graph is hard
I think that what I sketched above can largely sidestep this issue because we would still organize things into blocks, and effectively always have an instruction schedule, rather than a big soup of nodes. We would still have a familiar text format, etc...
Basically, this issue is proposing: "Let's add the control/effect dependencies, but not remove basic blocks. Then our goal is to do as much aggressive code motion in our existing CFG and basic blocks as we can while satisfying dependencies, rather than to reconstruct a CFG and schedule instructions in blocks from scratch." I think they are ultimately equivalent in theory, but I suspect that the former will be easier than the latter in practice (we basically already do the former during egraph elaboration).
Too many nodes are on the effect chain and/or have a control input
Quite possible this would still be an issue.
The source program (Wasm or rustc MIR) is going to constrain things somewhat. Some programs just don't have as much wiggle room as others. This is the nature of compiler optimization in general though. Many programs might not benefit from a particular analysis, but the ones that do really benefit. The trick is finding the best bang for your buck and complexity budget.
It is worth mentioning that the source-to-CLIF frontend could also make things worse by unnecessarily introducing dependencies that aren't strictly needed (for example, by linearizing all effects during CLIF construction, even when they could have been a partial ordering) which would constrain the optimizer and give it less freedom for code motion.
Memory operations do not float easily
Looking at their example, it seems like the difficulty is somewhat incidental and not fundamental.
They have effect dependencies between their two loads, but neither load actually produces an effect AFAICT, and so they should both instead depend on
start
, not on each other. With that change, they should each be able to float to either side of theif
, since all their dependencies will be satisfied at that point. It seems like this might be an instance of the issue I was just mentioning where the frontend introduces unnecessary dependencies. They talk about doing analyses to determine when they can relax dependencies and how this is no harder on a CFG than SoN. I don't see any reason that is false, but would prefer to avoid these analyses completely and investing in better dependency construction in the frontend.SoN is just CFG where pure nodes are floating
This is a really interesting/revealing statement because Cranelift already has "CFG where pure nodes are floating" and that is our egraph and side-effectful skeleton! And what I wrote up above is very much not just that because we would not impose a total ordering on anything that produces or consumes side effects, these operations would be partially ordered, and this is exactly the property that allows for many different schedulings / more aggressive code motion and ultimately better optimization. I think a key part of avoiding a single effect chain with a total ordering would be the bits I wrote up about generalizing our alias regions to effect dependencies. If you linearize all memory operations into a total order, that is equivalent to using a single alias region, and we know that we get tons of benefit from using multiple alias regions already today.
Managing the effect and control chains manually is hard
I believe it. Emitting partially-ordered effects puts even more or this kind of management responsibility on the frontend.
This doesn't, by itself, mean the whole endeavor isn't worthwhile.
The scheduler is too complex
I believe it.
But, as mentioned before, I think we can avoid that complexity by doing code motion on an existing schedule (something we already do!) rather than creating a new schedule from scratch. They use partial redundancy as an example: our egraph elaboration already avoids introducing partial redundancy and I don't see that getting any harder in this new world.
Finding a good order to visit the graph is difficult
Again, we can avoid this issue since we would retain a CFG and a schedule of instructions.
Cache unfriendliness
I'm skeptical of how generalizable these numbers are. A lot of this seems tied up in the exact data structures used for the IR.
That said, CFGs and basic blocks have implicit ordering and dependencies between instructions, while SoN represents them with explicit edges. Those explicit edges need to have some kind of runtime representation, which will lead to larger in-memory structures and therefore more cache misses. So I believe it, but also think there is a lot of performance engineering that could be done to mitigate the effects. It seems like the potential code improvement could be worth the compile time overhead, especially for final-tier compilers.
Sea of Nodes destroys any prior scheduling, by construction
We already destroy any existing schedule with our egraph mid-end, so introducing effect dependencies doesn't change that for us. :shrug:
cfallin commented on issue #10427:
I should add some thoughts here too -- I found the V8 post interesting, and obviously did a lot of thinking about this when initially researching and implementing our aegraphs mid-end (which has many parallels re: "free-floating nodes", as Nick mentions), so I certainly have some Opinions.
The kernel of the writeup seems to be that the SoN approach is too complex to debug or introspect; see e.g. all the notes about visualization. I certainly experienced some of this while implementing and debugging aegraphs, and we'd see more complexity for sure if we added deps to CLIF outside of the aegraph phase, but to some degree I think this is philosophical: either one represents this dimension of the IR or one does not, and if the representation is needed to enable some transform, it's usually better to buy into it (and then focus on making the ergonomics for debugging and development better) than to construct an ad-hoc approximation of it.
The writeup mentions difficulties with code de-duplication / re-duplication and avoiding creating redundant computations; we solved this with our aegraph "elaboration" algorithm, which ends up giving exactly equivalent results to classical CSE/GVN, and it just isn't really an issue. In fact de-duplication helps with compile time because it means that rewrites only need to happen once for a given node.
The writeup states that there was less opportunity than expected to take advantage of the partial-order freedom to reschedule code. I don't doubt that in the context of their JIT backend, but we're motivating this issue in our context with increasingly fine-grained alias information that we know should unlock optimization potential. It seems that messy JavaScript semantics played a large part for V8 here: the "looking at a binary operator wrong could run any code and explode the universe" aspect of the language's dynamism means that code is mostly linearized in practice, whereas with Wasm GC operators' static typing and more constrained semantics, there might be more to do. All we need is a few examples of e.g. LICM-ability of loads unlocked by the partial deps to make the framework worth it.
I suspect there is some "project lifecycle" aspect here too. V8 is a very mature and complex compiler backend, and so they have implemented many of the optimizations that they want to build and will be looking for ways to simplify. On the other hand we are still relatively young in some ways and are looking for ways to enable more performance. That doesn't mean that we have to pursue complexity, or can't learn from others; but it does mean that the calculus is a bit different, and simple implementations of ideas like this one (without going all-in or removing blocks completely -- again as Nick describes above) might be appropriate ways to find this performance, or at least to experiment with.
hanna-kruppe commented on issue #10427:
There are ways to have some of the "explicit, principled effect representation" cake without wrangling explicit SoN-style control and effect dependency edges through the entire pipeline. As described by Filip Pizlo (and implemented e.g. in WebKit's B3 JIT), one can pick a uniform representation for the effects of any instruction, including reads/writes of heap alias regions and control flow (branches "write control", control-dependent instructions "read control") and anything else that matters for code motion. This gives a more principled answer to "am I allowed to move this instruction past that one?" than a collection of ad-hoc flags like
can_move
,readonly
,notrap
, and so on.It's probably also easier for frontends to get such effect read/write sets right because they are mostly about what each instruction does (and some flat context like whether a load from wasm linear memory is gated under an explicit bounds check). Constructing the fully explicit def-use chains for each kind of effect seems harder to get right, similar to using mutable variables during CLIF construction vs. directly constructing everything in SSA form. Even if CLIF wanted explicit "effect values", it may be useful for
cranelift-frontend
to construct these from per-instruction read/write sets in a generic fashion, instead of leaving producers entirely to their own devices.In terms of the IR that optimization passes work on, only knowing read/write of each instruction is of course less convenient for doing aggressive code motion. You can easily tell which instructions have no effects at all, but moving an effectful instruction from program point A to program point B would be absurdly tricky with raw pairwise "can this instruction be swapped with a predecessor/successor" queries. I don't know if and how B3 solves it, but in my mind this is similar to def-use chains for data flow. You can build your entire IR around an explicit, sparse representation of which "writes" are relevant for which "reads" (i.e., SSA for data flow, SoN-style for effects/control), but you don't have to. It can also be handled by a supplementary analysis that's computed only in the part of the pipeline where it's most useful. For data flow dependencies, everyone seems to agree that SSA is a very good trade-off. For effects and control dependencies, no such consensus has emerged yet. But having a clear ground truth for "what effects/dependencies does this instruction have" seems good regardless of if/how you materialize the implied relations between different instructions.
fitzgen commented on issue #10427:
You _can_ build your entire IR around an explicit, sparse representation of which "writes" are relevant for which "reads" (i.e., SSA for data flow, SoN-style for effects/control), but you _don't have to_. It can also be handled by a supplementary analysis that's computed only in the part of the pipeline where it's most useful. For data flow dependencies, everyone seems to agree that SSA is a very good trade-off. For effects and control dependencies, no such consensus has emerged yet. But having a _clear ground truth_ for "what effects/dependencies does this instruction have" seems good regardless of if/how you materialize the implied relations between different instructions.
Agreed: there are multiple viable approaches to representing effects and control in an IR. We just have to try and figure out what is the best fit for Cranelift, and what is realistic given our engineering resources.
There are ways to have some of the "explicit, principled effect representation" cake without wrangling explicit SoN-style control and effect dependency edges through the entire pipeline. As described by Filip Pizlo (and implemented e.g. in WebKit's B3 JIT), one can pick a _uniform_ representation for the effects of any instruction, including reads/writes of heap alias regions and control flow (branches "write control", control-dependent instructions "read control") and anything else that matters for code motion. This gives a more principled answer to "am I allowed to move this instruction past that one?" than a collection of ad-hoc flags like
can_move
,readonly
,notrap
, and so on.I see this approach as, roughly, extending our existing alias regions infrastructure from annotating only memory operations to any instruction. I do believe this would be an improvement over what we have today. That said, fully removing our ad-hoc flags would require roughly the same kind and amount of work as adding full effect edges to CLIF and I don't think it would solve the class of bug we hit that motivated my thinking on this topic, where a memory operation was moved past its safety checks, as effectively as adding effect dependencies would.
On the one hand, we could make conditional branches clobber all alias regions, but this is overly conservative and would unnecessarily restrict code motion. We would end up generating worse code than we are able to today. We wouldn't be able to lift a Wasm linear memory's base pointer out of a loop anymore, for example.
Alternatively, we could keep track of exactly which operations depend on that conditional branch, and make the branch clobber only those alias regions. This is equivalent to the work necessary to track and manage effect dependencies, however it has a few downsides:
It is a little weird since you are not adding a dependency to the instruction that relies on that branch having happened, but instead going back to that branch that was already emitted and mutating its alias region clobber set after the fact. If some optimization pass later removes the instruction that added the branch's clobber for a particular alias region, the pass must either do some kind of analysis to see if any instruction still depends on that branch before it can safely remove the clobber, or it must leave the clobber in place, which adds unnecessary constraints and potentially pessimizes our generated code.
It results in each alias region having a linear side effect chain, rather than an arbitrary partial ordering. This, again, imposes artificial constraints on the optimizer.
While it does unify side-effects and control at the IR level, it does not unify with pure data dependencies, which means that we don't ultimately unify and simplify dependency-management in our compiler/optimizer: we have to consider both an instruction's pure data-flow dependencies and we must also infer its side-effectful dependencies from which alias regions it uses and which predecessors clobbered that region.
In particular, I think (3) is the biggest downside of annotating any instruction with alias regions -- as well as our current state of things -- and conversely is the biggest potential benefit of adding effect dependencies. We aren't a large team and we don't have the maintainer-power to write and maintain more and more optimization passes and static analyses, brute-forcing our way to good codegen the traditional way. Therefore, we must pursue techniques that we believe can give us outsized performance-bang for their complexity-buck. With effect dependencies, I see the potential to address a scary class of bugs in a principled manner, simplify our internals, and allow for even more aggressive optimization than we can do today, all at once.
It's probably also easier for frontends to get such effect read/write sets right because they are mostly about what each instruction does (and some flat context like whether a load from wasm linear memory is gated under an explicit bounds check). Constructing the fully explicit def-use chains for each kind of effect seems harder to get right, similar to using mutable variables during CLIF construction vs. directly constructing everything in SSA form. Even if CLIF wanted explicit "effect values", it may be useful for
cranelift-frontend
to construct these from per-instruction read/write sets in a generic fashion, instead of leaving producers entirely to their own devices.Yeah, there is definitely work we should be able to do in
cranelift-frontend
to help improve things on this front.Off the top of my head, a good approach for frontends to start with would be to
- define a
cranelift_frontend::Variable
for each logical alias region they have,- initialize those variables to the function's initial, starting effect inside the entry block as the first thing they do before emitting any instructions,
- and then a store to Wasm memory
i
(for example) would get the value of thei
th Wasm memory's effectVariable
, depend on that value when emitting thestore
instruction, and then redefine that variable with the new effect value returned from the insertedstore
instruction.This approach would be equivalent to, and no more difficult than, annotating every instruction with the alias regions it reads and clobbers in the frontend. We could wrap this pattern up in helper methods in
cranelift-frontend
.
hanna-kruppe commented on issue #10427:
Reusing the
Variable
infrastructure for "this instruction reads/writes these heap alias regions" is a very neat unifying idea, and a clear advantage of treating them as justValue
s. I've started thinking about the implications for control dependencies and other effects that aren't just heap alias regions (e.g., potential trapping). But now I've convinced myself there's something missing even from the description of heap alias regions! The OP said:Instructions that depend on external state but do not themselves have side-effects -- like read-only, non-trapping memory loads -- will only take an effect value as an operand, and will not return a new effect value:
But in this case, we might have a snippet like this: a store to wasm memory followed by a load from the same memory, with a constant address that's guaranteed in-bounds for the wasm memory type.
block1(v0: effect, v1: i8): v2 = const.i32 123 # in-bounds for memory accesses v3: i8 = load.i8 v0, v2+0 v4: effect = store.i8 v0, v1, v2+0 ...
What prevents incorrectly moving the store before the load? Certainly not the SSA "defs dominate uses" requirement. This example could be fixed by making everything that reads some effect/alias region also produce a new effect value. But this flies in the face of the intuition I've tried to apply to these effect values, and needs some new tweak to make sure successive read-only operations aren't forced into a total order again by the effect chain.
I suppose this gestures at the question in the back of my mind all this time: why don't the various SoN-style and (R)VSDG-style IRs do what's proposed in this issue? Why do they have separate kinds of edges for effects and for data flow?
hanna-kruppe edited a comment on issue #10427:
Reusing the
Variable
infrastructure for "this instruction reads/writes these heap alias regions" is a very neat unifying idea, and a clear advantage of treating them as justValue
s. I've started thinking about the implications for control dependencies and other effects that aren't just heap alias regions (e.g., potential trapping). But now I've convinced myself there's something missing even from the description of heap alias regions! The OP said:Instructions that depend on external state but do not themselves have side-effects -- like read-only, non-trapping memory loads -- will only take an effect value as an operand, and will not return a new effect value:
But in this case, we might have a snippet like this: a store to wasm memory followed by a load from the same memory, with a constant address that's guaranteed in-bounds for the wasm memory type (v2 is the base address of that memory).
block1(v0: effect, v1: i8, v2: i64): v3: i8 = load.i8 v0, v2+123 v4: effect = store.i8 v0, v1, v2+123 ...
What prevents incorrectly moving the store before the load? Certainly not the SSA "defs dominate uses" requirement. This example could be fixed by making everything that reads some effect/alias region also produce a new effect value. But this flies in the face of the intuition I've tried to apply to these effect values, and needs some new tweak to make sure successive read-only operations aren't forced into a total order again by the effect chain.
I suppose this gestures at the question in the back of my mind all this time: why don't the various SoN-style and (R)VSDG-style IRs do what's proposed in this issue? Why do they have separate kinds of edges for effects and for data flow?
cfallin commented on issue #10427:
What prevents incorrectly moving the store before the load?
Ah, this is really interesting! In one sense, the memory-state values are linear (or affine): the store consumes the old memory/effect state (or clobbers it), so it is no longer available for the load to use.
I see several ways to encode this into the IR:
- Create a notion of linear/affine values and consume them (thus destroying them) when the memory state is overwritten. Not too hard to check, but may have weird/unknown implications on other transforms, because that is definitely not a usual feature of SSA. So the failure mode here is that we have a long tail of IR verification failures.
- Create a notion of "clobbers" -- a variation on the above -- at the SSA level. An op is allowed to indicate that some SSA value is no longer available in points dominated. It's kind of an "anti-def".
- When lowering from the "variable" level to SSA, track all readers of a given version of an effect, and fan-in deps to the next update. This encodes the deps in a classical SSA way with explicit edges.
I personally favor the last one, for simplicity reasons, though it is the most verbose encoding (hence the least space-efficient) if we have n readers for every writer, for example.
fitzgen commented on issue #10427:
What prevents incorrectly moving the store before the load?
Thanks for whipping up this example, it is indeed an interesting one.
I had originally been imagining effect values as affine, but at some point Chris convinced me we shouldn't do that, and additionally it gets messy when we want to fence on a bunch of effects, so we merge them together, do an operation depending on that merged effect, and then unpack them again from the resulting effect value afterward the operation (e.g. fanning in all effects to a
call
and then fanning out to recover the new version of each logical alias region). Agreed with Chris that it might have weird implications on various transforms and that we would need a bunch of annoying/specialized CLIF validation rules.So to allow the encoding of dependencies we want purely in SSA values, where the store depends on the load, I think we need all instructions that read from the outside world, even if they don't mutate it, to return new effect values. Then the store can depend on the load's effect, and the desired ordering is encoded in the regular value dependencies again.
block1(v0: effect, v1: i8, v2: i64): v3: i8, v4: effect = load.i8 v0, v2+123 v5: effect = store.i8 v3, v1, v2+123 ...
I believe this is the third approach Chris enumerated, but making explicit our need for, e.g., loads to return a new effect value.
The anti-def idea is an interesting one, but again it is pretty bespoke and separate from the rest of our SSA infrastructure, which is in opposition to some of our goals here and makes me disinclined to pursue it further.
I would lean towards making any operation that reads from the outside world also return a new effect value.
fitzgen edited a comment on issue #10427:
What prevents incorrectly moving the store before the load?
Thanks for whipping up this example, it is indeed an interesting one.
I had originally been imagining effect values as affine, but at some point Chris convinced me we shouldn't do that, and additionally it gets messy when we want to fence on a bunch of effects, so we merge them together, do an operation depending on that merged effect, and then unpack them again from the resulting effect value after the operation (e.g. fanning in all effects to a
call
and then fanning out to recover the new version of each logical alias region). Agreed with Chris that it might have weird implications on various transforms and that we would need a bunch of annoying/specialized CLIF validation rules.So to allow the encoding of dependencies we want purely in SSA values, where the store depends on the load, I think we need all instructions that read from the outside world, even if they don't mutate it, to return new effect values. Then the store can depend on the load's effect, and the desired ordering is encoded in the regular value dependencies again.
block1(v0: effect, v1: i8, v2: i64): v3: i8, v4: effect = load.i8 v0, v2+123 v5: effect = store.i8 v3, v1, v2+123 ...
I believe this is the third approach Chris enumerated, but making explicit our need for, e.g., loads to return a new effect value.
The anti-def idea is an interesting one, but again it is pretty bespoke and separate from the rest of our SSA infrastructure, which is in opposition to some of our goals here and makes me disinclined to pursue it further.
I would lean towards making any operation that reads from the outside world also return a new effect value.
fitzgen edited a comment on issue #10427:
What prevents incorrectly moving the store before the load?
Thanks for whipping up this example, it is indeed an interesting one.
I had originally been imagining effect values as affine, but at some point Chris convinced me we shouldn't do that, and additionally it gets messy when we want to fence on a bunch of effects, so we merge them together, do an operation depending on that merged effect, and then unpack them again from the resulting effect value after the operation (e.g. fanning in all effects to a
call
and then fanning out to recover the new version of each logical alias region). Agreed with Chris that it might have weird implications on various transforms and that we would need a bunch of annoying/specialized CLIF validation rules.So to allow the encoding of dependencies we want purely in SSA values, where the store depends on the load, I think we need all instructions that read from the outside world, even if they don't mutate it, to return new effect values. Then the store can depend on the load's effect, and the desired ordering is encoded in the regular value dependencies again.
block1(v0: effect, v1: i8, v2: i64): v3: i8, v4: effect = load.i8 v0, v2+123 v5: effect = store.i8 v4, v1, v2+123 ...
I believe this is the third approach Chris enumerated, but making explicit our need for, e.g., loads to return a new effect value.
The anti-def idea is an interesting one, but again it is pretty bespoke and separate from the rest of our SSA infrastructure, which is in opposition to some of our goals here and makes me disinclined to pursue it further.
I would lean towards making any operation that reads from the outside world also return a new effect value.
hanna-kruppe commented on issue #10427:
Regarding affine/linear values, note that there would have to be a notion of a non-consuming use as well for operations which are read-only w.r.t. that effect. Otherwise everything related to the same alias region is forced into a single linear chain, which makes the whole thing much less appealing.
But then this raises interesting questions for merging effects - if a load depends on both control and the last write to a wasm memory, and these are merged with a
combine_effects
operation, is that a consuming use of the input effects? If yes, you have to artificially introduce order between operations that depend on intersecting but non-identical sets of effects. If not, this is a way to duplicate effect values and must be accounted for when constructing the consuming use(s) of later writes.
hanna-kruppe edited a comment on issue #10427:
Regarding affine/linear values, note that there would have to be a notion of a non-consuming use as well for operations which are read-only w.r.t. that effect. Otherwise everything related to the same alias region is forced into a single linear chain, which makes the whole thing much less appealing.
But then this raises interesting questions for merging effects - if a load depends on both control and the last write to a wasm memory, and these are merged with a
combine_effects
operation, is that a consuming use of the input effects? If yes, you have to artificially introduce order between operations that depend on intersecting but non-identical sets of effects. If not, this is a way to duplicate effect values and must be accounted for when constructing the consuming use(s) of later writes.(There were already very good reasons to not go this route, but it seems there’s even more complexity hiding there.)
hanna-kruppe edited a comment on issue #10427:
Regarding affine/linear values, note that there would have to be a notion of a non-consuming use as well for operations which are read-only w.r.t. that effect. Otherwise everything related to the same alias region is forced into a single linear chain, which makes the whole thing much less appealing.
But then this raises interesting questions for merging effects - if a load depends on both control and the last write to a wasm memory, and these are merged with a
combine_effects
operation, is that a consuming use of the input effects? If yes, you have to artificially introduce order between operations that depend on intersecting but non-identical sets of effects. If not, this is a way to duplicate effect values and must be accounted for when constructing the consuming use(s) of later writes.(There were already very good reasons given to not go this route, but it seems there’s even more complexity hiding there.)
hanna-kruppe commented on issue #10427:
Hmm, while affine effect values lead to two different kinds of uses, an encoding into regular SSA def-use relationships may also need a similar distinction for defs: if you want to DCE a non-trapping load, you need to know that removing/replacing the uses of its effect result value is valid. In other words, you need to know which effect values represent actual changes and which ones are “pass through” for the sake of wiring up a read to relevant later writes. This seems to drag in a significant chunk of “explicit effect chain manipulation” in SoN terms, or in CLIF terms, a significant amount of special knowledge about the semantics of specific instructions that doesn’t fall out from the SSA data flow alone.
hanna-kruppe edited a comment on issue #10427:
Hmm, while affine effect values lead to two different kinds of uses, an encoding into regular SSA def-use relationships may also need a similar distinction for defs: if you want to DCE a non-trapping load, you need to know that removing its effect result value (or replacing it with the effect inputs to the load?) is valid. In other words, you need to know which effect values represent actual changes and which ones are “pass through” for the sake of wiring up a read to relevant later writes. This seems to drag in a significant chunk of “explicit effect chain manipulation” in SoN terms, or in CLIF terms, a significant amount of special knowledge about the semantics of specific instructions that doesn’t fall out from the SSA data flow alone.
fitzgen commented on issue #10427:
In other words, you need to know which effect values represent actual changes and which ones are “pass through” for the sake of wiring up a read to relevant later writes. This seems to drag in a significant chunk of “explicit effect chain manipulation” in SoN terms, or in CLIF terms, a significant amount of special knowledge about the semantics of specific instructions that doesn’t fall out from the SSA data flow alone.
I don't think this is actually too bad: we can annotate these instructions in the
cranelift-meta
crate's instruction definitions/builders to automatically generate predicate methods for use incranelift-codegen
, similar to ourside_effects_idempotent
flag, for example.
hanna-kruppe commented on issue #10427:
How do you distinguish loads that may trap and those that don’t, if both of them take and produce effect values?
hanna-kruppe edited a comment on issue #10427:
How do you distinguish loads that may trap from those that don’t, if both kinds of load take and produce effect values?
fitzgen commented on issue #10427:
We will still need to annotate loads with the
ir::TrapCode
of their trap, so we can use that.
Last updated: Apr 18 2025 at 00:13 UTC