wasmtime / issue #6097 cranelift: Sink loads after lowering · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #6097 cranelift: Sink loads after lowering

Wasmtime GitHub notifications bot (Mar 24 2023 at 01:36):

jameysharp opened issue #6097:

Feature

On x86-64, unlike RISC-style instruction sets, many instructions have the ability to do a memory load. We can "sink" a load instruction into one of these other instructions, under the right conditions. One of those conditions is that the result of the load must be used exactly once. If it's used multiple times, then sinking the load means that the load is duplicated.

Our current mechanism for checking whether load-sinking is safe has been quite error-prone. We check for unique uses in the input CLIF, but during lowering, a single use of a CLIF Value may turn into multiple uses of the corresponding register. To deal with that, we try to ensure that lowering rules only convert a Value into a register once if they need to use the register multiple times. However this has been difficult to ensure because there's an implicit conversion from Value to types like GprMem and XmmMem.

This would all be much easier if we could count the number of times that the value is used _after_ lowering, since that's what we actually want to know.

Benefit

The primary benefit is to eliminate this class of bugs from the x86-64 backend. (In principle other architectures could do load-sinking too, but we don't currently support any that do.)

Also, I think we can delete a bunch of code in cranelift/codegen/src/machinst/lower.rs, and possibly one linear-time pass over the input CLIF, by specializing this.

Implementation

Besides the single-use rule, the other rule is that a load must not be moved past another side-effecting instruction. The machinst::lower module currently uses "instruction coloring" to enforce this. As far as I can tell, there are no other ways for instructions to be re-ordered during lowering, so this is the only use of these colors.

It's important to note that machinst lowers basic blocks in post-order, so all blocks which could have used a value are lowered before the block which defines that value. In addition, it lowers the instructions in each block in reverse order, so again the uses are seen before their definition.

So here's the approach I have in mind:

In enums like GprMem, have the Mem variant keep not only the address-mode, but also the Value that we got the input from.

For each Value which is the result of a load, track whether it is used more than once during lowering. Currently machinst does this in value_lowered_uses, but I think it might not need that after this.

When converting a Value produced by a load in the current block into e.g. GprMem, assume that we're going to sink the load, unless we've already seen other uses of this value. (We won't try to sink loads across block boundaries.) Add the Value to a queue of tentatively-sunk loads.

When we're about to lower a side-effecting instruction, check its opcode:

If it's a load, then pop Values from the queue until we find the result of _this_ load. If it is still only used once, then we can forget about it: we've successfully sunk the load. Afterward we can forget about how many times this load is used, too: now that we've found its definition, we won't see any more uses.

If it isn't a load, then pop every Value from the queue. We can't sink any loads past this instruction.

For every Value popped from the queue without proving that sinking that load was safe, add it to a set of failed attempts so we can clean it up later.

During register allocation and again while emitting machine code, any time a Mem variant comes up, check it against the set of failed attempts. If it's there, treat it as the equivalent Reg variant instead. Alternatively, rewrite the current block when we're done lowering it, so we can discard the failed-attempt set early.

The queue is necessary because we may have tentatively sunk multiple loads by the time we see the actual load instructions, in which case we need to make sure that they're all in the right order. If we wanted to allow reordering loads with respect to each other but not with any other side effects, this could be a set instead of a queue. We could even keep a queue for atomic loads and a set for other loads if we want to implement a hybrid policy…

Note that the queue must be empty by the time we finish lowering a block, because we only add loads to it which are in the current block.

I don't think any of the above needs any machinery in machinst beyond the guarantee that we always lower uses before definitions. So other backends can avoid paying the cost of tracking colors and unique uses. If/when we add another backend that supports sinking loads, we can factor it out to common code then.

Wasmtime GitHub notifications bot (Mar 24 2023 at 01:36):

jameysharp labeled issue #6097:

Feature

On x86-64, unlike RISC-style instruction sets, many instructions have the ability to do a memory load. We can "sink" a load instruction into one of these other instructions, under the right conditions. One of those conditions is that the result of the load must be used exactly once. If it's used multiple times, then sinking the load means that the load is duplicated.

Our current mechanism for checking whether load-sinking is safe has been quite error-prone. We check for unique uses in the input CLIF, but during lowering, a single use of a CLIF Value may turn into multiple uses of the corresponding register. To deal with that, we try to ensure that lowering rules only convert a Value into a register once if they need to use the register multiple times. However this has been difficult to ensure because there's an implicit conversion from Value to types like GprMem and XmmMem.

This would all be much easier if we could count the number of times that the value is used _after_ lowering, since that's what we actually want to know.

Benefit

The primary benefit is to eliminate this class of bugs from the x86-64 backend. (In principle other architectures could do load-sinking too, but we don't currently support any that do.)

Also, I think we can delete a bunch of code in cranelift/codegen/src/machinst/lower.rs, and possibly one linear-time pass over the input CLIF, by specializing this.

Implementation

Besides the single-use rule, the other rule is that a load must not be moved past another side-effecting instruction. The machinst::lower module currently uses "instruction coloring" to enforce this. As far as I can tell, there are no other ways for instructions to be re-ordered during lowering, so this is the only use of these colors.

It's important to note that machinst lowers basic blocks in post-order, so all blocks which could have used a value are lowered before the block which defines that value. In addition, it lowers the instructions in each block in reverse order, so again the uses are seen before their definition.

So here's the approach I have in mind:

In enums like GprMem, have the Mem variant keep not only the address-mode, but also the Value that we got the input from.

For each Value which is the result of a load, track whether it is used more than once during lowering. Currently machinst does this in value_lowered_uses, but I think it might not need that after this.

When converting a Value produced by a load in the current block into e.g. GprMem, assume that we're going to sink the load, unless we've already seen other uses of this value. (We won't try to sink loads across block boundaries.) Add the Value to a queue of tentatively-sunk loads.

When we're about to lower a side-effecting instruction, check its opcode:

If it's a load, then pop Values from the queue until we find the result of _this_ load. If it is still only used once, then we can forget about it: we've successfully sunk the load. Afterward we can forget about how many times this load is used, too: now that we've found its definition, we won't see any more uses.

If it isn't a load, then pop every Value from the queue. We can't sink any loads past this instruction.

For every Value popped from the queue without proving that sinking that load was safe, add it to a set of failed attempts so we can clean it up later.

During register allocation and again while emitting machine code, any time a Mem variant comes up, check it against the set of failed attempts. If it's there, treat it as the equivalent Reg variant instead. Alternatively, rewrite the current block when we're done lowering it, so we can discard the failed-attempt set early.

The queue is necessary because we may have tentatively sunk multiple loads by the time we see the actual load instructions, in which case we need to make sure that they're all in the right order. If we wanted to allow reordering loads with respect to each other but not with any other side effects, this could be a set instead of a queue. We could even keep a queue for atomic loads and a set for other loads if we want to implement a hybrid policy…

Note that the queue must be empty by the time we finish lowering a block, because we only add loads to it which are in the current block.

I don't think any of the above needs any machinery in machinst beyond the guarantee that we always lower uses before definitions. So other backends can avoid paying the cost of tracking colors and unique uses. If/when we add another backend that supports sinking loads, we can factor it out to common code then.

Wasmtime GitHub notifications bot (Mar 24 2023 at 01:36):

jameysharp labeled issue #6097:

Feature

On x86-64, unlike RISC-style instruction sets, many instructions have the ability to do a memory load. We can "sink" a load instruction into one of these other instructions, under the right conditions. One of those conditions is that the result of the load must be used exactly once. If it's used multiple times, then sinking the load means that the load is duplicated.

Our current mechanism for checking whether load-sinking is safe has been quite error-prone. We check for unique uses in the input CLIF, but during lowering, a single use of a CLIF Value may turn into multiple uses of the corresponding register. To deal with that, we try to ensure that lowering rules only convert a Value into a register once if they need to use the register multiple times. However this has been difficult to ensure because there's an implicit conversion from Value to types like GprMem and XmmMem.

This would all be much easier if we could count the number of times that the value is used _after_ lowering, since that's what we actually want to know.

Benefit

The primary benefit is to eliminate this class of bugs from the x86-64 backend. (In principle other architectures could do load-sinking too, but we don't currently support any that do.)

Also, I think we can delete a bunch of code in cranelift/codegen/src/machinst/lower.rs, and possibly one linear-time pass over the input CLIF, by specializing this.

Implementation

Besides the single-use rule, the other rule is that a load must not be moved past another side-effecting instruction. The machinst::lower module currently uses "instruction coloring" to enforce this. As far as I can tell, there are no other ways for instructions to be re-ordered during lowering, so this is the only use of these colors.

It's important to note that machinst lowers basic blocks in post-order, so all blocks which could have used a value are lowered before the block which defines that value. In addition, it lowers the instructions in each block in reverse order, so again the uses are seen before their definition.

So here's the approach I have in mind:

In enums like GprMem, have the Mem variant keep not only the address-mode, but also the Value that we got the input from.

For each Value which is the result of a load, track whether it is used more than once during lowering. Currently machinst does this in value_lowered_uses, but I think it might not need that after this.

When converting a Value produced by a load in the current block into e.g. GprMem, assume that we're going to sink the load, unless we've already seen other uses of this value. (We won't try to sink loads across block boundaries.) Add the Value to a queue of tentatively-sunk loads.

When we're about to lower a side-effecting instruction, check its opcode:

If it's a load, then pop Values from the queue until we find the result of _this_ load. If it is still only used once, then we can forget about it: we've successfully sunk the load. Afterward we can forget about how many times this load is used, too: now that we've found its definition, we won't see any more uses.

If it isn't a load, then pop every Value from the queue. We can't sink any loads past this instruction.

For every Value popped from the queue without proving that sinking that load was safe, add it to a set of failed attempts so we can clean it up later.

During register allocation and again while emitting machine code, any time a Mem variant comes up, check it against the set of failed attempts. If it's there, treat it as the equivalent Reg variant instead. Alternatively, rewrite the current block when we're done lowering it, so we can discard the failed-attempt set early.

The queue is necessary because we may have tentatively sunk multiple loads by the time we see the actual load instructions, in which case we need to make sure that they're all in the right order. If we wanted to allow reordering loads with respect to each other but not with any other side effects, this could be a set instead of a queue. We could even keep a queue for atomic loads and a set for other loads if we want to implement a hybrid policy…

Note that the queue must be empty by the time we finish lowering a block, because we only add loads to it which are in the current block.

I don't think any of the above needs any machinery in machinst beyond the guarantee that we always lower uses before definitions. So other backends can avoid paying the cost of tracking colors and unique uses. If/when we add another backend that supports sinking loads, we can factor it out to common code then.

Wasmtime GitHub notifications bot (Mar 24 2023 at 01:36):

jameysharp labeled issue #6097:

Feature

On x86-64, unlike RISC-style instruction sets, many instructions have the ability to do a memory load. We can "sink" a load instruction into one of these other instructions, under the right conditions. One of those conditions is that the result of the load must be used exactly once. If it's used multiple times, then sinking the load means that the load is duplicated.

Our current mechanism for checking whether load-sinking is safe has been quite error-prone. We check for unique uses in the input CLIF, but during lowering, a single use of a CLIF Value may turn into multiple uses of the corresponding register. To deal with that, we try to ensure that lowering rules only convert a Value into a register once if they need to use the register multiple times. However this has been difficult to ensure because there's an implicit conversion from Value to types like GprMem and XmmMem.

This would all be much easier if we could count the number of times that the value is used _after_ lowering, since that's what we actually want to know.

Benefit

The primary benefit is to eliminate this class of bugs from the x86-64 backend. (In principle other architectures could do load-sinking too, but we don't currently support any that do.)

Also, I think we can delete a bunch of code in cranelift/codegen/src/machinst/lower.rs, and possibly one linear-time pass over the input CLIF, by specializing this.

Implementation

Besides the single-use rule, the other rule is that a load must not be moved past another side-effecting instruction. The machinst::lower module currently uses "instruction coloring" to enforce this. As far as I can tell, there are no other ways for instructions to be re-ordered during lowering, so this is the only use of these colors.

It's important to note that machinst lowers basic blocks in post-order, so all blocks which could have used a value are lowered before the block which defines that value. In addition, it lowers the instructions in each block in reverse order, so again the uses are seen before their definition.

So here's the approach I have in mind:

In enums like GprMem, have the Mem variant keep not only the address-mode, but also the Value that we got the input from.

For each Value which is the result of a load, track whether it is used more than once during lowering. Currently machinst does this in value_lowered_uses, but I think it might not need that after this.

When converting a Value produced by a load in the current block into e.g. GprMem, assume that we're going to sink the load, unless we've already seen other uses of this value. (We won't try to sink loads across block boundaries.) Add the Value to a queue of tentatively-sunk loads.

When we're about to lower a side-effecting instruction, check its opcode:

If it's a load, then pop Values from the queue until we find the result of _this_ load. If it is still only used once, then we can forget about it: we've successfully sunk the load. Afterward we can forget about how many times this load is used, too: now that we've found its definition, we won't see any more uses.

If it isn't a load, then pop every Value from the queue. We can't sink any loads past this instruction.

For every Value popped from the queue without proving that sinking that load was safe, add it to a set of failed attempts so we can clean it up later.

During register allocation and again while emitting machine code, any time a Mem variant comes up, check it against the set of failed attempts. If it's there, treat it as the equivalent Reg variant instead. Alternatively, rewrite the current block when we're done lowering it, so we can discard the failed-attempt set early.

The queue is necessary because we may have tentatively sunk multiple loads by the time we see the actual load instructions, in which case we need to make sure that they're all in the right order. If we wanted to allow reordering loads with respect to each other but not with any other side effects, this could be a set instead of a queue. We could even keep a queue for atomic loads and a set for other loads if we want to implement a hybrid policy…

Note that the queue must be empty by the time we finish lowering a block, because we only add loads to it which are in the current block.

I don't think any of the above needs any machinery in machinst beyond the guarantee that we always lower uses before definitions. So other backends can avoid paying the cost of tracking colors and unique uses. If/when we add another backend that supports sinking loads, we can factor it out to common code then.

Wasmtime GitHub notifications bot (Mar 24 2023 at 02:13):

cfallin commented on issue #6097:

I do like this approach -- thank you very much for filling out some details here!

I, too, dislike the current unique/multi-use analysis and the way it interacts with lowering rules; I'd love to see something better. I'll inject some general caution that this is an extremely subtle problem space and I've been surprised by unexpected interactions, and there are a lot of hard-won (as in "learned the hard way") correctness conditions in the current code; all that to say, we should just be very careful with the replacement too!

One general principle I'd want to try to stick to is that to whatever degree possible, I'd like to see the infrastructure for this built in a way that is not backend-specific. While it's true that we're thinking about load-merging now and that is mostly relevant for x86, there may be other kinds of side-effect motion that we want to do later (perhaps with traps, for example); so a system that reasons generally about sinking/hoisting and merging has value beyond just the x86 RegMem case. That doesn't have to mean that we have state in machinst::Lower as we do today -- maybe it's a separate SinkingTracker or somesuch.

I suspect that your system can be generalized such that items in the queue are any side-effecting ops that we want to try to sink, is that right?

It'd be worthwhile trying to write a (semi-formal) correctness argument for this, if you haven't yet. (It feels like the right kind of correctness property that is both extremely important, and tractable to actually show holds.) I think I can see how you'd argue that the total order of side-effecting ops (including loads) is maintained. The items in the queue represent a sliding window, but the order in the queue does not necessarily correspond to program order (it corresponds to use-order). Rather, we reconcile with program order as we pop, and discard (reject the sinking) anything that would have come out-of-order. So basically the queue only ever holds the in-(reverse)-use-order list of sinkable ops within a single inter-side-effect-window region of code. Is that right?

(Also, I think we need to consider start-of-basic-block a "side-effecting point" that flushes the queue as well, unless you can make an argument about how the traversal order makes that unnecessary but my head is starting to hurt now :-) )

Anyway, I look forward to seeing this work out!

Wasmtime GitHub notifications bot (Mar 24 2023 at 22:48):

jameysharp commented on issue #6097:

A brief (because I'm taking today off and totally not here right now) note on one small point: In my first draft of this I had it emptying the queue at basic block boundaries like you suggest. Then I changed it so things don't go into the queue in the first place if they come from another basic block, so we can just assert that the queue is empty at block boundaries.

Last updated: Apr 18 2025 at 03:16 UTC