jameysharp opened issue #5623:
Feature
On some backends, different Cranelift instructions may lower to the same machine instruction, producing multiple results at once.
For example, if a function evaluates both
udiv v0, v1
andurem v0, v1
, then on x86 both results are produced with the samediv
instruction.Currently, Cranelift emits the same
div
instruction twice in that case (see #5573) but we'd like to deduplicate this.I believe the same optimization is relevant for
imul
/umulhi
,imul
/smulhi
, and probably other pairs of instructions.Benefit
Fewer instructions retired to compute the same result. Also, since these x86 instructions use the same fixed registers for input and output, evaluating the same instruction back to back requires a bunch of register moves, so deduplicating the instructions relieves local register pressure. On the other hand, merging these ops may increase the length of live ranges, so it may increase global register pressure.
Implementation
@cfallin suggests introducing a mid-end egraph optimization, but only for backends which benefit from it. We'd introduce Cranelift instructions such as
udivrem
representing the combined operation with multiple results. We'd then have egraph rules indicating thatudiv
can be replaced by the first result ofudivrem
, but at the same time that the second result ofudivrem
can be replaced byurem
. (And also _vice versa_.)That way, if we encounter a
udiv
, then we'll add the correspondingurem
as an available expression; if we later encounter an equivalenturem
, we'll find that it's already available and de-duplicate it.Alternatives
We could try to deduplicate machine instructions after lowering but before register allocation.
jameysharp labeled issue #5623:
Feature
On some backends, different Cranelift instructions may lower to the same machine instruction, producing multiple results at once.
For example, if a function evaluates both
udiv v0, v1
andurem v0, v1
, then on x86 both results are produced with the samediv
instruction.Currently, Cranelift emits the same
div
instruction twice in that case (see #5573) but we'd like to deduplicate this.I believe the same optimization is relevant for
imul
/umulhi
,imul
/smulhi
, and probably other pairs of instructions.Benefit
Fewer instructions retired to compute the same result. Also, since these x86 instructions use the same fixed registers for input and output, evaluating the same instruction back to back requires a bunch of register moves, so deduplicating the instructions relieves local register pressure. On the other hand, merging these ops may increase the length of live ranges, so it may increase global register pressure.
Implementation
@cfallin suggests introducing a mid-end egraph optimization, but only for backends which benefit from it. We'd introduce Cranelift instructions such as
udivrem
representing the combined operation with multiple results. We'd then have egraph rules indicating thatudiv
can be replaced by the first result ofudivrem
, but at the same time that the second result ofudivrem
can be replaced byurem
. (And also _vice versa_.)That way, if we encounter a
udiv
, then we'll add the correspondingurem
as an available expression; if we later encounter an equivalenturem
, we'll find that it's already available and de-duplicate it.Alternatives
We could try to deduplicate machine instructions after lowering but before register allocation.
jameysharp labeled issue #5623:
Feature
On some backends, different Cranelift instructions may lower to the same machine instruction, producing multiple results at once.
For example, if a function evaluates both
udiv v0, v1
andurem v0, v1
, then on x86 both results are produced with the samediv
instruction.Currently, Cranelift emits the same
div
instruction twice in that case (see #5573) but we'd like to deduplicate this.I believe the same optimization is relevant for
imul
/umulhi
,imul
/smulhi
, and probably other pairs of instructions.Benefit
Fewer instructions retired to compute the same result. Also, since these x86 instructions use the same fixed registers for input and output, evaluating the same instruction back to back requires a bunch of register moves, so deduplicating the instructions relieves local register pressure. On the other hand, merging these ops may increase the length of live ranges, so it may increase global register pressure.
Implementation
@cfallin suggests introducing a mid-end egraph optimization, but only for backends which benefit from it. We'd introduce Cranelift instructions such as
udivrem
representing the combined operation with multiple results. We'd then have egraph rules indicating thatudiv
can be replaced by the first result ofudivrem
, but at the same time that the second result ofudivrem
can be replaced byurem
. (And also _vice versa_.)That way, if we encounter a
udiv
, then we'll add the correspondingurem
as an available expression; if we later encounter an equivalenturem
, we'll find that it's already available and de-duplicate it.Alternatives
We could try to deduplicate machine instructions after lowering but before register allocation.
jameysharp labeled issue #5623:
Feature
On some backends, different Cranelift instructions may lower to the same machine instruction, producing multiple results at once.
For example, if a function evaluates both
udiv v0, v1
andurem v0, v1
, then on x86 both results are produced with the samediv
instruction.Currently, Cranelift emits the same
div
instruction twice in that case (see #5573) but we'd like to deduplicate this.I believe the same optimization is relevant for
imul
/umulhi
,imul
/smulhi
, and probably other pairs of instructions.Benefit
Fewer instructions retired to compute the same result. Also, since these x86 instructions use the same fixed registers for input and output, evaluating the same instruction back to back requires a bunch of register moves, so deduplicating the instructions relieves local register pressure. On the other hand, merging these ops may increase the length of live ranges, so it may increase global register pressure.
Implementation
@cfallin suggests introducing a mid-end egraph optimization, but only for backends which benefit from it. We'd introduce Cranelift instructions such as
udivrem
representing the combined operation with multiple results. We'd then have egraph rules indicating thatudiv
can be replaced by the first result ofudivrem
, but at the same time that the second result ofudivrem
can be replaced byurem
. (And also _vice versa_.)That way, if we encounter a
udiv
, then we'll add the correspondingurem
as an available expression; if we later encounter an equivalenturem
, we'll find that it's already available and de-duplicate it.Alternatives
We could try to deduplicate machine instructions after lowering but before register allocation.
cfallin commented on issue #5623:
I think this is potentially an important optimization opportunity in some kinds of code (bignum libraries, as Jamey has been investigating, for one!).
To add a little more detail to my thoughts: in the abstract, what we want to do is encounter one of the ops (say,
udiv
) and then note that if we see aurem
later, it can use the value that we have already computed "for free". This needs to be a forward pass ideally, unless we add more complexity, because we can always generate a value and save it for later but it's much harder to hoist the single instruction that produces both results if we recognize the opportunity at the last use. (Consider if e.g. theudiv
is GVN'd and shared from one block to several others, while theurem
is only used in one place: if we see theurem
first in a backward pass, we'd somehow have to know about the GVN'ing and hoist theudivrem
.)Given that it's best if this is a forward pass, and given that it interacts with GVN, it makes sense to me to do it as part of the mid-end optimization pass. And the primitive that makes sense to me is what we might call
provide
: when computing one value, we indicate that another value is also available. The idea is that we can rewriteudiv
to the first result ofudivrem
, and at the same time, (i) hashcons a(urem ...)
node on the right-hand side of the rule, and (ii) note that it is available ("provide" it) with the second result ofudivrem
. If theurem
exists elsewhere, GVN will find it during hashconsing, and we'll "union" it with theudivrem
result, which we can then always prefer during extraction.The backend can then lower
udivrem
to the one machine instructionDIV
on x86. Note also that the rewrite toudivrem
should be machine-specific (selected only for some architectures) because e.g. aarch64 does not have an instruction that gives div and rem results for free. (In fact it doesn't even haverem
: this is computed by dividing, multiplying, and subtracting!)To sketch what the mid-end rule might look like (exact form and combinators aren't so important here, the key idea is the "provides"):
(rule (udiv ty x y) (if-let (two_results div rem) (udivrem ty x y)) (with_provides rem (urem ty x y)) div)
and vice-versa for
urem
.This has a few prerequisites:
- Handling of multi-result nodes in the mid-end. (I sketched some plumbing around this with thetwo_results
etor as well above.)
- Machine-specific enabling of rules in the mid-end. Since it's possible to build Cranelift to support multiple targets at once, this may require either dynamic checks ((if (target_x64))
) or monomorphizing the mid-end per target (less desirable).
cfallin edited a comment on issue #5623:
I think this is potentially an important optimization opportunity in some kinds of code (bignum libraries, as Jamey has been investigating, for one!).
To add a little more detail to my thoughts: in the abstract, what we want to do is encounter one of the ops (say,
udiv
) and then note that if we see aurem
later, it can use the value that we have already computed "for free". This needs to be a forward pass ideally, unless we add more complexity, because we can always generate a value and save it for later but it's much harder to hoist the single instruction that produces both results if we recognize the opportunity at the last use. (Consider if e.g. theudiv
is GVN'd and shared from one block to several others, while theurem
is only used in one place: if we see theurem
first in a backward pass, we'd somehow have to know about the GVN'ing and hoist theudivrem
.)Given that it's best if this is a forward pass, and given that it interacts with GVN, it makes sense to me to do it as part of the mid-end optimization pass. And the primitive that makes sense to me is what we might call
provide
: when computing one value, we indicate that another value is also available. The idea is that we can rewriteudiv
to the first result ofudivrem
, and at the same time, (i) hashcons a(urem ...)
node on the right-hand side of the rule, and (ii) note that it is available ("provide" it) with the second result ofudivrem
. If theurem
exists elsewhere, GVN will find it during hashconsing, and we'll "union" it with theudivrem
result, which we can then always prefer during extraction.The backend can then lower
udivrem
to the one machine instructionDIV
on x86. Note also that the rewrite toudivrem
should be machine-specific (selected only for some architectures) because e.g. aarch64 does not have an instruction that gives div and rem results for free. (In fact it doesn't even haverem
: this is computed by dividing, multiplying, and subtracting!)To sketch what the mid-end rule might look like (exact form and combinators aren't so important here, the key idea is the "provides"):
(rule (udiv ty x y) (if-let (two_results div rem) (udivrem ty x y)) (with_provides rem (urem ty x y) div))
and vice-versa for
urem
.This has a few prerequisites:
- Handling of multi-result nodes in the mid-end. (I sketched some plumbing around this with thetwo_results
etor as well above.)
- Machine-specific enabling of rules in the mid-end. Since it's possible to build Cranelift to support multiple targets at once, this may require either dynamic checks ((if (target_x64))
) or monomorphizing the mid-end per target (less desirable).
jameysharp commented on issue #5623:
There's a refinement of this that Chris and I discussed but I hadn't written down.
For instructions with multiple fixed value results, I think we can generate a separate term for each result. (For example,
udivrem_1
andudivrem_2
, although it would be nice if we named the results instead of numbering them.)
- When used as a constructor, this term builds the instruction if it isn't already in the GVN map, then returns the requested result.
- When used as an extractor, this term matches if some value in the given e-class is the requested result of a matching instruction.
The key implementation challenge here is that the GVN map is not currently usable this way. Right now it maps an
InstructionData
to aValue
, but we need it to map to anInst
instead. That may require changes throughout the equality saturation phase; I haven't dug into it enough to work out the details.If we have all of that, then I don't think we need any new ISLE terms like
two_results
orwith_provides
. Instead, rewrite(udiv ty x y)
to(udivrem_1 ty x y)
(and perhaps _vice versa_), and let elaboration sort out whether the more complex instruction is worth-while. In particular I think we need to ensure that the cost model satisfies these constraints:
- cost(
udiv
) < cost(udivrem
)- cost(
urem
) < cost(udivrem
)- cost(
udiv
) + cost(urem
) > cost(udivrem
)One more thing, regarding whether this
udivrem
optimization is machine-specific: The aarch64 example is interesting because it shows that this optimization is worth-while even on targets which don't have a dedicated instruction for this. Since the lowering ofurem
has to compute the result ofudiv
first, we should reuse that result if it's needed elsewhere. I think this fused instruction will turn out to be worth having on all targets, because it describes an arithmetic identity that is target-independent.
jameysharp commented on issue #5623:
@cfallin, @fitzgen, @elliottt and I discussed this today. We identified a problem: the cost-function can't currently accurately account for instructions with multiple results.
If two results from the same instruction are both used, the cost of that instruction ideally would only be counted once. Unfortunately, the individual results may each be used through an arbitrarily long chain of other instructions before being demanded in some context where we must have both results. So the current dynamic-programming approach to computing the cost function can't see that it's ever worth using
udivrem
ifudiv
is cheaper.We suspect fixing this is NP-hard, though we haven't bothered figuring out which problem it reduces to. Chris mentioned it felt like set cover, maybe. On further thought I guess a zero-one integer linear programming solver would be a natural fit for this problem, and that's one of Karp's original NP-complete problems. On the other hand apparently there are known classes of ILP problems solvable in polynomial time and who knows, maybe we fall in one of those?
Last updated: Jan 24 2025 at 00:11 UTC