jameysharp opened issue #5623:
Feature
On some backends, different Cranelift instructions may lower to the same machine instruction, producing multiple results at once.
For example, if a function evaluates both
udiv v0, v1andurem v0, v1, then on x86 both results are produced with the samedivinstruction.Currently, Cranelift emits the same
divinstruction twice in that case (see #5573) but we'd like to deduplicate this.I believe the same optimization is relevant for
imul/umulhi,imul/smulhi, and probably other pairs of instructions.Benefit
Fewer instructions retired to compute the same result. Also, since these x86 instructions use the same fixed registers for input and output, evaluating the same instruction back to back requires a bunch of register moves, so deduplicating the instructions relieves local register pressure. On the other hand, merging these ops may increase the length of live ranges, so it may increase global register pressure.
Implementation
@cfallin suggests introducing a mid-end egraph optimization, but only for backends which benefit from it. We'd introduce Cranelift instructions such as
udivremrepresenting the combined operation with multiple results. We'd then have egraph rules indicating thatudivcan be replaced by the first result ofudivrem, but at the same time that the second result ofudivremcan be replaced byurem. (And also _vice versa_.)That way, if we encounter a
udiv, then we'll add the correspondinguremas an available expression; if we later encounter an equivalenturem, we'll find that it's already available and de-duplicate it.Alternatives
We could try to deduplicate machine instructions after lowering but before register allocation.
jameysharp labeled issue #5623:
Feature
On some backends, different Cranelift instructions may lower to the same machine instruction, producing multiple results at once.
For example, if a function evaluates both
udiv v0, v1andurem v0, v1, then on x86 both results are produced with the samedivinstruction.Currently, Cranelift emits the same
divinstruction twice in that case (see #5573) but we'd like to deduplicate this.I believe the same optimization is relevant for
imul/umulhi,imul/smulhi, and probably other pairs of instructions.Benefit
Fewer instructions retired to compute the same result. Also, since these x86 instructions use the same fixed registers for input and output, evaluating the same instruction back to back requires a bunch of register moves, so deduplicating the instructions relieves local register pressure. On the other hand, merging these ops may increase the length of live ranges, so it may increase global register pressure.
Implementation
@cfallin suggests introducing a mid-end egraph optimization, but only for backends which benefit from it. We'd introduce Cranelift instructions such as
udivremrepresenting the combined operation with multiple results. We'd then have egraph rules indicating thatudivcan be replaced by the first result ofudivrem, but at the same time that the second result ofudivremcan be replaced byurem. (And also _vice versa_.)That way, if we encounter a
udiv, then we'll add the correspondinguremas an available expression; if we later encounter an equivalenturem, we'll find that it's already available and de-duplicate it.Alternatives
We could try to deduplicate machine instructions after lowering but before register allocation.
jameysharp labeled issue #5623:
Feature
On some backends, different Cranelift instructions may lower to the same machine instruction, producing multiple results at once.
For example, if a function evaluates both
udiv v0, v1andurem v0, v1, then on x86 both results are produced with the samedivinstruction.Currently, Cranelift emits the same
divinstruction twice in that case (see #5573) but we'd like to deduplicate this.I believe the same optimization is relevant for
imul/umulhi,imul/smulhi, and probably other pairs of instructions.Benefit
Fewer instructions retired to compute the same result. Also, since these x86 instructions use the same fixed registers for input and output, evaluating the same instruction back to back requires a bunch of register moves, so deduplicating the instructions relieves local register pressure. On the other hand, merging these ops may increase the length of live ranges, so it may increase global register pressure.
Implementation
@cfallin suggests introducing a mid-end egraph optimization, but only for backends which benefit from it. We'd introduce Cranelift instructions such as
udivremrepresenting the combined operation with multiple results. We'd then have egraph rules indicating thatudivcan be replaced by the first result ofudivrem, but at the same time that the second result ofudivremcan be replaced byurem. (And also _vice versa_.)That way, if we encounter a
udiv, then we'll add the correspondinguremas an available expression; if we later encounter an equivalenturem, we'll find that it's already available and de-duplicate it.Alternatives
We could try to deduplicate machine instructions after lowering but before register allocation.
jameysharp labeled issue #5623:
Feature
On some backends, different Cranelift instructions may lower to the same machine instruction, producing multiple results at once.
For example, if a function evaluates both
udiv v0, v1andurem v0, v1, then on x86 both results are produced with the samedivinstruction.Currently, Cranelift emits the same
divinstruction twice in that case (see #5573) but we'd like to deduplicate this.I believe the same optimization is relevant for
imul/umulhi,imul/smulhi, and probably other pairs of instructions.Benefit
Fewer instructions retired to compute the same result. Also, since these x86 instructions use the same fixed registers for input and output, evaluating the same instruction back to back requires a bunch of register moves, so deduplicating the instructions relieves local register pressure. On the other hand, merging these ops may increase the length of live ranges, so it may increase global register pressure.
Implementation
@cfallin suggests introducing a mid-end egraph optimization, but only for backends which benefit from it. We'd introduce Cranelift instructions such as
udivremrepresenting the combined operation with multiple results. We'd then have egraph rules indicating thatudivcan be replaced by the first result ofudivrem, but at the same time that the second result ofudivremcan be replaced byurem. (And also _vice versa_.)That way, if we encounter a
udiv, then we'll add the correspondinguremas an available expression; if we later encounter an equivalenturem, we'll find that it's already available and de-duplicate it.Alternatives
We could try to deduplicate machine instructions after lowering but before register allocation.
cfallin commented on issue #5623:
I think this is potentially an important optimization opportunity in some kinds of code (bignum libraries, as Jamey has been investigating, for one!).
To add a little more detail to my thoughts: in the abstract, what we want to do is encounter one of the ops (say,
udiv) and then note that if we see auremlater, it can use the value that we have already computed "for free". This needs to be a forward pass ideally, unless we add more complexity, because we can always generate a value and save it for later but it's much harder to hoist the single instruction that produces both results if we recognize the opportunity at the last use. (Consider if e.g. theudivis GVN'd and shared from one block to several others, while theuremis only used in one place: if we see theuremfirst in a backward pass, we'd somehow have to know about the GVN'ing and hoist theudivrem.)Given that it's best if this is a forward pass, and given that it interacts with GVN, it makes sense to me to do it as part of the mid-end optimization pass. And the primitive that makes sense to me is what we might call
provide: when computing one value, we indicate that another value is also available. The idea is that we can rewriteudivto the first result ofudivrem, and at the same time, (i) hashcons a(urem ...)node on the right-hand side of the rule, and (ii) note that it is available ("provide" it) with the second result ofudivrem. If theuremexists elsewhere, GVN will find it during hashconsing, and we'll "union" it with theudivremresult, which we can then always prefer during extraction.The backend can then lower
udivremto the one machine instructionDIVon x86. Note also that the rewrite toudivremshould be machine-specific (selected only for some architectures) because e.g. aarch64 does not have an instruction that gives div and rem results for free. (In fact it doesn't even haverem: this is computed by dividing, multiplying, and subtracting!)To sketch what the mid-end rule might look like (exact form and combinators aren't so important here, the key idea is the "provides"):
(rule (udiv ty x y) (if-let (two_results div rem) (udivrem ty x y)) (with_provides rem (urem ty x y)) div)and vice-versa for
urem.This has a few prerequisites:
- Handling of multi-result nodes in the mid-end. (I sketched some plumbing around this with thetwo_resultsetor as well above.)
- Machine-specific enabling of rules in the mid-end. Since it's possible to build Cranelift to support multiple targets at once, this may require either dynamic checks ((if (target_x64))) or monomorphizing the mid-end per target (less desirable).
cfallin edited a comment on issue #5623:
I think this is potentially an important optimization opportunity in some kinds of code (bignum libraries, as Jamey has been investigating, for one!).
To add a little more detail to my thoughts: in the abstract, what we want to do is encounter one of the ops (say,
udiv) and then note that if we see auremlater, it can use the value that we have already computed "for free". This needs to be a forward pass ideally, unless we add more complexity, because we can always generate a value and save it for later but it's much harder to hoist the single instruction that produces both results if we recognize the opportunity at the last use. (Consider if e.g. theudivis GVN'd and shared from one block to several others, while theuremis only used in one place: if we see theuremfirst in a backward pass, we'd somehow have to know about the GVN'ing and hoist theudivrem.)Given that it's best if this is a forward pass, and given that it interacts with GVN, it makes sense to me to do it as part of the mid-end optimization pass. And the primitive that makes sense to me is what we might call
provide: when computing one value, we indicate that another value is also available. The idea is that we can rewriteudivto the first result ofudivrem, and at the same time, (i) hashcons a(urem ...)node on the right-hand side of the rule, and (ii) note that it is available ("provide" it) with the second result ofudivrem. If theuremexists elsewhere, GVN will find it during hashconsing, and we'll "union" it with theudivremresult, which we can then always prefer during extraction.The backend can then lower
udivremto the one machine instructionDIVon x86. Note also that the rewrite toudivremshould be machine-specific (selected only for some architectures) because e.g. aarch64 does not have an instruction that gives div and rem results for free. (In fact it doesn't even haverem: this is computed by dividing, multiplying, and subtracting!)To sketch what the mid-end rule might look like (exact form and combinators aren't so important here, the key idea is the "provides"):
(rule (udiv ty x y) (if-let (two_results div rem) (udivrem ty x y)) (with_provides rem (urem ty x y) div))and vice-versa for
urem.This has a few prerequisites:
- Handling of multi-result nodes in the mid-end. (I sketched some plumbing around this with thetwo_resultsetor as well above.)
- Machine-specific enabling of rules in the mid-end. Since it's possible to build Cranelift to support multiple targets at once, this may require either dynamic checks ((if (target_x64))) or monomorphizing the mid-end per target (less desirable).
jameysharp commented on issue #5623:
There's a refinement of this that Chris and I discussed but I hadn't written down.
For instructions with multiple fixed value results, I think we can generate a separate term for each result. (For example,
udivrem_1andudivrem_2, although it would be nice if we named the results instead of numbering them.)
- When used as a constructor, this term builds the instruction if it isn't already in the GVN map, then returns the requested result.
- When used as an extractor, this term matches if some value in the given e-class is the requested result of a matching instruction.
The key implementation challenge here is that the GVN map is not currently usable this way. Right now it maps an
InstructionDatato aValue, but we need it to map to anInstinstead. That may require changes throughout the equality saturation phase; I haven't dug into it enough to work out the details.If we have all of that, then I don't think we need any new ISLE terms like
two_resultsorwith_provides. Instead, rewrite(udiv ty x y)to(udivrem_1 ty x y)(and perhaps _vice versa_), and let elaboration sort out whether the more complex instruction is worth-while. In particular I think we need to ensure that the cost model satisfies these constraints:
- cost(
udiv) < cost(udivrem)- cost(
urem) < cost(udivrem)- cost(
udiv) + cost(urem) > cost(udivrem)One more thing, regarding whether this
udivremoptimization is machine-specific: The aarch64 example is interesting because it shows that this optimization is worth-while even on targets which don't have a dedicated instruction for this. Since the lowering ofuremhas to compute the result ofudivfirst, we should reuse that result if it's needed elsewhere. I think this fused instruction will turn out to be worth having on all targets, because it describes an arithmetic identity that is target-independent.
jameysharp commented on issue #5623:
@cfallin, @fitzgen, @elliottt and I discussed this today. We identified a problem: the cost-function can't currently accurately account for instructions with multiple results.
If two results from the same instruction are both used, the cost of that instruction ideally would only be counted once. Unfortunately, the individual results may each be used through an arbitrarily long chain of other instructions before being demanded in some context where we must have both results. So the current dynamic-programming approach to computing the cost function can't see that it's ever worth using
udivremifudivis cheaper.We suspect fixing this is NP-hard, though we haven't bothered figuring out which problem it reduces to. Chris mentioned it felt like set cover, maybe. On further thought I guess a zero-one integer linear programming solver would be a natural fit for this problem, and that's one of Karp's original NP-complete problems. On the other hand apparently there are known classes of ILP problems solvable in polynomial time and who knows, maybe we fall in one of those?
Last updated: Dec 13 2025 at 19:03 UTC