cfallin labeled issue #4123:
The code generation for (i) branches on booleans, and (ii) branches on integer values that come from compares but are not directly observable (e.g. in a different basic block), is suboptimal. We often see:
- Masking, like
AND reg, 1
, because we pessimistically assume that the upper bits of a boolean value are undefined;cmp
,setcc
(x64) /cset
(aarch64) into a register followed by a conditional branch sometime later;- A combination of the above two.
The root causes are:
- We do not pattern-match far enough back, in some cases, to fuse the
brz
andicmp
at the Cranelift level, and this is exacerbated by GVN and LICM that hoist icmps earlier in the function;- We do not have combination patterns that recognize when some producers of bools-as-integers will actually define the high bits.
Some combination of more aggressive pattern matching and demanded-bits analysis could improve the codegen in these cases.
cfallin labeled issue #4123:
The code generation for (i) branches on booleans, and (ii) branches on integer values that come from compares but are not directly observable (e.g. in a different basic block), is suboptimal. We often see:
- Masking, like
AND reg, 1
, because we pessimistically assume that the upper bits of a boolean value are undefined;cmp
,setcc
(x64) /cset
(aarch64) into a register followed by a conditional branch sometime later;- A combination of the above two.
The root causes are:
- We do not pattern-match far enough back, in some cases, to fuse the
brz
andicmp
at the Cranelift level, and this is exacerbated by GVN and LICM that hoist icmps earlier in the function;- We do not have combination patterns that recognize when some producers of bools-as-integers will actually define the high bits.
Some combination of more aggressive pattern matching and demanded-bits analysis could improve the codegen in these cases.
cfallin labeled issue #4123:
The code generation for (i) branches on booleans, and (ii) branches on integer values that come from compares but are not directly observable (e.g. in a different basic block), is suboptimal. We often see:
- Masking, like
AND reg, 1
, because we pessimistically assume that the upper bits of a boolean value are undefined;cmp
,setcc
(x64) /cset
(aarch64) into a register followed by a conditional branch sometime later;- A combination of the above two.
The root causes are:
- We do not pattern-match far enough back, in some cases, to fuse the
brz
andicmp
at the Cranelift level, and this is exacerbated by GVN and LICM that hoist icmps earlier in the function;- We do not have combination patterns that recognize when some producers of bools-as-integers will actually define the high bits.
Some combination of more aggressive pattern matching and demanded-bits analysis could improve the codegen in these cases.
cfallin opened issue #4123:
The code generation for (i) branches on booleans, and (ii) branches on integer values that come from compares but are not directly observable (e.g. in a different basic block), is suboptimal. We often see:
- Masking, like
AND reg, 1
, because we pessimistically assume that the upper bits of a boolean value are undefined;cmp
,setcc
(x64) /cset
(aarch64) into a register followed by a conditional branch sometime later;- A combination of the above two.
The root causes are:
- We do not pattern-match far enough back, in some cases, to fuse the
brz
andicmp
at the Cranelift level, and this is exacerbated by GVN and LICM that hoist icmps earlier in the function;- We do not have combination patterns that recognize when some producers of bools-as-integers will actually define the high bits.
Some combination of more aggressive pattern matching and demanded-bits analysis could improve the codegen in these cases.
sparker-arm commented on issue #4123:
Would preventing the legalizing of
br_icmp
also help here? And possibly even do the opposite and combinebr
andicmp
, when theicmp
has single user?
cfallin commented on issue #4123:
Possibly... I'm a bit torn because our general principle is to decompose ops at the CLIF level, to allow for better optimization (this has certainly been our rule for SIMD for example). I could imagine a case where a
cset
produces a bool hoisted out of a loop, and the loop branches on that rather than a fresh compare, which reduces the loop-carried live set by one register. (This actually makes me think we might eventually want to have a notion of "hoisted for perf reasons, do not re-merge" feed from the mid-end into isel pattern matching, but that's a separate conversation!)An ad-hoc fusing pass is also somewhat brittle; imho it's better to have one place where we reason about macro-op matching (namely isel).
I think we should be able to do OK with better pattern matching, at least to remove the masking to start; but this is still pretty open to investigation!
Last updated: Dec 23 2024 at 12:05 UTC