bnjbvr opened PR #2146 from x64-isel
to main
:
This implements a few instruction selection optimizations for x64:
- optimize select/brz/brnz when the input is a comparison, to avoid the materialization of the flag register. Float comparisons are a bit tricky because of Equal/NotEqual (and other float CC that don't convert to a single int CC), but the rest is pretty simple.
- commuting operands of integer arithmetic commutative operations when one operand is an immediate, to avoid materializing the immediate
- use more addressing modes for loads/stores. Unfortunately, impact is limited because of the way heap_addr is legalized (an uext64 of the 32-bit operator), preventing the use of e.g. the shift bits in the SIB amode, or folding integer immediates into the offset altogether. Range analysis could help with this...
- fix the pinned register hack, which was not effectful; see commit message and patch. I'm not married to the longer name I've used for catching side-effectful instructions.
On embenchen in Spidermonkey, this is between a 3 to 7% speedup in compilation time, and i see generated code runtime speedups up to 25%.
bnjbvr requested cfallin for a review on PR #2146.
cfallin submitted PR Review.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Idle thought (not necessary for this PR): we should consolidate these tree-matching helpers among the backends (i.e. with aarch64) as we're doing for the ABI support; there's nothing machine-specific in the function below.
cfallin created PR Review Comment:
I think I'd prefer to lift the
op != Opcode::GetPinnedReg
condition out of this helper and to its callsite; otherwise it's an oddly specific predicate to have as a general-purpose helper, IMHO.
cfallin created PR Review Comment:
small_cst
=>small_constant
for clarity?
cfallin created PR Review Comment:
code-style nit, but
match swap_operands { FcmpOperands::Swap => ..., FcmpOperands::DontSwap => ... }
would be a bit more idiomatic I think.
cfallin created PR Review Comment:
This helper works but will call
ctx.get_input()
multiple times; could we factor this a bit more so that we pass oneLowerInput
around? Maybelowerinput_to_*()
and then wrap these withinput_to_*()
?
cfallin created PR Review Comment:
match input_to_imm(...) { Some(shift_amt) if shift_amt <= 3 => Some(...), _ => None }
cfallin created PR Review Comment:
I'm not sure exactly what the right factoring is, but there seems to be a lot of duplication across all the points at which an fcmp / icmp is lowered. I think it'd be much better to centralize the common bits (e.g. the match on
FloatCC
below) into a helper.At a high level, it seems what you want is, for any input that's a bool, "lower this somehow to the machine's condition flags and give me one or more conditions that indicate true". This would allow you to factor the input side (icmp or fcmp) and output side (csel for int or float, cond branch) into separate pieces.
cfallin created PR Review Comment:
We should clarify an important rule/invariant here: hte target of the branch must also be a successor of the actual terminator (otherwise the CFG is incorrect, which would confuse the RA). Also perhaps we should note when this instruction might be used (sequence of multiple cond branches deriving form one comparison) and pointing to an example (fcmp lowering).
cfallin created PR Review Comment:
Comment here for why the reg may not be virtual: we may directly return a machine register when we know that register holds the result of an opcode (by definition).
cfallin created PR Review Comment:
Why only
size == 1
? Actually, I'm not sure I understand the need for any sort of extension: surely the output of a select, having the same type as the inputs, has the same property of "upper bits don't matter"? So we can always do a full-register cmove, no matter the type of value held by the register.
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
I'm ok to do this, but since the helper is unused in other contexts that don't want to check for get_pinned_reg, this would be over-generalizing the helper without any clear benefits. Should we instead move this helper to the
machinst
directory, and rename ithas_lowering_side_effect
or something like this instead?
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
Yeah, I've thought about this and realized that such an helper would have to return a condition code, or a conjunction (AND, for Equal), or disjunction (OR, for NotEqual) of those. And then, if you look at select, we can't handle properly the result for
Equal
, so we'd have to use something else. This is one of these cases where the code duplication seemed better than having overly complicated helpers that would have to shoehorn many edge cases. I'll try something...
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
Because cmove can handle 16, 32 and 64-bits operands, but no 8-bits operands. But you're probably right about upper bits not mattering here.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Hmm, yeah, that sounds fine. I think the thing that stuck out the most to me was the the function name essentially had the same complexity as its body :-) "Lowering side effect" is much better.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Hmm, you would definitely have a better handle on this than I would -- it just seemed to want some factoring somewhere. I'm OK with landing this bit as-is, for what it's worth, if nothing easy occurs to us; we can always come back and refactor later.
bjorn3 submitted PR Review.
bjorn3 created PR Review Comment:
You can pretend that it is a 32bit int, as all bits above the size of the value are undefined, right? Also doing 32bit cmov for 16bit ints would safe a prefix byte.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
+1 for Chris' initial comment (produce a sequence of insns, the last one of which sets the condition codes somehow, and return the required test (Z, NZ, C, NB, etc) as the result of the lowering. However you refactor this, it might be helpful to have a design that makes it easy to take advantage of the fact that most integer insns set the condition codes, at least the Z and S flags, in some useful way. So for example then you could do the transformations
(x ^ y) == 0
-->xor %x, %y ; jz ..
(x << 1) != 0
-->shl $1, %x ; jnz ..
(x + y) < 0
-->add %x, %y ; js ..
[because S is the top bit of the result](x & $0b1001) != 0
-->tst $9, %x ; jnz ..
[important for efficient bitfield tests]GCC and LLVM both do this kind of trickery, nowadays quite extensively.
That would seem to be easy(er) if the only thing the helper returns is the final condition code -- in particular, that the helper doesn't assume that the next instruction will be a
cmp
. Also [maybe obvious] once you have such a helper, you can use it also to generate the insns + test for a CLIF conditional move (select?) and simply emit acmov<test>
rather thanj<test>
.Maybe having different helpers for conditions derived from FP and non-FP inputs might be clearer? The set of tricks you can do in the integer-input case seems quite different from the FP-input case.
julian-seward1 edited PR Review Comment.
julian-seward1 edited PR Review Comment.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Not to mention it would avoid a potential partial register use stall. Basically 16-bit integer insns of any kind are bad. +1 for @bjorn3's proposal of doing the cmov at 32 bits even for the 8- and 16-bit sizes.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Clearer might be
lower_to_amode
?
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
All of the above assuming of course that neither operand is in memory. If that is the case then we'd have to do it at the specified size else risk having unexpected traps at page boundaries.
bjorn3 submitted PR Review.
bjorn3 submitted PR Review.
bjorn3 created PR Review Comment:
Maybe also add the expected index of the return param used for
input
?
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
Good idea, added to todo list.
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
Good point, added a note!
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
I'm not sure what do you mean by this?
bjorn3 submitted PR Review.
bjorn3 created PR Review Comment:
You are currently ignoring the return param index part of
inputs.inst
. It would be better to check it against a value given by the caller and returnNone
if they don't match.
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
Ah, great point; the aarch64 backend suffers from the same issue. I'll fix it when refactoring helpers.
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
Nice, it reduces the amount of code even for regular select.
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
Actually, there might be a bug somewhere else, because i'm seeing a
Icmpif
with two results, and this seems to cause failures. I wonder if sometimes a result is replaced and the result index isn't getting updated correctly. Saving this for another yak shaving episode.
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
Found a way to factor this out that's not too terrible, and also implemented it correctly for
Trapff
lowering (probably not tested through Spidermonkey). Thanks for insisting on this!
bnjbvr updated PR #2146 from x64-isel
to main
:
This implements a few instruction selection optimizations for x64:
- optimize select/brz/brnz when the input is a comparison, to avoid the materialization of the flag register. Float comparisons are a bit tricky because of Equal/NotEqual (and other float CC that don't convert to a single int CC), but the rest is pretty simple.
- commuting operands of integer arithmetic commutative operations when one operand is an immediate, to avoid materializing the immediate
- use more addressing modes for loads/stores. Unfortunately, impact is limited because of the way heap_addr is legalized (an uext64 of the 32-bit operator), preventing the use of e.g. the shift bits in the SIB amode, or folding integer immediates into the offset altogether. Range analysis could help with this...
- fix the pinned register hack, which was not effectful; see commit message and patch. I'm not married to the longer name I've used for catching side-effectful instructions.
On embenchen in Spidermonkey, this is between a 3 to 7% speedup in compilation time, and i see generated code runtime speedups up to 25%.
bnjbvr requested cfallin for a review on PR #2146.
cfallin submitted PR Review.
bnjbvr updated PR #2146 from x64-isel
to main
:
This implements a few instruction selection optimizations for x64:
- optimize select/brz/brnz when the input is a comparison, to avoid the materialization of the flag register. Float comparisons are a bit tricky because of Equal/NotEqual (and other float CC that don't convert to a single int CC), but the rest is pretty simple.
- commuting operands of integer arithmetic commutative operations when one operand is an immediate, to avoid materializing the immediate
- use more addressing modes for loads/stores. Unfortunately, impact is limited because of the way heap_addr is legalized (an uext64 of the 32-bit operator), preventing the use of e.g. the shift bits in the SIB amode, or folding integer immediates into the offset altogether. Range analysis could help with this...
- fix the pinned register hack, which was not effectful; see commit message and patch. I'm not married to the longer name I've used for catching side-effectful instructions.
On embenchen in Spidermonkey, this is between a 3 to 7% speedup in compilation time, and i see generated code runtime speedups up to 25%.
bnjbvr updated PR #2146 from x64-isel
to main
:
This implements a few instruction selection optimizations for x64:
- optimize select/brz/brnz when the input is a comparison, to avoid the materialization of the flag register. Float comparisons are a bit tricky because of Equal/NotEqual (and other float CC that don't convert to a single int CC), but the rest is pretty simple.
- commuting operands of integer arithmetic commutative operations when one operand is an immediate, to avoid materializing the immediate
- use more addressing modes for loads/stores. Unfortunately, impact is limited because of the way heap_addr is legalized (an uext64 of the 32-bit operator), preventing the use of e.g. the shift bits in the SIB amode, or folding integer immediates into the offset altogether. Range analysis could help with this...
- fix the pinned register hack, which was not effectful; see commit message and patch. I'm not married to the longer name I've used for catching side-effectful instructions.
On embenchen in Spidermonkey, this is between a 3 to 7% speedup in compilation time, and i see generated code runtime speedups up to 25%.
bnjbvr merged PR #2146.
Last updated: Nov 22 2024 at 17:03 UTC