wasmtime / PR #2146 Enhance machinst x64 instruction sele... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #2146 Enhance machinst x64 instruction sele...

Wasmtime GitHub notifications bot (Aug 19 2020 at 14:02):

bnjbvr opened PR #2146 from x64-isel to main:

This implements a few instruction selection optimizations for x64:

optimize select/brz/brnz when the input is a comparison, to avoid the materialization of the flag register. Float comparisons are a bit tricky because of Equal/NotEqual (and other float CC that don't convert to a single int CC), but the rest is pretty simple.

commuting operands of integer arithmetic commutative operations when one operand is an immediate, to avoid materializing the immediate

use more addressing modes for loads/stores. Unfortunately, impact is limited because of the way heap_addr is legalized (an uext64 of the 32-bit operator), preventing the use of e.g. the shift bits in the SIB amode, or folding integer immediates into the offset altogether. Range analysis could help with this...

fix the pinned register hack, which was not effectful; see commit message and patch. I'm not married to the longer name I've used for catching side-effectful instructions.

On embenchen in Spidermonkey, this is between a 3 to 7% speedup in compilation time, and i see generated code runtime speedups up to 25%.

Wasmtime GitHub notifications bot (Aug 19 2020 at 14:02):

bnjbvr requested cfallin for a review on PR #2146.

Wasmtime GitHub notifications bot (Aug 20 2020 at 00:14):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (Aug 20 2020 at 00:14):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (Aug 20 2020 at 00:14):

cfallin created PR Review Comment:

Idle thought (not necessary for this PR): we should consolidate these tree-matching helpers among the backends (i.e. with aarch64) as we're doing for the ABI support; there's nothing machine-specific in the function below.

Wasmtime GitHub notifications bot (Aug 20 2020 at 00:14):

cfallin created PR Review Comment:

I think I'd prefer to lift the op != Opcode::GetPinnedReg condition out of this helper and to its callsite; otherwise it's an oddly specific predicate to have as a general-purpose helper, IMHO.

Wasmtime GitHub notifications bot (Aug 20 2020 at 00:14):

cfallin created PR Review Comment:

small_cst => small_constant for clarity?

Wasmtime GitHub notifications bot (Aug 20 2020 at 00:14):

cfallin created PR Review Comment:

code-style nit, but match swap_operands { FcmpOperands::Swap => ..., FcmpOperands::DontSwap => ... } would be a bit more idiomatic I think.

Wasmtime GitHub notifications bot (Aug 20 2020 at 00:14):

cfallin created PR Review Comment:

This helper works but will call ctx.get_input() multiple times; could we factor this a bit more so that we pass one LowerInput around? Maybe lowerinput_to_*() and then wrap these with input_to_*()?

Wasmtime GitHub notifications bot (Aug 20 2020 at 00:14):

cfallin created PR Review Comment:

match input_to_imm(...) { Some(shift_amt) if shift_amt <= 3 => Some(...), _ => None }

Wasmtime GitHub notifications bot (Aug 20 2020 at 00:14):

cfallin created PR Review Comment:

I'm not sure exactly what the right factoring is, but there seems to be a lot of duplication across all the points at which an fcmp / icmp is lowered. I think it'd be much better to centralize the common bits (e.g. the match on FloatCC below) into a helper.

At a high level, it seems what you want is, for any input that's a bool, "lower this somehow to the machine's condition flags and give me one or more conditions that indicate true". This would allow you to factor the input side (icmp or fcmp) and output side (csel for int or float, cond branch) into separate pieces.

Wasmtime GitHub notifications bot (Aug 20 2020 at 00:14):

cfallin created PR Review Comment:

We should clarify an important rule/invariant here: hte target of the branch must also be a successor of the actual terminator (otherwise the CFG is incorrect, which would confuse the RA). Also perhaps we should note when this instruction might be used (sequence of multiple cond branches deriving form one comparison) and pointing to an example (fcmp lowering).

Wasmtime GitHub notifications bot (Aug 20 2020 at 00:14):

cfallin created PR Review Comment:

Comment here for why the reg may not be virtual: we may directly return a machine register when we know that register holds the result of an opcode (by definition).

Wasmtime GitHub notifications bot (Aug 20 2020 at 00:14):

cfallin created PR Review Comment:

Why only size == 1? Actually, I'm not sure I understand the need for any sort of extension: surely the output of a select, having the same type as the inputs, has the same property of "upper bits don't matter"? So we can always do a full-register cmove, no matter the type of value held by the register.

Wasmtime GitHub notifications bot (Aug 20 2020 at 15:48):

bnjbvr submitted PR Review.

Wasmtime GitHub notifications bot (Aug 20 2020 at 15:49):

bnjbvr created PR Review Comment:

I'm ok to do this, but since the helper is unused in other contexts that don't want to check for get_pinned_reg, this would be over-generalizing the helper without any clear benefits. Should we instead move this helper to the machinst directory, and rename it has_lowering_side_effect or something like this instead?

Wasmtime GitHub notifications bot (Aug 20 2020 at 15:55):

bnjbvr submitted PR Review.

Wasmtime GitHub notifications bot (Aug 20 2020 at 15:55):

bnjbvr created PR Review Comment:

Yeah, I've thought about this and realized that such an helper would have to return a condition code, or a conjunction (AND, for Equal), or disjunction (OR, for NotEqual) of those. And then, if you look at select, we can't handle properly the result for Equal, so we'd have to use something else. This is one of these cases where the code duplication seemed better than having overly complicated helpers that would have to shoehorn many edge cases. I'll try something...

Wasmtime GitHub notifications bot (Aug 20 2020 at 15:57):

bnjbvr submitted PR Review.

Wasmtime GitHub notifications bot (Aug 20 2020 at 15:57):

bnjbvr created PR Review Comment:

Because cmove can handle 16, 32 and 64-bits operands, but no 8-bits operands. But you're probably right about upper bits not mattering here.

Wasmtime GitHub notifications bot (Aug 20 2020 at 18:43):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (Aug 20 2020 at 18:43):

cfallin created PR Review Comment:

Hmm, yeah, that sounds fine. I think the thing that stuck out the most to me was the the function name essentially had the same complexity as its body :-) "Lowering side effect" is much better.

Wasmtime GitHub notifications bot (Aug 20 2020 at 18:46):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (Aug 20 2020 at 18:46):

cfallin created PR Review Comment:

Hmm, you would definitely have a better handle on this than I would -- it just seemed to want some factoring somewhere. I'm OK with landing this bit as-is, for what it's worth, if nothing easy occurs to us; we can always come back and refactor later.

Wasmtime GitHub notifications bot (Aug 20 2020 at 18:47):

bjorn3 submitted PR Review.

Wasmtime GitHub notifications bot (Aug 20 2020 at 18:47):

bjorn3 created PR Review Comment:

You can pretend that it is a 32bit int, as all bits above the size of the value are undefined, right? Also doing 32bit cmov for 16bit ints would safe a prefix byte.

Wasmtime GitHub notifications bot (Aug 21 2020 at 05:28):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Aug 21 2020 at 05:28):

julian-seward1 created PR Review Comment:

+1 for Chris' initial comment (produce a sequence of insns, the last one of which sets the condition codes somehow, and return the required test (Z, NZ, C, NB, etc) as the result of the lowering. However you refactor this, it might be helpful to have a design that makes it easy to take advantage of the fact that most integer insns set the condition codes, at least the Z and S flags, in some useful way. So for example then you could do the transformations

(x ^ y) == 0 --> xor %x, %y ; jz ..

(x << 1) != 0 --> shl $1, %x ; jnz ..

(x + y) < 0 --> add %x, %y ; js .. [because S is the top bit of the result]

(x & $0b1001) != 0 --> tst $9, %x ; jnz .. [important for efficient bitfield tests]

GCC and LLVM both do this kind of trickery, nowadays quite extensively.

That would seem to be easy(er) if the only thing the helper returns is the final condition code -- in particular, that the helper doesn't assume that the next instruction will be a cmp. Also [maybe obvious] once you have such a helper, you can use it also to generate the insns + test for a CLIF conditional move (select?) and simply emit a cmov<test> rather than j<test>.

Maybe having different helpers for conditions derived from FP and non-FP inputs might be clearer? The set of tricks you can do in the integer-input case seems quite different from the FP-input case.

Wasmtime GitHub notifications bot (Aug 21 2020 at 05:29):

julian-seward1 edited PR Review Comment.

Wasmtime GitHub notifications bot (Aug 21 2020 at 05:30):

julian-seward1 edited PR Review Comment.

Wasmtime GitHub notifications bot (Aug 21 2020 at 05:46):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Aug 21 2020 at 05:46):

julian-seward1 created PR Review Comment:

Not to mention it would avoid a potential partial register use stall. Basically 16-bit integer insns of any kind are bad. +1 for @bjorn3's proposal of doing the cmov at 32 bits even for the 8- and 16-bit sizes.

Wasmtime GitHub notifications bot (Aug 21 2020 at 05:51):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Aug 21 2020 at 05:51):

julian-seward1 created PR Review Comment:

Clearer might be lower_to_amode ?

Wasmtime GitHub notifications bot (Aug 21 2020 at 05:59):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Aug 21 2020 at 05:59):

julian-seward1 created PR Review Comment:

All of the above assuming of course that neither operand is in memory. If that is the case then we'd have to do it at the specified size else risk having unexpected traps at page boundaries.

Wasmtime GitHub notifications bot (Aug 21 2020 at 06:26):

bjorn3 submitted PR Review.

Wasmtime GitHub notifications bot (Aug 21 2020 at 06:26):

bjorn3 submitted PR Review.

Wasmtime GitHub notifications bot (Aug 21 2020 at 06:26):

bjorn3 created PR Review Comment:

Maybe also add the expected index of the return param used for input?

Wasmtime GitHub notifications bot (Aug 21 2020 at 09:27):

bnjbvr submitted PR Review.

Wasmtime GitHub notifications bot (Aug 21 2020 at 09:27):

bnjbvr created PR Review Comment:

Good idea, added to todo list.

Wasmtime GitHub notifications bot (Aug 21 2020 at 09:51):

bnjbvr submitted PR Review.

Wasmtime GitHub notifications bot (Aug 21 2020 at 09:51):

bnjbvr created PR Review Comment:

Good point, added a note!

Wasmtime GitHub notifications bot (Aug 21 2020 at 09:52):

bnjbvr submitted PR Review.

Wasmtime GitHub notifications bot (Aug 21 2020 at 09:52):

bnjbvr created PR Review Comment:

I'm not sure what do you mean by this?

Wasmtime GitHub notifications bot (Aug 21 2020 at 10:30):

bjorn3 submitted PR Review.

Wasmtime GitHub notifications bot (Aug 21 2020 at 10:30):

bjorn3 created PR Review Comment:

You are currently ignoring the return param index part of inputs.inst. It would be better to check it against a value given by the caller and return None if they don't match.

Wasmtime GitHub notifications bot (Aug 21 2020 at 10:40):

bnjbvr submitted PR Review.

Wasmtime GitHub notifications bot (Aug 21 2020 at 10:40):

bnjbvr created PR Review Comment:

Ah, great point; the aarch64 backend suffers from the same issue. I'll fix it when refactoring helpers.

Wasmtime GitHub notifications bot (Aug 21 2020 at 10:40):

bnjbvr submitted PR Review.

Wasmtime GitHub notifications bot (Aug 21 2020 at 10:40):

bnjbvr created PR Review Comment:

Nice, it reduces the amount of code even for regular select.

Wasmtime GitHub notifications bot (Aug 21 2020 at 13:41):

bnjbvr submitted PR Review.

Wasmtime GitHub notifications bot (Aug 21 2020 at 13:41):

bnjbvr created PR Review Comment:

Actually, there might be a bug somewhere else, because i'm seeing a Icmpif with two results, and this seems to cause failures. I wonder if sometimes a result is replaced and the result index isn't getting updated correctly. Saving this for another yak shaving episode.

Wasmtime GitHub notifications bot (Aug 21 2020 at 13:59):

bnjbvr submitted PR Review.

Wasmtime GitHub notifications bot (Aug 21 2020 at 13:59):

bnjbvr created PR Review Comment:

Found a way to factor this out that's not too terrible, and also implemented it correctly for Trapff lowering (probably not tested through Spidermonkey). Thanks for insisting on this!

Wasmtime GitHub notifications bot (Aug 21 2020 at 13:59):

bnjbvr updated PR #2146 from x64-isel to main:

This implements a few instruction selection optimizations for x64:

optimize select/brz/brnz when the input is a comparison, to avoid the materialization of the flag register. Float comparisons are a bit tricky because of Equal/NotEqual (and other float CC that don't convert to a single int CC), but the rest is pretty simple.

commuting operands of integer arithmetic commutative operations when one operand is an immediate, to avoid materializing the immediate

use more addressing modes for loads/stores. Unfortunately, impact is limited because of the way heap_addr is legalized (an uext64 of the 32-bit operator), preventing the use of e.g. the shift bits in the SIB amode, or folding integer immediates into the offset altogether. Range analysis could help with this...

fix the pinned register hack, which was not effectful; see commit message and patch. I'm not married to the longer name I've used for catching side-effectful instructions.

On embenchen in Spidermonkey, this is between a 3 to 7% speedup in compilation time, and i see generated code runtime speedups up to 25%.

Wasmtime GitHub notifications bot (Aug 21 2020 at 14:02):

bnjbvr requested cfallin for a review on PR #2146.

Wasmtime GitHub notifications bot (Aug 21 2020 at 16:17):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (Aug 24 2020 at 14:21):

bnjbvr updated PR #2146 from x64-isel to main:

This implements a few instruction selection optimizations for x64:

optimize select/brz/brnz when the input is a comparison, to avoid the materialization of the flag register. Float comparisons are a bit tricky because of Equal/NotEqual (and other float CC that don't convert to a single int CC), but the rest is pretty simple.

commuting operands of integer arithmetic commutative operations when one operand is an immediate, to avoid materializing the immediate

use more addressing modes for loads/stores. Unfortunately, impact is limited because of the way heap_addr is legalized (an uext64 of the 32-bit operator), preventing the use of e.g. the shift bits in the SIB amode, or folding integer immediates into the offset altogether. Range analysis could help with this...

fix the pinned register hack, which was not effectful; see commit message and patch. I'm not married to the longer name I've used for catching side-effectful instructions.

On embenchen in Spidermonkey, this is between a 3 to 7% speedup in compilation time, and i see generated code runtime speedups up to 25%.

Wasmtime GitHub notifications bot (Aug 24 2020 at 14:21):

bnjbvr updated PR #2146 from x64-isel to main:

This implements a few instruction selection optimizations for x64:

optimize select/brz/brnz when the input is a comparison, to avoid the materialization of the flag register. Float comparisons are a bit tricky because of Equal/NotEqual (and other float CC that don't convert to a single int CC), but the rest is pretty simple.

commuting operands of integer arithmetic commutative operations when one operand is an immediate, to avoid materializing the immediate

use more addressing modes for loads/stores. Unfortunately, impact is limited because of the way heap_addr is legalized (an uext64 of the 32-bit operator), preventing the use of e.g. the shift bits in the SIB amode, or folding integer immediates into the offset altogether. Range analysis could help with this...

fix the pinned register hack, which was not effectful; see commit message and patch. I'm not married to the longer name I've used for catching side-effectful instructions.

On embenchen in Spidermonkey, this is between a 3 to 7% speedup in compilation time, and i see generated code runtime speedups up to 25%.

Wasmtime GitHub notifications bot (Aug 24 2020 at 15:00):

bnjbvr merged PR #2146.

Last updated: Apr 18 2025 at 16:03 UTC