alexcrichton commented on issue #6876:
This might be good for a rule like on x64 where much of the logic of bitselect can be skipped if the mask is known to have a particular pattern? (e.g. generated by icmp or fcmp)
afonso360 commented on issue #6876:
Yeah, that's a neat way of de-duplicating these rules. I also think it might be worth copying that
vconst_all_ones_or_zeros
pattern into an egraphs rule, since it looks like we should be able to const propagate that
afonso360 edited a comment on issue #6876:
Yeah, that's a neat way of de-duplicating these rules. I also think it might be worth copying that
vconst_all_ones_or_zeros
pattern into an egraphs rule, since it looks like we should be able to const propagate that.
afonso360 edited a comment on issue #6876:
Yeah, that's a neat way of de-duplicating these rules. I also think it might be worth copying that
vconst_all_ones_or_zeros
pattern into an egraphs rule, since it looks like we should be able to const propagate that.Edit: Hmm actually I don't think we can quite deduplicate this as neatly since we have different ways of generating a
icmp
mask andfcmp
mask.
afonso360 edited a comment on issue #6876:
Yeah, that's a neat way of de-duplicating these rules. I also think it might be worth copying that
vconst_all_ones_or_zeros
pattern into an egraphs rule, since it looks like we should be able to const propagate that.Edit: Hmm actually I don't think we can quite deduplicate this as neatly since we have different ways of generating a
icmp
mask andfcmp
mask.Edit2: Wait, now I'm confused. We're already sort of skipping the main bitselect lowering by using
vmerge
instead, that merges whole lanes instead of bits. We can't skip generating theicmp
/fcmp
mask since in RISC-V the masks are 1 bit per lane, so we always need to make the conversion from 1 bit per bit, into 1 bit per lane. Is that what you meant?
alexcrichton commented on issue #6876:
Oh this may not be applicable for riscv64, but on x64 at least there's no bitselect instruction so it uses the and/or/not/etc combo by default, but x64 does have lane-based select so if a bitselect's input mask is a comparison then the and/or/not/etc combo can be skipped in favor of a single lane-based instruction. I figured the same might be possible for riscv64 where bitselect could skip the combo of instructions and generate a single
vmerge*
(which I interpreted, perhaps mistakenly, as a lane-based instruction) after the normal instruction generated for the comparison. That way the rules forbitselect
wouldn't generate icmp/fcmp instructions at all, instead the output of those instructions will be fed directly intovmerge
.since we have different ways of generating a icmp mask and fcmp mask.
On x64 I believe this is handled by
bitselect
not actually generating the icmp/fcmp, instead basically callingput_in_reg
for the output of the icmp/fcmp which then will later use other lowering rules to generate the icmp/fcmp.We can't skip generating the icmp/fcmp mask since in RISC-V the masks are 1 bit per lane, so we always need to make the conversion from 1 bit per bit, into 1 bit per lane. Is that what you meant?
Right yeah, my thinking was that like on x64 the lowering for
bitselect
wouldn't generate the cmp, only thevmerge
, but because the mask of thebitselect
is only pattern-matched on but we don't use the sub-operands that it would still generate the cmp naturally via other lowering rules.
afonso360 commented on issue #6876:
Oh, okay so it's pretty much the same on RISC-V. We don't have a direct
bitselect
equivalent, we lower toand/or/not
combo as well by default.vmerge
is a lanewise select as you mentioned.That way the rules for bitselect wouldn't generate icmp/fcmp instructions at all, instead the output of those instructions will be fed directly into vmerge.
Here's where this diverges.
vmerge
does not support the regular masks that are fed intobitselect
. For ai8x16
we need a 16bit mask instead of a 128bit mask. (Or a 8 bit mask fori16x8
, etc...)In the regular
icmp
andfcmp
instruction lowerings we do two things, first we generate the small mask (1 bit per lane), and then we do a second step of expanding that into a big mask.In this PR
gen_icmp_mask
andgen_fcmp_mask
only generate the small mask, that is compatible withvmerge
. But we never expand it, which is what would happen if we let the regularicmp
andfcmp
instructions get lowered.Does that make sense? I think I might be missing something, sorry if I am.
afonso360 edited a comment on issue #6876:
Oh, okay so it's pretty much the same on RISC-V. We don't have a direct
bitselect
equivalent, we lower toand/or/not
combo as well by default.vmerge
is a lanewise select as you mentioned.That way the rules for bitselect wouldn't generate icmp/fcmp instructions at all, instead the output of those instructions will be fed directly into vmerge.
Here's where this diverges.
vmerge
does not support the regular masks that are fed intobitselect
. For ai8x16
we need a 16bit mask instead of a 128bit mask. (Or a 8 bit mask fori16x8
, etc...)In the regular
icmp
andfcmp
instruction lowerings we do two things, first we generate the small mask (1 bit per lane), and then we do a second step of expanding that into a big mask.In this PR
gen_icmp_mask
andgen_fcmp_mask
only generate the small mask, that is compatible withvmerge
. But we never expand it, which is what would happen if we let the regularicmp
andfcmp
instructions get lowered.Does that make sense? I think I might be missing something, sorry if I am.
--
Edit:Yeah I think this is where x64 and RISC-V are different. Reading the spec for the
blend
instruction it looks like it makes a decision based on the MSB on each lane, which allows it to directly use the input forbitselect
, if we know its all the same for the entire lane.On RISC-V our masks are completely different we only use the lowest
n
bits of the entire register (wheren
is the lane count). So they are sort of "compressed" masks. But that makes them incompatible with WASM and Cranelift'sicmp/fcmp
, so we need to pretend that they are the same by expanding them, and that prevents just generatingvmerge
as part ofbitselect
even if know that the input is aicmp
/fcmp
.
jameysharp commented on issue #6876:
A different way to deduplicate these rules would be along the lines of
maybe_uextend
; riscv could use amaybe_bitcast
extractor to have just one bitselect rule for each of icmp and fcmp, instead of two for each.But I'm not sure that would be more broadly useful so I'm not sure it's worth doing. Just a thought in case it helps with something later.
Last updated: Nov 22 2024 at 16:03 UTC