Stream: git-wasmtime

Topic: wasmtime / issue #6876 riscv64: Optimize `bitselect+{i,f}...


view this post on Zulip Wasmtime GitHub notifications bot (Aug 22 2023 at 14:22):

alexcrichton commented on issue #6876:

This might be good for a rule like on x64 where much of the logic of bitselect can be skipped if the mask is known to have a particular pattern? (e.g. generated by icmp or fcmp)

view this post on Zulip Wasmtime GitHub notifications bot (Aug 22 2023 at 14:50):

afonso360 commented on issue #6876:

Yeah, that's a neat way of de-duplicating these rules. I also think it might be worth copying that vconst_all_ones_or_zeros pattern into an egraphs rule, since it looks like we should be able to const propagate that

view this post on Zulip Wasmtime GitHub notifications bot (Aug 22 2023 at 14:50):

afonso360 edited a comment on issue #6876:

Yeah, that's a neat way of de-duplicating these rules. I also think it might be worth copying that vconst_all_ones_or_zeros pattern into an egraphs rule, since it looks like we should be able to const propagate that.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 22 2023 at 14:52):

afonso360 edited a comment on issue #6876:

Yeah, that's a neat way of de-duplicating these rules. I also think it might be worth copying that vconst_all_ones_or_zeros pattern into an egraphs rule, since it looks like we should be able to const propagate that.

Edit: Hmm actually I don't think we can quite deduplicate this as neatly since we have different ways of generating a icmp mask and fcmp mask.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 22 2023 at 14:58):

afonso360 edited a comment on issue #6876:

Yeah, that's a neat way of de-duplicating these rules. I also think it might be worth copying that vconst_all_ones_or_zeros pattern into an egraphs rule, since it looks like we should be able to const propagate that.

Edit: Hmm actually I don't think we can quite deduplicate this as neatly since we have different ways of generating a icmp mask and fcmp mask.

Edit2: Wait, now I'm confused. We're already sort of skipping the main bitselect lowering by using vmerge instead, that merges whole lanes instead of bits. We can't skip generating the icmp/fcmp mask since in RISC-V the masks are 1 bit per lane, so we always need to make the conversion from 1 bit per bit, into 1 bit per lane. Is that what you meant?

view this post on Zulip Wasmtime GitHub notifications bot (Aug 22 2023 at 15:17):

alexcrichton commented on issue #6876:

Oh this may not be applicable for riscv64, but on x64 at least there's no bitselect instruction so it uses the and/or/not/etc combo by default, but x64 does have lane-based select so if a bitselect's input mask is a comparison then the and/or/not/etc combo can be skipped in favor of a single lane-based instruction. I figured the same might be possible for riscv64 where bitselect could skip the combo of instructions and generate a single vmerge* (which I interpreted, perhaps mistakenly, as a lane-based instruction) after the normal instruction generated for the comparison. That way the rules for bitselect wouldn't generate icmp/fcmp instructions at all, instead the output of those instructions will be fed directly into vmerge.

since we have different ways of generating a icmp mask and fcmp mask.

On x64 I believe this is handled by bitselect not actually generating the icmp/fcmp, instead basically calling put_in_reg for the output of the icmp/fcmp which then will later use other lowering rules to generate the icmp/fcmp.

We can't skip generating the icmp/fcmp mask since in RISC-V the masks are 1 bit per lane, so we always need to make the conversion from 1 bit per bit, into 1 bit per lane. Is that what you meant?

Right yeah, my thinking was that like on x64 the lowering for bitselect wouldn't generate the cmp, only the vmerge, but because the mask of the bitselect is only pattern-matched on but we don't use the sub-operands that it would still generate the cmp naturally via other lowering rules.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 22 2023 at 16:19):

afonso360 commented on issue #6876:

Oh, okay so it's pretty much the same on RISC-V. We don't have a direct bitselect equivalent, we lower to and/or/not combo as well by default. vmerge is a lanewise select as you mentioned.

That way the rules for bitselect wouldn't generate icmp/fcmp instructions at all, instead the output of those instructions will be fed directly into vmerge.

Here's where this diverges. vmerge does not support the regular masks that are fed into bitselect. For a i8x16 we need a 16bit mask instead of a 128bit mask. (Or a 8 bit mask for i16x8, etc...)

In the regular icmp and fcmp instruction lowerings we do two things, first we generate the small mask (1 bit per lane), and then we do a second step of expanding that into a big mask.

In this PR gen_icmp_mask and gen_fcmp_mask only generate the small mask, that is compatible with vmerge. But we never expand it, which is what would happen if we let the regular icmp and fcmp instructions get lowered.

Does that make sense? I think I might be missing something, sorry if I am.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 22 2023 at 16:30):

afonso360 edited a comment on issue #6876:

Oh, okay so it's pretty much the same on RISC-V. We don't have a direct bitselect equivalent, we lower to and/or/not combo as well by default. vmerge is a lanewise select as you mentioned.

That way the rules for bitselect wouldn't generate icmp/fcmp instructions at all, instead the output of those instructions will be fed directly into vmerge.

Here's where this diverges. vmerge does not support the regular masks that are fed into bitselect. For a i8x16 we need a 16bit mask instead of a 128bit mask. (Or a 8 bit mask for i16x8, etc...)

In the regular icmp and fcmp instruction lowerings we do two things, first we generate the small mask (1 bit per lane), and then we do a second step of expanding that into a big mask.

In this PR gen_icmp_mask and gen_fcmp_mask only generate the small mask, that is compatible with vmerge. But we never expand it, which is what would happen if we let the regular icmp and fcmp instructions get lowered.

Does that make sense? I think I might be missing something, sorry if I am.

--
Edit:

Yeah I think this is where x64 and RISC-V are different. Reading the spec for the blend instruction it looks like it makes a decision based on the MSB on each lane, which allows it to directly use the input for bitselect, if we know its all the same for the entire lane.

On RISC-V our masks are completely different we only use the lowest n bits of the entire register (where n is the lane count). So they are sort of "compressed" masks. But that makes them incompatible with WASM and Cranelift's icmp/fcmp, so we need to pretend that they are the same by expanding them, and that prevents just generating vmerge as part of bitselect even if know that the input is a icmp/fcmp.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 22 2023 at 17:21):

jameysharp commented on issue #6876:

A different way to deduplicate these rules would be along the lines of maybe_uextend; riscv could use a maybe_bitcast extractor to have just one bitselect rule for each of icmp and fcmp, instead of two for each.

But I'm not sure that would be more broadly useful so I'm not sure it's worth doing. Just a thought in case it helps with something later.


Last updated: Dec 23 2024 at 12:05 UTC