abrown opened issue #3045:
Feature
As discussed in several places (e.g. here and here), several Wasm SIMD instructions can be translated to multiple, simpler CLIF instructions and then pattern-matched in the backend lowering.
Benefit
Removing specialized CLIF instructions would make things simpler for users of the builder API and somewhat simpler for maintainers. I feel this feature is more about making the CLIF ISA "nicer," which is a bit relative, but not unreasonable. The risk is that it actually complicates things ("one CLIF instruction per Wasm instruction" is also "nice") or that it slows down lowering.
Implementation
This would involve:
- reviewing CLIF SIMD instructions to identify specialized instructions that can be replaced by pattern-matching: e.g., any instruction that involves widening/narrowing, conversion from signed to unsigned or vice versa, pairwise or low/high access, conversion from to from floats, etc.
- replace those instructions in the lowerings with the pattern-matched version
- remove unused instructions
Alternatives
Leave things as-is: some mixture of the "specialized CLIF instruction for a Wasm SIMD instruction" paradigm along with some of the "pattern-match several simpler CLIF instructions" paradigm.
abrown commented on issue #3045:
The more I think about this, the more ambivalent I feel about whether it should be done or not: I think I remember talking to @sunfishcode early on and the plan was to just go ahead and add a CLIF instruction for every Wasm SIMD instruction. So perhaps the first step would be to decide whether this is the way to go. IIRC, the following people might be likely to encourage this approach: @akirilov-arm, @julian-seward1, @cfallin?
cfallin commented on issue #3045:
This is an interesting design question -- thank you for opening the issue @abrown! I agree that we've discussed it in ad-hoc ways and danced around the issue for a while, and it would benefit us now to decide on a specific approach and stick to it.
Personally, I'm of the general opinion (as you allude to) that a simpler IR ISA tends to be better, but within reason. That is, if we can break down O(n^2) "A+B opcodes" into O(n) A opcodes and B opcodes, and then pattern-match some (maybe sparse) subset of the O(n^2) space back into specialized machine instructions, that seems to be a win in terms of complexity. But as the dimensionality increases, i.e. more ops combined into one, the tradeoff changes and there is more value in naming the specific bundle of ops and handling it as a unit.
As I see it, the advantages of a "break down into fundamental ops" policy are:
- Simpler to understand the set of opcodes. A person targeting CLIF will be less overwhelmed and more able to generate appropriate code if they see a smaller set of primitives. That is, this avoids the cost of having an IR where one needs to know to use exactly the right intrinsic to get optimal performance.
- Simpler to develop alternative executors, and analyses, for IR. if someone wants to write an interpreter (other than ours), or a translator to some other IR, or write an analysis that needs to understand the effects of opcodes, it is easier to do this with a smaller set of opcodes, and there are less likely to be corner-case bugs. Especially as opcode semantics become very specialized, we lock ourselves into existing implementations more and more.
- Easier to get started in developing a new backend.
On the other hand, the clear advantages of "one Wasm-SIMD op remains one CLIF op" are:
- Less lowering overhead (we already have the unit recognized as such when it enters the system; no need to deconstruct. and reverse engineer it)
- Less actual implementation complexity when one knows exactly how to lower a particular Wasm op to a particular sequence of machine instructions.
One could summarize this as "the concept exists as a unit, so name it and treat it as a unit".
In some ideal future world, we might consider having a "multi-level IR" where we could have both the high-level concept and one particular lowering into fundamental ops. In some sense, this is what legalization did -- it had fallback rules that would break down more complex ops into simpler ones only when the more complex ops didn't have a direct encoding. In the hopefully near-to-mid-term future, I'll have some infrastructure built to let us do this again, in a more efficient way (statically rather than in-place edits dynamically). In the meantime, though, we need to decide one or the other.
So I might suggest a rule that balances the two: any combo-op that can be expressed as two existing opcodes should be broken down as such -- this avoids the explosion of opcodes for all of the "load+op", "extend+op", and similar cases -- while more complex behaviors that enter as a unit (one Wasm instruction), and for which we have a specific lowering in mind that differs from the naive sum-of-parts, can be reified as a single CLIF opcode. This also feels more acceptable to me when the opcodes are "arithmetic" in nature (i.e., pure, no loads/stores) because then analyses need not have special cases and can just ignore them for many purposes. (This was one reason I was happy to see
LoadSplat
split back into a load + separate op.)Does that seem reasonable? I'm especially interested in any thoughts @sunfishcode might have from his experience here.
cfallin edited a comment on issue #3045:
This is an interesting design question -- thank you for opening the issue @abrown! I agree that we've discussed it in ad-hoc ways and danced around the issue for a while, and it would benefit us now to decide on a specific approach and stick to it.
Personally, I'm of the general opinion (as you allude to) that a simpler IR ISA tends to be better, but within reason. That is, if we can break down O(n^2) "A+B opcodes" into O(n) A opcodes and B opcodes, and then pattern-match some (maybe sparse) subset of the O(n^2) space back into specialized machine instructions, that seems to be a win in terms of complexity. But as the dimensionality increases, i.e. more ops combined into one, the tradeoff changes and there is more value in naming the specific bundle of ops and handling it as a unit.
As I see it, the advantages of a "break down into fundamental ops" policy are:
- Simpler to understand the set of opcodes. A person targeting CLIF will be less overwhelmed and more able to generate appropriate code if they see a smaller set of primitives. That is, this avoids the cost of having an IR where one needs to know to use exactly the right intrinsic to get optimal performance.
- Simpler to develop alternative executors, and analyses, for IR. if someone wants to write an interpreter (other than ours), or a translator to some other IR, or write an analysis that needs to understand the effects of opcodes, it is easier to do this with a smaller set of opcodes, and there are less likely to be corner-case bugs. Especially as opcode semantics become very specialized, we lock ourselves into existing implementations more and more.
- Easier to get started in developing a new backend.
On the other hand, the clear advantages of "one Wasm-SIMD op remains one CLIF op" are:
- Less lowering overhead (we already have the unit recognized as such when it enters the system; no need to deconstruct and reverse engineer it)
- Less actual implementation complexity when one knows exactly how to lower a particular Wasm op to a particular sequence of machine instructions.
One could summarize this as "the concept exists as a unit, so name it and treat it as a unit".
In some ideal future world, we might consider having a "multi-level IR" where we could have both the high-level concept and one particular lowering into fundamental ops. In some sense, this is what legalization did -- it had fallback rules that would break down more complex ops into simpler ones only when the more complex ops didn't have a direct encoding. In the hopefully near-to-mid-term future, I'll have some infrastructure built to let us do this again, in a more efficient way (statically rather than in-place edits dynamically). In the meantime, though, we need to decide one or the other.
So I might suggest a rule that balances the two: any combo-op that can be expressed as two existing opcodes should be broken down as such -- this avoids the explosion of opcodes for all of the "load+op", "extend+op", and similar cases -- while more complex behaviors that enter as a unit (one Wasm instruction), and for which we have a specific lowering in mind that differs from the naive sum-of-parts, can be reified as a single CLIF opcode. This also feels more acceptable to me when the opcodes are "arithmetic" in nature (i.e., pure, no loads/stores, no traps) because then analyses need not have special cases and can just ignore them for many purposes. (This was one reason I was happy to see
LoadSplat
split back into a load + separate op.)Does that seem reasonable? I'm especially interested in any thoughts @sunfishcode might have from his experience here.
abrown commented on issue #3045:
Seems reasonable to me; your additional detail on the pros and cons is good.
jlb6740 commented on issue #3045:
"Simpler to understand the set of opcodes. A person targeting CLIF will be less overwhelmed and more able to generate appropriate code if they see a smaller set of primitives. That is, this avoids the cost of having an IR where one needs to know to use exactly the right intrinsic to get optimal performance."
@cfallin By the point above by simpler I assume you mean the CLIF instruction are more primitive, but what do you mean by "avoids an IR where one needs to know to use exactly the right intrinsic to get optimal performance" given the CLI F is a IR target that is expected to be later lowered?
I have always favored the simplicity and reduced overhead and complexity of having one CLIF instruction map directly to one WASM instruction if in fact CLIF is purpose to be a target for WASM. However, as has been pointed out in the other issues referenced this has been discussed before. In very clear cases I think your final suggestion sounds reasonable, especially considering those WASM instructions that themselves were explained and introduced as being the combo of two or more WASM instructions.
cfallin commented on issue #3045:
@jlb6740 in the above, I was thinking of the case where CLIF is the target of some other backend or code generator (e.g.: someone trying to use Cranelift to generate fast JIT'd numeric code). Simplicity also matters when it's the source but the IR is processed by code other than ours; e.g. someone writes an analysis that evaluates security taintedness, or value ranges, or numeric precision, or ..., and needs to understand semantics to some degree. Basically, there is a cost (which may be worth paying in many cases, to be clear!) to adding new opcodes that is mainly paid when we try to move beyond the current "Wasm -> CLIF -> new backends" pipeline.
One thing that I'm hearing, though, and that definitely has value, is the emphasis that there should be some level at which Wasm semantics specifically are 1-to-1-mapped; so when I get the new isel backend-generator online and have the rule-based design working, we will be in a place where we could have Wasm-opcode-specific CLIF opcodes, but in a separate "stratum" of sorts that is always broken down by another rule unless there's a specific match in the consumer. Basically what I'm saying is I think that this tension is temporary and I'll work as fast as I can to build the thing that fixes it :-)
jlb6740 edited a comment on issue #3045:
"Simpler to understand the set of opcodes. A person targeting CLIF will be less overwhelmed and more able to generate appropriate code if they see a smaller set of primitives. That is, this avoids the cost of having an IR where one needs to know to use exactly the right intrinsic to get optimal performance."
@cfallin By the point above by simpler I assume you mean the CLIF instruction are more primitive, but what do you mean by "avoids an IR where one needs to know to use exactly the right intrinsic to get optimal performance" given the CLI F is a IR target that is expected to be later lowered?
I have always favored the simplicity and reduced overhead and complexity of having one CLIF instruction map directly to one WASM instruction if in fact CLIF is purpose to be a target for WASM. However, as has been pointed out in the other issues referenced, this has been discussed before. In very clear cases I think your final suggestion sounds reasonable, especially considering those WASM instructions that themselves were explained and introduced as being the combo of two or more WASM instructions.
sparker-arm commented on issue #3045:
I have my own two versions of extend pairwise add support for arm64, and my knee jerk reaction is to favour a reduced IR to make mid-end work easier... But is it intended that cranelift will have many mid-end components to make the burden on the multiple backends really worth it? I feel everyone doing their own 'extend + arith' pattern matching, as well as being tedious, is equally likely to produce bugs than the benefits of a more simple IR. Also, is instruction selection performed at the basic block level, or is it global? If it's performed locally, then we also run the risk of losing the opportunity to recombine operations in the backends if instructions are moved around and this could cripple SIMD performance.
sparker-arm commented on issue #3045:
I'd follow up by saying that I'm currently implemented ext mul operations too, and I have a sneakingly feeling that using the intermediate extensions are causing some form of type legalization/expansion trouble which is making 'correct' isel difficult for i32x4 inputs.
sparker-arm edited a comment on issue #3045:
I have my own two versions of extend pairwise add support for arm64, and my knee jerk reaction is to favour a reduced IR to make mid-end work easier... But is it intended that cranelift will have many mid-end components to make the burden on the multiple backends really worth it? I feel everyone doing their own 'extend + arith' pattern matching, as well as being tedious, is equally likely to produce bugs than the benefits of a more simple IR. Also, is instruction selection performed at the basic block level, or is it global? If it's performed locally, then we also run the risk of losing the opportunity to recombine operations in the backends if instructions are moved around and this could cripple SIMD performance.
There is the option of having a 1-1 mapping in the IR and also a target independent layer that can perform expansion into simpler instructions if the backend requests it.
cfallin commented on issue #3045:
is instruction selection performed at the basic block level, or is it global?
It's global, at least for pure opcodes (we use a coloring scheme to prevent illegal movement of effectful ops, and color changes on every BB boundary for simplicity). In this case (SIMD ops) I think it should still work even if e.g. LICM lifts half the code out of a loop or something like that.
There is the option of having a 1-1 mapping in the IR and also a target independent layer that can perform expansion into simpler instructions if the backend requests it.
Yep, that's basically what I want to build :-)
akirilov-arm commented on issue #3045:
... any combo-op that can be expressed as two existing opcodes should be broken down as such -- this avoids the explosion of opcodes for all of the "load+op", "extend+op", and similar cases -- while more complex behaviors that enter as a unit (one Wasm instruction), _and_ for which we have a specific lowering in mind that differs from the naive sum-of-parts, can be reified as a single CLIF opcode. [...] Does that seem reasonable?
Yes, it does, in fact that was the rule I pretty much followed with pull requests #3035 and #3044, the former concerning a more complicated Wasm instruction that makes sense to map 1:1 to an IR operation, while the latter relating to instructions that have simple decompositions into existing IR operations.
I don't have anything to add to the quite exhaustive lists of pros and cons above except specific examples - the extended integer multiplication instructions are an interesting one because both the AArch64 and x64 backends already have everything they need to implement them (that's basic support, not optimal code generation), so only changes to the translator from Wasm to CLIF are truly necessary. Also, the very recent fuzzer work tracked in issue #3050 (which covers SIMD operations as well) seems to benefit from a somewhat smaller number of operations to deal with.
akirilov-arm labeled issue #3045:
Feature
As discussed in several places (e.g. here and here), several Wasm SIMD instructions can be translated to multiple, simpler CLIF instructions and then pattern-matched in the backend lowering.
Benefit
Removing specialized CLIF instructions would make things simpler for users of the builder API and somewhat simpler for maintainers. I feel this feature is more about making the CLIF ISA "nicer," which is a bit relative, but not unreasonable. The risk is that it actually complicates things ("one CLIF instruction per Wasm instruction" is also "nice") or that it slows down lowering.
Implementation
This would involve:
- reviewing CLIF SIMD instructions to identify specialized instructions that can be replaced by pattern-matching: e.g., any instruction that involves widening/narrowing, conversion from signed to unsigned or vice versa, pairwise or low/high access, conversion from to from floats, etc.
- replace those instructions in the lowerings with the pattern-matched version
- remove unused instructions
Alternatives
Leave things as-is: some mixture of the "specialized CLIF instruction for a Wasm SIMD instruction" paradigm along with some of the "pattern-match several simpler CLIF instructions" paradigm.
cfallin labeled issue #3045:
Feature
As discussed in several places (e.g. here and here), several Wasm SIMD instructions can be translated to multiple, simpler CLIF instructions and then pattern-matched in the backend lowering.
Benefit
Removing specialized CLIF instructions would make things simpler for users of the builder API and somewhat simpler for maintainers. I feel this feature is more about making the CLIF ISA "nicer," which is a bit relative, but not unreasonable. The risk is that it actually complicates things ("one CLIF instruction per Wasm instruction" is also "nice") or that it slows down lowering.
Implementation
This would involve:
- reviewing CLIF SIMD instructions to identify specialized instructions that can be replaced by pattern-matching: e.g., any instruction that involves widening/narrowing, conversion from signed to unsigned or vice versa, pairwise or low/high access, conversion from to from floats, etc.
- replace those instructions in the lowerings with the pattern-matched version
- remove unused instructions
Alternatives
Leave things as-is: some mixture of the "specialized CLIF instruction for a Wasm SIMD instruction" paradigm along with some of the "pattern-match several simpler CLIF instructions" paradigm.
cfallin labeled issue #3045:
Feature
As discussed in several places (e.g. here and here), several Wasm SIMD instructions can be translated to multiple, simpler CLIF instructions and then pattern-matched in the backend lowering.
Benefit
Removing specialized CLIF instructions would make things simpler for users of the builder API and somewhat simpler for maintainers. I feel this feature is more about making the CLIF ISA "nicer," which is a bit relative, but not unreasonable. The risk is that it actually complicates things ("one CLIF instruction per Wasm instruction" is also "nice") or that it slows down lowering.
Implementation
This would involve:
- reviewing CLIF SIMD instructions to identify specialized instructions that can be replaced by pattern-matching: e.g., any instruction that involves widening/narrowing, conversion from signed to unsigned or vice versa, pairwise or low/high access, conversion from to from floats, etc.
- replace those instructions in the lowerings with the pattern-matched version
- remove unused instructions
Alternatives
Leave things as-is: some mixture of the "specialized CLIF instruction for a Wasm SIMD instruction" paradigm along with some of the "pattern-match several simpler CLIF instructions" paradigm.
alexcrichton commented on issue #3045:
AFAIK this largely ended up as-is in the issue description where there are not a lot of SIMD ops specialized just for wasm simd, but rather most opcodes are pretty general and many wasm simd instructions are broken down into their components. In that sense I feel like the design here has (since the time of this opening) settled on "let's stick with simple ops" and it's been working out well-enough, so is it perhaps time to close this?
abrown closed issue #3045:
Feature
As discussed in several places (e.g. here and here), several Wasm SIMD instructions can be translated to multiple, simpler CLIF instructions and then pattern-matched in the backend lowering.
Benefit
Removing specialized CLIF instructions would make things simpler for users of the builder API and somewhat simpler for maintainers. I feel this feature is more about making the CLIF ISA "nicer," which is a bit relative, but not unreasonable. The risk is that it actually complicates things ("one CLIF instruction per Wasm instruction" is also "nice") or that it slows down lowering.
Implementation
This would involve:
- reviewing CLIF SIMD instructions to identify specialized instructions that can be replaced by pattern-matching: e.g., any instruction that involves widening/narrowing, conversion from signed to unsigned or vice versa, pairwise or low/high access, conversion from to from floats, etc.
- replace those instructions in the lowerings with the pattern-matched version
- remove unused instructions
Alternatives
Leave things as-is: some mixture of the "specialized CLIF instruction for a Wasm SIMD instruction" paradigm along with some of the "pattern-match several simpler CLIF instructions" paradigm.
Last updated: Dec 23 2024 at 13:07 UTC