wasmtime / issue #3249 Cranelift: simplify opcode set by ... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #3249 Cranelift: simplify opcode set by ...

Wasmtime GitHub notifications bot (Aug 26 2021 at 17:16):

In #3233 we discussed various simplifications we could perform to reduce the number of CLIF opcodes that exist. In general, we should strive for simplicity -- fewer opcodes mean less cost for every consumer of the IR -- as long as simplifications don't impose other overheads, such as unnecessary expansion and pattern-matching.

The "flags"-related opcodes, such as ifcmp / ffcmp, brif / brff, trapif / trapff, and those that communicate carry/borrow flags via the full flags value, seem to be good candidates for removal. This is for at least a few reasons:

A flag value behaves differently than most others in the IR, and as such, imposes complexity cost. For example, only one flag value may be live at a time. This constraint, enforced by the validator, is meant to allow the flags value to be directly lowered onto a machine's CPU flags register, but our backends no longer work this way. For another example, a flag value cannot be stored/loaded to/from memory.

The flags do not have a well-defined value but rather serve as a sort of opaque connection between producer (e.g., a compare) and consumer (e.g., a branch). For example, a bitcast/raw-bitcast from a flag value to an integer is undefined (and disallowed by the validator).

There is redundancy in the instruction set, as some conditions can also be materialized as bools. For example, we have both iadd_carry, which produces the carry flag as a bool, and iadd_ifcout, which produces a flags value with the carry embedded in it.

With better pattern-matching in the backends (both as they stand today and potentially with a DSL to help as proposed in bytecodealliance/rfcs#15), it is less important for the CLIF to directly correspond to machine code; we can pattern-match a producer and consumer (e.g. a compare and a branch) that communicate via a b1 (bool) just as well as we can those that communicate via opaque flags values.

Thus, we should look into removing all uses of "flags" values, instead consuming the flags where they are produced to generate a bool condition as appropriate.

One downside of this approach is that we cannot directly express reuse of flags for more than one condition (e.g. ifcmp then multiple trueifs with multiple flag conditions). Potentially we could instead pattern-match on multiple compare instructions with the same arguments and merge them instead, so e.g. icmp ge v0, v1 and icmp eq v0, v1 become a comopare-setcc-setcc three-instruction sequence.

cc @abrown @afonso360 @bjorn3 from earlier discussion

Wasmtime GitHub notifications bot (Aug 26 2021 at 17:32):

sunfishcode commented on issue #3249:

A consideration here is that there's only one flags register and it's expensive to store and load, at least on x86. With a special flags type, it's easy for things like gvn and licm to "do the right thing" and avoid breaking up compare+branch patterns and inducing flags spills/reloads which are usually much more expensive than computing them redundantly. And, the constraints on the flags type, that at most one value be live at a time, and that it not be live across instructions which might clobber it, mean that the backend doesn't need code for spilling and restoring flags values.

Wasmtime GitHub notifications bot (Aug 26 2021 at 17:44):

cfallin commented on issue #3249:

That's a good point! The approach that we use today is to recompute the compare wherever we need it; so the compare always comes just before its use in the final machine code, by construction and without spills/reloads. I think we see the equivalent of this problem though when we do the pattern-matching, in that if we can't look back far enough (if the compare has been hoisted way up) then we just error out -- so actually, we are relying on the validator's single-flags-live constraint for that too, now that I think about it (with the constraint in place we can never hit the panic).

I was about to propose all-in-one ("compare+branch") instructions -- actually I think we already have bricmp, but we'd want the same for trueif, selectif, trapif, ... -- but then this gives us an unfortunate O(n^2) explosion of all producers * all consumers. Essentially the single-flags-live constraint is a way of joining the two halves in a way that they're factored, but can't stray too far away.

Needs more thought...

Wasmtime GitHub notifications bot (Aug 26 2021 at 17:49):

cfallin commented on issue #3249:

A bit of refinement of the above: if we can't look back far enough to see a bool condition, the worst that happens is that we materialize it (setcc on x86, cset on aarch64) and then use that later. The panic-if-we-can't-look-back-far-enough case specifically is for matching flags producers with consumers (e.g. ifcmp + brif).

So I could see a design where we (i) standardize on bools, (ii) have some notion of "cheap to recompute, expensive to keep value live" that inhibits LICM / GVN on icmp, fcmp, iadd_carry, and friends, and (iii) hit the happy path most of the time in matching the bool producer to consumer. This seems like an OK compromise between IR semantics/simplicity and performance. Thoughts?

Wasmtime GitHub notifications bot (Aug 26 2021 at 18:40):

sunfishcode commented on issue #3249:

At a high level, that sounds reasonable to me :-).

Wasmtime GitHub notifications bot (Sep 01 2021 at 12:59):

sparker-arm commented on issue #3249:

+1 to standardizing on bools, as they're not hard to materialize into a register and it makes the IR much cleaner. Also, with existing nodes for min/max and saturating arithmetic, I think that a lot of important cases where the producer and consumer could be, unfortunately, separated have been negated.

Wasmtime GitHub notifications bot (Sep 21 2021 at 02:08):

chc4 commented on issue #3249:

I'm writing a dynamic recompilation engine targetting Cranelift. For part of that, it requires tracking processor flags for proper lifting, so that e.g. add x, y; jz ... works correctly. This is technically representational in Cranelift (though doesn't currently work, see https://github.com/bytecodealliance/wasmtime/issues/2860#issuecomment-922574641), but wouldn't map to the proposed reduced instruction set; the explicit carry-out B1 is only the CF flag, and so doesn't allow a way for me to directly wire other processor flags (ZF, OF, PF) from producers to consumers.

Wasmtime GitHub notifications bot (Sep 21 2021 at 03:58):

chc4 edited a comment on issue #3249:

I'm writing a dynamic recompilation engine targetting Cranelift. For part of that, it requires tracking processor flags for proper lifting, so that e.g. add x, y; jz ... works correctly. This is technically representational in Cranelift (though doesn't currently work, see https://github.com/bytecodealliance/wasmtime/issues/2860#issuecomment-922574641), but wouldn't map to the proposed reduced instruction set; the explicit carry-out B1 is only the CF flag, and so doesn't allow a way for me to directly wire other processor flags (ZF, OF, PF) from producers to consumers.

B1 flags may work here if I can just emit "manual" processor flags recomputation, that are removed by Cranelift in cases where the values can be pulled from an internal live iflags value at selection time; some kind of zf(val) /cf(x, y) intrinsic might be better for that, so that the flag recomputation is a clear candidate for live range inclusion?

Wasmtime GitHub notifications bot (Oct 01 2021 at 21:28):

akirilov-arm labeled issue #3249:

In #3233 we discussed various simplifications we could perform to reduce the number of CLIF opcodes that exist. In general, we should strive for simplicity -- fewer opcodes mean less cost for every consumer of the IR -- as long as simplifications don't impose other overheads, such as unnecessary expansion and pattern-matching.

The "flags"-related opcodes, such as ifcmp / ffcmp, brif / brff, trapif / trapff, and those that communicate carry/borrow flags via the full flags value, seem to be good candidates for removal. This is for at least a few reasons:

A flag value behaves differently than most others in the IR, and as such, imposes complexity cost. For example, only one flag value may be live at a time. This constraint, enforced by the validator, is meant to allow the flags value to be directly lowered onto a machine's CPU flags register, but our backends no longer work this way. For another example, a flag value cannot be stored/loaded to/from memory.

The flags do not have a well-defined value but rather serve as a sort of opaque connection between producer (e.g., a compare) and consumer (e.g., a branch). For example, a bitcast/raw-bitcast from a flag value to an integer is undefined (and disallowed by the validator).

There is redundancy in the instruction set, as some conditions can also be materialized as bools. For example, we have both iadd_carry, which produces the carry flag as a bool, and iadd_ifcout, which produces a flags value with the carry embedded in it.

With better pattern-matching in the backends (both as they stand today and potentially with a DSL to help as proposed in bytecodealliance/rfcs#15), it is less important for the CLIF to directly correspond to machine code; we can pattern-match a producer and consumer (e.g. a compare and a branch) that communicate via a b1 (bool) just as well as we can those that communicate via opaque flags values.

Thus, we should look into removing all uses of "flags" values, instead consuming the flags where they are produced to generate a bool condition as appropriate.

One downside of this approach is that we cannot directly express reuse of flags for more than one condition (e.g. ifcmp then multiple trueifs with multiple flag conditions). Potentially we could instead pattern-match on multiple compare instructions with the same arguments and merge them instead, so e.g. icmp ge v0, v1 and icmp eq v0, v1 become a comopare-setcc-setcc three-instruction sequence.

cc @abrown @afonso360 @bjorn3 from earlier discussion

Wasmtime GitHub notifications bot (May 04 2022 at 19:59):

cfallin labeled issue #3249:

In #3233 we discussed various simplifications we could perform to reduce the number of CLIF opcodes that exist. In general, we should strive for simplicity -- fewer opcodes mean less cost for every consumer of the IR -- as long as simplifications don't impose other overheads, such as unnecessary expansion and pattern-matching.

The "flags"-related opcodes, such as ifcmp / ffcmp, brif / brff, trapif / trapff, and those that communicate carry/borrow flags via the full flags value, seem to be good candidates for removal. This is for at least a few reasons:

A flag value behaves differently than most others in the IR, and as such, imposes complexity cost. For example, only one flag value may be live at a time. This constraint, enforced by the validator, is meant to allow the flags value to be directly lowered onto a machine's CPU flags register, but our backends no longer work this way. For another example, a flag value cannot be stored/loaded to/from memory.

The flags do not have a well-defined value but rather serve as a sort of opaque connection between producer (e.g., a compare) and consumer (e.g., a branch). For example, a bitcast/raw-bitcast from a flag value to an integer is undefined (and disallowed by the validator).

There is redundancy in the instruction set, as some conditions can also be materialized as bools. For example, we have both iadd_carry, which produces the carry flag as a bool, and iadd_ifcout, which produces a flags value with the carry embedded in it.

With better pattern-matching in the backends (both as they stand today and potentially with a DSL to help as proposed in bytecodealliance/rfcs#15), it is less important for the CLIF to directly correspond to machine code; we can pattern-match a producer and consumer (e.g. a compare and a branch) that communicate via a b1 (bool) just as well as we can those that communicate via opaque flags values.

Thus, we should look into removing all uses of "flags" values, instead consuming the flags where they are produced to generate a bool condition as appropriate.

One downside of this approach is that we cannot directly express reuse of flags for more than one condition (e.g. ifcmp then multiple trueifs with multiple flag conditions). Potentially we could instead pattern-match on multiple compare instructions with the same arguments and merge them instead, so e.g. icmp ge v0, v1 and icmp eq v0, v1 become a comopare-setcc-setcc three-instruction sequence.

cc @abrown @afonso360 @bjorn3 from earlier discussion

Wasmtime GitHub notifications bot (May 04 2022 at 19:59):

cfallin labeled issue #3249:

In #3233 we discussed various simplifications we could perform to reduce the number of CLIF opcodes that exist. In general, we should strive for simplicity -- fewer opcodes mean less cost for every consumer of the IR -- as long as simplifications don't impose other overheads, such as unnecessary expansion and pattern-matching.

The "flags"-related opcodes, such as ifcmp / ffcmp, brif / brff, trapif / trapff, and those that communicate carry/borrow flags via the full flags value, seem to be good candidates for removal. This is for at least a few reasons:

A flag value behaves differently than most others in the IR, and as such, imposes complexity cost. For example, only one flag value may be live at a time. This constraint, enforced by the validator, is meant to allow the flags value to be directly lowered onto a machine's CPU flags register, but our backends no longer work this way. For another example, a flag value cannot be stored/loaded to/from memory.

The flags do not have a well-defined value but rather serve as a sort of opaque connection between producer (e.g., a compare) and consumer (e.g., a branch). For example, a bitcast/raw-bitcast from a flag value to an integer is undefined (and disallowed by the validator).

There is redundancy in the instruction set, as some conditions can also be materialized as bools. For example, we have both iadd_carry, which produces the carry flag as a bool, and iadd_ifcout, which produces a flags value with the carry embedded in it.

With better pattern-matching in the backends (both as they stand today and potentially with a DSL to help as proposed in bytecodealliance/rfcs#15), it is less important for the CLIF to directly correspond to machine code; we can pattern-match a producer and consumer (e.g. a compare and a branch) that communicate via a b1 (bool) just as well as we can those that communicate via opaque flags values.

Thus, we should look into removing all uses of "flags" values, instead consuming the flags where they are produced to generate a bool condition as appropriate.

One downside of this approach is that we cannot directly express reuse of flags for more than one condition (e.g. ifcmp then multiple trueifs with multiple flag conditions). Potentially we could instead pattern-match on multiple compare instructions with the same arguments and merge them instead, so e.g. icmp ge v0, v1 and icmp eq v0, v1 become a comopare-setcc-setcc three-instruction sequence.

cc @abrown @afonso360 @bjorn3 from earlier discussion

Last updated: Feb 28 2025 at 03:10 UTC