Stream: git-wasmtime

Topic: wasmtime / Issue #2340 Allow loads to merge into other op...


view this post on Zulip Wasmtime GitHub notifications bot (Oct 30 2020 at 00:17):

cfallin opened Issue #2340:

In MachInst-based backends, instruction selection works by pattern-matching on a root opcode and potentially the opcodes that produce its inputs (the "operand tree"). Sometimes, a particular machine backend is able to pattern-match several opcodes in a certain arrangement and generate one machine instruction that covers them all (e.g.: multiply-add).

Some machine architectures have the ability to load one operand of an operation from memory. For example, on x86, we can do add rax, [rbx], which is a combination load and add (in that order).

Unfortunately, because of a (sound but conservative) design in our pattern-matching system, we cannot match patterns where the load is an operand rather than the root, because we do not allow operations that have side-effects or touch memory to move in the program. This is because reasoning about when it is safe to do so is complex. (Instruction coloring is the framework that does this reasoning, but because the color changes after the load or side-effecting op, its own color differs from that of every program point below it, so it will never merge.)

Consider the following test-case as an interesting example:

block1(v0: i32, v1: i32):
v2 = sload.i32 v0
v3 = sload.i32 v1
v4 = iadd.i32 v2, v0
v5 = iadd.i32 v3, v1

We want this to lower to two add-from-memory instructions on x86. Working bottom-up (as VCode lowering does), we first see v5 and perhaps reason that we can sink the load v3 because it does not move across any other memory ops. We would also like to do this for v2 into v4, and in this example we can, but only after we sink v3; the load producing v2 moves across where v3 used to be. Hence, a naïve coloring scheme won't necessarily work.

A solution that @julian-seward1 and I have converged on involves numbering only side-effecting ops with colors (or sequence numbers), and while scanning backward through a basic block, tracking the "current sequence number", or in other words, the next number whose op we expect to see. We permit a merge of any load with this sequence number if (i) it's in the same BB and (ii) this use is its only use; then we decrement the sequence number.

This implements the very simple rule that a load or store never moves over another load or store; but as we sink one, we create space for the one above it to sink, as well.

(Later we might want to extend this with multiple sequence numbers, one per disjoint unit of state, sort of in the spirit of a vector clock; but details there are TBD.)

view this post on Zulip Wasmtime GitHub notifications bot (Oct 30 2020 at 00:19):

cfallin commented on Issue #2340:

cc @akirilov-arm and @abrown: this should enable us to properly handle the load-merging that LoadSplat addresses in #2278.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 30 2020 at 00:19):

cfallin assigned Issue #2340:

In MachInst-based backends, instruction selection works by pattern-matching on a root opcode and potentially the opcodes that produce its inputs (the "operand tree"). Sometimes, a particular machine backend is able to pattern-match several opcodes in a certain arrangement and generate one machine instruction that covers them all (e.g.: multiply-add).

Some machine architectures have the ability to load one operand of an operation from memory. For example, on x86, we can do add rax, [rbx], which is a combination load and add (in that order).

Unfortunately, because of a (sound but conservative) design in our pattern-matching system, we cannot match patterns where the load is an operand rather than the root, because we do not allow operations that have side-effects or touch memory to move in the program. This is because reasoning about when it is safe to do so is complex. (Instruction coloring is the framework that does this reasoning, but because the color changes after the load or side-effecting op, its own color differs from that of every program point below it, so it will never merge.)

Consider the following test-case as an interesting example:

block1(v0: i32, v1: i32):
v2 = sload.i32 v0
v3 = sload.i32 v1
v4 = iadd.i32 v2, v0
v5 = iadd.i32 v3, v1

We want this to lower to two add-from-memory instructions on x86. Working bottom-up (as VCode lowering does), we first see v5 and perhaps reason that we can sink the load v3 because it does not move across any other memory ops. We would also like to do this for v2 into v4, and in this example we can, but only after we sink v3; the load producing v2 moves across where v3 used to be. Hence, a naïve coloring scheme won't necessarily work.

A solution that @julian-seward1 and I have converged on involves numbering only side-effecting ops with colors (or sequence numbers), and while scanning backward through a basic block, tracking the "current sequence number", or in other words, the next number whose op we expect to see. We permit a merge of any load with this sequence number if (i) it's in the same BB and (ii) this use is its only use; then we decrement the sequence number.

This implements the very simple rule that a load or store never moves over another load or store; but as we sink one, we create space for the one above it to sink, as well.

(Later we might want to extend this with multiple sequence numbers, one per disjoint unit of state, sort of in the spirit of a vector clock; but details there are TBD.)

view this post on Zulip Wasmtime GitHub notifications bot (Oct 30 2020 at 00:20):

cfallin labeled Issue #2340 (assigned to cfallin):

In MachInst-based backends, instruction selection works by pattern-matching on a root opcode and potentially the opcodes that produce its inputs (the "operand tree"). Sometimes, a particular machine backend is able to pattern-match several opcodes in a certain arrangement and generate one machine instruction that covers them all (e.g.: multiply-add).

Some machine architectures have the ability to load one operand of an operation from memory. For example, on x86, we can do add rax, [rbx], which is a combination load and add (in that order).

Unfortunately, because of a (sound but conservative) design in our pattern-matching system, we cannot match patterns where the load is an operand rather than the root, because we do not allow operations that have side-effects or touch memory to move in the program. This is because reasoning about when it is safe to do so is complex. (Instruction coloring is the framework that does this reasoning, but because the color changes after the load or side-effecting op, its own color differs from that of every program point below it, so it will never merge.)

Consider the following test-case as an interesting example:

block1(v0: i32, v1: i32):
v2 = sload.i32 v0
v3 = sload.i32 v1
v4 = iadd.i32 v2, v0
v5 = iadd.i32 v3, v1

We want this to lower to two add-from-memory instructions on x86. Working bottom-up (as VCode lowering does), we first see v5 and perhaps reason that we can sink the load v3 because it does not move across any other memory ops. We would also like to do this for v2 into v4, and in this example we can, but only after we sink v3; the load producing v2 moves across where v3 used to be. Hence, a naïve coloring scheme won't necessarily work.

A solution that @julian-seward1 and I have converged on involves numbering only side-effecting ops with colors (or sequence numbers), and while scanning backward through a basic block, tracking the "current sequence number", or in other words, the next number whose op we expect to see. We permit a merge of any load with this sequence number if (i) it's in the same BB and (ii) this use is its only use; then we decrement the sequence number.

This implements the very simple rule that a load or store never moves over another load or store; but as we sink one, we create space for the one above it to sink, as well.

(Later we might want to extend this with multiple sequence numbers, one per disjoint unit of state, sort of in the spirit of a vector clock; but details there are TBD.)

view this post on Zulip Wasmtime GitHub notifications bot (Nov 16 2020 at 23:31):

cfallin closed Issue #2340 (assigned to cfallin):

In MachInst-based backends, instruction selection works by pattern-matching on a root opcode and potentially the opcodes that produce its inputs (the "operand tree"). Sometimes, a particular machine backend is able to pattern-match several opcodes in a certain arrangement and generate one machine instruction that covers them all (e.g.: multiply-add).

Some machine architectures have the ability to load one operand of an operation from memory. For example, on x86, we can do add rax, [rbx], which is a combination load and add (in that order).

Unfortunately, because of a (sound but conservative) design in our pattern-matching system, we cannot match patterns where the load is an operand rather than the root, because we do not allow operations that have side-effects or touch memory to move in the program. This is because reasoning about when it is safe to do so is complex. (Instruction coloring is the framework that does this reasoning, but because the color changes after the load or side-effecting op, its own color differs from that of every program point below it, so it will never merge.)

Consider the following test-case as an interesting example:

block1(v0: i32, v1: i32):
v2 = sload.i32 v0
v3 = sload.i32 v1
v4 = iadd.i32 v2, v0
v5 = iadd.i32 v3, v1

We want this to lower to two add-from-memory instructions on x86. Working bottom-up (as VCode lowering does), we first see v5 and perhaps reason that we can sink the load v3 because it does not move across any other memory ops. We would also like to do this for v2 into v4, and in this example we can, but only after we sink v3; the load producing v2 moves across where v3 used to be. Hence, a naïve coloring scheme won't necessarily work.

A solution that @julian-seward1 and I have converged on involves numbering only side-effecting ops with colors (or sequence numbers), and while scanning backward through a basic block, tracking the "current sequence number", or in other words, the next number whose op we expect to see. We permit a merge of any load with this sequence number if (i) it's in the same BB and (ii) this use is its only use; then we decrement the sequence number.

This implements the very simple rule that a load or store never moves over another load or store; but as we sink one, we create space for the one above it to sink, as well.

(Later we might want to extend this with multiple sequence numbers, one per disjoint unit of state, sort of in the spirit of a vector clock; but details there are TBD.)


Last updated: Dec 23 2024 at 12:05 UTC