wasmtime / issue #6154 Cranelift, egraphs, jump threading... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #6154 Cranelift, egraphs, jump threading...

Wasmtime GitHub notifications bot (Apr 05 2023 at 16:24):

alexcrichton labeled issue #6154:

At a high level this is a pretty simple issue where this module:

(module
  (memory 1)
  (func (param i32) (result v128)
    local.get 0
    v128.load32_splat
  )
)

generates this code by default on x86_64:

$ cargo -q run compile foo.wat --cranelift-enable has_avx && objdump -S foo.cwasm

foo.cwasm:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_wasm_function_0>:
       0:       55                      push   %rbp
       1:       48 89 e5                mov    %rsp,%rbp
       4:       4c 8b 5f 50             mov    0x50(%rdi),%r11
       8:       8b f2                   mov    %edx,%esi
       a:       45 8b 5c 33 00          mov    0x0(%r11,%rsi,1),%r11d
       f:       c4 41 79 6e c3          vmovd  %r11d,%xmm8
      14:       c4 c1 79 70 c0 00       vpshufd $0x0,%xmm8,%xmm0
      1a:       48 89 ec                mov    %rbp,%rsp
      1d:       5d                      pop    %rbp
      1e:       c3                      retq
        ...

Naively this instruction lowering pattern matches to the cranelift-wasm lowering of the v128.load32_splat instruction, is to generate a load (the instruction at 0xa) followed by a splat (the instructions at 0xf and 0x14). With AVX, however, this instruction should lower to a single vbroadcastss instruction. This is where this issue gets odd. If egraphs are disabled, then everything works ok:

$ cargo -q run compile foo.wat --cranelift-enable has_avx --cranelift-set use_egraphs=false && objdump -S foo.cwasm

foo.cwasm:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_wasm_function_0>:
       0:       55                      push   %rbp
       1:       48 89 e5                mov    %rsp,%rbp
       4:       44 8b ca                mov    %edx,%r9d
       7:       4c 8b 57 50             mov    0x50(%rdi),%r10
       b:       c4 82 79 18 44 0a 00    vbroadcastss 0x0(%r10,%r9,1),%xmm0
      12:       48 89 ec                mov    %rbp,%rsp
      15:       5d                      pop    %rbp
      16:       c3                      retq
        ...

So there's an issue here where egraphs are transforming the code into something that can't be pattern-matched by instruction selection. According to logging, when egraphs are enabled, this is the input CLIF into lowering (after egraphs):

function u0:0(i64 vmctx, i64, i32) -> i8x16 fast {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    gv4 = load.i64 notrap aligned readonly gv3+80
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32):
@0020                               v5 = load.i64 notrap aligned readonly v0+80
@0020                               v4 = uextend.i64 v2
@0020                               v6 = iadd v5, v4
@0020                               v7 = load.i32 little heap v6
@0024                               jump block1

                                block1:
@0020                               v8 = splat.i32x4 v7
@0024                               v9 = bitcast.i8x16 little v8
                                    v3 -> v9
@0024                               return v9
}

I believe that the issue here is that the splat has moved across basic blocks. This means that the load sinking can't fire. The input function to egraphs, however, was:

function u0:0(i64 vmctx, i64, i32) -> i8x16 fast {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    gv4 = load.i64 notrap aligned readonly gv3+80
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32):
@0020                               v4 = uextend.i64 v2
@0020                               v5 = load.i64 notrap aligned readonly v0+80
@0020                               v6 = iadd v5, v4
@0020                               v7 = load.i32 little heap v6
@0020                               v8 = splat.i32x4 v7
@0024                               v9 = bitcast.i8x16 little v8
                                    v3 -> v9
@0024                               jump block1

                                block1:
@0024                               return v3
}

where it can clearly be seen that egraphs are moving the splat and bitcast instructions across basic blocks.

So that's a basic description of the problem! How best to fix this, though, depends. This was talked briefly about at today's Cranelift meeting but some ways that this could be tackled are:

Technically there's no need for cranelift-wasm to generate the block1 block in the first place. This is likely done for convenience of translation, but it may be possible to make translation "fancier" and not eagerly allocate a basic block for the return instruction.
Today Cranelift does not have any sort of jump-threading/branch-folding pass. Abstractly block1 has one predecessor which has a jump instruction, so there's no need for block1 to exist and the block0 and block1 blocks could be fused. This optimization pass could happen before egraphs, for example. Note that this optimization does already happen to a degree during lowering due to MachBuffer optimizations.
Perhaps even more fancifully the load sinking could be souped up to work in this case. Given the complications this is not likely to be a viable solution.

At a basic level this I think is an issue worth fixing, but at a higher level I think that this issue showcases the lack of jump-threading in Cranelift and the need for it as egraphs are moving around instructions. Hence the title of this issue and the predicted way to fix it, which would be a jump-threading pass of sorts in Cranelift.

Wasmtime GitHub notifications bot (Apr 05 2023 at 16:24):

alexcrichton opened issue #6154:

At a high level this is a pretty simple issue where this module:

(module
  (memory 1)
  (func (param i32) (result v128)
    local.get 0
    v128.load32_splat
  )
)

generates this code by default on x86_64:

$ cargo -q run compile foo.wat --cranelift-enable has_avx && objdump -S foo.cwasm

foo.cwasm:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_wasm_function_0>:
       0:       55                      push   %rbp
       1:       48 89 e5                mov    %rsp,%rbp
       4:       4c 8b 5f 50             mov    0x50(%rdi),%r11
       8:       8b f2                   mov    %edx,%esi
       a:       45 8b 5c 33 00          mov    0x0(%r11,%rsi,1),%r11d
       f:       c4 41 79 6e c3          vmovd  %r11d,%xmm8
      14:       c4 c1 79 70 c0 00       vpshufd $0x0,%xmm8,%xmm0
      1a:       48 89 ec                mov    %rbp,%rsp
      1d:       5d                      pop    %rbp
      1e:       c3                      retq
        ...

$ cargo -q run compile foo.wat --cranelift-enable has_avx --cranelift-set use_egraphs=false && objdump -S foo.cwasm

foo.cwasm:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_wasm_function_0>:
       0:       55                      push   %rbp
       1:       48 89 e5                mov    %rsp,%rbp
       4:       44 8b ca                mov    %edx,%r9d
       7:       4c 8b 57 50             mov    0x50(%rdi),%r10
       b:       c4 82 79 18 44 0a 00    vbroadcastss 0x0(%r10,%r9,1),%xmm0
      12:       48 89 ec                mov    %rbp,%rsp
      15:       5d                      pop    %rbp
      16:       c3                      retq
        ...

function u0:0(i64 vmctx, i64, i32) -> i8x16 fast {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    gv4 = load.i64 notrap aligned readonly gv3+80
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32):
@0020                               v5 = load.i64 notrap aligned readonly v0+80
@0020                               v4 = uextend.i64 v2
@0020                               v6 = iadd v5, v4
@0020                               v7 = load.i32 little heap v6
@0024                               jump block1

                                block1:
@0020                               v8 = splat.i32x4 v7
@0024                               v9 = bitcast.i8x16 little v8
                                    v3 -> v9
@0024                               return v9
}

I believe that the issue here is that the splat has moved across basic blocks. This means that the load sinking can't fire. The input function to egraphs, however, was:

function u0:0(i64 vmctx, i64, i32) -> i8x16 fast {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    gv4 = load.i64 notrap aligned readonly gv3+80
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32):
@0020                               v4 = uextend.i64 v2
@0020                               v5 = load.i64 notrap aligned readonly v0+80
@0020                               v6 = iadd v5, v4
@0020                               v7 = load.i32 little heap v6
@0020                               v8 = splat.i32x4 v7
@0024                               v9 = bitcast.i8x16 little v8
                                    v3 -> v9
@0024                               jump block1

                                block1:
@0024                               return v3
}

where it can clearly be seen that egraphs are moving the splat and bitcast instructions across basic blocks.

So that's a basic description of the problem! How best to fix this, though, depends. This was talked briefly about at today's Cranelift meeting but some ways that this could be tackled are:

Technically there's no need for cranelift-wasm to generate the block1 block in the first place. This is likely done for convenience of translation, but it may be possible to make translation "fancier" and not eagerly allocate a basic block for the return instruction.
Today Cranelift does not have any sort of jump-threading/branch-folding pass. Abstractly block1 has one predecessor which has a jump instruction, so there's no need for block1 to exist and the block0 and block1 blocks could be fused. This optimization pass could happen before egraphs, for example. Note that this optimization does already happen to a degree during lowering due to MachBuffer optimizations.
Perhaps even more fancifully the load sinking could be souped up to work in this case. Given the complications this is not likely to be a viable solution.

Wasmtime GitHub notifications bot (Apr 05 2023 at 16:24):

alexcrichton labeled issue #6154:

At a high level this is a pretty simple issue where this module:

(module
  (memory 1)
  (func (param i32) (result v128)
    local.get 0
    v128.load32_splat
  )
)

generates this code by default on x86_64:

$ cargo -q run compile foo.wat --cranelift-enable has_avx && objdump -S foo.cwasm

foo.cwasm:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_wasm_function_0>:
       0:       55                      push   %rbp
       1:       48 89 e5                mov    %rsp,%rbp
       4:       4c 8b 5f 50             mov    0x50(%rdi),%r11
       8:       8b f2                   mov    %edx,%esi
       a:       45 8b 5c 33 00          mov    0x0(%r11,%rsi,1),%r11d
       f:       c4 41 79 6e c3          vmovd  %r11d,%xmm8
      14:       c4 c1 79 70 c0 00       vpshufd $0x0,%xmm8,%xmm0
      1a:       48 89 ec                mov    %rbp,%rsp
      1d:       5d                      pop    %rbp
      1e:       c3                      retq
        ...

$ cargo -q run compile foo.wat --cranelift-enable has_avx --cranelift-set use_egraphs=false && objdump -S foo.cwasm

foo.cwasm:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_wasm_function_0>:
       0:       55                      push   %rbp
       1:       48 89 e5                mov    %rsp,%rbp
       4:       44 8b ca                mov    %edx,%r9d
       7:       4c 8b 57 50             mov    0x50(%rdi),%r10
       b:       c4 82 79 18 44 0a 00    vbroadcastss 0x0(%r10,%r9,1),%xmm0
      12:       48 89 ec                mov    %rbp,%rsp
      15:       5d                      pop    %rbp
      16:       c3                      retq
        ...

function u0:0(i64 vmctx, i64, i32) -> i8x16 fast {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    gv4 = load.i64 notrap aligned readonly gv3+80
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32):
@0020                               v5 = load.i64 notrap aligned readonly v0+80
@0020                               v4 = uextend.i64 v2
@0020                               v6 = iadd v5, v4
@0020                               v7 = load.i32 little heap v6
@0024                               jump block1

                                block1:
@0020                               v8 = splat.i32x4 v7
@0024                               v9 = bitcast.i8x16 little v8
                                    v3 -> v9
@0024                               return v9
}

I believe that the issue here is that the splat has moved across basic blocks. This means that the load sinking can't fire. The input function to egraphs, however, was:

function u0:0(i64 vmctx, i64, i32) -> i8x16 fast {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    gv4 = load.i64 notrap aligned readonly gv3+80
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32):
@0020                               v4 = uextend.i64 v2
@0020                               v5 = load.i64 notrap aligned readonly v0+80
@0020                               v6 = iadd v5, v4
@0020                               v7 = load.i32 little heap v6
@0020                               v8 = splat.i32x4 v7
@0024                               v9 = bitcast.i8x16 little v8
                                    v3 -> v9
@0024                               jump block1

                                block1:
@0024                               return v3
}

where it can clearly be seen that egraphs are moving the splat and bitcast instructions across basic blocks.

So that's a basic description of the problem! How best to fix this, though, depends. This was talked briefly about at today's Cranelift meeting but some ways that this could be tackled are:

Technically there's no need for cranelift-wasm to generate the block1 block in the first place. This is likely done for convenience of translation, but it may be possible to make translation "fancier" and not eagerly allocate a basic block for the return instruction.
Today Cranelift does not have any sort of jump-threading/branch-folding pass. Abstractly block1 has one predecessor which has a jump instruction, so there's no need for block1 to exist and the block0 and block1 blocks could be fused. This optimization pass could happen before egraphs, for example. Note that this optimization does already happen to a degree during lowering due to MachBuffer optimizations.
Perhaps even more fancifully the load sinking could be souped up to work in this case. Given the complications this is not likely to be a viable solution.

Wasmtime GitHub notifications bot (Apr 05 2023 at 16:24):

alexcrichton labeled issue #6154:

At a high level this is a pretty simple issue where this module:

(module
  (memory 1)
  (func (param i32) (result v128)
    local.get 0
    v128.load32_splat
  )
)

generates this code by default on x86_64:

$ cargo -q run compile foo.wat --cranelift-enable has_avx && objdump -S foo.cwasm

foo.cwasm:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_wasm_function_0>:
       0:       55                      push   %rbp
       1:       48 89 e5                mov    %rsp,%rbp
       4:       4c 8b 5f 50             mov    0x50(%rdi),%r11
       8:       8b f2                   mov    %edx,%esi
       a:       45 8b 5c 33 00          mov    0x0(%r11,%rsi,1),%r11d
       f:       c4 41 79 6e c3          vmovd  %r11d,%xmm8
      14:       c4 c1 79 70 c0 00       vpshufd $0x0,%xmm8,%xmm0
      1a:       48 89 ec                mov    %rbp,%rsp
      1d:       5d                      pop    %rbp
      1e:       c3                      retq
        ...

$ cargo -q run compile foo.wat --cranelift-enable has_avx --cranelift-set use_egraphs=false && objdump -S foo.cwasm

foo.cwasm:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_wasm_function_0>:
       0:       55                      push   %rbp
       1:       48 89 e5                mov    %rsp,%rbp
       4:       44 8b ca                mov    %edx,%r9d
       7:       4c 8b 57 50             mov    0x50(%rdi),%r10
       b:       c4 82 79 18 44 0a 00    vbroadcastss 0x0(%r10,%r9,1),%xmm0
      12:       48 89 ec                mov    %rbp,%rsp
      15:       5d                      pop    %rbp
      16:       c3                      retq
        ...

function u0:0(i64 vmctx, i64, i32) -> i8x16 fast {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    gv4 = load.i64 notrap aligned readonly gv3+80
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32):
@0020                               v5 = load.i64 notrap aligned readonly v0+80
@0020                               v4 = uextend.i64 v2
@0020                               v6 = iadd v5, v4
@0020                               v7 = load.i32 little heap v6
@0024                               jump block1

                                block1:
@0020                               v8 = splat.i32x4 v7
@0024                               v9 = bitcast.i8x16 little v8
                                    v3 -> v9
@0024                               return v9
}

I believe that the issue here is that the splat has moved across basic blocks. This means that the load sinking can't fire. The input function to egraphs, however, was:

function u0:0(i64 vmctx, i64, i32) -> i8x16 fast {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    gv4 = load.i64 notrap aligned readonly gv3+80
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32):
@0020                               v4 = uextend.i64 v2
@0020                               v5 = load.i64 notrap aligned readonly v0+80
@0020                               v6 = iadd v5, v4
@0020                               v7 = load.i32 little heap v6
@0020                               v8 = splat.i32x4 v7
@0024                               v9 = bitcast.i8x16 little v8
                                    v3 -> v9
@0024                               jump block1

                                block1:
@0024                               return v3
}

where it can clearly be seen that egraphs are moving the splat and bitcast instructions across basic blocks.

So that's a basic description of the problem! How best to fix this, though, depends. This was talked briefly about at today's Cranelift meeting but some ways that this could be tackled are:

Technically there's no need for cranelift-wasm to generate the block1 block in the first place. This is likely done for convenience of translation, but it may be possible to make translation "fancier" and not eagerly allocate a basic block for the return instruction.
Today Cranelift does not have any sort of jump-threading/branch-folding pass. Abstractly block1 has one predecessor which has a jump instruction, so there's no need for block1 to exist and the block0 and block1 blocks could be fused. This optimization pass could happen before egraphs, for example. Note that this optimization does already happen to a degree during lowering due to MachBuffer optimizations.
Perhaps even more fancifully the load sinking could be souped up to work in this case. Given the complications this is not likely to be a viable solution.

Wasmtime GitHub notifications bot (Oct 23 2023 at 10:55):

tschneidereit commented on issue #6154:

Is this still relevant with #6818 landed?

Wasmtime GitHub notifications bot (Oct 23 2023 at 14:32):

alexcrichton commented on issue #6154:

Unfortunately, yes. The reproductions are now:

Uses vbroadcastss: wasmtime compile foo.wat --target x86_64 -Ccranelift-has-avx -O opt-level=0 && objdump -S foo.cwasm

Does not use vbroadcastss: wasmtime compile foo.wat --target x86_64 -Ccranelift-has-avx && objdump -S foo.cwasm

Wasmtime GitHub notifications bot (Oct 23 2023 at 16:50):

cfallin commented on issue #6154:

@jameysharp and I talked about this briefly this morning, and with a fresh look, we think that an explicit jump-threading opt that runs in the mid-end (probably before e-graphs because it exposes additional opportunities) is the simplest approach. #6818's discussion was more about whether we could remove the backend's branch opts, which include jump-threading but at a lower level, and the conclusion was that we shouldn't; but this doesn't preclude us from also doing this at the IR level (both give us different things).

It's probably a relatively small pass (50-100 lines); I'll see if I can get to this at some point soon!

Wasmtime GitHub notifications bot (Sep 06 2024 at 14:14):

dimitris-aspetakis commented on issue #6154:

I can't help but note here my feeling from trying to implement mid-end instruction scheduling in #6260, that instruction scheduling should be more integrated with the register allocation and lowering stages. I believe the heuristics we use for register spilling might benefit from some sort of integration with the register allocator specifically.

In that spirit, I tend to support the first suggestion from @alexcrichton, where (if I understood correctly), elaborating instructions back to the layout is delayed. I mention our work in #6260 also because it could probably provide a solution to the issues discussed here and in #8787, with special cases in the scheduling pass.

I understand that my comments are a bit too abstract — what I mostly seek is feedback before I dive deeper with possible implementation details.

As a side-note: would it make sense for pattern-matching fusing optimizations from the instruction lowering stage to work on an AST-like representation directly generated from e-graphs? It sounds more expensive, but it also seems like it could catch more of such optimizations.

Wasmtime GitHub notifications bot (Sep 06 2024 at 14:19):

dimitris-aspetakis edited a comment on issue #6154:

I can't help but note here my feeling from trying to implement mid-end instruction scheduling in #6260, that instruction scheduling should be more integrated with the register allocation and lowering stages. I believe the heuristics we use for register spilling might benefit from some sort of integration with the register allocator specifically.

In that spirit, I tend to support the first suggestion from @alexcrichton, where (if I understood correctly), elaborating instructions back to the layout is delayed. I mention our work in #6260 also because it could probably provide a solution to the issues discussed here and in #8787, with special cases in the scheduling pass.

I understand that my comments are a bit too abstract — what I mostly seek is feedback (more like a _vibe-check_) before I dive deeper with possible implementation details.

As a side-note: would it make sense for pattern-matching fusing optimizations from the instruction lowering stage to work on an AST-like representation directly generated from e-graphs? It sounds more expensive, but it also seems like it could catch more of such optimizations.

Wasmtime GitHub notifications bot (Sep 06 2024 at 14:19):

dimitris-aspetakis edited a comment on issue #6154:

I can't help but note here my feeling from trying to implement mid-end instruction scheduling in #6260, that instruction scheduling should be more integrated with the register allocation and lowering stages. I believe the heuristics we use for register spilling might benefit from some sort of integration with the register allocator specifically.

In that spirit, I tend to support the first suggestion from @alexcrichton, where (if I understood correctly), elaborating instructions back to the layout is delayed. I mention our work in #6260 also because it could probably provide a solution to the issues discussed here and in #8787, with special cases in the scheduling pass.

I understand that my comments are a bit too abstract — what I mostly seek is feedback (more like a vibe-check) before I dive deeper with possible implementation details.

As a side-note: would it make sense for pattern-matching fusing optimizations from the instruction lowering stage to work on an AST-like representation directly generated from e-graphs? It sounds more expensive, but it also seems like it could catch more of such optimizations.

Wasmtime GitHub notifications bot (Sep 06 2024 at 16:01):

cfallin commented on issue #6154:

@dimitris-aspetakis I think you're talking about two different things here:

I think that if we could integrate lowering more directly with the sea-of-nodes aegraph representation, that would probably uncover some additional opportunity -- the most direct one being that lowering rules could act as a sort of natural cost function that replaces the ad-hoc one in aegraph extraction. That's probably quite a difficult thing to do (open design/research questions, maybe 6 months or so of work) but would be interesting to see the results of.

Integrating regalloc with instruction selection and/or scheduling (basically, altering the code we produce based on regalloc signals) is I think far more difficult however. There is at least one PhD thesis (David Koes, CMU, 2009) I'm aware of on the topic. It's a combinatorial constraint problem, everything affects everything else, and it's basically asking for a redesign of the entire compiler backend. I'm not saying it isn't possible but the scope is likely multiple years and there's no guarantee we'll come up with something that lands at a new interesting point in the design space in terms of runtime and compile time. Please do feel free to play with this more and propose designs and experiment with it (I'd recommend starting with some canonical examples that "don't work well" currently), just be aware of the difficulty.

Last updated: Apr 18 2025 at 19:03 UTC