alexcrichton labeled issue #6154:
At a high level this is a pretty simple issue where this module:
(module (memory 1) (func (param i32) (result v128) local.get 0 v128.load32_splat ) )
generates this code by default on x86_64:
$ cargo -q run compile foo.wat --cranelift-enable has_avx && objdump -S foo.cwasm foo.cwasm: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_wasm_function_0>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 4c 8b 5f 50 mov 0x50(%rdi),%r11 8: 8b f2 mov %edx,%esi a: 45 8b 5c 33 00 mov 0x0(%r11,%rsi,1),%r11d f: c4 41 79 6e c3 vmovd %r11d,%xmm8 14: c4 c1 79 70 c0 00 vpshufd $0x0,%xmm8,%xmm0 1a: 48 89 ec mov %rbp,%rsp 1d: 5d pop %rbp 1e: c3 retq ...
Naively this instruction lowering pattern matches to the cranelift-wasm lowering of the
v128.load32_splat
instruction, is to generate a load (the instruction at0xa
) followed by a splat (the instructions at0xf
and0x14
). With AVX, however, this instruction should lower to a singlevbroadcastss
instruction. This is where this issue gets odd. If egraphs are disabled, then everything works ok:$ cargo -q run compile foo.wat --cranelift-enable has_avx --cranelift-set use_egraphs=false && objdump -S foo.cwasm foo.cwasm: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_wasm_function_0>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 44 8b ca mov %edx,%r9d 7: 4c 8b 57 50 mov 0x50(%rdi),%r10 b: c4 82 79 18 44 0a 00 vbroadcastss 0x0(%r10,%r9,1),%xmm0 12: 48 89 ec mov %rbp,%rsp 15: 5d pop %rbp 16: c3 retq ...
So there's an issue here where egraphs are transforming the code into something that can't be pattern-matched by instruction selection. According to logging, when egraphs are enabled, this is the input CLIF into lowering (after egraphs):
function u0:0(i64 vmctx, i64, i32) -> i8x16 fast { gv0 = vmctx gv1 = load.i64 notrap aligned readonly gv0+8 gv2 = load.i64 notrap aligned gv1 gv3 = vmctx gv4 = load.i64 notrap aligned readonly gv3+80 stack_limit = gv2 block0(v0: i64, v1: i64, v2: i32): @0020 v5 = load.i64 notrap aligned readonly v0+80 @0020 v4 = uextend.i64 v2 @0020 v6 = iadd v5, v4 @0020 v7 = load.i32 little heap v6 @0024 jump block1 block1: @0020 v8 = splat.i32x4 v7 @0024 v9 = bitcast.i8x16 little v8 v3 -> v9 @0024 return v9 }
I believe that the issue here is that the
splat
has moved across basic blocks. This means that the load sinking can't fire. The input function to egraphs, however, was:function u0:0(i64 vmctx, i64, i32) -> i8x16 fast { gv0 = vmctx gv1 = load.i64 notrap aligned readonly gv0+8 gv2 = load.i64 notrap aligned gv1 gv3 = vmctx gv4 = load.i64 notrap aligned readonly gv3+80 stack_limit = gv2 block0(v0: i64, v1: i64, v2: i32): @0020 v4 = uextend.i64 v2 @0020 v5 = load.i64 notrap aligned readonly v0+80 @0020 v6 = iadd v5, v4 @0020 v7 = load.i32 little heap v6 @0020 v8 = splat.i32x4 v7 @0024 v9 = bitcast.i8x16 little v8 v3 -> v9 @0024 jump block1 block1: @0024 return v3 }
where it can clearly be seen that egraphs are moving the
splat
andbitcast
instructions across basic blocks.
So that's a basic description of the problem! How best to fix this, though, depends. This was talked briefly about at today's Cranelift meeting but some ways that this could be tackled are:
- Technically there's no need for
cranelift-wasm
to generate theblock1
block in the first place. This is likely done for convenience of translation, but it may be possible to make translation "fancier" and not eagerly allocate a basic block for thereturn
instruction.- Today Cranelift does not have any sort of jump-threading/branch-folding pass. Abstractly
block1
has one predecessor which has ajump
instruction, so there's no need forblock1
to exist and theblock0
andblock1
blocks could be fused. This optimization pass could happen before egraphs, for example. Note that this optimization does already happen to a degree during lowering due toMachBuffer
optimizations.- Perhaps even more fancifully the load sinking could be souped up to work in this case. Given the complications this is not likely to be a viable solution.
At a basic level this I think is an issue worth fixing, but at a higher level I think that this issue showcases the lack of jump-threading in Cranelift and the need for it as egraphs are moving around instructions. Hence the title of this issue and the predicted way to fix it, which would be a jump-threading pass of sorts in Cranelift.
alexcrichton opened issue #6154:
At a high level this is a pretty simple issue where this module:
(module (memory 1) (func (param i32) (result v128) local.get 0 v128.load32_splat ) )
generates this code by default on x86_64:
$ cargo -q run compile foo.wat --cranelift-enable has_avx && objdump -S foo.cwasm foo.cwasm: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_wasm_function_0>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 4c 8b 5f 50 mov 0x50(%rdi),%r11 8: 8b f2 mov %edx,%esi a: 45 8b 5c 33 00 mov 0x0(%r11,%rsi,1),%r11d f: c4 41 79 6e c3 vmovd %r11d,%xmm8 14: c4 c1 79 70 c0 00 vpshufd $0x0,%xmm8,%xmm0 1a: 48 89 ec mov %rbp,%rsp 1d: 5d pop %rbp 1e: c3 retq ...
Naively this instruction lowering pattern matches to the cranelift-wasm lowering of the
v128.load32_splat
instruction, is to generate a load (the instruction at0xa
) followed by a splat (the instructions at0xf
and0x14
). With AVX, however, this instruction should lower to a singlevbroadcastss
instruction. This is where this issue gets odd. If egraphs are disabled, then everything works ok:$ cargo -q run compile foo.wat --cranelift-enable has_avx --cranelift-set use_egraphs=false && objdump -S foo.cwasm foo.cwasm: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_wasm_function_0>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 44 8b ca mov %edx,%r9d 7: 4c 8b 57 50 mov 0x50(%rdi),%r10 b: c4 82 79 18 44 0a 00 vbroadcastss 0x0(%r10,%r9,1),%xmm0 12: 48 89 ec mov %rbp,%rsp 15: 5d pop %rbp 16: c3 retq ...
So there's an issue here where egraphs are transforming the code into something that can't be pattern-matched by instruction selection. According to logging, when egraphs are enabled, this is the input CLIF into lowering (after egraphs):
function u0:0(i64 vmctx, i64, i32) -> i8x16 fast { gv0 = vmctx gv1 = load.i64 notrap aligned readonly gv0+8 gv2 = load.i64 notrap aligned gv1 gv3 = vmctx gv4 = load.i64 notrap aligned readonly gv3+80 stack_limit = gv2 block0(v0: i64, v1: i64, v2: i32): @0020 v5 = load.i64 notrap aligned readonly v0+80 @0020 v4 = uextend.i64 v2 @0020 v6 = iadd v5, v4 @0020 v7 = load.i32 little heap v6 @0024 jump block1 block1: @0020 v8 = splat.i32x4 v7 @0024 v9 = bitcast.i8x16 little v8 v3 -> v9 @0024 return v9 }
I believe that the issue here is that the
splat
has moved across basic blocks. This means that the load sinking can't fire. The input function to egraphs, however, was:function u0:0(i64 vmctx, i64, i32) -> i8x16 fast { gv0 = vmctx gv1 = load.i64 notrap aligned readonly gv0+8 gv2 = load.i64 notrap aligned gv1 gv3 = vmctx gv4 = load.i64 notrap aligned readonly gv3+80 stack_limit = gv2 block0(v0: i64, v1: i64, v2: i32): @0020 v4 = uextend.i64 v2 @0020 v5 = load.i64 notrap aligned readonly v0+80 @0020 v6 = iadd v5, v4 @0020 v7 = load.i32 little heap v6 @0020 v8 = splat.i32x4 v7 @0024 v9 = bitcast.i8x16 little v8 v3 -> v9 @0024 jump block1 block1: @0024 return v3 }
where it can clearly be seen that egraphs are moving the
splat
andbitcast
instructions across basic blocks.
So that's a basic description of the problem! How best to fix this, though, depends. This was talked briefly about at today's Cranelift meeting but some ways that this could be tackled are:
- Technically there's no need for
cranelift-wasm
to generate theblock1
block in the first place. This is likely done for convenience of translation, but it may be possible to make translation "fancier" and not eagerly allocate a basic block for thereturn
instruction.- Today Cranelift does not have any sort of jump-threading/branch-folding pass. Abstractly
block1
has one predecessor which has ajump
instruction, so there's no need forblock1
to exist and theblock0
andblock1
blocks could be fused. This optimization pass could happen before egraphs, for example. Note that this optimization does already happen to a degree during lowering due toMachBuffer
optimizations.- Perhaps even more fancifully the load sinking could be souped up to work in this case. Given the complications this is not likely to be a viable solution.
At a basic level this I think is an issue worth fixing, but at a higher level I think that this issue showcases the lack of jump-threading in Cranelift and the need for it as egraphs are moving around instructions. Hence the title of this issue and the predicted way to fix it, which would be a jump-threading pass of sorts in Cranelift.
alexcrichton labeled issue #6154:
At a high level this is a pretty simple issue where this module:
(module (memory 1) (func (param i32) (result v128) local.get 0 v128.load32_splat ) )
generates this code by default on x86_64:
$ cargo -q run compile foo.wat --cranelift-enable has_avx && objdump -S foo.cwasm foo.cwasm: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_wasm_function_0>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 4c 8b 5f 50 mov 0x50(%rdi),%r11 8: 8b f2 mov %edx,%esi a: 45 8b 5c 33 00 mov 0x0(%r11,%rsi,1),%r11d f: c4 41 79 6e c3 vmovd %r11d,%xmm8 14: c4 c1 79 70 c0 00 vpshufd $0x0,%xmm8,%xmm0 1a: 48 89 ec mov %rbp,%rsp 1d: 5d pop %rbp 1e: c3 retq ...
Naively this instruction lowering pattern matches to the cranelift-wasm lowering of the
v128.load32_splat
instruction, is to generate a load (the instruction at0xa
) followed by a splat (the instructions at0xf
and0x14
). With AVX, however, this instruction should lower to a singlevbroadcastss
instruction. This is where this issue gets odd. If egraphs are disabled, then everything works ok:$ cargo -q run compile foo.wat --cranelift-enable has_avx --cranelift-set use_egraphs=false && objdump -S foo.cwasm foo.cwasm: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_wasm_function_0>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 44 8b ca mov %edx,%r9d 7: 4c 8b 57 50 mov 0x50(%rdi),%r10 b: c4 82 79 18 44 0a 00 vbroadcastss 0x0(%r10,%r9,1),%xmm0 12: 48 89 ec mov %rbp,%rsp 15: 5d pop %rbp 16: c3 retq ...
So there's an issue here where egraphs are transforming the code into something that can't be pattern-matched by instruction selection. According to logging, when egraphs are enabled, this is the input CLIF into lowering (after egraphs):
function u0:0(i64 vmctx, i64, i32) -> i8x16 fast { gv0 = vmctx gv1 = load.i64 notrap aligned readonly gv0+8 gv2 = load.i64 notrap aligned gv1 gv3 = vmctx gv4 = load.i64 notrap aligned readonly gv3+80 stack_limit = gv2 block0(v0: i64, v1: i64, v2: i32): @0020 v5 = load.i64 notrap aligned readonly v0+80 @0020 v4 = uextend.i64 v2 @0020 v6 = iadd v5, v4 @0020 v7 = load.i32 little heap v6 @0024 jump block1 block1: @0020 v8 = splat.i32x4 v7 @0024 v9 = bitcast.i8x16 little v8 v3 -> v9 @0024 return v9 }
I believe that the issue here is that the
splat
has moved across basic blocks. This means that the load sinking can't fire. The input function to egraphs, however, was:function u0:0(i64 vmctx, i64, i32) -> i8x16 fast { gv0 = vmctx gv1 = load.i64 notrap aligned readonly gv0+8 gv2 = load.i64 notrap aligned gv1 gv3 = vmctx gv4 = load.i64 notrap aligned readonly gv3+80 stack_limit = gv2 block0(v0: i64, v1: i64, v2: i32): @0020 v4 = uextend.i64 v2 @0020 v5 = load.i64 notrap aligned readonly v0+80 @0020 v6 = iadd v5, v4 @0020 v7 = load.i32 little heap v6 @0020 v8 = splat.i32x4 v7 @0024 v9 = bitcast.i8x16 little v8 v3 -> v9 @0024 jump block1 block1: @0024 return v3 }
where it can clearly be seen that egraphs are moving the
splat
andbitcast
instructions across basic blocks.
So that's a basic description of the problem! How best to fix this, though, depends. This was talked briefly about at today's Cranelift meeting but some ways that this could be tackled are:
- Technically there's no need for
cranelift-wasm
to generate theblock1
block in the first place. This is likely done for convenience of translation, but it may be possible to make translation "fancier" and not eagerly allocate a basic block for thereturn
instruction.- Today Cranelift does not have any sort of jump-threading/branch-folding pass. Abstractly
block1
has one predecessor which has ajump
instruction, so there's no need forblock1
to exist and theblock0
andblock1
blocks could be fused. This optimization pass could happen before egraphs, for example. Note that this optimization does already happen to a degree during lowering due toMachBuffer
optimizations.- Perhaps even more fancifully the load sinking could be souped up to work in this case. Given the complications this is not likely to be a viable solution.
At a basic level this I think is an issue worth fixing, but at a higher level I think that this issue showcases the lack of jump-threading in Cranelift and the need for it as egraphs are moving around instructions. Hence the title of this issue and the predicted way to fix it, which would be a jump-threading pass of sorts in Cranelift.
alexcrichton labeled issue #6154:
At a high level this is a pretty simple issue where this module:
(module (memory 1) (func (param i32) (result v128) local.get 0 v128.load32_splat ) )
generates this code by default on x86_64:
$ cargo -q run compile foo.wat --cranelift-enable has_avx && objdump -S foo.cwasm foo.cwasm: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_wasm_function_0>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 4c 8b 5f 50 mov 0x50(%rdi),%r11 8: 8b f2 mov %edx,%esi a: 45 8b 5c 33 00 mov 0x0(%r11,%rsi,1),%r11d f: c4 41 79 6e c3 vmovd %r11d,%xmm8 14: c4 c1 79 70 c0 00 vpshufd $0x0,%xmm8,%xmm0 1a: 48 89 ec mov %rbp,%rsp 1d: 5d pop %rbp 1e: c3 retq ...
Naively this instruction lowering pattern matches to the cranelift-wasm lowering of the
v128.load32_splat
instruction, is to generate a load (the instruction at0xa
) followed by a splat (the instructions at0xf
and0x14
). With AVX, however, this instruction should lower to a singlevbroadcastss
instruction. This is where this issue gets odd. If egraphs are disabled, then everything works ok:$ cargo -q run compile foo.wat --cranelift-enable has_avx --cranelift-set use_egraphs=false && objdump -S foo.cwasm foo.cwasm: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_wasm_function_0>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 44 8b ca mov %edx,%r9d 7: 4c 8b 57 50 mov 0x50(%rdi),%r10 b: c4 82 79 18 44 0a 00 vbroadcastss 0x0(%r10,%r9,1),%xmm0 12: 48 89 ec mov %rbp,%rsp 15: 5d pop %rbp 16: c3 retq ...
So there's an issue here where egraphs are transforming the code into something that can't be pattern-matched by instruction selection. According to logging, when egraphs are enabled, this is the input CLIF into lowering (after egraphs):
function u0:0(i64 vmctx, i64, i32) -> i8x16 fast { gv0 = vmctx gv1 = load.i64 notrap aligned readonly gv0+8 gv2 = load.i64 notrap aligned gv1 gv3 = vmctx gv4 = load.i64 notrap aligned readonly gv3+80 stack_limit = gv2 block0(v0: i64, v1: i64, v2: i32): @0020 v5 = load.i64 notrap aligned readonly v0+80 @0020 v4 = uextend.i64 v2 @0020 v6 = iadd v5, v4 @0020 v7 = load.i32 little heap v6 @0024 jump block1 block1: @0020 v8 = splat.i32x4 v7 @0024 v9 = bitcast.i8x16 little v8 v3 -> v9 @0024 return v9 }
I believe that the issue here is that the
splat
has moved across basic blocks. This means that the load sinking can't fire. The input function to egraphs, however, was:function u0:0(i64 vmctx, i64, i32) -> i8x16 fast { gv0 = vmctx gv1 = load.i64 notrap aligned readonly gv0+8 gv2 = load.i64 notrap aligned gv1 gv3 = vmctx gv4 = load.i64 notrap aligned readonly gv3+80 stack_limit = gv2 block0(v0: i64, v1: i64, v2: i32): @0020 v4 = uextend.i64 v2 @0020 v5 = load.i64 notrap aligned readonly v0+80 @0020 v6 = iadd v5, v4 @0020 v7 = load.i32 little heap v6 @0020 v8 = splat.i32x4 v7 @0024 v9 = bitcast.i8x16 little v8 v3 -> v9 @0024 jump block1 block1: @0024 return v3 }
where it can clearly be seen that egraphs are moving the
splat
andbitcast
instructions across basic blocks.
So that's a basic description of the problem! How best to fix this, though, depends. This was talked briefly about at today's Cranelift meeting but some ways that this could be tackled are:
- Technically there's no need for
cranelift-wasm
to generate theblock1
block in the first place. This is likely done for convenience of translation, but it may be possible to make translation "fancier" and not eagerly allocate a basic block for thereturn
instruction.- Today Cranelift does not have any sort of jump-threading/branch-folding pass. Abstractly
block1
has one predecessor which has ajump
instruction, so there's no need forblock1
to exist and theblock0
andblock1
blocks could be fused. This optimization pass could happen before egraphs, for example. Note that this optimization does already happen to a degree during lowering due toMachBuffer
optimizations.- Perhaps even more fancifully the load sinking could be souped up to work in this case. Given the complications this is not likely to be a viable solution.
At a basic level this I think is an issue worth fixing, but at a higher level I think that this issue showcases the lack of jump-threading in Cranelift and the need for it as egraphs are moving around instructions. Hence the title of this issue and the predicted way to fix it, which would be a jump-threading pass of sorts in Cranelift.
tschneidereit commented on issue #6154:
Is this still relevant with #6818 landed?
alexcrichton commented on issue #6154:
Unfortunately, yes. The reproductions are now:
- Uses
vbroadcastss
:wasmtime compile foo.wat --target x86_64 -Ccranelift-has-avx -O opt-level=0 && objdump -S foo.cwasm
- Does not use
vbroadcastss
:wasmtime compile foo.wat --target x86_64 -Ccranelift-has-avx && objdump -S foo.cwasm
cfallin commented on issue #6154:
@jameysharp and I talked about this briefly this morning, and with a fresh look, we think that an explicit jump-threading opt that runs in the mid-end (probably before e-graphs because it exposes additional opportunities) is the simplest approach. #6818's discussion was more about whether we could remove the backend's branch opts, which include jump-threading but at a lower level, and the conclusion was that we shouldn't; but this doesn't preclude us from also doing this at the IR level (both give us different things).
It's probably a relatively small pass (50-100 lines); I'll see if I can get to this at some point soon!
dimitris-aspetakis commented on issue #6154:
I can't help but note here my feeling from trying to implement mid-end instruction scheduling in #6260, that instruction scheduling should be more integrated with the register allocation and lowering stages. I believe the heuristics we use for register spilling might benefit from some sort of integration with the register allocator specifically.
In that spirit, I tend to support the first suggestion from @alexcrichton, where (if I understood correctly), elaborating instructions back to the layout is delayed. I mention our work in #6260 also because it could probably provide a solution to the issues discussed here and in #8787, with special cases in the scheduling pass.
I understand that my comments are a bit too abstract — what I mostly seek is feedback before I dive deeper with possible implementation details.
As a side-note: would it make sense for pattern-matching fusing optimizations from the instruction lowering stage to work on an AST-like representation directly generated from e-graphs? It sounds more expensive, but it also seems like it could catch more of such optimizations.
dimitris-aspetakis edited a comment on issue #6154:
I can't help but note here my feeling from trying to implement mid-end instruction scheduling in #6260, that instruction scheduling should be more integrated with the register allocation and lowering stages. I believe the heuristics we use for register spilling might benefit from some sort of integration with the register allocator specifically.
In that spirit, I tend to support the first suggestion from @alexcrichton, where (if I understood correctly), elaborating instructions back to the layout is delayed. I mention our work in #6260 also because it could probably provide a solution to the issues discussed here and in #8787, with special cases in the scheduling pass.
I understand that my comments are a bit too abstract — what I mostly seek is feedback (more like a _vibe-check_) before I dive deeper with possible implementation details.
As a side-note: would it make sense for pattern-matching fusing optimizations from the instruction lowering stage to work on an AST-like representation directly generated from e-graphs? It sounds more expensive, but it also seems like it could catch more of such optimizations.
dimitris-aspetakis edited a comment on issue #6154:
I can't help but note here my feeling from trying to implement mid-end instruction scheduling in #6260, that instruction scheduling should be more integrated with the register allocation and lowering stages. I believe the heuristics we use for register spilling might benefit from some sort of integration with the register allocator specifically.
In that spirit, I tend to support the first suggestion from @alexcrichton, where (if I understood correctly), elaborating instructions back to the layout is delayed. I mention our work in #6260 also because it could probably provide a solution to the issues discussed here and in #8787, with special cases in the scheduling pass.
I understand that my comments are a bit too abstract — what I mostly seek is feedback (more like a vibe-check) before I dive deeper with possible implementation details.
As a side-note: would it make sense for pattern-matching fusing optimizations from the instruction lowering stage to work on an AST-like representation directly generated from e-graphs? It sounds more expensive, but it also seems like it could catch more of such optimizations.
cfallin commented on issue #6154:
@dimitris-aspetakis I think you're talking about two different things here:
I think that if we could integrate lowering more directly with the sea-of-nodes aegraph representation, that would probably uncover some additional opportunity -- the most direct one being that lowering rules could act as a sort of natural cost function that replaces the ad-hoc one in aegraph extraction. That's probably quite a difficult thing to do (open design/research questions, maybe 6 months or so of work) but would be interesting to see the results of.
Integrating regalloc with instruction selection and/or scheduling (basically, altering the code we produce based on regalloc signals) is I think far more difficult however. There is at least one PhD thesis (David Koes, CMU, 2009) I'm aware of on the topic. It's a combinatorial constraint problem, everything affects everything else, and it's basically asking for a redesign of the entire compiler backend. I'm not saying it isn't possible but the scope is likely multiple years and there's no guarantee we'll come up with something that lands at a new interesting point in the design space in terms of runtime and compile time. Please do feel free to play with this more and propose designs and experiment with it (I'd recommend starting with some canonical examples that "don't work well" currently), just be aware of the difficulty.
Last updated: Dec 23 2024 at 12:05 UTC