wasmtime / PR #5926 cranelift: Merge `bitcasts` into `loads` · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #5926 cranelift: Merge `bitcasts` into `loads`

Wasmtime GitHub notifications bot (Mar 04 2023 at 12:44):

afonso360 opened PR #5926 from delete-bitcasts to main:

:wave: Hey,

I was looking at some wasmtime generated CLIF, and noticed that wasmtime emits a bitcast after SIMD loads. That's okay, but it prevents the loads being merged into some ops. So I decided to have a go at trying to remove these.

The rule works, which is nice, but It doesn't seem to delete the previous load, which I assume is some sort of correctness rule embedded in the mid end? This optimization looks correct to me, but I might be wrong!

For this example CLIF:
function %b() -> i32x4 {
    ss0 = explicit_slot 16

block0:
    v0 = stack_addr.i64 ss0
    v1 = load.i8x16 little notrap aligned v0
    v2 = bitcast.i32x4 little v1
    v3 = iabs v2
    return v3
}
We now merge the load into the iabs:
 10:   66 0f 38 1e 00          pabsd   (%rax), %xmm0
Where we used to do a load + pabsd:
   c:   f3 0f 6f 10             movdqu  (%rax), %xmm2
  10:   66 0f 38 1e c2          pabsd   %xmm2, %xmm0
In practice this doesn't help Wasmtime that much for SSE ops, since they require the loads to be aligned, but it does for VEX/EVEX ops.

Wasmtime GitHub notifications bot (Mar 04 2023 at 12:48):

afonso360 edited PR #5926 from delete-bitcasts to main:

:wave: Hey,

I was looking at some wasmtime generated CLIF, and noticed that wasmtime emits a bitcast after SIMD loads. That's okay, but it prevents the loads being merged into some ops. So I decided to have a go at trying to remove these.

The rule works, which is nice, but It doesn't seem to delete the previous load, which I assume is some sort of correctness rule embedded in the mid end? This optimization looks correct to me, but I might be wrong!

For this example CLIF:
function %b() -> i32x4 {
    ss0 = explicit_slot 16

block0:
    v0 = stack_addr.i64 ss0
    v1 = load.i8x16 little notrap aligned v0
    v2 = bitcast.i32x4 little v1
    v3 = iabs v2
    return v3
}
We now merge the load into the iabs:
 10:   66 0f 38 1e 00          pabsd   (%rax), %xmm0
Where we used to do a load + pabsd:
   c:   f3 0f 6f 10             movdqu  (%rax), %xmm2
  10:   66 0f 38 1e c2          pabsd   %xmm2, %xmm0
In practice this doesn't help Wasmtime that much for SSE ops, since they require the loads to be aligned, but it should for VEX/EVEX ops.

Wasmtime GitHub notifications bot (Mar 04 2023 at 12:59):

afonso360 edited PR #5926 from delete-bitcasts to main:

:wave: Hey,

I was looking at some wasmtime generated CLIF, and noticed that wasmtime emits a bitcast after SIMD loads. That's okay, but it prevents the loads being merged into some ops. So I decided to have a go at trying to remove these.

The rule works, which is nice, but It doesn't seem to delete the previous load, which I assume is some sort of correctness rule embedded in the mid end? This optimization looks correct to me, but I might be wrong!

For this example CLIF:
function %b() -> i32x4 {
    ss0 = explicit_slot 16

block0:
    v0 = stack_addr.i64 ss0
    v1 = load.i8x16 little notrap aligned v0
    v2 = bitcast.i32x4 little v1
    v3 = iabs v2
    return v3
}
We now merge the load into the iabs:
 10:   66 0f 38 1e 00          pabsd   (%rax), %xmm0
Where we used to do a load + pabsd:
   c:   f3 0f 6f 10             movdqu  (%rax), %xmm2
  10:   66 0f 38 1e c2          pabsd   %xmm2, %xmm0
In practice this doesn't help Wasmtime that much for SSE ops, since they require the loads to be aligned, but it should for VEX/EVEX ops.

Wasmtime GitHub notifications bot (Mar 04 2023 at 20:04):

alexcrichton submitted PR review.

Wasmtime GitHub notifications bot (Mar 04 2023 at 20:04):

alexcrichton submitted PR review.

Wasmtime GitHub notifications bot (Mar 04 2023 at 20:04):

alexcrichton created PR review comment:

I think i8x16 here is a typo (and what CI is tripping over), but CI also indicates:
    filecheck failed for function on line 32:
    #0 check: block0(v0: i8x16):
    #1 check: v3 = load.i64x2 little v0
    #2 check: return v3
    > function %load_bitcast(i64) -> i64x2 fast {
    Missed #0: \bblock0\(v0: i8x16\):
    > block0(v0: i64):
    >     v1 = load.i8x16 little v0
    >     v3 = load.i64x2 little v0
    >     v4 -> v3
    >     return v3
    > }
which look like the original load isn't actually being replaced but rather a second load is added?

If that's the case there may be something special about egraphs trying to preserve effectful operations which may need some extra help here getting bypassed?

Wasmtime GitHub notifications bot (Mar 04 2023 at 20:04):

alexcrichton created PR review comment:

FWIW the reason for this is that the load instruction for wasm is simply v128.load so we don't know what type is necessary for clif, so i8x16 is always chosen (IIRC). Each op in clif, though, is specifically typed so if it flows into something that needs f32x4 then the bitcast is required.

That being said I think this is a clever optimization to work around that limitation of wasm translation!

Wasmtime GitHub notifications bot (Mar 04 2023 at 20:04):

alexcrichton created PR review comment:

I've been sitting here trying to wrap my head around endianness and lane casts and how it affects s390x and I'll admit I still don't fully grok the implications of this. Natively it seems like having the same endianness is what's required here, but I might also natively think that "well just take the bitcast's endianness and use that on the load" although I'm not certin that's right.

To help assist with assurances here, though, could you also add some runtests which trigger this optimization and work before/after your PR?

Wasmtime GitHub notifications bot (Mar 04 2023 at 21:27):

afonso360 submitted PR review.

Wasmtime GitHub notifications bot (Mar 04 2023 at 21:27):

afonso360 created PR review comment:

Yeah, that makes sense, I'll add that info to the comment.

Wasmtime GitHub notifications bot (Mar 04 2023 at 21:41):

afonso360 updated PR #5926 from delete-bitcasts to main.

Wasmtime GitHub notifications bot (Mar 04 2023 at 21:41):

afonso360 updated PR #5926 from delete-bitcasts to main.

Wasmtime GitHub notifications bot (Mar 04 2023 at 21:52):

afonso360 submitted PR review.

Wasmtime GitHub notifications bot (Mar 04 2023 at 21:52):

afonso360 created PR review comment:

I think i8x16 here is a typo

Oops, that's right!

which look like the original load isn't actually being replaced but rather a second load is added?

If that's the case there may be something special about egraphs trying to preserve effectful operations which may need some extra help here getting bypassed?

Yeah, I noticed that but I really don't know how to fix that... I'm not very familiar with the egraphs stuff behind the curtains.

Wasmtime GitHub notifications bot (Mar 06 2023 at 21:52):

jameysharp submitted PR review.

Wasmtime GitHub notifications bot (Mar 06 2023 at 21:52):

jameysharp submitted PR review.

Wasmtime GitHub notifications bot (Mar 06 2023 at 21:52):

jameysharp created PR review comment:

We can't currently write egraph rules that will replace an instruction unless it has exactly one result Value, which is why store doesn't work here. I recently removed the other instructions from clif_opt.isle because rules using them would never fire. We've been talking about how to do this but don't have a concrete plan yet.

This restriction is primarily because our egraph identifies equivalence classes of Values, rather than identifying equivalent instructions. It's also because it's harder to define what equivalence means for instructions with side effects, such as store.

Wasmtime GitHub notifications bot (Mar 07 2023 at 13:07):

afonso360 closed without merge PR #5926.

Last updated: Apr 18 2025 at 14:03 UTC