afonso360 opened PR #5926 from delete-bitcasts
to main
:
:wave: Hey,
I was looking at some wasmtime generated CLIF, and noticed that wasmtime emits a
bitcast
after SIMD loads. That's okay, but it prevents the loads being merged into some ops. So I decided to have a go at trying to remove these.The rule works, which is nice, but It doesn't seem to delete the previous load, which I assume is some sort of correctness rule embedded in the mid end? This optimization looks correct to me, but I might be wrong!
For this example CLIF:
function %b() -> i32x4 { ss0 = explicit_slot 16 block0: v0 = stack_addr.i64 ss0 v1 = load.i8x16 little notrap aligned v0 v2 = bitcast.i32x4 little v1 v3 = iabs v2 return v3 }
We now merge the load into the iabs:
10: 66 0f 38 1e 00 pabsd (%rax), %xmm0
Where we used to do a
load
+pabsd
:c: f3 0f 6f 10 movdqu (%rax), %xmm2 10: 66 0f 38 1e c2 pabsd %xmm2, %xmm0
In practice this doesn't help Wasmtime that much for SSE ops, since they require the loads to be aligned, but it does for VEX/EVEX ops.
afonso360 edited PR #5926 from delete-bitcasts
to main
:
:wave: Hey,
I was looking at some wasmtime generated CLIF, and noticed that wasmtime emits a
bitcast
after SIMD loads. That's okay, but it prevents the loads being merged into some ops. So I decided to have a go at trying to remove these.The rule works, which is nice, but It doesn't seem to delete the previous load, which I assume is some sort of correctness rule embedded in the mid end? This optimization looks correct to me, but I might be wrong!
For this example CLIF:
function %b() -> i32x4 { ss0 = explicit_slot 16 block0: v0 = stack_addr.i64 ss0 v1 = load.i8x16 little notrap aligned v0 v2 = bitcast.i32x4 little v1 v3 = iabs v2 return v3 }
We now merge the load into the iabs:
10: 66 0f 38 1e 00 pabsd (%rax), %xmm0
Where we used to do a
load
+pabsd
:c: f3 0f 6f 10 movdqu (%rax), %xmm2 10: 66 0f 38 1e c2 pabsd %xmm2, %xmm0
In practice this doesn't help Wasmtime that much for SSE ops, since they require the loads to be aligned, but it should for VEX/EVEX ops.
afonso360 edited PR #5926 from delete-bitcasts
to main
:
:wave: Hey,
I was looking at some wasmtime generated CLIF, and noticed that wasmtime emits a
bitcast
after SIMD loads. That's okay, but it prevents the loads being merged into some ops. So I decided to have a go at trying to remove these.The rule works, which is nice, but It doesn't seem to delete the previous load, which I assume is some sort of correctness rule embedded in the mid end? This optimization looks correct to me, but I might be wrong!
For this example CLIF:
function %b() -> i32x4 { ss0 = explicit_slot 16 block0: v0 = stack_addr.i64 ss0 v1 = load.i8x16 little notrap aligned v0 v2 = bitcast.i32x4 little v1 v3 = iabs v2 return v3 }
We now merge the load into the iabs:
10: 66 0f 38 1e 00 pabsd (%rax), %xmm0
Where we used to do a
load
+pabsd
:c: f3 0f 6f 10 movdqu (%rax), %xmm2 10: 66 0f 38 1e c2 pabsd %xmm2, %xmm0
In practice this doesn't help Wasmtime that much for SSE ops, since they require the loads to be aligned, but it should for VEX/EVEX ops.
alexcrichton submitted PR review.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
I think
i8x16
here is a typo (and what CI is tripping over), but CI also indicates:filecheck failed for function on line 32: #0 check: block0(v0: i8x16): #1 check: v3 = load.i64x2 little v0 #2 check: return v3 > function %load_bitcast(i64) -> i64x2 fast { Missed #0: \bblock0\(v0: i8x16\): > block0(v0: i64): > v1 = load.i8x16 little v0 > v3 = load.i64x2 little v0 > v4 -> v3 > return v3 > }
which look like the original load isn't actually being replaced but rather a second load is added?
If that's the case there may be something special about egraphs trying to preserve effectful operations which may need some extra help here getting bypassed?
alexcrichton created PR review comment:
FWIW the reason for this is that the load instruction for wasm is simply
v128.load
so we don't know what type is necessary for clif, soi8x16
is always chosen (IIRC). Each op in clif, though, is specifically typed so if it flows into something that needsf32x4
then the bitcast is required.That being said I think this is a clever optimization to work around that limitation of wasm translation!
alexcrichton created PR review comment:
I've been sitting here trying to wrap my head around endianness and lane casts and how it affects s390x and I'll admit I still don't fully grok the implications of this. Natively it seems like having the same endianness is what's required here, but I might also natively think that "well just take the bitcast's endianness and use that on the load" although I'm not certin that's right.
To help assist with assurances here, though, could you also add some runtests which trigger this optimization and work before/after your PR?
afonso360 submitted PR review.
afonso360 created PR review comment:
Yeah, that makes sense, I'll add that info to the comment.
afonso360 updated PR #5926 from delete-bitcasts
to main
.
afonso360 updated PR #5926 from delete-bitcasts
to main
.
afonso360 submitted PR review.
afonso360 created PR review comment:
I think i8x16 here is a typo
Oops, that's right!
which look like the original load isn't actually being replaced but rather a second load is added?
If that's the case there may be something special about egraphs trying to preserve effectful operations which may need some extra help here getting bypassed?
Yeah, I noticed that but I really don't know how to fix that... I'm not very familiar with the egraphs stuff behind the curtains.
jameysharp submitted PR review.
jameysharp submitted PR review.
jameysharp created PR review comment:
We can't currently write egraph rules that will replace an instruction unless it has exactly one result
Value
, which is whystore
doesn't work here. I recently removed the other instructions fromclif_opt.isle
because rules using them would never fire. We've been talking about how to do this but don't have a concrete plan yet.This restriction is primarily because our egraph identifies equivalence classes of
Value
s, rather than identifying equivalent instructions. It's also because it's harder to define what equivalence means for instructions with side effects, such asstore
.
afonso360 closed without merge PR #5926.
Last updated: Nov 22 2024 at 17:03 UTC