afonso360 opened issue #4870:
:wave: Hey,
Jun Ryung Ju (sorry, I don't know the github user) on zulip discovered that bitwise operations on floats are unimplemented on AArch64. Upon further investigation they seem to be unimplemented everywhere.
@bjorn3 also mentions that cg_clif does not use them, it does a
bitcast
and then the operation on the integer version.This makes it likely a good candidate for simplifying our opcode set.
Is there any benefit / optimization that we can do on these that would justify keeping them around?
The bitwise ops that I'm referring to are:
band
bor
bxor
bnot
bor_not
bxor_not
cc: @cfallin
afonso360 edited issue #4870:
:wave: Hey,
Jun Ryung Ju (sorry, I don't know the github user) on zulip discovered that bitwise operations on floats are unimplemented on AArch64. Upon further investigation they seem to be unimplemented on all backends.
@bjorn3 also mentions that cg_clif does not use them, it does a
bitcast
and then the operation on the integer version.This makes it likely a good candidate for simplifying our opcode set.
Is there any benefit / optimization that we can do on these that would justify keeping them around?
The bitwise ops that I'm referring to are:
band
bor
bxor
bnot
bor_not
bxor_not
cc: @cfallin
yuyang-ok commented on issue #4870:
+1 remove.
jameysharp commented on issue #4870:
Perhaps optimizations like #4803 would do better if we keep bitwise CLIF operations on floats—we could express the optimization in the mid-end without introducing extra bitcast instructions.
That said, looking more closely, I think the best code sequence for that particular optimization is different enough across backends that we may not want to do it in the mid-end anyway. And it's niche enough to be a really weak argument for keeping bitwise operators on floats.
By contrast, I find the opposite argument compelling: they're complicated to lower on probably any architecture with a hardware floating-point unit, and nobody has wanted them yet.
So I'm also in favor of removing them.
akirilov-arm commented on issue #4870:
@afonso360 You are missing
band_not
in your list.Those operations are actually straightforward to implement on AArch64, since there is a hardware floating-point unit iff SIMD is supported; only
bxor_not
is a 2-instruction sequence (instead of 1). However, I probably also lean towards removing them, given the niche use case.
afonso360 commented on issue #4870:
Added, Thanks!
afonso360 edited issue #4870:
:wave: Hey,
Jun Ryung Ju (sorry, I don't know the github user) on zulip discovered that bitwise operations on floats are unimplemented on AArch64. Upon further investigation they seem to be unimplemented on all backends.
@bjorn3 also mentions that cg_clif does not use them, it does a
bitcast
and then the operation on the integer version.This makes it likely a good candidate for simplifying our opcode set.
Is there any benefit / optimization that we can do on these that would justify keeping them around?
The bitwise ops that I'm referring to are:
band
bor
bxor
bnot
band_not
bor_not
bxor_not
cc: @cfallin
cfallin commented on issue #4870:
Agreed, I think we should remove them. A floating-point type imbues some meaning on the value such that it's not meant to be used (or is not typically used) as a "bucket of bits" type, as integer types are.
I suspect that it would be possible to do reasonable lowerings for these on x86-64 too (since we use XMM regs and would have the integer-vec instructions at our disposal, just as on aarch64), so that angle doesn't hold as much significance for me. However, the "only build what is needed" angle does: if no one actually needs these op/type combinations and they are extra lowerings to add, maintain, and test, then let's not have them IMHO.
jameysharp commented on issue #4870:
It turns out that Wasmtime can actually generate SIMD bitwise ops on floating-point vectors, as in this example (from
tests/misc_testsuite/simd/issue_3327_bnot_lowering.wast
and #3327):(module (func $v128_not (export "v128_not") (result v128) v128.const f32x4 0 0 0 0 f32x4.abs v128.not) ) (assert_return (invoke "v128_not") (v128.const i32x4 -1 -1 -1 -1))
That currently compiles to this CLIF:
function u0:0(i64 vmctx) -> i8x16 fast { const0 = 0x00000000000000000000000000000000 block0(v0: i64): @0026 v2 = vconst.i8x16 const0 @0038 v3 = raw_bitcast.f32x4 v2 ; v2 = const0 @0038 v4 = fabs v3 @003b v5 = bnot v4 @003d v6 = raw_bitcast.i8x16 v5 @003d jump block1(v6) block1(v1: i8x16): @003d return v1 }
Do we want
cranelift-wasm
to insert a bitcast in between thefabs
andbnot
here?
abrown commented on issue #4870:
I would prefer fewer
raw_bitcasts
if at all possible? I'm not very fond of it...
cfallin commented on issue #4870:
Ah, interesting! That actually does change the calculus a bit in my mind. Given the principle above (only build what is needed) as the higher-priority one, I think it's totally fine to keep this support. Then for consistency it probably makes sense to support scalar floats as well; otherwise we have support for scalar ints, and all kinds of vectors, but not scalar floats, which is an odd and surprising hole.
akirilov-arm labeled issue #4870:
:wave: Hey,
Jun Ryung Ju (sorry, I don't know the github user) on zulip discovered that bitwise operations on floats are unimplemented on AArch64. Upon further investigation they seem to be unimplemented on all backends.
@bjorn3 also mentions that cg_clif does not use them, it does a
bitcast
and then the operation on the integer version.This makes it likely a good candidate for simplifying our opcode set.
Is there any benefit / optimization that we can do on these that would justify keeping them around?
The bitwise ops that I'm referring to are:
band
bor
bxor
bnot
band_not
bor_not
bxor_not
cc: @cfallin
akirilov-arm labeled issue #4870:
:wave: Hey,
Jun Ryung Ju (sorry, I don't know the github user) on zulip discovered that bitwise operations on floats are unimplemented on AArch64. Upon further investigation they seem to be unimplemented on all backends.
@bjorn3 also mentions that cg_clif does not use them, it does a
bitcast
and then the operation on the integer version.This makes it likely a good candidate for simplifying our opcode set.
Is there any benefit / optimization that we can do on these that would justify keeping them around?
The bitwise ops that I'm referring to are:
band
bor
bxor
bnot
band_not
bor_not
bxor_not
cc: @cfallin
jameysharp commented on issue #4870:
I'm closing this issue as we seem to have reached consensus on keeping bitwise operations on floats.
jameysharp closed issue #4870:
:wave: Hey,
Jun Ryung Ju (sorry, I don't know the github user) on zulip discovered that bitwise operations on floats are unimplemented on AArch64. Upon further investigation they seem to be unimplemented on all backends.
@bjorn3 also mentions that cg_clif does not use them, it does a
bitcast
and then the operation on the integer version.This makes it likely a good candidate for simplifying our opcode set.
Is there any benefit / optimization that we can do on these that would justify keeping them around?
The bitwise ops that I'm referring to are:
band
bor
bxor
bnot
band_not
bor_not
bxor_not
cc: @cfallin
ArtBlnd commented on issue #4870:
Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.
VCode { Entry block: 0 v130 := v135 Block 0: (original IR block: block0) (instruction range: 0 .. 8) Inst 0: args %v128=%xmm0 %v129=%xmm1 Inst 1: movd %v128, %v132l Inst 2: movd %v129, %v133l Inst 3: andl %v132l, %v133l, %v134l Inst 4: movd %v134l, %v135 Inst 5: movaps %v130, %v131 Inst 6: movaps %v131, %xmm0 Inst 7: ret }
I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
ArtBlnd edited a comment on issue #4870:
Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.
lowered band
VCode { Entry block: 0 v130 := v135 Block 0: (original IR block: block0) (instruction range: 0 .. 8) Inst 0: args %v128=%xmm0 %v129=%xmm1 Inst 1: movd %v128, %v132l Inst 2: movd %v129, %v133l Inst 3: andl %v132l, %v133l, %v134l Inst 4: movd %v134l, %v135 Inst 5: movaps %v130, %v131 Inst 6: movaps %v131, %xmm0 Inst 7: ret }
I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
ArtBlnd edited a comment on issue #4870:
Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.
lowered band
VCode { Entry block: 0 v130 := v135 Block 0: (original IR block: block0) (instruction range: 0 .. 8) Inst 0: args %v128=%xmm0 %v129=%xmm1 Inst 1: movd %v128, %v132l Inst 2: movd %v129, %v133l Inst 3: andl %v132l, %v133l, %v134l Inst 4: movd %v134l, %v135 Inst 5: movaps %v130, %v131 Inst 6: movaps %v131, %xmm0 Inst 7: ret }
I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
ArtBlnd edited a comment on issue #4870:
Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.
lowered band
VCode { Entry block: 0 v130 := v135 Block 0: (original IR block: block0) (instruction range: 0 .. 8) Inst 0: args %v128=%xmm0 %v129=%xmm1 Inst 1: movd %v128, %v132l Inst 2: movd %v129, %v133l Inst 3: andl %v132l, %v133l, %v134l Inst 4: movd %v134l, %v135 Inst 5: movaps %v130, %v131 Inst 6: movaps %v131, %xmm0 Inst 7: ret }
I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?
ArtBlnd edited a comment on issue #4870:
Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.
lowered band
VCode { Entry block: 0 v130 := v135 Block 0: (original IR block: block0) (instruction range: 0 .. 8) Inst 0: args %v128=%xmm0 %v129=%xmm1 Inst 1: movd %v128, %v132l Inst 2: movd %v129, %v133l Inst 3: andl %v132l, %v133l, %v134l Inst 4: movd %v134l, %v135 Inst 5: movaps %v130, %v131 Inst 6: movaps %v131, %xmm0 Inst 7: ret }
I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?Ok I think regalloc2 does not allocate Gpr for single F32, F64. I might need to make a custom lowering rule for bitwise ops.
ArtBlnd edited a comment on issue #4870:
Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.
VCode { Entry block: 0 v130 := v135 Block 0: (original IR block: block0) (instruction range: 0 .. 8) Inst 0: args %v128=%xmm0 %v129=%xmm1 Inst 1: movd %v128, %v132l Inst 2: movd %v129, %v133l Inst 3: andl %v132l, %v133l, %v134l Inst 4: movd %v134l, %v135 Inst 5: movaps %v130, %v131 Inst 6: movaps %v131, %xmm0 Inst 7: ret }
I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?Ok I think regalloc2 does not allocate Gpr for single F32, F64. I might need to make a custom lowering rule for bitwise ops.
ArtBlnd edited a comment on issue #4870:
Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.
// function %band_f32(f32, f32) -> f32 { // block0(v0: f32, v1: f32): // v2 = band v0, v1 // return v2 // } VCode { Entry block: 0 v130 := v135 Block 0: (original IR block: block0) (instruction range: 0 .. 8) Inst 0: args %v128=%xmm0 %v129=%xmm1 Inst 1: movd %v128, %v132l Inst 2: movd %v129, %v133l Inst 3: andl %v132l, %v133l, %v134l Inst 4: movd %v134l, %v135 Inst 5: movaps %v130, %v131 Inst 6: movaps %v131, %xmm0 Inst 7: ret }
I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?Ok I think regalloc2 does not allocate Gpr for single F32, F64. I might need to make a custom lowering rule for bitwise ops.
ArtBlnd edited a comment on issue #4870:
Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.
// function %band_f32(f32, f32) -> f32 { // block0(v0: f32, v1: f32): // v2 = band v0, v1 // return v2 // } VCode { Entry block: 0 v130 := v135 Block 0: (original IR block: block0) (instruction range: 0 .. 8) Inst 0: args %v128=%xmm0 %v129=%xmm1 Inst 1: movd %v128, %v132l Inst 2: movd %v129, %v133l Inst 3: andl %v132l, %v133l, %v134l Inst 4: movd %v134l, %v135 Inst 5: movaps %v130, %v131 Inst 6: movaps %v131, %xmm0 Inst 7: ret }
I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?Ok I think regalloc2 does not allocate Gpr for single F32, F64. I might need to make a custom lowering rule for bitwise ops.
ArtBlnd edited a comment on issue #4870:
Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.
// function %band_f32(f32, f32) -> f32 { // block0(v0: f32, v1: f32): // v2 = band v0, v1 // return v2 // } VCode { Entry block: 0 v130 := v135 Block 0: (original IR block: block0) (instruction range: 0 .. 8) Inst 0: args %v128=%xmm0 %v129=%xmm1 Inst 1: movd %v128, %v132l Inst 2: movd %v129, %v133l Inst 3: andl %v132l, %v133l, %v134l Inst 4: movd %v134l, %v135 Inst 5: movaps %v130, %v131 Inst 6: movaps %v131, %xmm0 Inst 7: ret }
I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?Ok I think regalloc2 does not allocate Gpr for single F32, F64. I might need to make a custom transform layer for float bitwise ops.
ArtBlnd edited a comment on issue #4870:
Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.
// function %band_f32(f32, f32) -> f32 { // block0(v0: f32, v1: f32): // v2 = band v0, v1 // return v2 // } VCode { Entry block: 0 v130 := v135 Block 0: (original IR block: block0) (instruction range: 0 .. 8) Inst 0: args %v128=%xmm0 %v129=%xmm1 Inst 1: movd %v128, %v132l Inst 2: movd %v129, %v133l Inst 3: andl %v132l, %v133l, %v134l Inst 4: movd %v134l, %v135 Inst 5: movaps %v130, %v131 Inst 6: movaps %v131, %xmm0 Inst 7: ret }
I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?Ok nvm. I think I just need lower XMM bitops on scalar bitops.
ArtBlnd edited a comment on issue #4870:
Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.
// function %band_f32(f32, f32) -> f32 { // block0(v0: f32, v1: f32): // v2 = band v0, v1 // return v2 // } VCode { Entry block: 0 v130 := v135 Block 0: (original IR block: block0) (instruction range: 0 .. 8) Inst 0: args %v128=%xmm0 %v129=%xmm1 Inst 1: movd %v128, %v132l Inst 2: movd %v129, %v133l Inst 3: andl %v132l, %v133l, %v134l Inst 4: movd %v134l, %v135 Inst 5: movaps %v130, %v131 Inst 6: movaps %v131, %xmm0 Inst 7: ret }
I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?Ok nvm. I think I just need lower scalar bitops to XMM bitops
Last updated: Nov 22 2024 at 17:03 UTC