Stream: git-wasmtime

Topic: wasmtime / issue #4870 Cranelift: simplify opcode set by ...


view this post on Zulip Wasmtime GitHub notifications bot (Sep 05 2022 at 13:37):

afonso360 opened issue #4870:

:wave: Hey,

Jun Ryung Ju (sorry, I don't know the github user) on zulip discovered that bitwise operations on floats are unimplemented on AArch64. Upon further investigation they seem to be unimplemented everywhere.

@bjorn3 also mentions that cg_clif does not use them, it does a bitcast and then the operation on the integer version.

This makes it likely a good candidate for simplifying our opcode set.

Is there any benefit / optimization that we can do on these that would justify keeping them around?

The bitwise ops that I'm referring to are:

cc: @cfallin

view this post on Zulip Wasmtime GitHub notifications bot (Sep 05 2022 at 13:38):

afonso360 edited issue #4870:

:wave: Hey,

Jun Ryung Ju (sorry, I don't know the github user) on zulip discovered that bitwise operations on floats are unimplemented on AArch64. Upon further investigation they seem to be unimplemented on all backends.

@bjorn3 also mentions that cg_clif does not use them, it does a bitcast and then the operation on the integer version.

This makes it likely a good candidate for simplifying our opcode set.

Is there any benefit / optimization that we can do on these that would justify keeping them around?

The bitwise ops that I'm referring to are:

cc: @cfallin

view this post on Zulip Wasmtime GitHub notifications bot (Sep 06 2022 at 09:05):

yuyang-ok commented on issue #4870:

+1 remove.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 06 2022 at 19:43):

jameysharp commented on issue #4870:

Perhaps optimizations like #4803 would do better if we keep bitwise CLIF operations on floats—we could express the optimization in the mid-end without introducing extra bitcast instructions.

That said, looking more closely, I think the best code sequence for that particular optimization is different enough across backends that we may not want to do it in the mid-end anyway. And it's niche enough to be a really weak argument for keeping bitwise operators on floats.

By contrast, I find the opposite argument compelling: they're complicated to lower on probably any architecture with a hardware floating-point unit, and nobody has wanted them yet.

So I'm also in favor of removing them.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 07 2022 at 10:54):

akirilov-arm commented on issue #4870:

@afonso360 You are missing band_not in your list.

Those operations are actually straightforward to implement on AArch64, since there is a hardware floating-point unit iff SIMD is supported; only bxor_not is a 2-instruction sequence (instead of 1). However, I probably also lean towards removing them, given the niche use case.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 07 2022 at 11:17):

afonso360 commented on issue #4870:

Added, Thanks!

view this post on Zulip Wasmtime GitHub notifications bot (Sep 07 2022 at 11:17):

afonso360 edited issue #4870:

:wave: Hey,

Jun Ryung Ju (sorry, I don't know the github user) on zulip discovered that bitwise operations on floats are unimplemented on AArch64. Upon further investigation they seem to be unimplemented on all backends.

@bjorn3 also mentions that cg_clif does not use them, it does a bitcast and then the operation on the integer version.

This makes it likely a good candidate for simplifying our opcode set.

Is there any benefit / optimization that we can do on these that would justify keeping them around?

The bitwise ops that I'm referring to are:

cc: @cfallin

view this post on Zulip Wasmtime GitHub notifications bot (Sep 07 2022 at 19:45):

cfallin commented on issue #4870:

Agreed, I think we should remove them. A floating-point type imbues some meaning on the value such that it's not meant to be used (or is not typically used) as a "bucket of bits" type, as integer types are.

I suspect that it would be possible to do reasonable lowerings for these on x86-64 too (since we use XMM regs and would have the integer-vec instructions at our disposal, just as on aarch64), so that angle doesn't hold as much significance for me. However, the "only build what is needed" angle does: if no one actually needs these op/type combinations and they are extra lowerings to add, maintain, and test, then let's not have them IMHO.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 07 2022 at 21:14):

jameysharp commented on issue #4870:

It turns out that Wasmtime can actually generate SIMD bitwise ops on floating-point vectors, as in this example (from tests/misc_testsuite/simd/issue_3327_bnot_lowering.wast and #3327):

(module
  (func $v128_not (export "v128_not") (result v128)
    v128.const f32x4 0 0 0 0
    f32x4.abs
    v128.not)
)

(assert_return (invoke "v128_not") (v128.const i32x4 -1 -1 -1 -1))

That currently compiles to this CLIF:

function u0:0(i64 vmctx) -> i8x16 fast {
    const0 = 0x00000000000000000000000000000000

                                block0(v0: i64):
@0026                               v2 = vconst.i8x16 const0
@0038                               v3 = raw_bitcast.f32x4 v2  ; v2 = const0
@0038                               v4 = fabs v3
@003b                               v5 = bnot v4
@003d                               v6 = raw_bitcast.i8x16 v5
@003d                               jump block1(v6)

                                block1(v1: i8x16):
@003d                               return v1
}

Do we want cranelift-wasm to insert a bitcast in between the fabs and bnot here?

view this post on Zulip Wasmtime GitHub notifications bot (Sep 07 2022 at 23:00):

abrown commented on issue #4870:

I would prefer fewer raw_bitcasts if at all possible? I'm not very fond of it...

view this post on Zulip Wasmtime GitHub notifications bot (Sep 07 2022 at 23:29):

cfallin commented on issue #4870:

Ah, interesting! That actually does change the calculus a bit in my mind. Given the principle above (only build what is needed) as the higher-priority one, I think it's totally fine to keep this support. Then for consistency it probably makes sense to support scalar floats as well; otherwise we have support for scalar ints, and all kinds of vectors, but not scalar floats, which is an odd and surprising hole.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 12 2022 at 09:50):

akirilov-arm labeled issue #4870:

:wave: Hey,

Jun Ryung Ju (sorry, I don't know the github user) on zulip discovered that bitwise operations on floats are unimplemented on AArch64. Upon further investigation they seem to be unimplemented on all backends.

@bjorn3 also mentions that cg_clif does not use them, it does a bitcast and then the operation on the integer version.

This makes it likely a good candidate for simplifying our opcode set.

Is there any benefit / optimization that we can do on these that would justify keeping them around?

The bitwise ops that I'm referring to are:

cc: @cfallin

view this post on Zulip Wasmtime GitHub notifications bot (Sep 12 2022 at 09:50):

akirilov-arm labeled issue #4870:

:wave: Hey,

Jun Ryung Ju (sorry, I don't know the github user) on zulip discovered that bitwise operations on floats are unimplemented on AArch64. Upon further investigation they seem to be unimplemented on all backends.

@bjorn3 also mentions that cg_clif does not use them, it does a bitcast and then the operation on the integer version.

This makes it likely a good candidate for simplifying our opcode set.

Is there any benefit / optimization that we can do on these that would justify keeping them around?

The bitwise ops that I'm referring to are:

cc: @cfallin

view this post on Zulip Wasmtime GitHub notifications bot (Oct 06 2022 at 16:23):

jameysharp commented on issue #4870:

I'm closing this issue as we seem to have reached consensus on keeping bitwise operations on floats.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 06 2022 at 16:23):

jameysharp closed issue #4870:

:wave: Hey,

Jun Ryung Ju (sorry, I don't know the github user) on zulip discovered that bitwise operations on floats are unimplemented on AArch64. Upon further investigation they seem to be unimplemented on all backends.

@bjorn3 also mentions that cg_clif does not use them, it does a bitcast and then the operation on the integer version.

This makes it likely a good candidate for simplifying our opcode set.

Is there any benefit / optimization that we can do on these that would justify keeping them around?

The bitwise ops that I'm referring to are:

cc: @cfallin

view this post on Zulip Wasmtime GitHub notifications bot (Oct 09 2022 at 22:27):

ArtBlnd commented on issue #4870:

Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.

VCode {
  Entry block: 0
  v130 := v135
Block 0:
    (original IR block: block0)
    (instruction range: 0 .. 8)
  Inst 0: args %v128=%xmm0 %v129=%xmm1
  Inst 1: movd    %v128, %v132l
  Inst 2: movd    %v129, %v133l
  Inst 3: andl    %v132l, %v133l, %v134l
  Inst 4: movd    %v134l, %v135
  Inst 5: movaps  %v130, %v131
  Inst 6: movaps  %v131, %xmm0
  Inst 7: ret
}

I think we need to make F32 and F64 to select Gpr instead of Xmm registers.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 09 2022 at 22:31):

ArtBlnd edited a comment on issue #4870:

Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.

lowered band

VCode {
  Entry block: 0
  v130 := v135
Block 0:
    (original IR block: block0)
    (instruction range: 0 .. 8)
  Inst 0: args %v128=%xmm0 %v129=%xmm1
  Inst 1: movd    %v128, %v132l
  Inst 2: movd    %v129, %v133l
  Inst 3: andl    %v132l, %v133l, %v134l
  Inst 4: movd    %v134l, %v135
  Inst 5: movaps  %v130, %v131
  Inst 6: movaps  %v131, %xmm0
  Inst 7: ret
}

I think we need to make F32 and F64 to select Gpr instead of Xmm registers.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 09 2022 at 22:31):

ArtBlnd edited a comment on issue #4870:

Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.

lowered band

VCode {
  Entry block: 0
  v130 := v135
Block 0:
    (original IR block: block0)
    (instruction range: 0 .. 8)
  Inst 0: args %v128=%xmm0 %v129=%xmm1
  Inst 1: movd    %v128, %v132l
  Inst 2: movd    %v129, %v133l
  Inst 3: andl    %v132l, %v133l, %v134l
  Inst 4: movd    %v134l, %v135
  Inst 5: movaps  %v130, %v131
  Inst 6: movaps  %v131, %xmm0
  Inst 7: ret
}

I think we need to make F32 and F64 to select Gpr instead of Xmm registers.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 09 2022 at 22:32):

ArtBlnd edited a comment on issue #4870:

Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.

lowered band

VCode {
  Entry block: 0
  v130 := v135
Block 0:
    (original IR block: block0)
    (instruction range: 0 .. 8)
  Inst 0: args %v128=%xmm0 %v129=%xmm1
  Inst 1: movd    %v128, %v132l
  Inst 2: movd    %v129, %v133l
  Inst 3: andl    %v132l, %v133l, %v134l
  Inst 4: movd    %v134l, %v135
  Inst 5: movaps  %v130, %v131
  Inst 6: movaps  %v131, %xmm0
  Inst 7: ret
}

I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 09 2022 at 22:37):

ArtBlnd edited a comment on issue #4870:

Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.

lowered band

VCode {
  Entry block: 0
  v130 := v135
Block 0:
    (original IR block: block0)
    (instruction range: 0 .. 8)
  Inst 0: args %v128=%xmm0 %v129=%xmm1
  Inst 1: movd    %v128, %v132l
  Inst 2: movd    %v129, %v133l
  Inst 3: andl    %v132l, %v133l, %v134l
  Inst 4: movd    %v134l, %v135
  Inst 5: movaps  %v130, %v131
  Inst 6: movaps  %v131, %xmm0
  Inst 7: ret
}

I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?

Ok I think regalloc2 does not allocate Gpr for single F32, F64. I might need to make a custom lowering rule for bitwise ops.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 09 2022 at 22:37):

ArtBlnd edited a comment on issue #4870:

Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.

VCode {
  Entry block: 0
  v130 := v135
Block 0:
    (original IR block: block0)
    (instruction range: 0 .. 8)
  Inst 0: args %v128=%xmm0 %v129=%xmm1
  Inst 1: movd    %v128, %v132l
  Inst 2: movd    %v129, %v133l
  Inst 3: andl    %v132l, %v133l, %v134l
  Inst 4: movd    %v134l, %v135
  Inst 5: movaps  %v130, %v131
  Inst 6: movaps  %v131, %xmm0
  Inst 7: ret
}

I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?

Ok I think regalloc2 does not allocate Gpr for single F32, F64. I might need to make a custom lowering rule for bitwise ops.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 09 2022 at 22:37):

ArtBlnd edited a comment on issue #4870:

Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.

// function %band_f32(f32, f32) -> f32 {
// block0(v0: f32, v1: f32):
//     v2 = band v0, v1
//     return v2
// }
VCode {
  Entry block: 0
  v130 := v135
Block 0:
    (original IR block: block0)
    (instruction range: 0 .. 8)
  Inst 0: args %v128=%xmm0 %v129=%xmm1
  Inst 1: movd    %v128, %v132l
  Inst 2: movd    %v129, %v133l
  Inst 3: andl    %v132l, %v133l, %v134l
  Inst 4: movd    %v134l, %v135
  Inst 5: movaps  %v130, %v131
  Inst 6: movaps  %v131, %xmm0
  Inst 7: ret
}

I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?

Ok I think regalloc2 does not allocate Gpr for single F32, F64. I might need to make a custom lowering rule for bitwise ops.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 09 2022 at 22:37):

ArtBlnd edited a comment on issue #4870:

Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.

// function %band_f32(f32, f32) -> f32 {
// block0(v0: f32, v1: f32):
//     v2 = band v0, v1
//     return v2
// }

VCode {
  Entry block: 0
  v130 := v135
Block 0:
    (original IR block: block0)
    (instruction range: 0 .. 8)
  Inst 0: args %v128=%xmm0 %v129=%xmm1
  Inst 1: movd    %v128, %v132l
  Inst 2: movd    %v129, %v133l
  Inst 3: andl    %v132l, %v133l, %v134l
  Inst 4: movd    %v134l, %v135
  Inst 5: movaps  %v130, %v131
  Inst 6: movaps  %v131, %xmm0
  Inst 7: ret
}

I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?

Ok I think regalloc2 does not allocate Gpr for single F32, F64. I might need to make a custom lowering rule for bitwise ops.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 09 2022 at 22:40):

ArtBlnd edited a comment on issue #4870:

Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.

// function %band_f32(f32, f32) -> f32 {
// block0(v0: f32, v1: f32):
//     v2 = band v0, v1
//     return v2
// }

VCode {
  Entry block: 0
  v130 := v135
Block 0:
    (original IR block: block0)
    (instruction range: 0 .. 8)
  Inst 0: args %v128=%xmm0 %v129=%xmm1
  Inst 1: movd    %v128, %v132l
  Inst 2: movd    %v129, %v133l
  Inst 3: andl    %v132l, %v133l, %v134l
  Inst 4: movd    %v134l, %v135
  Inst 5: movaps  %v130, %v131
  Inst 6: movaps  %v131, %xmm0
  Inst 7: ret
}

I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?

Ok I think regalloc2 does not allocate Gpr for single F32, F64. I might need to make a custom transform layer for float bitwise ops.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 09 2022 at 22:44):

ArtBlnd edited a comment on issue #4870:

Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.

// function %band_f32(f32, f32) -> f32 {
// block0(v0: f32, v1: f32):
//     v2 = band v0, v1
//     return v2
// }

VCode {
  Entry block: 0
  v130 := v135
Block 0:
    (original IR block: block0)
    (instruction range: 0 .. 8)
  Inst 0: args %v128=%xmm0 %v129=%xmm1
  Inst 1: movd    %v128, %v132l
  Inst 2: movd    %v129, %v133l
  Inst 3: andl    %v132l, %v133l, %v134l
  Inst 4: movd    %v134l, %v135
  Inst 5: movaps  %v130, %v131
  Inst 6: movaps  %v131, %xmm0
  Inst 7: ret
}

I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?

Ok nvm. I think I just need lower XMM bitops on scalar bitops.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 09 2022 at 23:08):

ArtBlnd edited a comment on issue #4870:

Ok. I am working on implementing float bitops on x86_64 currently. looks implementing bitwise operation using bitcasts results terrible.

// function %band_f32(f32, f32) -> f32 {
// block0(v0: f32, v1: f32):
//     v2 = band v0, v1
//     return v2
// }

VCode {
  Entry block: 0
  v130 := v135
Block 0:
    (original IR block: block0)
    (instruction range: 0 .. 8)
  Inst 0: args %v128=%xmm0 %v129=%xmm1
  Inst 1: movd    %v128, %v132l
  Inst 2: movd    %v129, %v133l
  Inst 3: andl    %v132l, %v133l, %v134l
  Inst 4: movd    %v134l, %v135
  Inst 5: movaps  %v130, %v131
  Inst 6: movaps  %v131, %xmm0
  Inst 7: ret
}

I think we need to make F32 and F64 to select Gpr instead of Xmm registers.
Is this possible or should I make an optimization for scalar float memory ops?

Ok nvm. I think I just need lower scalar bitops to XMM bitops


Last updated: Oct 23 2024 at 20:03 UTC