wasmtime / issue #4424 Craneline vector shift count seman... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #4424 Craneline vector shift count seman...

Wasmtime GitHub notifications bot (Jul 09 2022 at 17:38):

uweigand opened issue #4424:

In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me. In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.

This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:
    v0 = iconst.i32 17 ; note that this will shift off the end of each lane
    v1 = vconst.i16x8 [1 2 4 8 16 32 64 128]
    v2 = ishl v1, v0
However, looking at the documentation, we have this statement:
        Integer shift left. Shift the bits in ``x`` towards the MSB by ``y``
        places. Shift in zero bits to the LSB.

        The shift amount is masked to the size of ``x``.

        When shifting a B-bits integer type, this instruction computes:

            s &:= y \pmod B,
            a &:= x \cdot 2^s \pmod{2^B}.
Now, admittedly, this does not explicitly refer to vector types either way. However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.

And since the s390x vector shift instruction does implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.

Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics. And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:
      Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => {
            let (a, b) = state.pop2();
            let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder);
            let bitwidth = i64::from(type_of(op).lane_bits());
            // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width
            // we do `b AND 15`; this means fewer instructions than `iconst + urem`.
            let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1);
            state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth))
        }
It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it. This would allow efficient code generation on s390x without affecting performance on the other platforms.

In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.

FYI @cfallin @afonso360

Wasmtime GitHub notifications bot (Jul 11 2022 at 16:31):

akirilov-arm labeled issue #4424:

In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me. In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.

This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:
    v0 = iconst.i32 17 ; note that this will shift off the end of each lane
    v1 = vconst.i16x8 [1 2 4 8 16 32 64 128]
    v2 = ishl v1, v0
However, looking at the documentation, we have this statement:
        Integer shift left. Shift the bits in ``x`` towards the MSB by ``y``
        places. Shift in zero bits to the LSB.

        The shift amount is masked to the size of ``x``.

        When shifting a B-bits integer type, this instruction computes:

            s &:= y \pmod B,
            a &:= x \cdot 2^s \pmod{2^B}.
Now, admittedly, this does not explicitly refer to vector types either way. However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.

And since the s390x vector shift instruction does implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.

Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics. And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:
      Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => {
            let (a, b) = state.pop2();
            let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder);
            let bitwidth = i64::from(type_of(op).lane_bits());
            // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width
            // we do `b AND 15`; this means fewer instructions than `iconst + urem`.
            let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1);
            state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth))
        }
It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it. This would allow efficient code generation on s390x without affecting performance on the other platforms.

In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.

FYI @cfallin @afonso360

Wasmtime GitHub notifications bot (Jul 11 2022 at 16:31):

akirilov-arm labeled issue #4424:

In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me. In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.

This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:
    v0 = iconst.i32 17 ; note that this will shift off the end of each lane
    v1 = vconst.i16x8 [1 2 4 8 16 32 64 128]
    v2 = ishl v1, v0
However, looking at the documentation, we have this statement:
        Integer shift left. Shift the bits in ``x`` towards the MSB by ``y``
        places. Shift in zero bits to the LSB.

        The shift amount is masked to the size of ``x``.

        When shifting a B-bits integer type, this instruction computes:

            s &:= y \pmod B,
            a &:= x \cdot 2^s \pmod{2^B}.
Now, admittedly, this does not explicitly refer to vector types either way. However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.

And since the s390x vector shift instruction does implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.

Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics. And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:
      Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => {
            let (a, b) = state.pop2();
            let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder);
            let bitwidth = i64::from(type_of(op).lane_bits());
            // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width
            // we do `b AND 15`; this means fewer instructions than `iconst + urem`.
            let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1);
            state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth))
        }
It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it. This would allow efficient code generation on s390x without affecting performance on the other platforms.

In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.

FYI @cfallin @afonso360

Wasmtime GitHub notifications bot (Jul 11 2022 at 16:31):

akirilov-arm labeled issue #4424:

In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me. In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.

This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:
    v0 = iconst.i32 17 ; note that this will shift off the end of each lane
    v1 = vconst.i16x8 [1 2 4 8 16 32 64 128]
    v2 = ishl v1, v0
However, looking at the documentation, we have this statement:
        Integer shift left. Shift the bits in ``x`` towards the MSB by ``y``
        places. Shift in zero bits to the LSB.

        The shift amount is masked to the size of ``x``.

        When shifting a B-bits integer type, this instruction computes:

            s &:= y \pmod B,
            a &:= x \cdot 2^s \pmod{2^B}.
Now, admittedly, this does not explicitly refer to vector types either way. However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.

And since the s390x vector shift instruction does implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.

Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics. And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:
      Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => {
            let (a, b) = state.pop2();
            let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder);
            let bitwidth = i64::from(type_of(op).lane_bits());
            // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width
            // we do `b AND 15`; this means fewer instructions than `iconst + urem`.
            let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1);
            state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth))
        }
It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it. This would allow efficient code generation on s390x without affecting performance on the other platforms.

In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.

FYI @cfallin @afonso360

Wasmtime GitHub notifications bot (Jul 11 2022 at 16:31):

akirilov-arm labeled issue #4424:

In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me. In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.

This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:
    v0 = iconst.i32 17 ; note that this will shift off the end of each lane
    v1 = vconst.i16x8 [1 2 4 8 16 32 64 128]
    v2 = ishl v1, v0
However, looking at the documentation, we have this statement:
        Integer shift left. Shift the bits in ``x`` towards the MSB by ``y``
        places. Shift in zero bits to the LSB.

        The shift amount is masked to the size of ``x``.

        When shifting a B-bits integer type, this instruction computes:

            s &:= y \pmod B,
            a &:= x \cdot 2^s \pmod{2^B}.
Now, admittedly, this does not explicitly refer to vector types either way. However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.

And since the s390x vector shift instruction does implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.

Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics. And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:
      Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => {
            let (a, b) = state.pop2();
            let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder);
            let bitwidth = i64::from(type_of(op).lane_bits());
            // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width
            // we do `b AND 15`; this means fewer instructions than `iconst + urem`.
            let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1);
            state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth))
        }
It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it. This would allow efficient code generation on s390x without affecting performance on the other platforms.

In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.

FYI @cfallin @afonso360

Wasmtime GitHub notifications bot (Jul 11 2022 at 16:31):

akirilov-arm labeled issue #4424:

In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me. In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.

This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:
    v0 = iconst.i32 17 ; note that this will shift off the end of each lane
    v1 = vconst.i16x8 [1 2 4 8 16 32 64 128]
    v2 = ishl v1, v0
However, looking at the documentation, we have this statement:
        Integer shift left. Shift the bits in ``x`` towards the MSB by ``y``
        places. Shift in zero bits to the LSB.

        The shift amount is masked to the size of ``x``.

        When shifting a B-bits integer type, this instruction computes:

            s &:= y \pmod B,
            a &:= x \cdot 2^s \pmod{2^B}.
Now, admittedly, this does not explicitly refer to vector types either way. However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.

And since the s390x vector shift instruction does implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.

Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics. And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:
      Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => {
            let (a, b) = state.pop2();
            let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder);
            let bitwidth = i64::from(type_of(op).lane_bits());
            // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width
            // we do `b AND 15`; this means fewer instructions than `iconst + urem`.
            let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1);
            state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth))
        }
It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it. This would allow efficient code generation on s390x without affecting performance on the other platforms.

In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.

FYI @cfallin @afonso360

Wasmtime GitHub notifications bot (Jul 18 2022 at 16:34):

abrown commented on issue #4424:

I agree that this should be fixed. This was one of my early my contributions to the project (when things were quite different, https://github.com/bytecodealliance/cranelift/pull/1120); using Cranelift apart from a WebAssembly translation was not top of mind back. Since both main architectures (x64 and aarch64) needed the mod operation, it seemed reasonable then to have the translation do it. Now it makes more sense to move the extra operations down into each backend that needs them.

Wasmtime GitHub notifications bot (Jul 27 2022 at 17:54):

jameysharp closed issue #4424:

In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me. In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.

This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:
    v0 = iconst.i32 17 ; note that this will shift off the end of each lane
    v1 = vconst.i16x8 [1 2 4 8 16 32 64 128]
    v2 = ishl v1, v0
However, looking at the documentation, we have this statement:
        Integer shift left. Shift the bits in ``x`` towards the MSB by ``y``
        places. Shift in zero bits to the LSB.

        The shift amount is masked to the size of ``x``.

        When shifting a B-bits integer type, this instruction computes:

            s &:= y \pmod B,
            a &:= x \cdot 2^s \pmod{2^B}.
Now, admittedly, this does not explicitly refer to vector types either way. However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.

And since the s390x vector shift instruction does implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.

Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics. And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:
      Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => {
            let (a, b) = state.pop2();
            let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder);
            let bitwidth = i64::from(type_of(op).lane_bits());
            // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width
            // we do `b AND 15`; this means fewer instructions than `iconst + urem`.
            let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1);
            state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth))
        }
It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it. This would allow efficient code generation on s390x without affecting performance on the other platforms.

In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.

FYI @cfallin @afonso360

Last updated: Apr 17 2025 at 20:03 UTC