uweigand opened issue #4424:
In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me. In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.
This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:
v0 = iconst.i32 17 ; note that this will shift off the end of each lane v1 = vconst.i16x8 [1 2 4 8 16 32 64 128] v2 = ishl v1, v0
However, looking at the documentation, we have this statement:
Integer shift left. Shift the bits in ``x`` towards the MSB by ``y`` places. Shift in zero bits to the LSB. The shift amount is masked to the size of ``x``. When shifting a B-bits integer type, this instruction computes: s &:= y \pmod B, a &:= x \cdot 2^s \pmod{2^B}.
Now, admittedly, this does not explicitly refer to vector types either way. However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.
And since the s390x vector shift instruction does implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.
Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics. And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:
Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => { let (a, b) = state.pop2(); let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder); let bitwidth = i64::from(type_of(op).lane_bits()); // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width // we do `b AND 15`; this means fewer instructions than `iconst + urem`. let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1); state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth)) }
It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it. This would allow efficient code generation on s390x without affecting performance on the other platforms.
In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.
FYI @cfallin @afonso360
akirilov-arm labeled issue #4424:
In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me. In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.
This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:
v0 = iconst.i32 17 ; note that this will shift off the end of each lane v1 = vconst.i16x8 [1 2 4 8 16 32 64 128] v2 = ishl v1, v0
However, looking at the documentation, we have this statement:
Integer shift left. Shift the bits in ``x`` towards the MSB by ``y`` places. Shift in zero bits to the LSB. The shift amount is masked to the size of ``x``. When shifting a B-bits integer type, this instruction computes: s &:= y \pmod B, a &:= x \cdot 2^s \pmod{2^B}.
Now, admittedly, this does not explicitly refer to vector types either way. However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.
And since the s390x vector shift instruction does implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.
Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics. And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:
Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => { let (a, b) = state.pop2(); let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder); let bitwidth = i64::from(type_of(op).lane_bits()); // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width // we do `b AND 15`; this means fewer instructions than `iconst + urem`. let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1); state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth)) }
It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it. This would allow efficient code generation on s390x without affecting performance on the other platforms.
In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.
FYI @cfallin @afonso360
akirilov-arm labeled issue #4424:
In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me. In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.
This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:
v0 = iconst.i32 17 ; note that this will shift off the end of each lane v1 = vconst.i16x8 [1 2 4 8 16 32 64 128] v2 = ishl v1, v0
However, looking at the documentation, we have this statement:
Integer shift left. Shift the bits in ``x`` towards the MSB by ``y`` places. Shift in zero bits to the LSB. The shift amount is masked to the size of ``x``. When shifting a B-bits integer type, this instruction computes: s &:= y \pmod B, a &:= x \cdot 2^s \pmod{2^B}.
Now, admittedly, this does not explicitly refer to vector types either way. However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.
And since the s390x vector shift instruction does implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.
Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics. And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:
Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => { let (a, b) = state.pop2(); let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder); let bitwidth = i64::from(type_of(op).lane_bits()); // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width // we do `b AND 15`; this means fewer instructions than `iconst + urem`. let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1); state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth)) }
It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it. This would allow efficient code generation on s390x without affecting performance on the other platforms.
In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.
FYI @cfallin @afonso360
akirilov-arm labeled issue #4424:
In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me. In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.
This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:
v0 = iconst.i32 17 ; note that this will shift off the end of each lane v1 = vconst.i16x8 [1 2 4 8 16 32 64 128] v2 = ishl v1, v0
However, looking at the documentation, we have this statement:
Integer shift left. Shift the bits in ``x`` towards the MSB by ``y`` places. Shift in zero bits to the LSB. The shift amount is masked to the size of ``x``. When shifting a B-bits integer type, this instruction computes: s &:= y \pmod B, a &:= x \cdot 2^s \pmod{2^B}.
Now, admittedly, this does not explicitly refer to vector types either way. However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.
And since the s390x vector shift instruction does implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.
Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics. And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:
Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => { let (a, b) = state.pop2(); let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder); let bitwidth = i64::from(type_of(op).lane_bits()); // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width // we do `b AND 15`; this means fewer instructions than `iconst + urem`. let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1); state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth)) }
It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it. This would allow efficient code generation on s390x without affecting performance on the other platforms.
In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.
FYI @cfallin @afonso360
akirilov-arm labeled issue #4424:
In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me. In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.
This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:
v0 = iconst.i32 17 ; note that this will shift off the end of each lane v1 = vconst.i16x8 [1 2 4 8 16 32 64 128] v2 = ishl v1, v0
However, looking at the documentation, we have this statement:
Integer shift left. Shift the bits in ``x`` towards the MSB by ``y`` places. Shift in zero bits to the LSB. The shift amount is masked to the size of ``x``. When shifting a B-bits integer type, this instruction computes: s &:= y \pmod B, a &:= x \cdot 2^s \pmod{2^B}.
Now, admittedly, this does not explicitly refer to vector types either way. However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.
And since the s390x vector shift instruction does implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.
Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics. And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:
Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => { let (a, b) = state.pop2(); let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder); let bitwidth = i64::from(type_of(op).lane_bits()); // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width // we do `b AND 15`; this means fewer instructions than `iconst + urem`. let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1); state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth)) }
It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it. This would allow efficient code generation on s390x without affecting performance on the other platforms.
In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.
FYI @cfallin @afonso360
akirilov-arm labeled issue #4424:
In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me. In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.
This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:
v0 = iconst.i32 17 ; note that this will shift off the end of each lane v1 = vconst.i16x8 [1 2 4 8 16 32 64 128] v2 = ishl v1, v0
However, looking at the documentation, we have this statement:
Integer shift left. Shift the bits in ``x`` towards the MSB by ``y`` places. Shift in zero bits to the LSB. The shift amount is masked to the size of ``x``. When shifting a B-bits integer type, this instruction computes: s &:= y \pmod B, a &:= x \cdot 2^s \pmod{2^B}.
Now, admittedly, this does not explicitly refer to vector types either way. However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.
And since the s390x vector shift instruction does implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.
Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics. And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:
Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => { let (a, b) = state.pop2(); let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder); let bitwidth = i64::from(type_of(op).lane_bits()); // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width // we do `b AND 15`; this means fewer instructions than `iconst + urem`. let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1); state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth)) }
It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it. This would allow efficient code generation on s390x without affecting performance on the other platforms.
In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.
FYI @cfallin @afonso360
abrown commented on issue #4424:
I agree that this should be fixed. This was one of my early my contributions to the project (when things were quite different, https://github.com/bytecodealliance/cranelift/pull/1120); using Cranelift apart from a WebAssembly translation was not top of mind back. Since both main architectures (x64 and aarch64) needed the mod operation, it seemed reasonable then to have the translation do it. Now it makes more sense to move the extra operations down into each backend that needs them.
jameysharp closed issue #4424:
In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me. In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.
This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:
v0 = iconst.i32 17 ; note that this will shift off the end of each lane v1 = vconst.i16x8 [1 2 4 8 16 32 64 128] v2 = ishl v1, v0
However, looking at the documentation, we have this statement:
Integer shift left. Shift the bits in ``x`` towards the MSB by ``y`` places. Shift in zero bits to the LSB. The shift amount is masked to the size of ``x``. When shifting a B-bits integer type, this instruction computes: s &:= y \pmod B, a &:= x \cdot 2^s \pmod{2^B}.
Now, admittedly, this does not explicitly refer to vector types either way. However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.
And since the s390x vector shift instruction does implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.
Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics. And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:
Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => { let (a, b) = state.pop2(); let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder); let bitwidth = i64::from(type_of(op).lane_bits()); // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width // we do `b AND 15`; this means fewer instructions than `iconst + urem`. let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1); state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth)) }
It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it. This would allow efficient code generation on s390x without affecting performance on the other platforms.
In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.
FYI @cfallin @afonso360
Last updated: Dec 23 2024 at 12:05 UTC