Stream: git-wasmtime

Topic: wasmtime / issue #6103 cranelift/x64: Use "bit test" inst...


view this post on Zulip Wasmtime GitHub notifications bot (Mar 27 2023 at 16:14):

jameysharp opened issue #6103:

Feature

x86 has dedicated instructions that we could use for the following patterns:

It's also possible to match any of these patterns with a non-constant mask produced by shl 1, x (plus bnot in the case of the btr pattern). However in that case it's necessary to make sure that x is used modulo N, where N is the number of bits in the result type. See the lowerings for shl for details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.

Benefit

Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.

These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values.

Implementation

This is mostly like adding any other instruction, but as a particular note, we have an extractor called imm64_power_of_two for getting the base-two log of an iconst if that constant is a power of two.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 27 2023 at 16:14):

jameysharp labeled issue #6103:

Feature

x86 has dedicated instructions that we could use for the following patterns:

It's also possible to match any of these patterns with a non-constant mask produced by shl 1, x (plus bnot in the case of the btr pattern). However in that case it's necessary to make sure that x is used modulo N, where N is the number of bits in the result type. See the lowerings for shl for details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.

Benefit

Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.

These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values.

Implementation

This is mostly like adding any other instruction, but as a particular note, we have an extractor called imm64_power_of_two for getting the base-two log of an iconst if that constant is a power of two.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 27 2023 at 16:14):

jameysharp labeled issue #6103:

Feature

x86 has dedicated instructions that we could use for the following patterns:

It's also possible to match any of these patterns with a non-constant mask produced by shl 1, x (plus bnot in the case of the btr pattern). However in that case it's necessary to make sure that x is used modulo N, where N is the number of bits in the result type. See the lowerings for shl for details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.

Benefit

Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.

These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values.

Implementation

This is mostly like adding any other instruction, but as a particular note, we have an extractor called imm64_power_of_two for getting the base-two log of an iconst if that constant is a power of two.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 27 2023 at 16:14):

jameysharp labeled issue #6103:

Feature

x86 has dedicated instructions that we could use for the following patterns:

It's also possible to match any of these patterns with a non-constant mask produced by shl 1, x (plus bnot in the case of the btr pattern). However in that case it's necessary to make sure that x is used modulo N, where N is the number of bits in the result type. See the lowerings for shl for details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.

Benefit

Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.

These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values.

Implementation

This is mostly like adding any other instruction, but as a particular note, we have an extractor called imm64_power_of_two for getting the base-two log of an iconst if that constant is a power of two.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 27 2023 at 16:14):

jameysharp labeled issue #6103:

Feature

x86 has dedicated instructions that we could use for the following patterns:

It's also possible to match any of these patterns with a non-constant mask produced by shl 1, x (plus bnot in the case of the btr pattern). However in that case it's necessary to make sure that x is used modulo N, where N is the number of bits in the result type. See the lowerings for shl for details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.

Benefit

Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.

These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values.

Implementation

This is mostly like adding any other instruction, but as a particular note, we have an extractor called imm64_power_of_two for getting the base-two log of an iconst if that constant is a power of two.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 27 2023 at 17:18):

jameysharp edited issue #6103:

Feature

x86 has dedicated instructions that we could use for the following patterns:

It's also possible to match any of these patterns with a non-constant mask produced by shl 1, x (plus bnot in the case of the btr pattern). However in that case it's necessary to make sure that x is used modulo N, where N is the number of bits in the result type. See the lowerings for shl for details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.

Benefit

Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.

These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values. (edit: this is obviously only true for bt, not any of the other instructions which do modify their operands.)

Implementation

This is mostly like adding any other instruction, but as a particular note, we have an extractor called imm64_power_of_two for getting the base-two log of an iconst if that constant is a power of two.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 28 2023 at 14:30):

alexcrichton commented on issue #6103:

Interestingly LLVM only emits bts for this code with -Copt-level=s for Rust code:

#[no_mangle]
pub extern "C" fn foo(a: i64) -> i64 {
    a | (1 << 31)
}

yields:

$ rustc foo.rs --target x86_64-unknown-linux-gnu -O --crate-type cdylib --emit obj && llvm-objdump -S foo.o

foo.o:  file format elf64-x86-64

Disassembly of section .text.foo:

0000000000000000 <foo>:
       0: b8 00 00 00 80                movl    $2147483648, %eax       # imm = 0x80000000
       5: 48 09 f8                      orq     %rdi, %rax
       8: c3                            retq
$ rustc foo.rs --target x86_64-unknown-linux-gnu -Copt-level=s --crate-type cdylib --emit obj && llvm-objdump -S foo.o

foo.o:  file format elf64-x86-64

Disassembly of section .text.foo:

0000000000000000 <foo>:
       0: 48 89 f8                      movq    %rdi, %rax
       3: 48 0f ba e8 1f                btsq    $31, %rax
       8: c3                            retq

Ironically here they're the same size but I think that's just because a register movement was required, I can see how in general this generates slightly more compact code.

Speed-wise though it may be the case that these instructions take more cycles or something like that to prevent LLVM from emitting them by default? I could imagine though that if the instruction was folded into a comparison and elided a comparison instruction that would probably be a speed boost no matter what.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 28 2023 at 20:03):

jameysharp commented on issue #6103:

Interesting! I've now checked a few more cases.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 28 2023 at 20:16):

cfallin commented on issue #6103:

IMHO we should answer this question based on known performance characteristics of x86 implementations. In modern cores, the bit-test instructions are known to be slower than plain ALU operations. According to Agner Fog's instruction tables, on IceLake (a recent-ish Intel core) for example, AND/OR/XOR with reg/reg or reg/imm forms have a latency of 1 cycle and a throughput of 4/cycle, while BTS/BTR/BTC with reg/reg or reg/imm have a latency of 1 cycle and a throughput of 2/cycle. It gets worse for memory forms: the implementation is microcoded (11 uops; up to a few uops can be emitted by the hardware decoders, anything greater switches the frontend to microcode-sequencing mode with associated penalty).

In general I think the tendency of modern compilers is to stick to the "RISC-like" areas of x86 because the more complex corners of the ISA are unoptimized; IMHO we should do the same as we'd prefer performance over saving a few bytes of code size.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 28 2023 at 20:18):

cfallin edited a comment on issue #6103:

IMHO we should answer this question based on known performance characteristics of x86 implementations. In modern cores, the bit-test instructions are known to be slower than plain ALU operations. According to Agner Fog's instruction tables, on IceLake (a recent-ish Intel core) for example (p. 350-351 in that PDF), AND/OR/XOR with reg/reg or reg/imm forms have a latency of 1 cycle and a throughput of 4/cycle, while BTS/BTR/BTC with reg/reg or reg/imm have a latency of 1 cycle and a throughput of 2/cycle. It gets worse for memory forms: the implementation is microcoded (11 uops; up to a few uops can be emitted by the hardware decoders, anything greater switches the frontend to microcode-sequencing mode with associated penalty).

In general I think the tendency of modern compilers is to stick to the "RISC-like" areas of x86 because the more complex corners of the ISA are unoptimized; IMHO we should do the same as we'd prefer performance over saving a few bytes of code size.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 28 2023 at 20:18):

cfallin edited a comment on issue #6103:

IMHO we should answer this question based on known performance characteristics of x86 implementations. In modern cores, the bit-test instructions are known to be slower than plain ALU operations. According to Agner Fog's instruction tables, on IceLake (a recent-ish Intel core) for example (pp. 350-351 in that PDF), AND/OR/XOR with reg/reg or reg/imm forms have a latency of 1 cycle and a throughput of 4/cycle, while BTS/BTR/BTC with reg/reg or reg/imm have a latency of 1 cycle and a throughput of 2/cycle. It gets worse for memory forms: the implementation is microcoded (11 uops; up to a few uops can be emitted by the hardware decoders, anything greater switches the frontend to microcode-sequencing mode with associated penalty).

In general I think the tendency of modern compilers is to stick to the "RISC-like" areas of x86 because the more complex corners of the ISA are unoptimized; IMHO we should do the same as we'd prefer performance over saving a few bytes of code size.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 28 2023 at 21:51):

jameysharp commented on issue #6103:

Good to know! I'll close this then.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 28 2023 at 21:51):

jameysharp closed issue #6103:

Feature

x86 has dedicated instructions that we could use for the following patterns:

It's also possible to match any of these patterns with a non-constant mask produced by shl 1, x (plus bnot in the case of the btr pattern). However in that case it's necessary to make sure that x is used modulo N, where N is the number of bits in the result type. See the lowerings for shl for details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.

Benefit

Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.

These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values. (edit: this is obviously only true for bt, not any of the other instructions which do modify their operands.)

Implementation

This is mostly like adding any other instruction, but as a particular note, we have an extractor called imm64_power_of_two for getting the base-two log of an iconst if that constant is a power of two.


Last updated: Nov 22 2024 at 17:03 UTC