jameysharp opened issue #6103:
Feature
x86 has dedicated instructions that we could use for the following patterns:
bor
with a constant power-of-two:bts
("Bit Test and Set")bxor
with a constant power-of-two:btc
("Bit Test and Complement")band
with the bitwise-complement of a constant power-of-two:btr
("Bit Test and Reset")icmp
ofband
with a constant power-of-two:bt
("Bit Test") (this one may be a little trickier)It's also possible to match any of these patterns with a non-constant mask produced by
shl 1, x
(plusbnot
in the case of thebtr
pattern). However in that case it's necessary to make sure thatx
is used moduloN
, whereN
is the number of bits in the result type. See the lowerings forshl
for details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.Benefit
Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.
These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values.
Implementation
This is mostly like adding any other instruction, but as a particular note, we have an extractor called
imm64_power_of_two
for getting the base-two log of aniconst
if that constant is a power of two.
jameysharp labeled issue #6103:
Feature
x86 has dedicated instructions that we could use for the following patterns:
bor
with a constant power-of-two:bts
("Bit Test and Set")bxor
with a constant power-of-two:btc
("Bit Test and Complement")band
with the bitwise-complement of a constant power-of-two:btr
("Bit Test and Reset")icmp
ofband
with a constant power-of-two:bt
("Bit Test") (this one may be a little trickier)It's also possible to match any of these patterns with a non-constant mask produced by
shl 1, x
(plusbnot
in the case of thebtr
pattern). However in that case it's necessary to make sure thatx
is used moduloN
, whereN
is the number of bits in the result type. See the lowerings forshl
for details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.Benefit
Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.
These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values.
Implementation
This is mostly like adding any other instruction, but as a particular note, we have an extractor called
imm64_power_of_two
for getting the base-two log of aniconst
if that constant is a power of two.
jameysharp labeled issue #6103:
Feature
x86 has dedicated instructions that we could use for the following patterns:
bor
with a constant power-of-two:bts
("Bit Test and Set")bxor
with a constant power-of-two:btc
("Bit Test and Complement")band
with the bitwise-complement of a constant power-of-two:btr
("Bit Test and Reset")icmp
ofband
with a constant power-of-two:bt
("Bit Test") (this one may be a little trickier)It's also possible to match any of these patterns with a non-constant mask produced by
shl 1, x
(plusbnot
in the case of thebtr
pattern). However in that case it's necessary to make sure thatx
is used moduloN
, whereN
is the number of bits in the result type. See the lowerings forshl
for details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.Benefit
Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.
These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values.
Implementation
This is mostly like adding any other instruction, but as a particular note, we have an extractor called
imm64_power_of_two
for getting the base-two log of aniconst
if that constant is a power of two.
jameysharp labeled issue #6103:
Feature
x86 has dedicated instructions that we could use for the following patterns:
bor
with a constant power-of-two:bts
("Bit Test and Set")bxor
with a constant power-of-two:btc
("Bit Test and Complement")band
with the bitwise-complement of a constant power-of-two:btr
("Bit Test and Reset")icmp
ofband
with a constant power-of-two:bt
("Bit Test") (this one may be a little trickier)It's also possible to match any of these patterns with a non-constant mask produced by
shl 1, x
(plusbnot
in the case of thebtr
pattern). However in that case it's necessary to make sure thatx
is used moduloN
, whereN
is the number of bits in the result type. See the lowerings forshl
for details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.Benefit
Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.
These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values.
Implementation
This is mostly like adding any other instruction, but as a particular note, we have an extractor called
imm64_power_of_two
for getting the base-two log of aniconst
if that constant is a power of two.
jameysharp labeled issue #6103:
Feature
x86 has dedicated instructions that we could use for the following patterns:
bor
with a constant power-of-two:bts
("Bit Test and Set")bxor
with a constant power-of-two:btc
("Bit Test and Complement")band
with the bitwise-complement of a constant power-of-two:btr
("Bit Test and Reset")icmp
ofband
with a constant power-of-two:bt
("Bit Test") (this one may be a little trickier)It's also possible to match any of these patterns with a non-constant mask produced by
shl 1, x
(plusbnot
in the case of thebtr
pattern). However in that case it's necessary to make sure thatx
is used moduloN
, whereN
is the number of bits in the result type. See the lowerings forshl
for details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.Benefit
Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.
These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values.
Implementation
This is mostly like adding any other instruction, but as a particular note, we have an extractor called
imm64_power_of_two
for getting the base-two log of aniconst
if that constant is a power of two.
jameysharp edited issue #6103:
Feature
x86 has dedicated instructions that we could use for the following patterns:
bor
with a constant power-of-two:bts
("Bit Test and Set")bxor
with a constant power-of-two:btc
("Bit Test and Complement")band
with the bitwise-complement of a constant power-of-two:btr
("Bit Test and Reset")icmp
ofband
with a constant power-of-two:bt
("Bit Test") (this one may be a little trickier)It's also possible to match any of these patterns with a non-constant mask produced by
shl 1, x
(plusbnot
in the case of thebtr
pattern). However in that case it's necessary to make sure thatx
is used moduloN
, whereN
is the number of bits in the result type. See the lowerings forshl
for details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.Benefit
Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.
These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values. (edit: this is obviously only true for
bt
, not any of the other instructions which do modify their operands.)Implementation
This is mostly like adding any other instruction, but as a particular note, we have an extractor called
imm64_power_of_two
for getting the base-two log of aniconst
if that constant is a power of two.
alexcrichton commented on issue #6103:
Interestingly LLVM only emits
bts
for this code with-Copt-level=s
for Rust code:#[no_mangle] pub extern "C" fn foo(a: i64) -> i64 { a | (1 << 31) }
yields:
$ rustc foo.rs --target x86_64-unknown-linux-gnu -O --crate-type cdylib --emit obj && llvm-objdump -S foo.o foo.o: file format elf64-x86-64 Disassembly of section .text.foo: 0000000000000000 <foo>: 0: b8 00 00 00 80 movl $2147483648, %eax # imm = 0x80000000 5: 48 09 f8 orq %rdi, %rax 8: c3 retq $ rustc foo.rs --target x86_64-unknown-linux-gnu -Copt-level=s --crate-type cdylib --emit obj && llvm-objdump -S foo.o foo.o: file format elf64-x86-64 Disassembly of section .text.foo: 0000000000000000 <foo>: 0: 48 89 f8 movq %rdi, %rax 3: 48 0f ba e8 1f btsq $31, %rax 8: c3 retq
Ironically here they're the same size but I think that's just because a register movement was required, I can see how in general this generates slightly more compact code.
Speed-wise though it may be the case that these instructions take more cycles or something like that to prevent LLVM from emitting them by default? I could imagine though that if the instruction was folded into a comparison and elided a comparison instruction that would probably be a speed boost no matter what.
jameysharp commented on issue #6103:
Interesting! I've now checked a few more cases.
opt-level
1-3 all construct the constant mask like you observed, whileopt-level=s
andopt-level=z
both use these instructions.- Even with
opt-level=s
, integers smaller thanu64
don't use these instructions.- The bit index doesn't matter for instruction selection, only for whether a
movl
is sufficient or a fullmovq
is required to load the mask into a register. I thought it might prefer these instructions for bits 32-63 since themovq
encoding is so much larger, but no.- None of these configurations use the
bt
instruction: apparentlybool
is represented as "anything non-zero is true" so forx & (1 << 63) != 0
it emitted a single right-shift at all optimization levels.
cfallin commented on issue #6103:
IMHO we should answer this question based on known performance characteristics of x86 implementations. In modern cores, the bit-test instructions are known to be slower than plain ALU operations. According to Agner Fog's instruction tables, on IceLake (a recent-ish Intel core) for example, AND/OR/XOR with reg/reg or reg/imm forms have a latency of 1 cycle and a throughput of 4/cycle, while BTS/BTR/BTC with reg/reg or reg/imm have a latency of 1 cycle and a throughput of 2/cycle. It gets worse for memory forms: the implementation is microcoded (11 uops; up to a few uops can be emitted by the hardware decoders, anything greater switches the frontend to microcode-sequencing mode with associated penalty).
In general I think the tendency of modern compilers is to stick to the "RISC-like" areas of x86 because the more complex corners of the ISA are unoptimized; IMHO we should do the same as we'd prefer performance over saving a few bytes of code size.
cfallin edited a comment on issue #6103:
IMHO we should answer this question based on known performance characteristics of x86 implementations. In modern cores, the bit-test instructions are known to be slower than plain ALU operations. According to Agner Fog's instruction tables, on IceLake (a recent-ish Intel core) for example (p. 350-351 in that PDF), AND/OR/XOR with reg/reg or reg/imm forms have a latency of 1 cycle and a throughput of 4/cycle, while BTS/BTR/BTC with reg/reg or reg/imm have a latency of 1 cycle and a throughput of 2/cycle. It gets worse for memory forms: the implementation is microcoded (11 uops; up to a few uops can be emitted by the hardware decoders, anything greater switches the frontend to microcode-sequencing mode with associated penalty).
In general I think the tendency of modern compilers is to stick to the "RISC-like" areas of x86 because the more complex corners of the ISA are unoptimized; IMHO we should do the same as we'd prefer performance over saving a few bytes of code size.
cfallin edited a comment on issue #6103:
IMHO we should answer this question based on known performance characteristics of x86 implementations. In modern cores, the bit-test instructions are known to be slower than plain ALU operations. According to Agner Fog's instruction tables, on IceLake (a recent-ish Intel core) for example (pp. 350-351 in that PDF), AND/OR/XOR with reg/reg or reg/imm forms have a latency of 1 cycle and a throughput of 4/cycle, while BTS/BTR/BTC with reg/reg or reg/imm have a latency of 1 cycle and a throughput of 2/cycle. It gets worse for memory forms: the implementation is microcoded (11 uops; up to a few uops can be emitted by the hardware decoders, anything greater switches the frontend to microcode-sequencing mode with associated penalty).
In general I think the tendency of modern compilers is to stick to the "RISC-like" areas of x86 because the more complex corners of the ISA are unoptimized; IMHO we should do the same as we'd prefer performance over saving a few bytes of code size.
jameysharp commented on issue #6103:
Good to know! I'll close this then.
jameysharp closed issue #6103:
Feature
x86 has dedicated instructions that we could use for the following patterns:
bor
with a constant power-of-two:bts
("Bit Test and Set")bxor
with a constant power-of-two:btc
("Bit Test and Complement")band
with the bitwise-complement of a constant power-of-two:btr
("Bit Test and Reset")icmp
ofband
with a constant power-of-two:bt
("Bit Test") (this one may be a little trickier)It's also possible to match any of these patterns with a non-constant mask produced by
shl 1, x
(plusbnot
in the case of thebtr
pattern). However in that case it's necessary to make sure thatx
is used moduloN
, whereN
is the number of bits in the result type. See the lowerings forshl
for details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.Benefit
Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.
These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values. (edit: this is obviously only true for
bt
, not any of the other instructions which do modify their operands.)Implementation
This is mostly like adding any other instruction, but as a particular note, we have an extractor called
imm64_power_of_two
for getting the base-two log of aniconst
if that constant is a power of two.
Last updated: Dec 23 2024 at 13:07 UTC