jameysharp opened issue #6103:
Feature
x86 has dedicated instructions that we could use for the following patterns:
borwith a constant power-of-two:bts("Bit Test and Set")bxorwith a constant power-of-two:btc("Bit Test and Complement")bandwith the bitwise-complement of a constant power-of-two:btr("Bit Test and Reset")icmpofbandwith a constant power-of-two:bt("Bit Test") (this one may be a little trickier)It's also possible to match any of these patterns with a non-constant mask produced by
shl 1, x(plusbnotin the case of thebtrpattern). However in that case it's necessary to make sure thatxis used moduloN, whereNis the number of bits in the result type. See the lowerings forshlfor details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.Benefit
Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.
These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values.
Implementation
This is mostly like adding any other instruction, but as a particular note, we have an extractor called
imm64_power_of_twofor getting the base-two log of aniconstif that constant is a power of two.
jameysharp labeled issue #6103:
Feature
x86 has dedicated instructions that we could use for the following patterns:
borwith a constant power-of-two:bts("Bit Test and Set")bxorwith a constant power-of-two:btc("Bit Test and Complement")bandwith the bitwise-complement of a constant power-of-two:btr("Bit Test and Reset")icmpofbandwith a constant power-of-two:bt("Bit Test") (this one may be a little trickier)It's also possible to match any of these patterns with a non-constant mask produced by
shl 1, x(plusbnotin the case of thebtrpattern). However in that case it's necessary to make sure thatxis used moduloN, whereNis the number of bits in the result type. See the lowerings forshlfor details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.Benefit
Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.
These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values.
Implementation
This is mostly like adding any other instruction, but as a particular note, we have an extractor called
imm64_power_of_twofor getting the base-two log of aniconstif that constant is a power of two.
jameysharp labeled issue #6103:
Feature
x86 has dedicated instructions that we could use for the following patterns:
borwith a constant power-of-two:bts("Bit Test and Set")bxorwith a constant power-of-two:btc("Bit Test and Complement")bandwith the bitwise-complement of a constant power-of-two:btr("Bit Test and Reset")icmpofbandwith a constant power-of-two:bt("Bit Test") (this one may be a little trickier)It's also possible to match any of these patterns with a non-constant mask produced by
shl 1, x(plusbnotin the case of thebtrpattern). However in that case it's necessary to make sure thatxis used moduloN, whereNis the number of bits in the result type. See the lowerings forshlfor details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.Benefit
Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.
These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values.
Implementation
This is mostly like adding any other instruction, but as a particular note, we have an extractor called
imm64_power_of_twofor getting the base-two log of aniconstif that constant is a power of two.
jameysharp labeled issue #6103:
Feature
x86 has dedicated instructions that we could use for the following patterns:
borwith a constant power-of-two:bts("Bit Test and Set")bxorwith a constant power-of-two:btc("Bit Test and Complement")bandwith the bitwise-complement of a constant power-of-two:btr("Bit Test and Reset")icmpofbandwith a constant power-of-two:bt("Bit Test") (this one may be a little trickier)It's also possible to match any of these patterns with a non-constant mask produced by
shl 1, x(plusbnotin the case of thebtrpattern). However in that case it's necessary to make sure thatxis used moduloN, whereNis the number of bits in the result type. See the lowerings forshlfor details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.Benefit
Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.
These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values.
Implementation
This is mostly like adding any other instruction, but as a particular note, we have an extractor called
imm64_power_of_twofor getting the base-two log of aniconstif that constant is a power of two.
jameysharp labeled issue #6103:
Feature
x86 has dedicated instructions that we could use for the following patterns:
borwith a constant power-of-two:bts("Bit Test and Set")bxorwith a constant power-of-two:btc("Bit Test and Complement")bandwith the bitwise-complement of a constant power-of-two:btr("Bit Test and Reset")icmpofbandwith a constant power-of-two:bt("Bit Test") (this one may be a little trickier)It's also possible to match any of these patterns with a non-constant mask produced by
shl 1, x(plusbnotin the case of thebtrpattern). However in that case it's necessary to make sure thatxis used moduloN, whereNis the number of bits in the result type. See the lowerings forshlfor details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.Benefit
Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.
These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values.
Implementation
This is mostly like adding any other instruction, but as a particular note, we have an extractor called
imm64_power_of_twofor getting the base-two log of aniconstif that constant is a power of two.
jameysharp edited issue #6103:
Feature
x86 has dedicated instructions that we could use for the following patterns:
borwith a constant power-of-two:bts("Bit Test and Set")bxorwith a constant power-of-two:btc("Bit Test and Complement")bandwith the bitwise-complement of a constant power-of-two:btr("Bit Test and Reset")icmpofbandwith a constant power-of-two:bt("Bit Test") (this one may be a little trickier)It's also possible to match any of these patterns with a non-constant mask produced by
shl 1, x(plusbnotin the case of thebtrpattern). However in that case it's necessary to make sure thatxis used moduloN, whereNis the number of bits in the result type. See the lowerings forshlfor details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.Benefit
Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.
These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values. (edit: this is obviously only true for
bt, not any of the other instructions which do modify their operands.)Implementation
This is mostly like adding any other instruction, but as a particular note, we have an extractor called
imm64_power_of_twofor getting the base-two log of aniconstif that constant is a power of two.
alexcrichton commented on issue #6103:
Interestingly LLVM only emits
btsfor this code with-Copt-level=sfor Rust code:#[no_mangle] pub extern "C" fn foo(a: i64) -> i64 { a | (1 << 31) }yields:
$ rustc foo.rs --target x86_64-unknown-linux-gnu -O --crate-type cdylib --emit obj && llvm-objdump -S foo.o foo.o: file format elf64-x86-64 Disassembly of section .text.foo: 0000000000000000 <foo>: 0: b8 00 00 00 80 movl $2147483648, %eax # imm = 0x80000000 5: 48 09 f8 orq %rdi, %rax 8: c3 retq $ rustc foo.rs --target x86_64-unknown-linux-gnu -Copt-level=s --crate-type cdylib --emit obj && llvm-objdump -S foo.o foo.o: file format elf64-x86-64 Disassembly of section .text.foo: 0000000000000000 <foo>: 0: 48 89 f8 movq %rdi, %rax 3: 48 0f ba e8 1f btsq $31, %rax 8: c3 retqIronically here they're the same size but I think that's just because a register movement was required, I can see how in general this generates slightly more compact code.
Speed-wise though it may be the case that these instructions take more cycles or something like that to prevent LLVM from emitting them by default? I could imagine though that if the instruction was folded into a comparison and elided a comparison instruction that would probably be a speed boost no matter what.
jameysharp commented on issue #6103:
Interesting! I've now checked a few more cases.
opt-level1-3 all construct the constant mask like you observed, whileopt-level=sandopt-level=zboth use these instructions.- Even with
opt-level=s, integers smaller thanu64don't use these instructions.- The bit index doesn't matter for instruction selection, only for whether a
movlis sufficient or a fullmovqis required to load the mask into a register. I thought it might prefer these instructions for bits 32-63 since themovqencoding is so much larger, but no.- None of these configurations use the
btinstruction: apparentlyboolis represented as "anything non-zero is true" so forx & (1 << 63) != 0it emitted a single right-shift at all optimization levels.
cfallin commented on issue #6103:
IMHO we should answer this question based on known performance characteristics of x86 implementations. In modern cores, the bit-test instructions are known to be slower than plain ALU operations. According to Agner Fog's instruction tables, on IceLake (a recent-ish Intel core) for example, AND/OR/XOR with reg/reg or reg/imm forms have a latency of 1 cycle and a throughput of 4/cycle, while BTS/BTR/BTC with reg/reg or reg/imm have a latency of 1 cycle and a throughput of 2/cycle. It gets worse for memory forms: the implementation is microcoded (11 uops; up to a few uops can be emitted by the hardware decoders, anything greater switches the frontend to microcode-sequencing mode with associated penalty).
In general I think the tendency of modern compilers is to stick to the "RISC-like" areas of x86 because the more complex corners of the ISA are unoptimized; IMHO we should do the same as we'd prefer performance over saving a few bytes of code size.
cfallin edited a comment on issue #6103:
IMHO we should answer this question based on known performance characteristics of x86 implementations. In modern cores, the bit-test instructions are known to be slower than plain ALU operations. According to Agner Fog's instruction tables, on IceLake (a recent-ish Intel core) for example (p. 350-351 in that PDF), AND/OR/XOR with reg/reg or reg/imm forms have a latency of 1 cycle and a throughput of 4/cycle, while BTS/BTR/BTC with reg/reg or reg/imm have a latency of 1 cycle and a throughput of 2/cycle. It gets worse for memory forms: the implementation is microcoded (11 uops; up to a few uops can be emitted by the hardware decoders, anything greater switches the frontend to microcode-sequencing mode with associated penalty).
In general I think the tendency of modern compilers is to stick to the "RISC-like" areas of x86 because the more complex corners of the ISA are unoptimized; IMHO we should do the same as we'd prefer performance over saving a few bytes of code size.
cfallin edited a comment on issue #6103:
IMHO we should answer this question based on known performance characteristics of x86 implementations. In modern cores, the bit-test instructions are known to be slower than plain ALU operations. According to Agner Fog's instruction tables, on IceLake (a recent-ish Intel core) for example (pp. 350-351 in that PDF), AND/OR/XOR with reg/reg or reg/imm forms have a latency of 1 cycle and a throughput of 4/cycle, while BTS/BTR/BTC with reg/reg or reg/imm have a latency of 1 cycle and a throughput of 2/cycle. It gets worse for memory forms: the implementation is microcoded (11 uops; up to a few uops can be emitted by the hardware decoders, anything greater switches the frontend to microcode-sequencing mode with associated penalty).
In general I think the tendency of modern compilers is to stick to the "RISC-like" areas of x86 because the more complex corners of the ISA are unoptimized; IMHO we should do the same as we'd prefer performance over saving a few bytes of code size.
jameysharp commented on issue #6103:
Good to know! I'll close this then.
jameysharp closed issue #6103:
Feature
x86 has dedicated instructions that we could use for the following patterns:
borwith a constant power-of-two:bts("Bit Test and Set")bxorwith a constant power-of-two:btc("Bit Test and Complement")bandwith the bitwise-complement of a constant power-of-two:btr("Bit Test and Reset")icmpofbandwith a constant power-of-two:bt("Bit Test") (this one may be a little trickier)It's also possible to match any of these patterns with a non-constant mask produced by
shl 1, x(plusbnotin the case of thebtrpattern). However in that case it's necessary to make sure thatxis used moduloN, whereNis the number of bits in the result type. See the lowerings forshlfor details, and note that the bit-test instructions use their operands modulo the operand width just like the x86 shift instructions do.Benefit
Testing and modifying individual bits is a common pattern in certain kinds of systems programming, so such programs will produce smaller code and may run faster.
These instructions don't modify their operands, unlike the and/or/xor instructions, so they're potentially eligible for load-sinking, and also register allocation may be able to avoid duplicating values. (edit: this is obviously only true for
bt, not any of the other instructions which do modify their operands.)Implementation
This is mostly like adding any other instruction, but as a particular note, we have an extractor called
imm64_power_of_twofor getting the base-two log of aniconstif that constant is a power of two.
Last updated: Dec 13 2025 at 19:03 UTC