bjorn3 labeled issue #3615:
.clif
Test Casetarget x86_64 nehalem function u0:9(i16) -> i32 system_v { ; symbol _RNvMs_NtCs1K2K8gQcegq_4core3nums11count_zeros block0(v0: i16): v1 -> v0 jump block1 block1: @0000 v2 = bnot.i16 v1 @000b v3 = popcnt v2 v4 -> v3 @000b jump block2 block2: @000d v5 = uextend.i32 v4 @0011 return v5 }
Expected Results
The popcnt instruction only takes the 16bit part of
v1
into account.Actual Results
The popcnt instruction takes a 32bit register into account when computing the population count despite the upper half having an undefined value.
0000000000027246 <_RNvMs_NtCs1K2K8gQcegq_4core3nums11count_zeros>: 27246: 55 push %rbp 27247: 48 89 e5 mov %rsp,%rbp 2724a: f7 d7 not %edi 2724c: f3 0f b8 f7 popcnt %edi,%esi 27250: 0f b7 f6 movzwl %si,%esi 27253: 48 89 f0 mov %rsi,%rax 27256: 48 89 ec mov %rbp,%rsp 27259: 5d pop %rbp 2725a: c3 retq
Versions and Environment
Cranelift version or commit: Cranelift 0.79.0
Operating system: Linux
Architecture: x86_64
Extra Info
For reference: https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/isle.20performance.20with.20cg_clif/near/265273960
bjorn3 opened issue #3615:
.clif
Test Casetarget x86_64 nehalem function u0:9(i16) -> i32 system_v { ; symbol _RNvMs_NtCs1K2K8gQcegq_4core3nums11count_zeros block0(v0: i16): v1 -> v0 jump block1 block1: @0000 v2 = bnot.i16 v1 @000b v3 = popcnt v2 v4 -> v3 @000b jump block2 block2: @000d v5 = uextend.i32 v4 @0011 return v5 }
Expected Results
The popcnt instruction only takes the 16bit part of
v1
into account.Actual Results
The popcnt instruction takes a 32bit register into account when computing the population count despite the upper half having an undefined value.
0000000000027246 <_RNvMs_NtCs1K2K8gQcegq_4core3nums11count_zeros>: 27246: 55 push %rbp 27247: 48 89 e5 mov %rsp,%rbp 2724a: f7 d7 not %edi 2724c: f3 0f b8 f7 popcnt %edi,%esi 27250: 0f b7 f6 movzwl %si,%esi 27253: 48 89 f0 mov %rsi,%rax 27256: 48 89 ec mov %rbp,%rsp 27259: 5d pop %rbp 2725a: c3 retq
Versions and Environment
Cranelift version or commit: Cranelift 0.79.0
Operating system: Linux
Architecture: x86_64
Extra Info
For reference: https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/isle.20performance.20with.20cg_clif/near/265273960
bjorn3 labeled issue #3615:
.clif
Test Casetarget x86_64 nehalem function u0:9(i16) -> i32 system_v { ; symbol _RNvMs_NtCs1K2K8gQcegq_4core3nums11count_zeros block0(v0: i16): v1 -> v0 jump block1 block1: @0000 v2 = bnot.i16 v1 @000b v3 = popcnt v2 v4 -> v3 @000b jump block2 block2: @000d v5 = uextend.i32 v4 @0011 return v5 }
Expected Results
The popcnt instruction only takes the 16bit part of
v1
into account.Actual Results
The popcnt instruction takes a 32bit register into account when computing the population count despite the upper half having an undefined value.
0000000000027246 <_RNvMs_NtCs1K2K8gQcegq_4core3nums11count_zeros>: 27246: 55 push %rbp 27247: 48 89 e5 mov %rsp,%rbp 2724a: f7 d7 not %edi 2724c: f3 0f b8 f7 popcnt %edi,%esi 27250: 0f b7 f6 movzwl %si,%esi 27253: 48 89 f0 mov %rsi,%rax 27256: 48 89 ec mov %rbp,%rsp 27259: 5d pop %rbp 2725a: c3 retq
Versions and Environment
Cranelift version or commit: Cranelift 0.79.0
Operating system: Linux
Architecture: x86_64
Extra Info
For reference: https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/isle.20performance.20with.20cg_clif/near/265273960
cfallin commented on issue #3615:
OK, I was nerdsniped into looking at this briefly; it looks like an issue introduced with #2887 (the initial PR that makes use of
popcnt
instruction) and masked until recently because the upper bits happened to be zero in the register, as you say.The basic issue is that this match arm matches on
I32
, and the impl is intended only for actual 32-bit values; but above that, this code altersty
and computes anext_spec
(extension specification), turningI8
/I16
into anI32
with extension first. But this extension only happens with the fallback (non-popcnt
-instruction implementation) further down, not the "usepopcnt
if we can" early-out.I think we can just move the
let (ext_spec, ty) = ...
further down, below theuse_popcnt
case. If you want to put together a PR for that, I'm happy to review!
cfallin closed issue #3615:
.clif
Test Casetarget x86_64 nehalem function u0:9(i16) -> i32 system_v { ; symbol _RNvMs_NtCs1K2K8gQcegq_4core3nums11count_zeros block0(v0: i16): v1 -> v0 jump block1 block1: @0000 v2 = bnot.i16 v1 @000b v3 = popcnt v2 v4 -> v3 @000b jump block2 block2: @000d v5 = uextend.i32 v4 @0011 return v5 }
Expected Results
The popcnt instruction only takes the 16bit part of
v1
into account.Actual Results
The popcnt instruction takes a 32bit register into account when computing the population count despite the upper half having an undefined value.
0000000000027246 <_RNvMs_NtCs1K2K8gQcegq_4core3nums11count_zeros>: 27246: 55 push %rbp 27247: 48 89 e5 mov %rsp,%rbp 2724a: f7 d7 not %edi 2724c: f3 0f b8 f7 popcnt %edi,%esi 27250: 0f b7 f6 movzwl %si,%esi 27253: 48 89 f0 mov %rsi,%rax 27256: 48 89 ec mov %rbp,%rsp 27259: 5d pop %rbp 2725a: c3 retq
Versions and Environment
Cranelift version or commit: Cranelift 0.79.0
Operating system: Linux
Architecture: x86_64
Extra Info
For reference: https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/isle.20performance.20with.20cg_clif/near/265273960
Last updated: Jan 24 2025 at 00:11 UTC