ggreif opened PR #13332 from ggreif:gabor/clz-ctz-bool-fold to bytecodealliance:main:
Four mid-end ISLE rules in
opts/icmp.islefor the boolean-context cases:ctz(X) == 0 iff (X & 1) != 0 ctz(X) != 0 iff (X & 1) == 0 clz(X) == bw iff X == 0 clz(X) != bw iff X != 0When the result of
ctz/clzis only used to compare against the saturation value, rewrite so the bit-counting instruction is DCE'd. Backend then emits a singletest reg, imm/test reg, reginstead ofTZCNT/BSF/LZCNT/BSR+TEST+JCC— ~3 fewer cycles per occurrence on Intel x86_64 (TZCNT/LZCNT are 3-cycle latency with a false GPR dep), and proportionally more on JIT-less backends where bit-counting is a loop.Motivated by the converse wasm-side peephole in WebAssembly/binaryen#8562 (LSB→ctz fold under
-Osfor byte savings). With these rules, that fold is cycle-neutral on cranelift even when fed unconditionally, taking the air out of the main "but this regresses JITs" objection on the binaryen side.Filetest covers i32/i64 ctz/clz in both eq and ne forms plus a negative case (
clz(X) == 5must not trigger — that's a power-of-two range test on X, a separate rewrite family).
ggreif requested cfallin for a review on PR #13332.
ggreif requested wasmtime-compiler-reviewers for a review on PR #13332.
ggreif converted PR #13332 cranelift: fold ctz/clz comparisons against saturation values into direct LSB / null tests to a draft.
ggreif updated PR #13332.
ggreif edited PR #13332.
ggreif edited PR #13332:
Four mid-end ISLE rules in
opts/icmp.islefor the boolean-context cases — whenctz/clzflows into a comparison against zero (the consumer cares only \"is it zero?\", not the numeric value):\`\`\`
ctz(X) == 0 iff (X & 1) != 0 ; LSB of X set
ctz(X) != 0 iff (X & 1) == 0 ; LSB of X clear
clz(X) == 0 iff X <signed 0 ; MSB of X set (X is negative)
clz(X) != 0 iff X >=signed 0 ; MSB of X clear (X is non-negative)
\`\`\`The bit-counting instruction is DCE'd; backend emits a single-cycle \`test reg, imm\` (LSB case) or \`test reg, reg; js\` (sign case) instead of \`TZCNT/BSF/LZCNT/BSR\` + \`TEST\` + \`JCC\` — saves ~3 cycles per occurrence on Intel x86_64 (TZCNT/LZCNT are 3-cycle latency with a false GPR dep), proportionally more on JIT-less backends.
Why this matters in practice
The poster-child workload is the Motoko runtime's discriminator test on every \`Nat\`/\`Int\` operation:
- Compact (scalar) integers: low bit clear — fast path is plain Wasm i32/i64 arithmetic.
- Heap-allocated big integers (via libtommath): low bit set (skew tag).
Every arithmetic op begins with \`(x & 1) == 0\` to pick the path. Once binaryen's LSB→ctz fold (WebAssembly/binaryen#8562, under \`-Os\`) lands, that wasm becomes \`ctz X == 0\` — and lands on cranelift as TZCNT + TEST + JCC on the hot path of every numeric op. With these rules, cranelift collapses it back to a single \`test r, 1\`, restoring the original cost while keeping the wasm byte savings.
The clz / sign-bit half exists for the same reason on the rare paths that test sign before dispatching; it's the structurally parallel rewrite and ships in the same patch for completeness.
Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a negative case (\`ctz(X) == 4\` must not trigger — that's a numeric-value test on the count, a different rewrite family).
ggreif edited PR #13332:
Four mid-end ISLE rules in \`opts/icmp.isle\` for the boolean-context cases — when \`ctz\`/\`clz\` flows into a comparison against zero (the consumer cares only \"is it zero?\", not the numeric value):
\`\`\`
ctz(X) == 0 iff (X & 1) != 0 ; LSB of X set
ctz(X) != 0 iff (X & 1) == 0 ; LSB of X clear
clz(X) == 0 iff X <signed 0 ; MSB of X set (X is negative)
clz(X) != 0 iff X >=signed 0 ; MSB of X clear (X is non-negative)
\`\`\`The bit-counting instruction is DCE'd; backend emits a single-cycle \`test reg, imm\` (LSB case) or \`test reg, reg; js\` (sign case) instead of \`TZCNT/BSF/LZCNT/BSR\` + \`TEST\` + \`JCC\` — saves ~3 cycles per occurrence on Intel x86_64 (TZCNT/LZCNT are 3-cycle latency with a false GPR dep), proportionally more on JIT-less backends.
Why this matters in practice
The poster-child workload is the Motoko runtime's discriminator test on every \`Nat\`/\`Int\` operation:
- Compact (scalar) integers: low bit clear — fast path is plain Wasm i32/i64 arithmetic.
- Heap-allocated big integers (via libtommath): low bit set (skew tag).
Every arithmetic op begins with this LSB test. The Motoko codegen (\`src/codegen/instrList.ml:97-100\`) already emits the LSB-test-of-AND-1 pattern as \`(ctz X)\` — unconditionally, no flag gate — so every moc-compiled wasm running on wasmtime today does TZCNT + TEST + JCC on the hot path of every numeric op. The Rust RTS / GC paths that work on the same tagged pointer scheme see the same pattern.
With these rules in place, cranelift collapses the comparison back to a single \`test r, 1\` — restoring the original cost of the discriminator and unlocking measurable speed-ups for every Motoko canister on a wasmtime-based IC subnet (and any other wasm that produces this shape).
The clz / sign-bit half exists for the same reason on the rare paths that test sign before dispatching; structurally parallel rewrite, ships in the same patch.
The converse fold on the wasm-byte-savings side is in WebAssembly/binaryen#8562 (LSB→ctz under \`-Os\`); landing it there together with this in cranelift gives byte savings without cycle cost.
Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a negative case (\`ctz(X) == 4\` must not trigger — that's a numeric-value test on the count, a different rewrite family).
ggreif commented on PR #13332:
Sketched an extension to also catch the wasm-emitted shape
brif (ireduce.i32 (ctz.i64 X))(andclz), which is what frontends like Motoko'smocproduce — wasm'siftakes an i32 condition, so the i64 LSB test always flows throughireduceandbrifdirectly, with noicmpinterposed for the egraph rules in this PR to match.Scope creep: the natural place is each backend's
is_nonzerohelper (x64inst.isle:3806-3826, aarch64inst.isle:4659-4670, plus riscv64 and s390x), where rules like(rule (is_nonzero (ctz val @ (value_type (ty_32_or_64 ty)))) (CondResult.CC (x64_test ty val (RegMemImm.Imm 1)) (CC.Z))) (rule (is_nonzero (clz val @ (value_type (ty_32_or_64 ty)))) (CondResult.CC (x64_test ty val val) (CC.NS)))would lower
brif (ctz X)totest X, 1; jzandbrif (clz X)totest X, X; jnsdirectly, plus an(ireduce _ (ctz/clz _))variant for the wasm path.That's 4× backend files + filetests, different reviewers per arch, and a different review audience from this egraph PR. Punting on the amendment and filing a separate follow-up instead.
ggreif has marked PR #13332 as ready for review.
ggreif commented on PR #13332:
Concrete real-world workload for the
clzboolean-context fold: Motoko's classical-persistence backend already emits the barei32.clz; ifshape forInt <-> 0sign tests (e.g.Prim.abs). The compiler-side peephole (shrU 31; if -> clz; if(swap)andand 0x80000000; if -> clz; if(swap)inmotoko/src/codegen/instrList.ml) generates this directly because their target is i32 wasm without a wrap.So the JIT-side fold here is the natural meeting point for classical-Motoko output: clz directly into brif, no icmp, no ireduce. The rules in this PR don't yet catch that shape (it's
brif (clz X), noticmp eq (clz X) 0), but the equivalent backend lowering (per the previous comment) would close that gap end-to-end.
Last updated: Jun 01 2026 at 09:49 UTC