wasmtime / PR #13332 cranelift: fold ctz/clz comparisons ... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #13332 cranelift: fold ctz/clz comparisons ...

Wasmtime GitHub notifications bot (May 09 2026 at 15:04):

ggreif opened PR #13332 from ggreif:gabor/clz-ctz-bool-fold to bytecodealliance:main:

Four mid-end ISLE rules in opts/icmp.isle for the boolean-context cases:
ctz(X) == 0   iff  (X & 1) != 0
ctz(X) != 0   iff  (X & 1) == 0
clz(X) == bw  iff  X == 0
clz(X) != bw  iff  X != 0
When the result of ctz/clz is only used to compare against the saturation value, rewrite so the bit-counting instruction is DCE'd. Backend then emits a single test reg, imm / test reg, reg instead of TZCNT/BSF/LZCNT/BSR + TEST + JCC — ~3 fewer cycles per occurrence on Intel x86_64 (TZCNT/LZCNT are 3-cycle latency with a false GPR dep), and proportionally more on JIT-less backends where bit-counting is a loop.

Motivated by the converse wasm-side peephole in WebAssembly/binaryen#8562 (LSB→ctz fold under -Os for byte savings). With these rules, that fold is cycle-neutral on cranelift even when fed unconditionally, taking the air out of the main "but this regresses JITs" objection on the binaryen side.

Filetest covers i32/i64 ctz/clz in both eq and ne forms plus a negative case (clz(X) == 5 must not trigger — that's a power-of-two range test on X, a separate rewrite family).

Wasmtime GitHub notifications bot (May 09 2026 at 15:04):

ggreif requested cfallin for a review on PR #13332.

Wasmtime GitHub notifications bot (May 09 2026 at 15:04):

ggreif requested wasmtime-compiler-reviewers for a review on PR #13332.

Wasmtime GitHub notifications bot (May 09 2026 at 15:08):

ggreif converted PR #13332 cranelift: fold ctz/clz comparisons against saturation values into direct LSB / null tests to a draft.

Wasmtime GitHub notifications bot (May 09 2026 at 15:14):

ggreif updated PR #13332.

Wasmtime GitHub notifications bot (May 09 2026 at 15:14):

ggreif edited PR #13332.

Wasmtime GitHub notifications bot (May 09 2026 at 15:18):

ggreif edited PR #13332:

Four mid-end ISLE rules in opts/icmp.isle for the boolean-context cases — when ctz/clz flows into a comparison against zero (the consumer cares only \"is it zero?\", not the numeric value):

\`\`\`
ctz(X) == 0 iff (X & 1) != 0 ; LSB of X set
ctz(X) != 0 iff (X & 1) == 0 ; LSB of X clear
clz(X) == 0 iff X <signed 0 ; MSB of X set (X is negative)
clz(X) != 0 iff X >=signed 0 ; MSB of X clear (X is non-negative)
\`\`\`

The bit-counting instruction is DCE'd; backend emits a single-cycle \`test reg, imm\` (LSB case) or \`test reg, reg; js\` (sign case) instead of \`TZCNT/BSF/LZCNT/BSR\` + \`TEST\` + \`JCC\` — saves ~3 cycles per occurrence on Intel x86_64 (TZCNT/LZCNT are 3-cycle latency with a false GPR dep), proportionally more on JIT-less backends.

Why this matters in practice

The poster-child workload is the Motoko runtime's discriminator test on every \`Nat\`/\`Int\` operation:

Compact (scalar) integers: low bit clear — fast path is plain Wasm i32/i64 arithmetic.

Heap-allocated big integers (via libtommath): low bit set (skew tag).

Every arithmetic op begins with \`(x & 1) == 0\` to pick the path. Once binaryen's LSB→ctz fold (WebAssembly/binaryen#8562, under \`-Os\`) lands, that wasm becomes \`ctz X == 0\` — and lands on cranelift as TZCNT + TEST + JCC on the hot path of every numeric op. With these rules, cranelift collapses it back to a single \`test r, 1\`, restoring the original cost while keeping the wasm byte savings.

The clz / sign-bit half exists for the same reason on the rare paths that test sign before dispatching; it's the structurally parallel rewrite and ships in the same patch for completeness.

Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a negative case (\`ctz(X) == 4\` must not trigger — that's a numeric-value test on the count, a different rewrite family).

Wasmtime GitHub notifications bot (May 09 2026 at 15:23):

ggreif edited PR #13332:

Four mid-end ISLE rules in \`opts/icmp.isle\` for the boolean-context cases — when \`ctz\`/\`clz\` flows into a comparison against zero (the consumer cares only \"is it zero?\", not the numeric value):

\`\`\`
ctz(X) == 0 iff (X & 1) != 0 ; LSB of X set
ctz(X) != 0 iff (X & 1) == 0 ; LSB of X clear
clz(X) == 0 iff X <signed 0 ; MSB of X set (X is negative)
clz(X) != 0 iff X >=signed 0 ; MSB of X clear (X is non-negative)
\`\`\`

The bit-counting instruction is DCE'd; backend emits a single-cycle \`test reg, imm\` (LSB case) or \`test reg, reg; js\` (sign case) instead of \`TZCNT/BSF/LZCNT/BSR\` + \`TEST\` + \`JCC\` — saves ~3 cycles per occurrence on Intel x86_64 (TZCNT/LZCNT are 3-cycle latency with a false GPR dep), proportionally more on JIT-less backends.

Why this matters in practice

The poster-child workload is the Motoko runtime's discriminator test on every \`Nat\`/\`Int\` operation:

Compact (scalar) integers: low bit clear — fast path is plain Wasm i32/i64 arithmetic.

Heap-allocated big integers (via libtommath): low bit set (skew tag).

Every arithmetic op begins with this LSB test. The Motoko codegen (\`src/codegen/instrList.ml:97-100\`) already emits the LSB-test-of-AND-1 pattern as \`(ctz X)\` — unconditionally, no flag gate — so every moc-compiled wasm running on wasmtime today does TZCNT + TEST + JCC on the hot path of every numeric op. The Rust RTS / GC paths that work on the same tagged pointer scheme see the same pattern.

With these rules in place, cranelift collapses the comparison back to a single \`test r, 1\` — restoring the original cost of the discriminator and unlocking measurable speed-ups for every Motoko canister on a wasmtime-based IC subnet (and any other wasm that produces this shape).

The clz / sign-bit half exists for the same reason on the rare paths that test sign before dispatching; structurally parallel rewrite, ships in the same patch.

The converse fold on the wasm-byte-savings side is in WebAssembly/binaryen#8562 (LSB→ctz under \`-Os\`); landing it there together with this in cranelift gives byte savings without cycle cost.

Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a negative case (\`ctz(X) == 4\` must not trigger — that's a numeric-value test on the count, a different rewrite family).

Wasmtime GitHub notifications bot (May 09 2026 at 16:47):

ggreif commented on PR #13332:

Sketched an extension to also catch the wasm-emitted shape brif (ireduce.i32 (ctz.i64 X)) (and clz), which is what frontends like Motoko's moc produce — wasm's if takes an i32 condition, so the i64 LSB test always flows through ireduce and brif directly, with no icmp interposed for the egraph rules in this PR to match.

Scope creep: the natural place is each backend's is_nonzero helper (x64 inst.isle:3806-3826, aarch64 inst.isle:4659-4670, plus riscv64 and s390x), where rules like
(rule (is_nonzero (ctz val @ (value_type (ty_32_or_64 ty))))
  (CondResult.CC (x64_test ty val (RegMemImm.Imm 1)) (CC.Z)))

(rule (is_nonzero (clz val @ (value_type (ty_32_or_64 ty))))
  (CondResult.CC (x64_test ty val val) (CC.NS)))
would lower brif (ctz X) to test X, 1; jz and brif (clz X) to test X, X; jns directly, plus an (ireduce _ (ctz/clz _)) variant for the wasm path.

That's 4× backend files + filetests, different reviewers per arch, and a different review audience from this egraph PR. Punting on the amendment and filing a separate follow-up instead.

Wasmtime GitHub notifications bot (May 09 2026 at 16:48):

ggreif has marked PR #13332 as ready for review.

Wasmtime GitHub notifications bot (May 09 2026 at 16:50):

ggreif commented on PR #13332:

Concrete real-world workload for the clz boolean-context fold: Motoko's classical-persistence backend already emits the bare i32.clz; if shape for Int <-> 0 sign tests (e.g. Prim.abs). The compiler-side peephole (shrU 31; if -> clz; if(swap) and and 0x80000000; if -> clz; if(swap) in motoko/src/codegen/instrList.ml) generates this directly because their target is i32 wasm without a wrap.

So the JIT-side fold here is the natural meeting point for classical-Motoko output: clz directly into brif, no icmp, no ireduce. The rules in this PR don't yet catch that shape (it's brif (clz X), not icmp eq (clz X) 0), but the equivalent backend lowering (per the previous comment) would close that gap end-to-end.

Last updated: Jul 29 2026 at 05:03 UTC