Currently if you codegen an operation such as u64::checked_sub(v) to wasm you are going to get a sequence of operations that first check for bounds (i64.lt_u in this case) and then the actual subtraction inside an if.
This seems to be compiled pretty much verbatim by clift on x86:
mov 0x30(%rdi),%rax
cmp $0x4d2,%rax
jb fail
sub $0x4d2,%rax
mov %rax,0x30(%rdi)
fail:
ud2
However this can compute sub and get overflow information from flags directly. A-la:
mov 0x30(%rdi),%rax
sub $0x4d2,%rax
jo fail
mov %rax,0x30(%rdi)
fail:
ud2
Would such optimization be desirable? Is there an issue for this?
Similarly silly thing happens when using i64.sub128 to test for overflow as well:
mov 0x30(%rdi),%rax
xor %rdx,%rdx ;; rdx = 0
sub $0x4d2,%rax
sbb $0x0,%rdx ;; rdx = 0 - 0 - overflow_flag
test %rdx,%rdx ;; rdx == 0?
jne fail
mov %rax,0x30(%rdi)
fail:
ud2
which could equally test for overflow flag.
Sample wat:
(module
(global $glob (mut i64) (i64.const 12345))
(func (export "one")
global.get $glob
i64.const 1234
i64.lt_u
if
unreachable
else
(i64.sub (global.get $glob) (i64.const 1234))
global.set $glob
end
)
(func (export "two")
(local i64)
global.get $glob
local.tee 0
i64.const 1234
i64.lt_u
if
unreachable
else
(i64.sub (local.get 0) (i64.const 1234))
global.set $glob
end
)
(func (export "three")
(local i64)
global.get $glob
i64.const 0
i64.const 1234
i64.const 0
i64.sub128
i64.eqz
if
else
unreachable
end
global.set $glob
)
)
Optimizations are definitely desirable! This is a space that has historically been quite tricky to optimize, but if possible we'd definitely be interested
it's tricky to match on in Cranelift's mid-end optimization framework because the operation is split across control-flow blocks
I think the i64.sub128 case might be possible to improve without going across a control flow block. The clif looks this currently:
block0(v0: i64, v1: i64):
@006c v4 = load.i64 notrap aligned table v0+48
v17 = uextend.i128 v4
@0070 v6 = iconst.i64 1234
v19 = uextend.i128 v6 ; v6 = 1234
@0075 v10 = isub v17, v19
@0075 v11, v12 = isplit v10
@006a v2 = iconst.i64 0
@0077 v13 = icmp eq v12, v2 ; v2 = 0
@0077 v14 = uextend.i32 v13
@007b trapz v14, user11
@0078 jump block2
block2:
@007a jump block3
block3:
@007d store.i64 notrap aligned table v11, v0+48
@007f jump block1
block1:
@007f return
The part with icmp at @0077 in particular could probably be simplified to just (if trapz supports non-i32 sized registers?)
@007b trapz v12, user11
cause it just compares value with 0 and then passes it along into trapz.
(assuming I'm looking at post-optimization clif as output by wasmtime --emit-clif?)
if nobody takes this on, I may take a look at least at the last thing here over the weekend.
(to trapnz, not trapz…)
However this can compute sub and get overflow information from flags directly
This is going to be very tricky and needs careful design work, as a warning -- there are a number of tough-to-crack reasons that we don't share flags generation with multiple consumers of those flags:
All this to say, it's not a simple "add an opt rule" thing -- we need deep thought on our core algorithms to give us the tools to do this
I started prodding the i64.sub128 case with what I think should be relatively straightforward:
(rule (simplify_skeleton (trapz (uextend _ a) code)) (trapz a code))
and
function %test1(i32) -> i32 {
block0(v0: i32):
v1 = uextend.i64 v0
trapz.i64 v1, user1
return v0
}
as my test case. However, even though logs suggest that the rule matches and applies (simplify_skeleton: replace inst with inst3: trapz.i32 v0, user1) it ends up weird shortly after:
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph] Processing inst inst1: trapz.i64 v1, user1
[2025-10-10T16:39:19Z TRACE cranelift_codegen::opts] new iter from root v1
[2025-10-10T16:39:19Z TRACE cranelift_codegen::opts] iter: value v1
[2025-10-10T16:39:19Z TRACE cranelift_codegen::opts] -> value of type i64
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph] -> simplify_skeleton: yielded 1 simplification(s)
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph] -> simplify_skeleton: replace inst with inst3: trapz.i32 v0, user1
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph] -> mergeable side-effecting op inst1
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph] -> inserts as new (no GVN)
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph] Processing inst inst2: return v0
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph] rewriting arg v0 of inst inst2 to v0
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph] -> simplify_skeleton: yielded 0 simplification(s)
[2025-10-10T16:39:19Z TRACE cranelift_codegen::alias_analysis] alias analysis: scanning at inst2 with state LastStores { heap: None, table: None, vmctx: No
ne, other: None } (MultiAry { opcode: Return, args: EntityList { index: 9, unused: PhantomData<cranelift_codegen::ir::entities::Value> } })
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph] egraph built:
function %test1(i32) -> i32 fast {
block0(v0: i32):
trapz.i64 v1, user1 ;; /!\ why v1? the insn should've been replaced with `trapz.i32 v0`.
return v0
}
which eventually returns back to the original shape with the uextend. Before I dive into this (and having read the cautioning message above as well as the comment on simplify_skeleton_inst): should it be possible to make this transformation? am I right in thinking that the outcome here is weird?
Adding a little more logging around the if cost < best_cost logic (line 689 in cranelift/codegen/src/egraphs.rs), I see
[2025-10-10T17:22:12Z TRACE cranelift_codegen::egraph] -> cost = Cost::Finite { op_cost: 10, depth: 1 }; best_cost = Cost::Finite { op_cost: 10, depth: 1 }
so I think this is a consequence of Cost::of_skeleton_op not considering the cost of operands; it appears that the two trapz's are equally costly so we prefer not to replace.
I'd be happy to review a PR if you wanted to add operand cost into Cost::of_skeleton_op, then add that simplification rule!
Hm, this is somewhat problematic to do as elaborator computes costs for values after the skeleton passes run. We also wouldn't want to recurse in the Cost::of_skeleton_op as that would lead straight into the pit of non-linearity.
For initial prototyping I'll compute costs of operands recursing 1-deep only, but in the final version computation of cost for skeleton ops may want to become a little more like that of of_pure_op wherein the caller maintains a list of operand costs…
Ah, yeah, you're right
An alternative approach is to say that simplify_skeleton always unconditionally takes one rewrite and performs it. That would mean that the constructor is no longer "multi", which would cause the ISLE compiler's overlap checking to statically check that only one rewrite is possible (either because of LHS non-overlap or because of explicit priorities); then we always take that rewrite. Since these ops don't get turned into multi-representation egraph nodes anyway, I don't see that as hugely problematic, though it does mean that one has to be a little more manual about choosing the correct rewrites
(cc @fitzgen (he/him) on that one for thoughts; he's out today and will be back next week)
I wasn't able to spend too much time on this, although I did find myself hitting new and exciting bugs in the implementation of simplify_skeleton :frown:
if we wanted to keep the multi aspect, and fix the cost computation, then we could create a separate type of eclass for the side-effectful skeleton that stores insts instead of values
or we could just switch to non-multi and add priorities. although I vaguely remember attempting that at one point and running into islec bugs I didn't want to fix, iirc
Simonas (nagisa) said:
I wasn't able to spend too much time on this, although I did find myself hitting new and exciting bugs in the implementation of
simplify_skeleton:frown:
bug reports and fixes are always welcome!
Last updated: Dec 06 2025 at 07:03 UTC