Stream: cranelift

Topic: codegen for overflow-checking operations


view this post on Zulip Simonas (nagisa) (Oct 09 2025 at 13:07):

Currently if you codegen an operation such as u64::checked_sub(v) to wasm you are going to get a sequence of operations that first check for bounds (i64.lt_u in this case) and then the actual subtraction inside an if.

This seems to be compiled pretty much verbatim by clift on x86:

  mov    0x30(%rdi),%rax
  cmp    $0x4d2,%rax
  jb     fail
  sub    $0x4d2,%rax
  mov    %rax,0x30(%rdi)
fail:
  ud2

However this can compute sub and get overflow information from flags directly. A-la:

  mov    0x30(%rdi),%rax
  sub    $0x4d2,%rax
  jo     fail
  mov    %rax,0x30(%rdi)
fail:
  ud2

Would such optimization be desirable? Is there an issue for this?

view this post on Zulip Simonas (nagisa) (Oct 09 2025 at 13:15):

Similarly silly thing happens when using i64.sub128 to test for overflow as well:

  mov    0x30(%rdi),%rax
  xor    %rdx,%rdx                 ;; rdx = 0
  sub    $0x4d2,%rax
  sbb    $0x0,%rdx                 ;; rdx = 0 - 0 - overflow_flag
  test   %rdx,%rdx                 ;; rdx == 0?
  jne    fail
  mov    %rax,0x30(%rdi)
fail:
  ud2

which could equally test for overflow flag.

Sample wat:

(module
  (global $glob (mut i64) (i64.const 12345))

  (func (export "one")
    global.get $glob
    i64.const 1234
    i64.lt_u
    if
      unreachable
    else
      (i64.sub (global.get $glob) (i64.const 1234))
      global.set $glob
    end
  )

  (func (export "two")
    (local i64)
    global.get $glob
    local.tee 0
    i64.const 1234
    i64.lt_u
    if
      unreachable
    else
      (i64.sub (local.get 0) (i64.const 1234))
      global.set $glob
    end
  )

  (func (export "three")
    (local i64)
    global.get $glob
    i64.const 0
    i64.const 1234
    i64.const 0
    i64.sub128
    i64.eqz
    if
    else
      unreachable
    end
    global.set $glob
  )
)

view this post on Zulip Alex Crichton (Oct 09 2025 at 14:35):

Optimizations are definitely desirable! This is a space that has historically been quite tricky to optimize, but if possible we'd definitely be interested

view this post on Zulip fitzgen (he/him) (Oct 09 2025 at 15:30):

it's tricky to match on in Cranelift's mid-end optimization framework because the operation is split across control-flow blocks

view this post on Zulip Simonas (nagisa) (Oct 09 2025 at 15:38):

I think the i64.sub128 case might be possible to improve without going across a control flow block. The clif looks this currently:

                                block0(v0: i64, v1: i64):
@006c                               v4 = load.i64 notrap aligned table v0+48
                                    v17 = uextend.i128 v4
@0070                               v6 = iconst.i64 1234
                                    v19 = uextend.i128 v6  ; v6 = 1234
@0075                               v10 = isub v17, v19
@0075                               v11, v12 = isplit v10
@006a                               v2 = iconst.i64 0
@0077                               v13 = icmp eq v12, v2  ; v2 = 0
@0077                               v14 = uextend.i32 v13
@007b                               trapz v14, user11
@0078                               jump block2

                                block2:
@007a                               jump block3

                                block3:
@007d                               store.i64 notrap aligned table v11, v0+48
@007f                               jump block1

                                block1:
@007f                               return

The part with icmp at @0077 in particular could probably be simplified to just (if trapz supports non-i32 sized registers?)

@007b                               trapz v12, user11

cause it just compares value with 0 and then passes it along into trapz.

view this post on Zulip Simonas (nagisa) (Oct 09 2025 at 15:41):

(assuming I'm looking at post-optimization clif as output by wasmtime --emit-clif?)

view this post on Zulip Simonas (nagisa) (Oct 09 2025 at 15:46):

if nobody takes this on, I may take a look at least at the last thing here over the weekend.

view this post on Zulip Simonas (nagisa) (Oct 09 2025 at 15:53):

(to trapnz, not trapz…)

view this post on Zulip Chris Fallin (Oct 09 2025 at 17:28):

However this can compute sub and get overflow information from flags directly

This is going to be very tricky and needs careful design work, as a warning -- there are a number of tough-to-crack reasons that we don't share flags generation with multiple consumers of those flags:

All this to say, it's not a simple "add an opt rule" thing -- we need deep thought on our core algorithms to give us the tools to do this

view this post on Zulip Simonas (nagisa) (Oct 10 2025 at 17:09):

I started prodding the i64.sub128 case with what I think should be relatively straightforward:

(rule (simplify_skeleton (trapz (uextend _ a) code)) (trapz a code))

and

function %test1(i32) -> i32 {
block0(v0: i32):
    v1 = uextend.i64 v0
    trapz.i64 v1, user1
    return v0
}

as my test case. However, even though logs suggest that the rule matches and applies (simplify_skeleton: replace inst with inst3: trapz.i32 v0, user1) it ends up weird shortly after:

[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph] Processing inst inst1: trapz.i64 v1, user1
[2025-10-10T16:39:19Z TRACE cranelift_codegen::opts] new iter from root v1
[2025-10-10T16:39:19Z TRACE cranelift_codegen::opts] iter: value v1
[2025-10-10T16:39:19Z TRACE cranelift_codegen::opts]  -> value of type i64
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph]  -> simplify_skeleton: yielded 1 simplification(s)
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph]  -> simplify_skeleton: replace inst with inst3: trapz.i32 v0, user1
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph]  -> mergeable side-effecting op inst1
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph]  -> inserts as new (no GVN)
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph] Processing inst inst2: return v0
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph] rewriting arg v0 of inst inst2 to v0
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph]  -> simplify_skeleton: yielded 0 simplification(s)
[2025-10-10T16:39:19Z TRACE cranelift_codegen::alias_analysis] alias analysis: scanning at inst2 with state LastStores { heap: None, table: None, vmctx: No
ne, other: None } (MultiAry { opcode: Return, args: EntityList { index: 9, unused: PhantomData<cranelift_codegen::ir::entities::Value> } })
[2025-10-10T16:39:19Z TRACE cranelift_codegen::egraph] egraph built:
    function %test1(i32) -> i32 fast {
    block0(v0: i32):
        trapz.i64 v1, user1 ;; /!\ why v1? the insn should've been replaced with `trapz.i32 v0`.
        return v0
    }

which eventually returns back to the original shape with the uextend. Before I dive into this (and having read the cautioning message above as well as the comment on simplify_skeleton_inst): should it be possible to make this transformation? am I right in thinking that the outcome here is weird?

view this post on Zulip Chris Fallin (Oct 10 2025 at 17:24):

Adding a little more logging around the if cost < best_cost logic (line 689 in cranelift/codegen/src/egraphs.rs), I see

[2025-10-10T17:22:12Z TRACE cranelift_codegen::egraph] -> cost = Cost::Finite { op_cost: 10, depth: 1 }; best_cost = Cost::Finite { op_cost: 10, depth: 1 }

so I think this is a consequence of Cost::of_skeleton_op not considering the cost of operands; it appears that the two trapz's are equally costly so we prefer not to replace.

I'd be happy to review a PR if you wanted to add operand cost into Cost::of_skeleton_op, then add that simplification rule!

view this post on Zulip Simonas (nagisa) (Oct 10 2025 at 22:46):

Hm, this is somewhat problematic to do as elaborator computes costs for values after the skeleton passes run. We also wouldn't want to recurse in the Cost::of_skeleton_op as that would lead straight into the pit of non-linearity.

For initial prototyping I'll compute costs of operands recursing 1-deep only, but in the final version computation of cost for skeleton ops may want to become a little more like that of of_pure_op wherein the caller maintains a list of operand costs…

view this post on Zulip Chris Fallin (Oct 10 2025 at 22:49):

Ah, yeah, you're right

view this post on Zulip Chris Fallin (Oct 10 2025 at 22:52):

An alternative approach is to say that simplify_skeleton always unconditionally takes one rewrite and performs it. That would mean that the constructor is no longer "multi", which would cause the ISLE compiler's overlap checking to statically check that only one rewrite is possible (either because of LHS non-overlap or because of explicit priorities); then we always take that rewrite. Since these ops don't get turned into multi-representation egraph nodes anyway, I don't see that as hugely problematic, though it does mean that one has to be a little more manual about choosing the correct rewrites

view this post on Zulip Chris Fallin (Oct 10 2025 at 22:52):

(cc @fitzgen (he/him) on that one for thoughts; he's out today and will be back next week)

view this post on Zulip Simonas (nagisa) (Oct 12 2025 at 17:10):

I wasn't able to spend too much time on this, although I did find myself hitting new and exciting bugs in the implementation of simplify_skeleton :frown:

view this post on Zulip fitzgen (he/him) (Oct 13 2025 at 17:09):

if we wanted to keep the multi aspect, and fix the cost computation, then we could create a separate type of eclass for the side-effectful skeleton that stores insts instead of values

or we could just switch to non-multi and add priorities. although I vaguely remember attempting that at one point and running into islec bugs I didn't want to fix, iirc

view this post on Zulip fitzgen (he/him) (Oct 13 2025 at 17:10):

Simonas (nagisa) said:

I wasn't able to spend too much time on this, although I did find myself hitting new and exciting bugs in the implementation of simplify_skeleton :frown:

bug reports and fixes are always welcome!


Last updated: Dec 06 2025 at 07:03 UTC