Stream: git-wasmtime

Topic: wasmtime / issue #12139 Cranelift: ISLE mid-end performan...


view this post on Zulip Wasmtime GitHub notifications bot (Dec 08 2025 at 13:33):

bongjunj opened issue #12139:

Hi,

this is a follow-up of https://github.com/bytecodealliance/wasmtime/issues/12106 .

Although we've removed two sorts of performance regressing mid-end ISLE rules,
there still remains a significant performance degradation as well as other suspected cases.
(There is, of course, a bright side: we have significant performance improvements for many cases!)

Performance Regression:

First, here is the backing data of the performance regression:

Benchmark No Opt Main Main Speedup
blake3-scalar 317,727 317,719 0.00%
blake3-simd 313,115 306,232 2.25%
bz2 87,201,400 86,337,330 1.00%
pulldown-cmark 6,580,174 6,905,992 -4.72%
regex 209,743,816 210,183,175 -0.21%
shootout-ackermann 8,498,140 7,764,439 9.45%
shootout-base64 381,721,177 352,724,661 8.22%
shootout-ctype 830,813,398 796,486,698 4.31%
shootout-ed25519 9,583,747,723 9,395,321,203 2.01%
shootout-fib2 3,009,269,670 3,010,509,565 -0.04%
shootout-gimli 5,338,258 5,401,697 -1.17%
shootout-heapsort 2,382,073,831 2,375,914,107 0.26%
shootout-keccak 25,168,386 21,112,482 19.21%
shootout-matrix 538,696,036 544,739,691 -1.11%
shootout-memmove 36,156,621 36,115,998 0.11%
shootout-minicsv 1,481,713,625 1,291,534,227 14.73%
shootout-nestedloop 449 442 1.43%
shootout-random 630,328,205 439,691,474 43.36%
shootout-ratelimit 39,148,817 39,956,714 -2.02%
shootout-seqhash 8,869,585,125 8,639,110,150 2.67%
shootout-sieve 905,404,028 840,777,681 7.69%
shootout-switch 139,525,474 153,663,682 -9.20%
shootout-xblabla20 2,891,404 2,907,369 -0.55%
shootout-xchacha20 4,384,703 4,395,319 -0.24%
spidermonkey 636,104,785 631,998,404 0.65%

Unlike the previous cases, the cause is not obvious.

19245 clif/v-no-opts/shootout-switch/wasm[0]--function[9]--__original_main.clif
19241 clif/v-main/shootout-switch/wasm[0]--function[9]--__original_main.clif

The number of instructions does not increase significantly from no-opt to main.
However, the applied optimizations make the program use long-lived value:

--- /data/bongjun/clif/v-no-opts/shootout-switch/wasm[0]--function[9]--__original_main.clif 2025-12-08 12:43:58.406738645 +0000
+++ /data/bongjun/clif/v-main/shootout-switch/wasm[0]--function[9]--__original_main.clif    2025-12-08 12:49:01.961085326 +0000

-                                    v8572 = iconst.i32 1066
-                                    v8573 = iconst.i32 0
-@d20b                               v4324 = call fn1(v0, v0, v8572, v8573)  ; v8572 = 1066, v8573 = 0
-                                    v8574 = iadd.i64 v105, v106  ; v106 = 3584
-@d219                               v4333 = load.i32 little heap v8574
-                                    v8575 = iconst.i32 6
-                                    v8576 = icmp uge v4333, v8575  ; v8575 = 6
+                                    v8603 = iconst.i32 1066
+                                    v8604 = iconst.i32 0
+@d20b                               v4324 = call fn1(v0, v0, v8603, v8604)  ; v8603 = 1066, v8604 = 0
+                                    v8605 = iadd.i64 v11, v106  ; v106 = 3584
+@d219                               v4333 = load.i32 little heap v8605
+                                    v8606 = iconst.i32 6
+                                    v8607 = icmp uge v4333, v8606  ; v8606 = 6

See v8574 and v8605 which uses v105 and v11.
v11 is defined at the beginning, but v105 is defined later than v11:

                                block0(v0: i64, v1: i64):
@01f0                               v5 = load.i32 notrap aligned table v0+256
@01f6                               v6 = iconst.i32 16
@01f8                               v7 = isub v5, v6  ; v6 = 16
@01fb                               store notrap aligned table v7, v0+256
@0203                               v9 = iconst.i32 0x2710
@0207                               v11 = load.i64 notrap aligned readonly can_move checked v0+56

...

@02d6                               v105 = iadd.i64 v11, v4439

This might increase the register pressure, causing more spills which can degrade the performance.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 08 2025 at 14:56):

alexcrichton added the cranelift label to Issue #12139.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 08 2025 at 14:56):

alexcrichton added the cranelift:goal:optimize-speed label to Issue #12139.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 08 2025 at 14:56):

alexcrichton added the performance label to Issue #12139.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 12 2025 at 11:32):

bongjunj commented on issue #12139:

Investigating the generated CLIF for shootout-switch case, I noticed there are over 2,500 unused constants in a block:
https://gist.github.com/bongjunj/08cf48d4e5827cf8ed270f26442e2604#file-main-clif-L216-L240 .

The constants defined in the block are never referenced anywhere in the function.
Is this a intended behavior for translating switch?

To repro this, run wasmtime compile sightglass/benchmarks/shootout/shootut-switch.wasm --emit-clif <dir> and look up the shootout-switch/wasm\[0\]--function\[9\]--__original_main.clif file.

cc @cfallin

view this post on Zulip Wasmtime GitHub notifications bot (Dec 12 2025 at 11:32):

bongjunj edited a comment on issue #12139:

Investigating the generated CLIF for shootout-switch case, I noticed there are over 2,500 unused constants in a block:
https://gist.github.com/bongjunj/08cf48d4e5827cf8ed270f26442e2604#file-main-clif-L216-L240 .
For example, look up v185 in the CLIF file. None are matched.

The constants defined in the block are never referenced anywhere in the function.
Is this a intended behavior for translating switch?

To repro this, run wasmtime compile sightglass/benchmarks/shootout/shootut-switch.wasm --emit-clif <dir> and look up the shootout-switch/wasm\[0\]--function\[9\]--__original_main.clif file.

cc @cfallin

view this post on Zulip Wasmtime GitHub notifications bot (Dec 12 2025 at 12:00):

bongjunj edited a comment on issue #12139:

Investigating the generated CLIF for shootout-switch case, I noticed there are over 2,500 unused constants in a block:
https://gist.github.com/bongjunj/08cf48d4e5827cf8ed270f26442e2604#file-main-clif-L216-L240 .
For example, look up v185 in the CLIF file. None are matched.

This not only happens in the specified block, but also in another block too:

                                block3:
                                    v4434 = iconst.i32 16
                                    v4435 = iadd.i32 v16, v4434  ; v4434 = 16
                                    v4436 = iconst.i32 0
@0236                               v31 = iconst.i32 7
@0231                               v28 = iconst.i32 12
@0243                               v38 = iconst.i32 6
@023e                               v36 = iconst.i32 8
@0250                               v45 = iconst.i32 5
@024b                               v43 = iconst.i32 4
@0267                               v57 = iconst.i32 3
@0262                               v55 = iconst.i32 -4
@0274                               v64 = iconst.i32 2
@026f                               v62 = iconst.i32 -8
@0281                               v71 = iconst.i32 1
@027c                               v69 = iconst.i32 -12
@0289                               v76 = iconst.i32 -16
                                    v4409 = iconst.i32 9992
@0293                               v81 = iconst.i32 32
@022d                               jump block4(v4435, v4436)  ; v4436 = 0

...

@028e                               v78 = uextend.i64 v4463
@028e                               v80 = iadd.i64 v11, v78
@028e                               store little heap v30, v80
                                    v4413 = icmp ne v30, v4409  ; v4409 = 9992

Surprisingly, none of the constants from v31 to v76 in the block are used anywhere in the function.
The constants defined in the block are never referenced anywhere in the function.
Is this a intended behavior for translating switch?

Furthermore, the constant v4409 that is newly generated as a result of optimization is not placed close enough to its usage v4413. I see no valid reason to instantiate such constant distant from its only usage.

To repro this, run wasmtime compile sightglass/benchmarks/shootout/shootut-switch.wasm --emit-clif <dir> and look up the shootout-switch/wasm\[0\]--function\[9\]--__original_main.clif file.

cc @cfallin

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2025 at 08:06):

bongjunj commented on issue #12139:

Performance-regressing rules

Removing the following rules, I observed the performance regression can be fixed.
The performance gaps between the no-opt version (NO) and the latest version (Base) of Cranelift are -9.20% and zero before and after this removal.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/cprop.isle#L196-L211

(1) Removing these, the gap between NO and Base is -6.97%.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/cprop.isle#L97-L102

(2) Further removing these, the gap between NO and Base is -1.37%.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/arithmetic.isle#L241

(3) Further removing these, the gap between NO and Base is zero.

Analysis

The rules in (1) interferes the loop backedge.
Let's take a detailed look on the generated IR and x86 machine code.

                                 block4(v27: i32, v30: i32):
-                                    v4432 = iadd v30, v4411  ; v4411 = 8
-                                    v4433 = iconst.i32 0x2710
-                                    v4434 = icmp ne v4432, v4433  ; v4433 = 0x2710
-@02a3                               v87 = uextend.i32 v4434
-                                    v4435 = iconst.i32 32
-                                    v4436 = iadd v27, v4435  ; v4435 = 32
-@02a4                               brif v87, block4(v4436, v4432), block6
+                                    v4413 = icmp ne v30, v4409  ; v4409 = 9992
+@02a3                               v87 = uextend.i32 v4413
+                                    v4464 = iconst.i32 32
+                                    v4465 = iadd v27, v4464  ; v4464 = 32
+                                    v4466 = iadd v30, v4443  ; v4443 = 8
+@02a4                               brif v87, block4(v4465, v4466), block6

The IR transformation can be simplified into:

- a <- x + 8
- c <- a != 10,000 (0x2710)
- d <- y + 32
- br_if c, block4(d, a), block6
+ c <- x != 9,992
+ d <- y + 32
+ a <- x + 8
+ br_if c, block4(d, a), block6

This degraded the quality of generated machine code. Although the comparison is simplified, the addition is now two-phase (compute and commit) and there are now two jumps.

- add    r8d,0x8
- add    ecx,0x20
- cmp    r8d,0x2710
- jne    fd <__original_main+0x7d>
+ add    ecx,0x20
+ lea    r11d,[r8+0x8]        ; r11d = x + 8 (compute next)
+ cmp    r8d,0x2708           ; compare *old* x to 9992
+ je     181 <__original_main+0x101>
+ mov    r8,r11               ; commit x = x + 8
+jmp    102 <__original_main+0x82>

I couldn't decide to fix the mid-end or code-gen at the moment.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2025 at 08:13):

bongjunj edited a comment on issue #12139:

Performance-regressing rules

Removing the following rules, I observed the performance regression can be fixed.
The performance gaps between the no-opt version (NO) and the latest version (Base) of Cranelift are -9.20% and zero before and after this removal.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/cprop.isle#L196-L211

(1) Removing these, the gap between NO and Base was -6.97%.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/cprop.isle#L97-L102

(2) Further removing these, the gap between NO and Base was -1.37%.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/arithmetic.isle#L241

(3) Further removing these, the gap between NO and Base was zero.

Analysis

The rules in (1) interferes the loop backedge.
Let's take a detailed look on the generated IR and x86 machine code.

                                 block4(v27: i32, v30: i32):
-                                    v4432 = iadd v30, v4411  ; v4411 = 8
-                                    v4433 = iconst.i32 0x2710
-                                    v4434 = icmp ne v4432, v4433  ; v4433 = 0x2710
-@02a3                               v87 = uextend.i32 v4434
-                                    v4435 = iconst.i32 32
-                                    v4436 = iadd v27, v4435  ; v4435 = 32
-@02a4                               brif v87, block4(v4436, v4432), block6
+                                    v4413 = icmp ne v30, v4409  ; v4409 = 9992
+@02a3                               v87 = uextend.i32 v4413
+                                    v4464 = iconst.i32 32
+                                    v4465 = iadd v27, v4464  ; v4464 = 32
+                                    v4466 = iadd v30, v4443  ; v4443 = 8
+@02a4                               brif v87, block4(v4465, v4466), block6

The IR transformation can be simplified into:

- a <- x + 8
- c <- a != 10,000 (0x2710)
- d <- y + 32
- br_if c, block4(d, a), block6
+ c <- x != 9,992
+ d <- y + 32
+ a <- x + 8
+ br_if c, block4(d, a), block6

This degraded the quality of generated machine code. Although the comparison is simplified, the addition is now two-phase (compute and commit) and there are now two jumps.

- add    r8d,0x8
- add    ecx,0x20
- cmp    r8d,0x2710
- jne    fd <__original_main+0x7d>
+ add    ecx,0x20
+ lea    r11d,[r8+0x8]        ; r11d = x + 8 (compute next)
+ cmp    r8d,0x2708           ; compare *old* x to 9992
+ je     181 <__original_main+0x101>
+ mov    r8,r11               ; commit x = x + 8
+jmp    102 <__original_main+0x82>

I couldn't decide to fix the mid-end or code-gen at the moment.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2025 at 08:14):

bongjunj edited a comment on issue #12139:

Performance-regressing rules

Removing the following rules, I observed the performance regression can be fixed.
The performance gaps between the no-opt version (NO) and the latest version (Base) of Cranelift are -9.20% and zero before and after this removal.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/cprop.isle#L196-L211

(1) Removing these, the gap between NO and Base was -6.97%.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/cprop.isle#L97-L102

(2) Further removing these, the gap between NO and Base was -1.37%.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/arithmetic.isle#L241

(3) Further removing these, the gap between NO and Base was zero.

Analysis

My theory is that the rules in (1) interferes the loop back-edge.
Let's take a detailed look on the generated IR and x86 machine code.

                                 block4(v27: i32, v30: i32):
-                                    v4432 = iadd v30, v4411  ; v4411 = 8
-                                    v4433 = iconst.i32 0x2710
-                                    v4434 = icmp ne v4432, v4433  ; v4433 = 0x2710
-@02a3                               v87 = uextend.i32 v4434
-                                    v4435 = iconst.i32 32
-                                    v4436 = iadd v27, v4435  ; v4435 = 32
-@02a4                               brif v87, block4(v4436, v4432), block6
+                                    v4413 = icmp ne v30, v4409  ; v4409 = 9992
+@02a3                               v87 = uextend.i32 v4413
+                                    v4464 = iconst.i32 32
+                                    v4465 = iadd v27, v4464  ; v4464 = 32
+                                    v4466 = iadd v30, v4443  ; v4443 = 8
+@02a4                               brif v87, block4(v4465, v4466), block6

The IR transformation can be simplified into:

- a <- x + 8
- c <- a != 10,000 (0x2710)
- d <- y + 32
- br_if c, block4(d, a), block6
+ c <- x != 9,992
+ d <- y + 32
+ a <- x + 8
+ br_if c, block4(d, a), block6

This degraded the quality of generated machine code. Although the comparison is simplified, the addition is now two-phase (compute and commit) and there are now two jumps.

- add    r8d,0x8
- add    ecx,0x20
- cmp    r8d,0x2710
- jne    fd <__original_main+0x7d>
+ add    ecx,0x20
+ lea    r11d,[r8+0x8]        ; r11d = x + 8 (compute next)
+ cmp    r8d,0x2708           ; compare *old* x to 9992
+ je     181 <__original_main+0x101>
+ mov    r8,r11               ; commit x = x + 8
+jmp    102 <__original_main+0x82>

I couldn't decide to fix the mid-end or code-gen at the moment.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2025 at 08:26):

bongjunj edited a comment on issue #12139:

Performance-regressing rules

Removing the following rules, I observed the performance regression can be fixed.
The performance gaps in execution time between the no-opt version (NO) and the latest version (Base) of Cranelift are -9.20% and zero before and after this removal.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/cprop.isle#L196-L211

(1) Removing these, the gap between NO and Base was -6.97%.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/cprop.isle#L97-L102

(2) Further removing these, the gap between NO and Base was -1.37%.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/arithmetic.isle#L241

(3) Further removing these, the gap between NO and Base was zero.

Analysis 1

My theory is that the rules in (1) interferes the loop back-edge.
Let's take a detailed look on the generated IR and x86 machine code.

                                 block4(v27: i32, v30: i32):
-                                    v4432 = iadd v30, v4411  ; v4411 = 8
-                                    v4433 = iconst.i32 0x2710
-                                    v4434 = icmp ne v4432, v4433  ; v4433 = 0x2710
-@02a3                               v87 = uextend.i32 v4434
-                                    v4435 = iconst.i32 32
-                                    v4436 = iadd v27, v4435  ; v4435 = 32
-@02a4                               brif v87, block4(v4436, v4432), block6
+                                    v4413 = icmp ne v30, v4409  ; v4409 = 9992
+@02a3                               v87 = uextend.i32 v4413
+                                    v4464 = iconst.i32 32
+                                    v4465 = iadd v27, v4464  ; v4464 = 32
+                                    v4466 = iadd v30, v4443  ; v4443 = 8
+@02a4                               brif v87, block4(v4465, v4466), block6

The IR transformation can be simplified into:

- a <- x + 8
- c <- a != 10,000 (0x2710)
- d <- y + 32
- br_if c, block4(d, a), block6
+ c <- x != 9,992
+ d <- y + 32
+ a <- x + 8
+ br_if c, block4(d, a), block6

This degraded the quality of generated machine code. Although the comparison is simplified, the addition is now two-phase (compute and commit) and there are now two jumps.

- add    r8d,0x8
- add    ecx,0x20
- cmp    r8d,0x2710
- jne    fd <__original_main+0x7d>
+ add    ecx,0x20
+ lea    r11d,[r8+0x8]        ; r11d = x + 8 (compute next)
+ cmp    r8d,0x2708           ; compare *old* x to 9992
+ je     181 <__original_main+0x101>
+ mov    r8,r11               ; commit x = x + 8
+jmp    102 <__original_main+0x82>

Analysis 2

The rules in (2) and (3) are then now interesting. How did they "cure" the performance problem?
In fact, the rules in (3) can interact with these rules in icmp.isle:

(rule (simplify (ne ty (iadd _ a k) (iadd _ b k))) (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ a k) (iadd _ k b))) (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ k a) (iadd _ b k))) (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ k a) (iadd _ k b))) (subsume (ne ty a b)))

Then, similar to Analysis 1, it can cause same problem in generating jump instructions.

Conclusion

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2025 at 08:27):

bongjunj edited a comment on issue #12139:

Performance-regressing rules

Removing the following rules, I observed the performance regression can be fixed.
The performance gaps in execution time between the no-opt version (NO) and the latest version (Base) of Cranelift are -9.20% and zero before and after this removal.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/cprop.isle#L196-L211

(1) Removing these, the gap between NO and Base was -6.97%.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/cprop.isle#L97-L102

(2) Further removing these, the gap between NO and Base was -1.37%.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/arithmetic.isle#L241

(3) Further removing these, the gap between NO and Base was zero.

Analysis 1

My theory is that the rules in (1) interferes the loop back-edge.
Let's take a detailed look on the generated IR and x86 machine code.

                                 block4(v27: i32, v30: i32):
-                                    v4432 = iadd v30, v4411  ; v4411 = 8
-                                    v4433 = iconst.i32 0x2710
-                                    v4434 = icmp ne v4432, v4433  ; v4433 = 0x2710
-@02a3                               v87 = uextend.i32 v4434
-                                    v4435 = iconst.i32 32
-                                    v4436 = iadd v27, v4435  ; v4435 = 32
-@02a4                               brif v87, block4(v4436, v4432), block6
+                                    v4413 = icmp ne v30, v4409  ; v4409 = 9992
+@02a3                               v87 = uextend.i32 v4413
+                                    v4464 = iconst.i32 32
+                                    v4465 = iadd v27, v4464  ; v4464 = 32
+                                    v4466 = iadd v30, v4443  ; v4443 = 8
+@02a4                               brif v87, block4(v4465, v4466), block6

The IR transformation can be simplified into:

- a <- x + 8
- c <- a != 10,000 (0x2710)
- d <- y + 32
- br_if c, block4(d, a), block6
+ c <- x != 9,992
+ d <- y + 32
+ a <- x + 8
+ br_if c, block4(d, a), block6

This degraded the quality of generated machine code. Although the comparison is simplified, the addition is now two-phase (compute and commit) and there are now two jumps.

- add    r8d,0x8
- add    ecx,0x20
- cmp    r8d,0x2710
- jne    fd <__original_main+0x7d>
+ add    ecx,0x20
+ lea    r11d,[r8+0x8]        ; r11d = x + 8 (compute next)
+ cmp    r8d,0x2708           ; compare *old* x to 9992
+ je     181 <__original_main+0x101>
+ mov    r8,r11               ; commit x = x + 8
+jmp    102 <__original_main+0x82>

Analysis 2

The rules in (2) and (3) are then now interesting. How did they "cure" the performance problem?
In fact, the rules in (3) can interact with these rules in icmp.isle:

(rule (simplify (ne ty (iadd _ a k) (iadd _ b k))) (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ a k) (iadd _ k b))) (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ k a) (iadd _ b k))) (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ k a) (iadd _ k b))) (subsume (ne ty a b)))

Then, similar to Analysis 1, it can cause the same problem in generating jump instructions.

Conclusion

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2025 at 08:29):

bongjunj edited a comment on issue #12139:

Performance-regressing rules

Removing the following rules, I observed the performance regression can be fixed.
The performance gaps in execution time between the no-opt version (NO) and the latest version (Base) of Cranelift are -9.20% and zero before and after this removal.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/cprop.isle#L196-L211

(1) Removing these, the gap between NO and Base was -6.97%.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/cprop.isle#L97-L102

(2) Further removing these, the gap between NO and Base was -1.37%.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/arithmetic.isle#L241

(3) Further removing these, the gap between NO and Base was zero.

Analysis 1

My theory is that the rules in (1) interferes the loop back-edge.
Let's take a detailed look on the generated IR and x86 machine code.

                                 block4(v27: i32, v30: i32):
-                                    v4432 = iadd v30, v4411  ; v4411 = 8
-                                    v4433 = iconst.i32 0x2710
-                                    v4434 = icmp ne v4432, v4433  ; v4433 = 0x2710
-@02a3                               v87 = uextend.i32 v4434
-                                    v4435 = iconst.i32 32
-                                    v4436 = iadd v27, v4435  ; v4435 = 32
-@02a4                               brif v87, block4(v4436, v4432), block6
+                                    v4413 = icmp ne v30, v4409  ; v4409 = 9992
+@02a3                               v87 = uextend.i32 v4413
+                                    v4464 = iconst.i32 32
+                                    v4465 = iadd v27, v4464  ; v4464 = 32
+                                    v4466 = iadd v30, v4443  ; v4443 = 8
+@02a4                               brif v87, block4(v4465, v4466), block6

The IR transformation can be simplified into:

- a <- x + 8
- c <- a != 10,000 (0x2710)
- d <- y + 32
- br_if c, block4(d, a), block6
+ c <- x != 9,992
+ d <- y + 32
+ a <- x + 8
+ br_if c, block4(d, a), block6

This degraded the quality of generated machine code. Although the comparison is simplified, the addition is now two-phase (compute and commit) and there are now two jumps.

- add    r8d,0x8
- add    ecx,0x20
- cmp    r8d,0x2710
- jne    fd <__original_main+0x7d>
+ add    ecx,0x20
+ lea    r11d,[r8+0x8]        ; r11d = x + 8 (compute next)
+ cmp    r8d,0x2708           ; compare *old* x to 9992
+ je     181 <__original_main+0x101>
+ mov    r8,r11               ; commit x = x + 8
+jmp    102 <__original_main+0x82>

Analysis 2

The rules in (2) and (3) are then now interesting. How did they "cure" the performance problem?
In fact, the rules in (3) can interact with these rules in icmp.isle:

(rule (simplify (ne ty (iadd _ a k) (iadd _ b k))) (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ a k) (iadd _ k b))) (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ k a) (iadd _ b k))) (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ k a) (iadd _ k b))) (subsume (ne ty a b)))

Then, similar to Analysis 1, it can cause the same problem in generating jump instructions.

Conclusion

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2025 at 12:32):

bongjunj edited a comment on issue #12139:

shootout-switch

Performance-regressing rules

Removing the following rules, I observed the performance regression can be fixed.
The performance gaps in execution time between the no-opt version (NO) and the latest version (Base) of Cranelift are -9.20% and zero before and after this removal.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/cprop.isle#L196-L211

(1) Removing these, the gap between NO and Base was -6.97%.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/cprop.isle#L97-L102

(2) Further removing these, the gap between NO and Base was -1.37%.

https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/arithmetic.isle#L241

(3) Further removing these, the gap between NO and Base was zero.

Analysis 1

My theory is that the rules in (1) interferes the loop back-edge.
Let's take a detailed look on the generated IR and x86 machine code.

                                 block4(v27: i32, v30: i32):
-                                    v4432 = iadd v30, v4411  ; v4411 = 8
-                                    v4433 = iconst.i32 0x2710
-                                    v4434 = icmp ne v4432, v4433  ; v4433 = 0x2710
-@02a3                               v87 = uextend.i32 v4434
-                                    v4435 = iconst.i32 32
-                                    v4436 = iadd v27, v4435  ; v4435 = 32
-@02a4                               brif v87, block4(v4436, v4432), block6
+                                    v4413 = icmp ne v30, v4409  ; v4409 = 9992
+@02a3                               v87 = uextend.i32 v4413
+                                    v4464 = iconst.i32 32
+                                    v4465 = iadd v27, v4464  ; v4464 = 32
+                                    v4466 = iadd v30, v4443  ; v4443 = 8
+@02a4                               brif v87, block4(v4465, v4466), block6

The IR transformation can be simplified into:

- a <- x + 8
- c <- a != 10,000 (0x2710)
- d <- y + 32
- br_if c, block4(d, a), block6
+ c <- x != 9,992
+ d <- y + 32
+ a <- x + 8
+ br_if c, block4(d, a), block6

This degraded the quality of generated machine code. Although the comparison is simplified, the addition is now two-phase (compute and commit) and there are now two jumps.

- add    r8d,0x8
- add    ecx,0x20
- cmp    r8d,0x2710
- jne    fd <__original_main+0x7d>
+ add    ecx,0x20
+ lea    r11d,[r8+0x8]        ; r11d = x + 8 (compute next)
+ cmp    r8d,0x2708           ; compare *old* x to 9992
+ je     181 <__original_main+0x101>
+ mov    r8,r11               ; commit x = x + 8
+jmp    102 <__original_main+0x82>

Analysis 2

The rules in (2) and (3) are then now interesting. How did they "cure" the performance problem?
In fact, the rules in (3) can interact with these rules in icmp.isle:

(rule (simplify (ne ty (iadd _ a k) (iadd _ b k))) (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ a k) (iadd _ k b))) (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ k a) (iadd _ b k))) (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ k a) (iadd _ k b))) (subsume (ne ty a b)))

Then, similar to Analysis 1, it can cause the same problem in generating jump instructions.

Conclusion

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2025 at 14:43):

bongjunj commented on issue #12139:

pulldown-cmark

Analysis

The hottest function for this benchmark program is pulldown_cmark::parse::FirstPass::parse_line.
For this function, the latest version of Cranelift (Base) generated 7% larger x86 assembly than the no-opt version (NO).

More "reading from the stack" instructions are there in Base.

This might be from the IR transformation below (notice the increased usage of v8, which can contribute to higher register pressure):

@@ -2600,9 +2586,7 @@
 @892d                               store little heap v3177, v3174
                                     v4298 = isub.i32 v8, v9  ; v9 = 64
...
 @895c                               v3246 = uextend.i64 v3239
 @895c                               v3248 = iadd.i64 v13, v3246
 @895c                               istore8 little heap v3245, v3248
-                                    v4308 = iconst.i32 64
-                                    v4309 = iadd v4298, v4308  ; v4308 = 64
-@8965                               store notrap aligned table v4309, v0+400
+@8965                               store.i32 notrap aligned table v8, v0+400
 @896b                               jump block1

This symptom repeats:

 @752e                               v34 = iadd.i32 v3, v33  ; v33 = 92
-@8917                               v3100 = iconst.i32 40
-                                    v3558 = iadd v3555, v3100  ; v3100 = 40
-@87b9                               v2349 = call fn11(v0, v0, v34, v3558)
+                                    v3602 = iconst.i32 -24
+                                    v3603 = iadd.i32 v8, v3602  ; v3602 = -24
+@87b9                               v2349 = call fn11(v0, v0, v34, v3603)
 @87c0                               jump block11
-@785a                               v315 = iadd v3567, v314  ; v314 = 44
-@8822                               store little heap v3571, v315
-                                    v3572 = iconst.i64 56
-                                    v3573 = iadd v3567, v3572  ; v3572 = 56
-@8829                               store.i32 little heap v47, v3573
-@8832                               v2402 = iadd v3571, v47
-                                    v3574 = iconst.i64 60
-                                    v3575 = iadd v3567, v3574  ; v3574 = 60
-@8835                               store little heap v2402, v3575
-                                    v3576 = iconst.i32 92
-                                    v3577 = iadd.i32 v3, v3576  ; v3576 = 92
-                                    v3578 = iconst.i32 40
-                                    v3579 = iadd v3565, v3578  ; v3578 = 40
-@883f                               v2413 = call fn11(v0, v0, v3577, v3579)
-@8846                               jump block10(v2387, v47, v2402, v259, v410, v1203, v3571)
+@785a                               v315 = iadd v3720, v314  ; v314 = 44
+@8822                               store little heap v3724, v315
+                                    v3725 = iconst.i64 56
+                                    v3726 = iadd v3720, v3725  ; v3725 = 56
+@8829                               store.i32 little heap v47, v3726
+@8832                               v2402 = iadd v3724, v47
+                                    v3727 = iconst.i64 60
+                                    v3728 = iadd v3720, v3727  ; v3727 = 60
+@8835                               store little heap v2402, v3728
+                                    v3729 = iconst.i32 92
+                                    v3730 = iadd.i32 v3, v3729  ; v3729 = 92
+                                    v3731 = iconst.i32 -24
+                                    v3732 = iadd.i32 v8, v3731  ; v3731 = -24
+@883f                               v2413 = call fn11(v0, v0, v3730, v3732)
+@8846                               jump block10(v2387, v47, v2402, v259, v410, v1203, v3724)

Sidenote

I am not sure this is an optimizing transformation which is intended):

-          v3835 = iconst.i32 1
-          v3836 = iadd.i32 v1156, v3835  ; v3835 = 1
-          v3837 = iconst.i32 6
-@7fd7     v1163 = urem v3836, v3837  ; v3837 = 6
-@7fd8     br_table v1163, block201, [block200, block198, block199, block200, block199]
+          v3988 = iconst.i32 1
+          v3989 = iadd.i32 v1156, v3988  ; v3988 = 1
+          v3685 = iconst.i32 -1431655765
+          v3686 = umulhi v3989, v3685  ; v3685 = -1431655765
+          v3990 = iconst.i32 2
+          v3687 = ushr v3686, v3990  ; v3990 = 2
+          v3991 = iconst.i32 6
+          v3688 = imul v3687, v3991  ; v3991 = 6
+          v3689 = isub v3989, v3688
+          v1163 -> v3689
+@7fd8     br_table v3689, block201, [block200, block198, block199, block200, block199]

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2025 at 14:47):

bongjunj edited a comment on issue #12139:

pulldown-cmark

The execution performance is regressed by -4.38% from the no-opt version (NO) to the lastest version (Base).

Analysis

The hottest function for this benchmark program is pulldown_cmark::parse::FirstPass::parse_line.
For this function, the latest version of Base generated 7% larger x86 assembly than the no-opt version NO.
In addition, more "reading from the stack" instructions are there in Base, indicating a register allocation problem.

This might be from the IR transformation below (notice the increased usage of v8, which can contribute to higher register pressure):

@@ -2600,9 +2586,7 @@
 @892d                               store little heap v3177, v3174
                                     v4298 = isub.i32 v8, v9  ; v9 = 64
...
 @895c                               v3246 = uextend.i64 v3239
 @895c                               v3248 = iadd.i64 v13, v3246
 @895c                               istore8 little heap v3245, v3248
-                                    v4308 = iconst.i32 64
-                                    v4309 = iadd v4298, v4308  ; v4308 = 64
-@8965                               store notrap aligned table v4309, v0+400
+@8965                               store.i32 notrap aligned table v8, v0+400
 @896b                               jump block1

This symptom repeats:

 @752e                               v34 = iadd.i32 v3, v33  ; v33 = 92
-@8917                               v3100 = iconst.i32 40
-                                    v3558 = iadd v3555, v3100  ; v3100 = 40
-@87b9                               v2349 = call fn11(v0, v0, v34, v3558)
+                                    v3602 = iconst.i32 -24
+                                    v3603 = iadd.i32 v8, v3602  ; v3602 = -24
+@87b9                               v2349 = call fn11(v0, v0, v34, v3603)
 @87c0                               jump block11
-@785a                               v315 = iadd v3567, v314  ; v314 = 44
-@8822                               store little heap v3571, v315
-                                    v3572 = iconst.i64 56
-                                    v3573 = iadd v3567, v3572  ; v3572 = 56
-@8829                               store.i32 little heap v47, v3573
-@8832                               v2402 = iadd v3571, v47
-                                    v3574 = iconst.i64 60
-                                    v3575 = iadd v3567, v3574  ; v3574 = 60
-@8835                               store little heap v2402, v3575
-                                    v3576 = iconst.i32 92
-                                    v3577 = iadd.i32 v3, v3576  ; v3576 = 92
-                                    v3578 = iconst.i32 40
-                                    v3579 = iadd v3565, v3578  ; v3578 = 40
-@883f                               v2413 = call fn11(v0, v0, v3577, v3579)
-@8846                               jump block10(v2387, v47, v2402, v259, v410, v1203, v3571)
+@785a                               v315 = iadd v3720, v314  ; v314 = 44
+@8822                               store little heap v3724, v315
+                                    v3725 = iconst.i64 56
+                                    v3726 = iadd v3720, v3725  ; v3725 = 56
+@8829                               store.i32 little heap v47, v3726
+@8832                               v2402 = iadd v3724, v47
+                                    v3727 = iconst.i64 60
+                                    v3728 = iadd v3720, v3727  ; v3727 = 60
+@8835                               store little heap v2402, v3728
+                                    v3729 = iconst.i32 92
+                                    v3730 = iadd.i32 v3, v3729  ; v3729 = 92
+                                    v3731 = iconst.i32 -24
+                                    v3732 = iadd.i32 v8, v3731  ; v3731 = -24
+@883f                               v2413 = call fn11(v0, v0, v3730, v3732)
+@8846                               jump block10(v2387, v47, v2402, v259, v410, v1203, v3724)

Sidenote

I am not sure this is an optimizing transformation which is intended):

-          v3835 = iconst.i32 1
-          v3836 = iadd.i32 v1156, v3835  ; v3835 = 1
-          v3837 = iconst.i32 6
-@7fd7     v1163 = urem v3836, v3837  ; v3837 = 6
-@7fd8     br_table v1163, block201, [block200, block198, block199, block200, block199]
+          v3988 = iconst.i32 1
+          v3989 = iadd.i32 v1156, v3988  ; v3988 = 1
+          v3685 = iconst.i32 -1431655765
+          v3686 = umulhi v3989, v3685  ; v3685 = -1431655765
+          v3990 = iconst.i32 2
+          v3687 = ushr v3686, v3990  ; v3990 = 2
+          v3991 = iconst.i32 6
+          v3688 = imul v3687, v3991  ; v3991 = 6
+          v3689 = isub v3989, v3688
+          v1163 -> v3689
+@7fd8     br_table v3689, block201, [block200, block198, block199, block200, block199]

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2025 at 19:16):

cfallin commented on issue #12139:

@bongjunj thanks for this analysis! I'll think about it in more detail on Monday but I wonder if you could provide one more data point: you have found how to remove a regression on shootout-switch and pulldown-cmark but with the rules that cause this removed, how does performance look on the rest of the benchmarks?

Basically: I want to think about performance with two metrics of goodness, namely average/overall speedup, and negative outliers. We want to avoid outliers, and they may point the way toward deficiencies in our rules or algorithms. But I also don't want to regress compiler performance as a whole by removing rules -- so it'd be useful to know the overall impact.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 15 2025 at 08:44):

bongjunj commented on issue #12139:

I delta-debugged to locate rules that is causing performance regression for pulldown-cmark and found out a list of rules:

(rule (simplify (ne ty (iadd _ a k) (iadd _ b k)))
      (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ a k) (iadd _ k b)))
      (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ k a) (iadd _ b k)))
      (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ k a) (iadd _ k b)))
      (subsume (ne ty a b)))

(rule (simplify (isub ty (ishl ty x z) (ishl ty y z))) (ishl ty (isub ty x y) z))

(rule (simplify (uextend ty (uextend _intermediate_ty x)))
      (uextend ty x))
(rule (simplify (sextend ty (sextend _intermediate_ty x)))
      (sextend ty x))

;; Reassociate across `==`/`!=` when we can simplify a constant
;; `x + K1 == K2` --> `x == K2 - K1`
(rule (simplify (eq ty1 (iadd ty2 x k1@(iconst _ _)) k2@(iconst _ _)))
      (eq ty1 x (isub ty2 k2 k1)))
(rule (simplify (ne ty1 (iadd ty2 x k1@(iconst _ _)) k2@(iconst _ _)))
      (ne ty1 x (isub ty2 k2 k1)))
;; `x - K1 == K2` --> `x == K2 + K1`
(rule (simplify (eq ty1 (isub ty2 x k1@(iconst _ _)) k2@(iconst _ _)))
      (eq ty1 x (iadd ty2 k2 k1)))
(rule (simplify (ne ty1 (isub ty2 x k1@(iconst _ _)) k2@(iconst _ _)))
      (ne ty1 x (iadd ty2 k2 k1)))
;; `x + K1 == y + K2` --> `x == y + (K2 - K1)`
(rule (simplify (eq ty1 (iadd ty2 x k1@(iconst _ _)) (iadd ty3 y k2@(iconst _ _))))
      (eq ty1 x (iadd ty2 y (isub ty3 k2 k1))))
(rule (simplify (ne ty1 (iadd ty2 x k1@(iconst _ _)) (iadd ty3 y k2@(iconst _ _))))
      (ne ty1 x (iadd ty2 y (isub ty3 k2 k1))))

(rule (simplify (iadd ty
                      (isub ty x (iconst ty (u64_from_imm64 k1)))
                      (iconst ty (u64_from_imm64 k2))))
      (iadd ty x (iconst ty (imm64_masked ty (u64_wrapping_sub k2 k1)))))
(rule (simplify (iadd ty
                      (isub ty (iconst ty (u64_from_imm64 k1)) x)
                      (iconst ty (u64_from_imm64 k2))))
      (isub ty (iconst ty (imm64_masked ty (u64_wrapping_add k1 k2))) x))

;; iconst_[su] support $I128, but do so by extending, so restricting to
;; 64-bit or smaller keeps it from just remaking essentially the same thing.
(rule (simplify (uextend (fits_in_64 wide) (iconst_u narrow k)))
      (subsume (iconst_u wide k)))
(rule (simplify (sextend (fits_in_64 wide) (iconst_s narrow k)))
      (subsume (iconst_s wide k)))

(rule (simplify
       (isub (fits_in_64 ty)
             (iconst ty (u64_from_imm64 k1))
             (iconst ty (u64_from_imm64 k2))))
      (subsume (iconst ty (imm64_masked ty (u64_wrapping_sub k1 k2)))))

;; (x - y) + x --> x
(rule (simplify (iadd ty (isub ty x y) y)) x)
(rule (simplify (iadd ty y (isub ty x y))) x)

;; (x + y) - y --> x
(rule (simplify (isub ty (iadd ty x y) x)) y)
(rule (simplify (isub ty (iadd ty x y) y)) x)

(rule (simplify_skeleton (urem x (iconst_u $I32 (u64_extract_non_zero (u32_from_u64 d)))))
      (if-let false (u32_is_power_of_two d))
      (apply_div_const_magic_u32 (Opcode.Urem) x d))

(rule (simplify_skeleton (udiv x (iconst_u $I32 (u64_extract_non_zero (u32_from_u64 d)))))
      (if-let false (u32_is_power_of_two d))
      (apply_div_const_magic_u32 (Opcode.Udiv) x d))

;; 0-x == (ineg x).
(rule (simplify (isub ty
                      (iconst_u ty 0)
                      x))
      (ineg ty x))

A performance regression is observed from the no-opt Cranelift with each of these rules.
For experiment, I removed all of these rules, and could reduce the perf regression to -2.24% but failed to eliminate all.

Here are the overall data.

Benchmark Fix1 Speedup Fix2 Speedup Main Speedup
blake3-scalar -0.69% -0.27% 0.11%
blake3-simd 0.64% 0.82% 2.51%
bz2 0.36% 0.22% 0.99%
pulldown-cmark -5.08% -2.24% -5.28%
regex -0.51% 0.77% -0.11%
shootout-ackermann -1.39% -1.52% 9.63%
shootout-base64 8.93% 7.04% 8.23%
shootout-ctype 1.07% 1.06% 4.30%
shootout-ed25519 1.83% 0.70% 1.96%
shootout-fib2 0.06% 0.08% 0.03%
shootout-gimli -0.27% -0.16% -1.33%
shootout-heapsort -2.33% -0.76% 0.08%
shootout-keccak 19.10% 19.11% 19.10%
shootout-matrix 0.17% 0.24% -0.19%
shootout-memmove -6.34% -6.27% -0.17%
shootout-minicsv 0.68% 0.69% 14.62%
shootout-nestedloop -11.25% 4.71% -4.75%
shootout-random 43.37% 0.00% 43.35%
shootout-ratelimit -0.39% -0.52% -2.00%
shootout-seqhash -0.85% -1.55% 1.84%
shootout-sieve 0.07% -0.03% 7.77%
shootout-switch 0.03% 0.00% -9.20%
shootout-xblabla20 -3.06% -3.14% -0.34%
shootout-xchacha20 0.15% -0.36% 0.00%
spidermonkey -1.66% -2.58% 0.58%

You can see that the performance is restored completely for shootout-switch and partially for pulldown-cmark.
But we are losing at shootout-ackermann, shootout-base64, shootout-heapsort, shootout-memmove, spidermonkey and others.

Probably it is the time to tap in the backend codegen/instruction scheduling/register allocation because a subtle change in IR is causing a surprising performance regression. For example, the following IR transformation (no-opt to base) in pulldown-cmark

        v3555 = isub.i32 v8, v9
-@8917  v3100 = iconst.i32 40
-       v3558 = iadd v3555, v3100  ; v3100 = 40
-@87b9  v2349 = call fn11(v0, v0, v34, v3558)
+       v3602 = iconst.i32 -24
+       v3603 = iadd.i32 v8, v3602  ; v3602 = -24
+@87b9  v2349 = call fn11(v0, v0, v34, v3603)

changed

3b9a:   mov    r8,rcx
3b9d:   sub    r8d,0x40
3ba1:   mov    rax,QWORD PTR [rsp+0x30]
3ba6:   mov    DWORD PTR [rbx+r8*1+0x3c],eax
3bab:   mov    rcx,QWORD PTR [rsp+0x48]
3bb0:   mov    DWORD PTR [rbx+r8*1+0x38],ecx
3bb5:   mov    r10d,0x1
3bbb:   mov    BYTE PTR [rbx+r8*1+0x28],r10b
3bc0:   mov    r11,QWORD PTR [rsp+0x20]
3bc5:   lea    edx,[r11+0x5c]
3bc9:   lea    ecx,[r8+0x28]
...

to

3e8b:   sub    r9d,0x40
3e8f:   mov    rdi,QWORD PTR [rsp+0xc8]
3e97:   mov    r8,QWORD PTR [rsp+0x30]
3e9c:   mov    DWORD PTR [rdi+r9*1+0x3c],r8d
3ea1:   mov    rcx,QWORD PTR [rsp+0x48]
3ea6:   mov    DWORD PTR [rdi+r9*1+0x38],ecx
3eab:   mov    r11d,0x1
3eb1:   mov    BYTE PTR [rdi+r9*1+0x28],r11b
3eb6:   mov    r8,QWORD PTR [rsp+0x20]
3ebb:   lea    edx,[r8+0x5c]
3ebf:   mov    r9,QWORD PTR [rsp+0xd0]
3ec7:   lea    ecx,[r9-0x18]
...

Notice that there are more spills/loads.
Similar regression happens for shootout-switch (one jump to two jumps) with a slight change in IR.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 15 2025 at 08:47):

bongjunj edited a comment on issue #12139:

I delta-debugged to locate rules that is causing performance regression for pulldown-cmark and found out a list of rules:

(rule (simplify (ne ty (iadd _ a k) (iadd _ b k)))
      (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ a k) (iadd _ k b)))
      (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ k a) (iadd _ b k)))
      (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ k a) (iadd _ k b)))
      (subsume (ne ty a b)))

(rule (simplify (isub ty (ishl ty x z) (ishl ty y z))) (ishl ty (isub ty x y) z))

(rule (simplify (uextend ty (uextend _intermediate_ty x)))
      (uextend ty x))
(rule (simplify (sextend ty (sextend _intermediate_ty x)))
      (sextend ty x))

;; Reassociate across `==`/`!=` when we can simplify a constant
;; `x + K1 == K2` --> `x == K2 - K1`
(rule (simplify (eq ty1 (iadd ty2 x k1@(iconst _ _)) k2@(iconst _ _)))
      (eq ty1 x (isub ty2 k2 k1)))
(rule (simplify (ne ty1 (iadd ty2 x k1@(iconst _ _)) k2@(iconst _ _)))
      (ne ty1 x (isub ty2 k2 k1)))
;; `x - K1 == K2` --> `x == K2 + K1`
(rule (simplify (eq ty1 (isub ty2 x k1@(iconst _ _)) k2@(iconst _ _)))
      (eq ty1 x (iadd ty2 k2 k1)))
(rule (simplify (ne ty1 (isub ty2 x k1@(iconst _ _)) k2@(iconst _ _)))
      (ne ty1 x (iadd ty2 k2 k1)))
;; `x + K1 == y + K2` --> `x == y + (K2 - K1)`
(rule (simplify (eq ty1 (iadd ty2 x k1@(iconst _ _)) (iadd ty3 y k2@(iconst _ _))))
      (eq ty1 x (iadd ty2 y (isub ty3 k2 k1))))
(rule (simplify (ne ty1 (iadd ty2 x k1@(iconst _ _)) (iadd ty3 y k2@(iconst _ _))))
      (ne ty1 x (iadd ty2 y (isub ty3 k2 k1))))

(rule (simplify (iadd ty
                      (isub ty x (iconst ty (u64_from_imm64 k1)))
                      (iconst ty (u64_from_imm64 k2))))
      (iadd ty x (iconst ty (imm64_masked ty (u64_wrapping_sub k2 k1)))))
(rule (simplify (iadd ty
                      (isub ty (iconst ty (u64_from_imm64 k1)) x)
                      (iconst ty (u64_from_imm64 k2))))
      (isub ty (iconst ty (imm64_masked ty (u64_wrapping_add k1 k2))) x))

;; iconst_[su] support $I128, but do so by extending, so restricting to
;; 64-bit or smaller keeps it from just remaking essentially the same thing.
(rule (simplify (uextend (fits_in_64 wide) (iconst_u narrow k)))
      (subsume (iconst_u wide k)))
(rule (simplify (sextend (fits_in_64 wide) (iconst_s narrow k)))
      (subsume (iconst_s wide k)))

(rule (simplify
       (isub (fits_in_64 ty)
             (iconst ty (u64_from_imm64 k1))
             (iconst ty (u64_from_imm64 k2))))
      (subsume (iconst ty (imm64_masked ty (u64_wrapping_sub k1 k2)))))

;; (x - y) + x --> x
(rule (simplify (iadd ty (isub ty x y) y)) x)
(rule (simplify (iadd ty y (isub ty x y))) x)

;; (x + y) - y --> x
(rule (simplify (isub ty (iadd ty x y) x)) y)
(rule (simplify (isub ty (iadd ty x y) y)) x)

(rule (simplify_skeleton (urem x (iconst_u $I32 (u64_extract_non_zero (u32_from_u64 d)))))
      (if-let false (u32_is_power_of_two d))
      (apply_div_const_magic_u32 (Opcode.Urem) x d))

(rule (simplify_skeleton (udiv x (iconst_u $I32 (u64_extract_non_zero (u32_from_u64 d)))))
      (if-let false (u32_is_power_of_two d))
      (apply_div_const_magic_u32 (Opcode.Udiv) x d))

;; 0-x == (ineg x).
(rule (simplify (isub ty
                      (iconst_u ty 0)
                      x))
      (ineg ty x))

A performance regression is observed from the no-opt Cranelift with each of these rules.
For experiment, I removed all of these rules, and could reduce the perf regression to -2.24% but failed to eliminate all.

Here are the overall data.

Benchmark Fix1 Speedup Fix2 Speedup Main Speedup
blake3-scalar -0.69% -0.27% 0.11%
blake3-simd 0.64% 0.82% 2.51%
bz2 0.36% 0.22% 0.99%
pulldown-cmark -5.08% -2.24% -5.28%
regex -0.51% 0.77% -0.11%
shootout-ackermann -1.39% -1.52% 9.63%
shootout-base64 8.93% 7.04% 8.23%
shootout-ctype 1.07% 1.06% 4.30%
shootout-ed25519 1.83% 0.70% 1.96%
shootout-fib2 0.06% 0.08% 0.03%
shootout-gimli -0.27% -0.16% -1.33%
shootout-heapsort -2.33% -0.76% 0.08%
shootout-keccak 19.10% 19.11% 19.10%
shootout-matrix 0.17% 0.24% -0.19%
shootout-memmove -6.34% -6.27% -0.17%
shootout-minicsv 0.68% 0.69% 14.62%
shootout-nestedloop -11.25% 4.71% -4.75%
shootout-random 43.37% 0.00% 43.35%
shootout-ratelimit -0.39% -0.52% -2.00%
shootout-seqhash -0.85% -1.55% 1.84%
shootout-sieve 0.07% -0.03% 7.77%
shootout-switch 0.03% 0.00% -9.20%
shootout-xblabla20 -3.06% -3.14% -0.34%
shootout-xchacha20 0.15% -0.36% 0.00%
spidermonkey -1.66% -2.58% 0.58%

You can see that the performance is restored completely for shootout-switch and partially for pulldown-cmark.
But we are losing at shootout-ackermann, shootout-base64, shootout-heapsort, shootout-memmove, spidermonkey and others.

Probably it is the time to tap in the backend codegen/instruction scheduling/register allocation because a subtle change in IR is causing a surprising performance regression. I suspect the codegen is not handling IRs transformed by the mid-end optimizations and can produce a quality code only for non-optimized codes. For example, the following IR transformation (no-opt to base) in pulldown-cmark

        v3555 = isub.i32 v8, v9
-@8917  v3100 = iconst.i32 40
-       v3558 = iadd v3555, v3100  ; v3100 = 40
-@87b9  v2349 = call fn11(v0, v0, v34, v3558)
+       v3602 = iconst.i32 -24
+       v3603 = iadd.i32 v8, v3602  ; v3602 = -24
+@87b9  v2349 = call fn11(v0, v0, v34, v3603)

changed

3b9a:   mov    r8,rcx
3b9d:   sub    r8d,0x40
3ba1:   mov    rax,QWORD PTR [rsp+0x30]
3ba6:   mov    DWORD PTR [rbx+r8*1+0x3c],eax
3bab:   mov    rcx,QWORD PTR [rsp+0x48]
3bb0:   mov    DWORD PTR [rbx+r8*1+0x38],ecx
3bb5:   mov    r10d,0x1
3bbb:   mov    BYTE PTR [rbx+r8*1+0x28],r10b
3bc0:   mov    r11,QWORD PTR [rsp+0x20]
3bc5:   lea    edx,[r11+0x5c]
3bc9:   lea    ecx,[r8+0x28]
...

to

3e8b:   sub    r9d,0x40
3e8f:   mov    rdi,QWORD PTR [rsp+0xc8]
3e97:   mov    r8,QWORD PTR [rsp+0x30]
3e9c:   mov    DWORD PTR [rdi+r9*1+0x3c],r8d
3ea1:   mov    rcx,QWORD PTR [rsp+0x48]
3ea6:   mov    DWORD PTR [rdi+r9*1+0x38],ecx
3eab:   mov    r11d,0x1
3eb1:   mov    BYTE PTR [rdi+r9*1+0x28],r11b
3eb6:   mov    r8,QWORD PTR [rsp+0x20]
3ebb:   lea    edx,[r8+0x5c]
3ebf:   mov    r9,QWORD PTR [rsp+0xd0]
3ec7:   lea    ecx,[r9-0x18]
...

Notice that there are more spills/loads.
Similar regression happens for shootout-switch (one jump to two jumps) with a slight change in IR.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 15 2025 at 08:53):

bongjunj edited a comment on issue #12139:

I delta-debugged to locate rules that is causing performance regression for pulldown-cmark and found out a list of rules:

(rule (simplify (ne ty (iadd _ a k) (iadd _ b k)))
      (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ a k) (iadd _ k b)))
      (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ k a) (iadd _ b k)))
      (subsume (ne ty a b)))
(rule (simplify (ne ty (iadd _ k a) (iadd _ k b)))
      (subsume (ne ty a b)))

(rule (simplify (isub ty (ishl ty x z) (ishl ty y z))) (ishl ty (isub ty x y) z))

(rule (simplify (uextend ty (uextend _intermediate_ty x)))
      (uextend ty x))
(rule (simplify (sextend ty (sextend _intermediate_ty x)))
      (sextend ty x))

;; Reassociate across `==`/`!=` when we can simplify a constant
;; `x + K1 == K2` --> `x == K2 - K1`
(rule (simplify (eq ty1 (iadd ty2 x k1@(iconst _ _)) k2@(iconst _ _)))
      (eq ty1 x (isub ty2 k2 k1)))
(rule (simplify (ne ty1 (iadd ty2 x k1@(iconst _ _)) k2@(iconst _ _)))
      (ne ty1 x (isub ty2 k2 k1)))
;; `x - K1 == K2` --> `x == K2 + K1`
(rule (simplify (eq ty1 (isub ty2 x k1@(iconst _ _)) k2@(iconst _ _)))
      (eq ty1 x (iadd ty2 k2 k1)))
(rule (simplify (ne ty1 (isub ty2 x k1@(iconst _ _)) k2@(iconst _ _)))
      (ne ty1 x (iadd ty2 k2 k1)))
;; `x + K1 == y + K2` --> `x == y + (K2 - K1)`
(rule (simplify (eq ty1 (iadd ty2 x k1@(iconst _ _)) (iadd ty3 y k2@(iconst _ _))))
      (eq ty1 x (iadd ty2 y (isub ty3 k2 k1))))
(rule (simplify (ne ty1 (iadd ty2 x k1@(iconst _ _)) (iadd ty3 y k2@(iconst _ _))))
      (ne ty1 x (iadd ty2 y (isub ty3 k2 k1))))

(rule (simplify (iadd ty
                      (isub ty x (iconst ty (u64_from_imm64 k1)))
                      (iconst ty (u64_from_imm64 k2))))
      (iadd ty x (iconst ty (imm64_masked ty (u64_wrapping_sub k2 k1)))))
(rule (simplify (iadd ty
                      (isub ty (iconst ty (u64_from_imm64 k1)) x)
                      (iconst ty (u64_from_imm64 k2))))
      (isub ty (iconst ty (imm64_masked ty (u64_wrapping_add k1 k2))) x))

;; iconst_[su] support $I128, but do so by extending, so restricting to
;; 64-bit or smaller keeps it from just remaking essentially the same thing.
(rule (simplify (uextend (fits_in_64 wide) (iconst_u narrow k)))
      (subsume (iconst_u wide k)))
(rule (simplify (sextend (fits_in_64 wide) (iconst_s narrow k)))
      (subsume (iconst_s wide k)))

(rule (simplify
       (isub (fits_in_64 ty)
             (iconst ty (u64_from_imm64 k1))
             (iconst ty (u64_from_imm64 k2))))
      (subsume (iconst ty (imm64_masked ty (u64_wrapping_sub k1 k2)))))

;; (x - y) + x --> x
(rule (simplify (iadd ty (isub ty x y) y)) x)
(rule (simplify (iadd ty y (isub ty x y))) x)

;; (x + y) - y --> x
(rule (simplify (isub ty (iadd ty x y) x)) y)
(rule (simplify (isub ty (iadd ty x y) y)) x)

(rule (simplify_skeleton (urem x (iconst_u $I32 (u64_extract_non_zero (u32_from_u64 d)))))
      (if-let false (u32_is_power_of_two d))
      (apply_div_const_magic_u32 (Opcode.Urem) x d))

(rule (simplify_skeleton (udiv x (iconst_u $I32 (u64_extract_non_zero (u32_from_u64 d)))))
      (if-let false (u32_is_power_of_two d))
      (apply_div_const_magic_u32 (Opcode.Udiv) x d))

;; 0-x == (ineg x).
(rule (simplify (isub ty
                      (iconst_u ty 0)
                      x))
      (ineg ty x))

A performance regression is observed from the no-opt Cranelift with each of these rules.
For experiment, I removed all of these rules, and could reduce the perf regression to -2.24% but failed to eliminate all.

Here are the overall data.

Benchmark Fix1 Speedup Fix2 Speedup Main Speedup
blake3-scalar -0.69% -0.27% 0.55%
blake3-simd 0.64% 0.82% 2.35%
bz2 0.36% 0.22% 1.05%
pulldown-cmark -5.08% -2.24% 0.59%
regex -0.51% 0.77% -0.22%
shootout-ackermann -1.39% -1.52% 9.74%
shootout-base64 8.93% 7.04% 8.22%
shootout-ctype 1.07% 1.06% 4.27%
shootout-ed25519 1.83% 0.70% 1.96%
shootout-fib2 0.06% 0.08% 0.08%
shootout-gimli -0.27% -0.16% -1.46%
shootout-heapsort -2.33% -0.76% 0.11%
shootout-keccak 19.10% 19.11% 19.12%
shootout-matrix 0.17% 0.24% 0.81%
shootout-memmove -6.34% -6.27% -0.60%
shootout-minicsv 0.68% 0.69% 15.46%
shootout-nestedloop -11.25% 4.71% 3.71%
shootout-random 43.37% 0.00% 43.37%
shootout-ratelimit -0.39% -0.52% -1.89%
shootout-seqhash -0.85% -1.55% 1.86%
shootout-sieve 0.07% -0.03% 7.67%
shootout-switch 0.03% 0.00% -9.16%
shootout-xblabla20 -3.06% -3.14% -0.31%
shootout-xchacha20 0.15% -0.36% 0.07%
spidermonkey -1.66% -2.58% 0.24%

You can see that the performance is restored completely for shootout-switch and partially for pulldown-cmark.
But we are losing at shootout-ackermann, shootout-base64, shootout-heapsort, shootout-memmove, spidermonkey and others.

Probably it is the time to tap in the backend codegen/instruction scheduling/register allocation because a subtle change in IR is causing a surprising performance regression. I suspect the codegen is not handling IRs transformed by the mid-end optimizations and can produce a quality code only for non-optimized codes. For example, the following IR transformation (no-opt to base) in pulldown-cmark

        v3555 = isub.i32 v8, v9
-@8917  v3100 = iconst.i32 40
-       v3558 = iadd v3555, v3100  ; v3100 = 40
-@87b9  v2349 = call fn11(v0, v0, v34, v3558)
+       v3602 = iconst.i32 -24
+       v3603 = iadd.i32 v8, v3602  ; v3602 = -24
+@87b9  v2349 = call fn11(v0, v0, v34, v3603)

changed

3b9a:   mov    r8,rcx
3b9d:   sub    r8d,0x40
3ba1:   mov    rax,QWORD PTR [rsp+0x30]
3ba6:   mov    DWORD PTR [rbx+r8*1+0x3c],eax
3bab:   mov    rcx,QWORD PTR [rsp+0x48]
3bb0:   mov    DWORD PTR [rbx+r8*1+0x38],ecx
3bb5:   mov    r10d,0x1
3bbb:   mov    BYTE PTR [rbx+r8*1+0x28],r10b
3bc0:   mov    r11,QWORD PTR [rsp+0x20]
3bc5:   lea    edx,[r11+0x5c]
3bc9:   lea    ecx,[r8+0x28]
...

to

3e8b:   sub    r9d,0x40
3e8f:   mov    rdi,QWORD PTR [rsp+0xc8]
3e97:   mov    r8,QWORD PTR [rsp+0x30]
3e9c:   mov    DWORD PTR [rdi+r9*1+0x3c],r8d
3ea1:   mov    rcx,QWORD PTR [rsp+0x48]
3ea6:   mov    DWORD PTR [rdi+r9*1+0x38],ecx
3eab:   mov    r11d,0x1
3eb1:   mov    BYTE PTR [rdi+r9*1+0x28],r11b
3eb6:   mov    r8,QWORD PTR [rsp+0x20]
3ebb:   lea    edx,[r8+0x5c]
3ebf:   mov    r9,QWORD PTR [rsp+0xd0]
3ec7:   lea    ecx,[r9-0x18]
...

Notice that there are more spills/loads.
Similar regression happens for shootout-switch (one jump to two jumps) with a slight change in IR.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 15 2025 at 20:18):

fitzgen commented on issue #12139:

Thanks (as always) for investigating and sharing your results with us, @bongjunj!

My theory is that the rules in (1) interferes the loop back-edge.

For (1) I suspect we just need to either

Probably the latter? Because it doesn't grow the number of enodes and lets us continue to do the most canonicalizing we can (and therefore the most const propagation we can).

Anyways, that CLIF snippet you isolated seems like it would be a great little filetest to add (even before we have a fix) and then we can have confidence that (a) whatever fix we implement does indeed fix that example code and (b) the codegen for this input pattern won't regress in the future.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 15 2025 at 20:30):

fitzgen commented on issue #12139:

I am not sure this is an optimizing transformation which is intended):

diff - v3835 = iconst.i32 1 - v3836 = iadd.i32 v1156, v3835 ; v3835 = 1 - v3837 = iconst.i32 6 -@7fd7 v1163 = urem v3836, v3837 ; v3837 = 6 -@7fd8 br_table v1163, block201, [block200, block198, block199, block200, block199] + v3988 = iconst.i32 1 + v3989 = iadd.i32 v1156, v3988 ; v3988 = 1 + v3685 = iconst.i32 -1431655765 + v3686 = umulhi v3989, v3685 ; v3685 = -1431655765 + v3990 = iconst.i32 2 + v3687 = ushr v3686, v3990 ; v3990 = 2 + v3991 = iconst.i32 6 + v3688 = imul v3687, v3991 ; v3991 = 6 + v3689 = isub v3989, v3688 + v1163 -> v3689 +@7fd8 br_table v3689, block201, [block200, block198, block199, block200, block199]

This is almost definitely the result of our division/remainder to magic-sequence rules.

It produces more native instructions, but they should take less cycles in practice because there is no division/mod/rem operation anymore. Unless you can observe actual slowdowns to motivate otherwise, I think we do want to keep this rewrite.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 15 2025 at 20:37):

fitzgen commented on issue #12139:

This might be from the IR transformation below (notice the increased usage of v8, which can contribute to higher register pressure):

Addressing this specifically requires support for rematerializing values. The best place to do this is probably in the register allocator (where we can keep reusing values until we get register pressure and can then rematerialize rather than spilling and reloading), although we can also do it in the mid-end (and in fact will rematerialize constants oncer-per-block IIRC in the mid-end). The problem with doing more remat in the mid-end is that it doesn't know about register pressure or even the number of available registers that the backend has. Seems like the kind of thing that would be best to do during regalloc.

Anyways, we have a couple issues related to rematerialization on file:

view this post on Zulip Wasmtime GitHub notifications bot (Dec 15 2025 at 21:00):

fitzgen commented on issue #12139:

Probably it is the time to tap in the backend codegen/instruction scheduling/register allocation because a subtle change in IR is causing a surprising performance regression. I suspect the codegen is not handling IRs transformed by the mid-end optimizations and can produce a quality code only for non-optimized codes. For example, the following IR transformation (no-opt to base) in pulldown-cmark

An interesting performance/codegen fuzzing experiment would be to do something like:

I suspect that we would find a lot of stuff this way. Probably more than we realistically have spare cycles to evaluate. Would need to put effort into prioritizing bugs (so extracting from benchmarks / existing lowering rules is nice because it would implicitly find higher-priority bugs than starting from an arbitrary sequence would).

We would also have to take care to keep the snippets small enough that we don't run into https://github.com/bytecodealliance/wasmtime/issues/12156, as that would lead to uninteresting results (we already know that our cost function has issues with large expressions that have lots of shared structure and know we need to fix that, and it is a separate issue from our actual rewrite/lowering rule set).

view this post on Zulip Wasmtime GitHub notifications bot (Dec 15 2025 at 21:53):

cfallin commented on issue #12139:

Thanks, @bongjunj, for this analysis. To add my take on the immediate issue first: your data makes clear that fixing one outlier creates several more; and on balance it seems better to keep the rules the way they are. Regressing SpiderMonkey by 2% is a far bigger deal than one of the shootout benchmarks -- the latter are very small "kernels" (single hot loops doing one thing) so are sensitive to particular cases of heuristics gone bad, while the SpiderMonkey case is more representative of real workloads.

I think it is worth setting expectations here: while it would be ideal for the compiler to never have optimizations that make results worse, in practice once we widen the scope of optimizations large enough, with the NP-complete multidimensional optimization problem, there will inevitably be outliers. Studying those outliers is helpful, but they are not the same as, say, correctness bugs that must be fixed.

I'll echo @fitzgen's point as well that more canonicalization is better, as a general principle. Removing rules that canonicalize might avoid some case in some outlier but could have many unseen effects later -- we aren't measuring the opportunity cost. It'd be better to "complete the work" and make the rest of the pipeline work well with the canonicalized form.

Rematerialization in regalloc is a holy-grail if we can achieve it but is very hard to do because it crosses abstraction layers in the compiler: it would mean that we would somehow need to represent the multiple possibilities to compute (or reuse) a value in VCode, and we'd need to reason about the indirect effects of liveranges on register pressure. The simplest case is a pure instruction that can be re-executed, but even that is nontrivial from a regalloc point of view because it could extend the liveranges of the inputs of that rematerialized instruction, force more register pressure elsewhere, and cascade into evictions and other rematerialization decisions. The most principled way to deal with all of this is to encode the whole thing -- lowering, code scheduling, and regalloc -- into an ILP (integer linear programming) problem and solve it, but that's essentially a rewrite of Cranelift's whole backend and also not compatible with the compiler-performance corner we inhabit. All of that to say: we can point in that direction and say that that is an eventual possible solution, but it's not clear to me that it's practical or solvable.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 16 2025 at 05:42):

bongjunj commented on issue #12139:

add missing x64 lowering rules that recognize the canonicalized form

I'll echo @fitzgen's point as well that more canonicalization is better, as a general principle. Removing rules that canonicalize might avoid some case in some outlier but could have many unseen effects later -- we aren't measuring the opportunity cost. It'd be better to "complete the work" and make the rest of the pipeline work well with the canonicalized form.

Yes, I also think this is the right direction. Because some of the rules I removed are so general such as (x - y) + y => x, getting rid of them does not make much sense. We have to work on using them properly to boost performance. The downside of removing such rules is already observed in the spidermonkey case in the data above.

An interesting performance/codegen fuzzing experiment would be to do something like:

I think this can be a good microbenchmark to track code generation quality. Asserting the equality between the before- and after- mutations can be too strong. We could otherwise calculate the estimated number of CPU cycles using some tools like llvm-mca (https://llvm.org/docs/CommandGuide/llvm-mca.html). This can automatically detect performance regression of generated codes and withstand minor changes due to backend updates in the future.


P.S.

I was investigating this problem because it was surprising that some rules I have submitted here was degrading performance in some cases. I think it is clear that optimized IR can degrade performance, affecting further code-gen processes. But it seems like the problem can be solved later with an improved reg-alloc and code-gens!

view this post on Zulip Wasmtime GitHub notifications bot (Dec 16 2025 at 21:46):

fitzgen commented on issue #12139:

Asserting the equality between the before- and after- mutations can be too strong. We could otherwise calculate the estimated number of CPU cycles using some tools like llvm-mca (https://llvm.org/docs/CommandGuide/llvm-mca.html). This can automatically detect performance regression of generated codes and withstand minor changes due to backend updates in the future.

Indeed, I was originally thinking something similar, but then realized that (so long as we didn't introduce any null side effects in the mutation, like reading a value from memory and writing it back to the same place, which we won't ever be able to optimize away in the limit in Cranelift) then it doesn't really matter whether the codegen for the initial or mutated snippet is better. We should be generating whichever code is better when given either the initial or mutated snippet. llvm-mca would be good for determining which version of code we should generate for both snippets, given that we've already found a divergence between them.

P.S.

I was investigating this problem because it was surprising that some rules I have submitted here was degrading performance in some cases. I think it is clear that optimized IR can degrade performance, affecting further code-gen processes. But it seems like the problem can be solved later with an improved reg-alloc and code-gens!

It is true that these problems can and often should be solved at the regalloc or instruction selection levels, rather than in the mid-end, but that doesn't mean we shouldn't worry about performance regressions at all when adding new mid-end rules. Sometimes fixing the problem in the right place is pretty difficult and unlikely to happen soon (e.g. regalloc rematerialization). In general, we should strive not to regress the "default" suite of Sightglass benchmarks, which represent Real World programs; it can be okay, on the other hand, to regress the micro-benchmarks, like shootout, if we understand why the regression is happening and have other, higher-priority motivations to outweigh the micro-benchmark regression.


Last updated: Jan 10 2026 at 20:04 UTC