bongjunj opened issue #12139:
Hi,
this is a follow-up of https://github.com/bytecodealliance/wasmtime/issues/12106 .
Although we've removed two sorts of performance regressing mid-end ISLE rules,
there still remains a significant performance degradation as well as other suspected cases.
(There is, of course, a bright side: we have significant performance improvements for many cases!)Performance Regression:
- shootout-switch
- pulldown-cmark
First, here is the backing data of the performance regression:
Benchmark No Opt Main Main Speedup blake3-scalar 317,727 317,719 0.00% blake3-simd 313,115 306,232 2.25% bz2 87,201,400 86,337,330 1.00% pulldown-cmark 6,580,174 6,905,992 -4.72% regex 209,743,816 210,183,175 -0.21% shootout-ackermann 8,498,140 7,764,439 9.45% shootout-base64 381,721,177 352,724,661 8.22% shootout-ctype 830,813,398 796,486,698 4.31% shootout-ed25519 9,583,747,723 9,395,321,203 2.01% shootout-fib2 3,009,269,670 3,010,509,565 -0.04% shootout-gimli 5,338,258 5,401,697 -1.17% shootout-heapsort 2,382,073,831 2,375,914,107 0.26% shootout-keccak 25,168,386 21,112,482 19.21% shootout-matrix 538,696,036 544,739,691 -1.11% shootout-memmove 36,156,621 36,115,998 0.11% shootout-minicsv 1,481,713,625 1,291,534,227 14.73% shootout-nestedloop 449 442 1.43% shootout-random 630,328,205 439,691,474 43.36% shootout-ratelimit 39,148,817 39,956,714 -2.02% shootout-seqhash 8,869,585,125 8,639,110,150 2.67% shootout-sieve 905,404,028 840,777,681 7.69% shootout-switch 139,525,474 153,663,682 -9.20% shootout-xblabla20 2,891,404 2,907,369 -0.55% shootout-xchacha20 4,384,703 4,395,319 -0.24% spidermonkey 636,104,785 631,998,404 0.65% Unlike the previous cases, the cause is not obvious.
19245 clif/v-no-opts/shootout-switch/wasm[0]--function[9]--__original_main.clif 19241 clif/v-main/shootout-switch/wasm[0]--function[9]--__original_main.clifThe number of instructions does not increase significantly from no-opt to main.
However, the applied optimizations make the program use long-lived value:--- /data/bongjun/clif/v-no-opts/shootout-switch/wasm[0]--function[9]--__original_main.clif 2025-12-08 12:43:58.406738645 +0000 +++ /data/bongjun/clif/v-main/shootout-switch/wasm[0]--function[9]--__original_main.clif 2025-12-08 12:49:01.961085326 +0000 - v8572 = iconst.i32 1066 - v8573 = iconst.i32 0 -@d20b v4324 = call fn1(v0, v0, v8572, v8573) ; v8572 = 1066, v8573 = 0 - v8574 = iadd.i64 v105, v106 ; v106 = 3584 -@d219 v4333 = load.i32 little heap v8574 - v8575 = iconst.i32 6 - v8576 = icmp uge v4333, v8575 ; v8575 = 6 + v8603 = iconst.i32 1066 + v8604 = iconst.i32 0 +@d20b v4324 = call fn1(v0, v0, v8603, v8604) ; v8603 = 1066, v8604 = 0 + v8605 = iadd.i64 v11, v106 ; v106 = 3584 +@d219 v4333 = load.i32 little heap v8605 + v8606 = iconst.i32 6 + v8607 = icmp uge v4333, v8606 ; v8606 = 6See
v8574andv8605which usesv105andv11.
v11is defined at the beginning, butv105is defined later thanv11:block0(v0: i64, v1: i64): @01f0 v5 = load.i32 notrap aligned table v0+256 @01f6 v6 = iconst.i32 16 @01f8 v7 = isub v5, v6 ; v6 = 16 @01fb store notrap aligned table v7, v0+256 @0203 v9 = iconst.i32 0x2710 @0207 v11 = load.i64 notrap aligned readonly can_move checked v0+56 ... @02d6 v105 = iadd.i64 v11, v4439This might increase the register pressure, causing more spills which can degrade the performance.
alexcrichton added the cranelift label to Issue #12139.
alexcrichton added the cranelift:goal:optimize-speed label to Issue #12139.
alexcrichton added the performance label to Issue #12139.
bongjunj commented on issue #12139:
Investigating the generated CLIF for
shootout-switchcase, I noticed there are over 2,500 unused constants in a block:
https://gist.github.com/bongjunj/08cf48d4e5827cf8ed270f26442e2604#file-main-clif-L216-L240 .The constants defined in the block are never referenced anywhere in the function.
Is this a intended behavior for translatingswitch?To repro this, run
wasmtime compile sightglass/benchmarks/shootout/shootut-switch.wasm --emit-clif <dir>and look up theshootout-switch/wasm\[0\]--function\[9\]--__original_main.cliffile.cc @cfallin
bongjunj edited a comment on issue #12139:
Investigating the generated CLIF for
shootout-switchcase, I noticed there are over 2,500 unused constants in a block:
https://gist.github.com/bongjunj/08cf48d4e5827cf8ed270f26442e2604#file-main-clif-L216-L240 .
For example, look upv185in the CLIF file. None are matched.The constants defined in the block are never referenced anywhere in the function.
Is this a intended behavior for translatingswitch?To repro this, run
wasmtime compile sightglass/benchmarks/shootout/shootut-switch.wasm --emit-clif <dir>and look up theshootout-switch/wasm\[0\]--function\[9\]--__original_main.cliffile.cc @cfallin
bongjunj edited a comment on issue #12139:
Investigating the generated CLIF for
shootout-switchcase, I noticed there are over 2,500 unused constants in a block:
https://gist.github.com/bongjunj/08cf48d4e5827cf8ed270f26442e2604#file-main-clif-L216-L240 .
For example, look upv185in the CLIF file. None are matched.This not only happens in the specified block, but also in another block too:
block3: v4434 = iconst.i32 16 v4435 = iadd.i32 v16, v4434 ; v4434 = 16 v4436 = iconst.i32 0 @0236 v31 = iconst.i32 7 @0231 v28 = iconst.i32 12 @0243 v38 = iconst.i32 6 @023e v36 = iconst.i32 8 @0250 v45 = iconst.i32 5 @024b v43 = iconst.i32 4 @0267 v57 = iconst.i32 3 @0262 v55 = iconst.i32 -4 @0274 v64 = iconst.i32 2 @026f v62 = iconst.i32 -8 @0281 v71 = iconst.i32 1 @027c v69 = iconst.i32 -12 @0289 v76 = iconst.i32 -16 v4409 = iconst.i32 9992 @0293 v81 = iconst.i32 32 @022d jump block4(v4435, v4436) ; v4436 = 0 ... @028e v78 = uextend.i64 v4463 @028e v80 = iadd.i64 v11, v78 @028e store little heap v30, v80 v4413 = icmp ne v30, v4409 ; v4409 = 9992Surprisingly, none of the constants from
v31tov76in the block are used anywhere in the function.
The constants defined in the block are never referenced anywhere in the function.
Is this a intended behavior for translatingswitch?Furthermore, the constant
v4409that is newly generated as a result of optimization is not placed close enough to its usagev4413. I see no valid reason to instantiate such constant distant from its only usage.To repro this, run
wasmtime compile sightglass/benchmarks/shootout/shootut-switch.wasm --emit-clif <dir>and look up theshootout-switch/wasm\[0\]--function\[9\]--__original_main.cliffile.cc @cfallin
bongjunj commented on issue #12139:
Performance-regressing rules
Removing the following rules, I observed the performance regression can be fixed.
The performance gaps between the no-opt version (NO) and the latest version (Base) of Cranelift are -9.20% and zero before and after this removal.(1) Removing these, the gap between NO and Base is -6.97%.
(2) Further removing these, the gap between NO and Base is -1.37%.
(3) Further removing these, the gap between NO and Base is zero.
Analysis
The rules in (1) interferes the loop backedge.
Let's take a detailed look on the generated IR and x86 machine code.block4(v27: i32, v30: i32): - v4432 = iadd v30, v4411 ; v4411 = 8 - v4433 = iconst.i32 0x2710 - v4434 = icmp ne v4432, v4433 ; v4433 = 0x2710 -@02a3 v87 = uextend.i32 v4434 - v4435 = iconst.i32 32 - v4436 = iadd v27, v4435 ; v4435 = 32 -@02a4 brif v87, block4(v4436, v4432), block6 + v4413 = icmp ne v30, v4409 ; v4409 = 9992 +@02a3 v87 = uextend.i32 v4413 + v4464 = iconst.i32 32 + v4465 = iadd v27, v4464 ; v4464 = 32 + v4466 = iadd v30, v4443 ; v4443 = 8 +@02a4 brif v87, block4(v4465, v4466), block6The IR transformation can be simplified into:
- a <- x + 8 - c <- a != 10,000 (0x2710) - d <- y + 32 - br_if c, block4(d, a), block6 + c <- x != 9,992 + d <- y + 32 + a <- x + 8 + br_if c, block4(d, a), block6This degraded the quality of generated machine code. Although the comparison is simplified, the addition is now two-phase (compute and commit) and there are now two jumps.
- add r8d,0x8 - add ecx,0x20 - cmp r8d,0x2710 - jne fd <__original_main+0x7d> + add ecx,0x20 + lea r11d,[r8+0x8] ; r11d = x + 8 (compute next) + cmp r8d,0x2708 ; compare *old* x to 9992 + je 181 <__original_main+0x101> + mov r8,r11 ; commit x = x + 8 +jmp 102 <__original_main+0x82>I couldn't decide to fix the mid-end or code-gen at the moment.
bongjunj edited a comment on issue #12139:
Performance-regressing rules
Removing the following rules, I observed the performance regression can be fixed.
The performance gaps between the no-opt version (NO) and the latest version (Base) of Cranelift are -9.20% and zero before and after this removal.(1) Removing these, the gap between NO and Base was -6.97%.
(2) Further removing these, the gap between NO and Base was -1.37%.
(3) Further removing these, the gap between NO and Base was zero.
Analysis
The rules in (1) interferes the loop backedge.
Let's take a detailed look on the generated IR and x86 machine code.block4(v27: i32, v30: i32): - v4432 = iadd v30, v4411 ; v4411 = 8 - v4433 = iconst.i32 0x2710 - v4434 = icmp ne v4432, v4433 ; v4433 = 0x2710 -@02a3 v87 = uextend.i32 v4434 - v4435 = iconst.i32 32 - v4436 = iadd v27, v4435 ; v4435 = 32 -@02a4 brif v87, block4(v4436, v4432), block6 + v4413 = icmp ne v30, v4409 ; v4409 = 9992 +@02a3 v87 = uextend.i32 v4413 + v4464 = iconst.i32 32 + v4465 = iadd v27, v4464 ; v4464 = 32 + v4466 = iadd v30, v4443 ; v4443 = 8 +@02a4 brif v87, block4(v4465, v4466), block6The IR transformation can be simplified into:
- a <- x + 8 - c <- a != 10,000 (0x2710) - d <- y + 32 - br_if c, block4(d, a), block6 + c <- x != 9,992 + d <- y + 32 + a <- x + 8 + br_if c, block4(d, a), block6This degraded the quality of generated machine code. Although the comparison is simplified, the addition is now two-phase (compute and commit) and there are now two jumps.
- add r8d,0x8 - add ecx,0x20 - cmp r8d,0x2710 - jne fd <__original_main+0x7d> + add ecx,0x20 + lea r11d,[r8+0x8] ; r11d = x + 8 (compute next) + cmp r8d,0x2708 ; compare *old* x to 9992 + je 181 <__original_main+0x101> + mov r8,r11 ; commit x = x + 8 +jmp 102 <__original_main+0x82>I couldn't decide to fix the mid-end or code-gen at the moment.
bongjunj edited a comment on issue #12139:
Performance-regressing rules
Removing the following rules, I observed the performance regression can be fixed.
The performance gaps between the no-opt version (NO) and the latest version (Base) of Cranelift are -9.20% and zero before and after this removal.(1) Removing these, the gap between NO and Base was -6.97%.
(2) Further removing these, the gap between NO and Base was -1.37%.
(3) Further removing these, the gap between NO and Base was zero.
Analysis
My theory is that the rules in (1) interferes the loop back-edge.
Let's take a detailed look on the generated IR and x86 machine code.block4(v27: i32, v30: i32): - v4432 = iadd v30, v4411 ; v4411 = 8 - v4433 = iconst.i32 0x2710 - v4434 = icmp ne v4432, v4433 ; v4433 = 0x2710 -@02a3 v87 = uextend.i32 v4434 - v4435 = iconst.i32 32 - v4436 = iadd v27, v4435 ; v4435 = 32 -@02a4 brif v87, block4(v4436, v4432), block6 + v4413 = icmp ne v30, v4409 ; v4409 = 9992 +@02a3 v87 = uextend.i32 v4413 + v4464 = iconst.i32 32 + v4465 = iadd v27, v4464 ; v4464 = 32 + v4466 = iadd v30, v4443 ; v4443 = 8 +@02a4 brif v87, block4(v4465, v4466), block6The IR transformation can be simplified into:
- a <- x + 8 - c <- a != 10,000 (0x2710) - d <- y + 32 - br_if c, block4(d, a), block6 + c <- x != 9,992 + d <- y + 32 + a <- x + 8 + br_if c, block4(d, a), block6This degraded the quality of generated machine code. Although the comparison is simplified, the addition is now two-phase (compute and commit) and there are now two jumps.
- add r8d,0x8 - add ecx,0x20 - cmp r8d,0x2710 - jne fd <__original_main+0x7d> + add ecx,0x20 + lea r11d,[r8+0x8] ; r11d = x + 8 (compute next) + cmp r8d,0x2708 ; compare *old* x to 9992 + je 181 <__original_main+0x101> + mov r8,r11 ; commit x = x + 8 +jmp 102 <__original_main+0x82>I couldn't decide to fix the mid-end or code-gen at the moment.
bongjunj edited a comment on issue #12139:
Performance-regressing rules
Removing the following rules, I observed the performance regression can be fixed.
The performance gaps in execution time between the no-opt version (NO) and the latest version (Base) of Cranelift are -9.20% and zero before and after this removal.(1) Removing these, the gap between NO and Base was -6.97%.
(2) Further removing these, the gap between NO and Base was -1.37%.
(3) Further removing these, the gap between NO and Base was zero.
Analysis 1
My theory is that the rules in (1) interferes the loop back-edge.
Let's take a detailed look on the generated IR and x86 machine code.block4(v27: i32, v30: i32): - v4432 = iadd v30, v4411 ; v4411 = 8 - v4433 = iconst.i32 0x2710 - v4434 = icmp ne v4432, v4433 ; v4433 = 0x2710 -@02a3 v87 = uextend.i32 v4434 - v4435 = iconst.i32 32 - v4436 = iadd v27, v4435 ; v4435 = 32 -@02a4 brif v87, block4(v4436, v4432), block6 + v4413 = icmp ne v30, v4409 ; v4409 = 9992 +@02a3 v87 = uextend.i32 v4413 + v4464 = iconst.i32 32 + v4465 = iadd v27, v4464 ; v4464 = 32 + v4466 = iadd v30, v4443 ; v4443 = 8 +@02a4 brif v87, block4(v4465, v4466), block6The IR transformation can be simplified into:
- a <- x + 8 - c <- a != 10,000 (0x2710) - d <- y + 32 - br_if c, block4(d, a), block6 + c <- x != 9,992 + d <- y + 32 + a <- x + 8 + br_if c, block4(d, a), block6This degraded the quality of generated machine code. Although the comparison is simplified, the addition is now two-phase (compute and commit) and there are now two jumps.
- add r8d,0x8 - add ecx,0x20 - cmp r8d,0x2710 - jne fd <__original_main+0x7d> + add ecx,0x20 + lea r11d,[r8+0x8] ; r11d = x + 8 (compute next) + cmp r8d,0x2708 ; compare *old* x to 9992 + je 181 <__original_main+0x101> + mov r8,r11 ; commit x = x + 8 +jmp 102 <__original_main+0x82>Analysis 2
The rules in (2) and (3) are then now interesting. How did they "cure" the performance problem?
In fact, the rules in (3) can interact with these rules inicmp.isle:(rule (simplify (ne ty (iadd _ a k) (iadd _ b k))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ a k) (iadd _ k b))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ k a) (iadd _ b k))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ k a) (iadd _ k b))) (subsume (ne ty a b)))Then, similar to Analysis 1, it can cause same problem in generating jump instructions.
Conclusion
- I don't have a theory how the removed rules in (2) increased the performance, yet.
- I couldn't decide where to fix: the mid-end or code-gen? The mid-end rules definitely made the code-gen worse, but those rules "looks fine". They just make sense, decreasing the number of IR instructions.
bongjunj edited a comment on issue #12139:
Performance-regressing rules
Removing the following rules, I observed the performance regression can be fixed.
The performance gaps in execution time between the no-opt version (NO) and the latest version (Base) of Cranelift are -9.20% and zero before and after this removal.(1) Removing these, the gap between NO and Base was -6.97%.
(2) Further removing these, the gap between NO and Base was -1.37%.
(3) Further removing these, the gap between NO and Base was zero.
Analysis 1
My theory is that the rules in (1) interferes the loop back-edge.
Let's take a detailed look on the generated IR and x86 machine code.block4(v27: i32, v30: i32): - v4432 = iadd v30, v4411 ; v4411 = 8 - v4433 = iconst.i32 0x2710 - v4434 = icmp ne v4432, v4433 ; v4433 = 0x2710 -@02a3 v87 = uextend.i32 v4434 - v4435 = iconst.i32 32 - v4436 = iadd v27, v4435 ; v4435 = 32 -@02a4 brif v87, block4(v4436, v4432), block6 + v4413 = icmp ne v30, v4409 ; v4409 = 9992 +@02a3 v87 = uextend.i32 v4413 + v4464 = iconst.i32 32 + v4465 = iadd v27, v4464 ; v4464 = 32 + v4466 = iadd v30, v4443 ; v4443 = 8 +@02a4 brif v87, block4(v4465, v4466), block6The IR transformation can be simplified into:
- a <- x + 8 - c <- a != 10,000 (0x2710) - d <- y + 32 - br_if c, block4(d, a), block6 + c <- x != 9,992 + d <- y + 32 + a <- x + 8 + br_if c, block4(d, a), block6This degraded the quality of generated machine code. Although the comparison is simplified, the addition is now two-phase (compute and commit) and there are now two jumps.
- add r8d,0x8 - add ecx,0x20 - cmp r8d,0x2710 - jne fd <__original_main+0x7d> + add ecx,0x20 + lea r11d,[r8+0x8] ; r11d = x + 8 (compute next) + cmp r8d,0x2708 ; compare *old* x to 9992 + je 181 <__original_main+0x101> + mov r8,r11 ; commit x = x + 8 +jmp 102 <__original_main+0x82>Analysis 2
The rules in (2) and (3) are then now interesting. How did they "cure" the performance problem?
In fact, the rules in (3) can interact with these rules inicmp.isle:(rule (simplify (ne ty (iadd _ a k) (iadd _ b k))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ a k) (iadd _ k b))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ k a) (iadd _ b k))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ k a) (iadd _ k b))) (subsume (ne ty a b)))Then, similar to Analysis 1, it can cause the same problem in generating jump instructions.
Conclusion
- I don't have a theory how the removed rules in (2) increased the performance, yet.
- I couldn't decide where to fix: the mid-end or code-gen? The mid-end rules definitely made the code-gen worse, but those rules "looks fine". They just make sense, decreasing the number of IR instructions.
bongjunj edited a comment on issue #12139:
Performance-regressing rules
Removing the following rules, I observed the performance regression can be fixed.
The performance gaps in execution time between the no-opt version (NO) and the latest version (Base) of Cranelift are -9.20% and zero before and after this removal.(1) Removing these, the gap between NO and Base was -6.97%.
(2) Further removing these, the gap between NO and Base was -1.37%.
(3) Further removing these, the gap between NO and Base was zero.
Analysis 1
My theory is that the rules in (1) interferes the loop back-edge.
Let's take a detailed look on the generated IR and x86 machine code.block4(v27: i32, v30: i32): - v4432 = iadd v30, v4411 ; v4411 = 8 - v4433 = iconst.i32 0x2710 - v4434 = icmp ne v4432, v4433 ; v4433 = 0x2710 -@02a3 v87 = uextend.i32 v4434 - v4435 = iconst.i32 32 - v4436 = iadd v27, v4435 ; v4435 = 32 -@02a4 brif v87, block4(v4436, v4432), block6 + v4413 = icmp ne v30, v4409 ; v4409 = 9992 +@02a3 v87 = uextend.i32 v4413 + v4464 = iconst.i32 32 + v4465 = iadd v27, v4464 ; v4464 = 32 + v4466 = iadd v30, v4443 ; v4443 = 8 +@02a4 brif v87, block4(v4465, v4466), block6The IR transformation can be simplified into:
- a <- x + 8 - c <- a != 10,000 (0x2710) - d <- y + 32 - br_if c, block4(d, a), block6 + c <- x != 9,992 + d <- y + 32 + a <- x + 8 + br_if c, block4(d, a), block6This degraded the quality of generated machine code. Although the comparison is simplified, the addition is now two-phase (compute and commit) and there are now two jumps.
- add r8d,0x8 - add ecx,0x20 - cmp r8d,0x2710 - jne fd <__original_main+0x7d> + add ecx,0x20 + lea r11d,[r8+0x8] ; r11d = x + 8 (compute next) + cmp r8d,0x2708 ; compare *old* x to 9992 + je 181 <__original_main+0x101> + mov r8,r11 ; commit x = x + 8 +jmp 102 <__original_main+0x82>Analysis 2
The rules in (2) and (3) are then now interesting. How did they "cure" the performance problem?
In fact, the rules in (3) can interact with these rules inicmp.isle:(rule (simplify (ne ty (iadd _ a k) (iadd _ b k))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ a k) (iadd _ k b))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ k a) (iadd _ b k))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ k a) (iadd _ k b))) (subsume (ne ty a b)))Then, similar to Analysis 1, it can cause the same problem in generating jump instructions.
Conclusion
- I don't have a theory how the removed rules in (2) increased the performance, yet.
- I couldn't decide where to fix: the mid-end or code-gen? The mid-end rules definitely made the code-gen worse, but those rules "looks fine". They just make sense, decreasing the number of IR instructions. Probably we should be more careful when it comes to "jumping" instructions, assuming there would be a reason for the way in which the wasm assembly is produced.
bongjunj edited a comment on issue #12139:
shootout-switchPerformance-regressing rules
Removing the following rules, I observed the performance regression can be fixed.
The performance gaps in execution time between the no-opt version (NO) and the latest version (Base) of Cranelift are -9.20% and zero before and after this removal.(1) Removing these, the gap between NO and Base was -6.97%.
(2) Further removing these, the gap between NO and Base was -1.37%.
(3) Further removing these, the gap between NO and Base was zero.
Analysis 1
My theory is that the rules in (1) interferes the loop back-edge.
Let's take a detailed look on the generated IR and x86 machine code.block4(v27: i32, v30: i32): - v4432 = iadd v30, v4411 ; v4411 = 8 - v4433 = iconst.i32 0x2710 - v4434 = icmp ne v4432, v4433 ; v4433 = 0x2710 -@02a3 v87 = uextend.i32 v4434 - v4435 = iconst.i32 32 - v4436 = iadd v27, v4435 ; v4435 = 32 -@02a4 brif v87, block4(v4436, v4432), block6 + v4413 = icmp ne v30, v4409 ; v4409 = 9992 +@02a3 v87 = uextend.i32 v4413 + v4464 = iconst.i32 32 + v4465 = iadd v27, v4464 ; v4464 = 32 + v4466 = iadd v30, v4443 ; v4443 = 8 +@02a4 brif v87, block4(v4465, v4466), block6The IR transformation can be simplified into:
- a <- x + 8 - c <- a != 10,000 (0x2710) - d <- y + 32 - br_if c, block4(d, a), block6 + c <- x != 9,992 + d <- y + 32 + a <- x + 8 + br_if c, block4(d, a), block6This degraded the quality of generated machine code. Although the comparison is simplified, the addition is now two-phase (compute and commit) and there are now two jumps.
- add r8d,0x8 - add ecx,0x20 - cmp r8d,0x2710 - jne fd <__original_main+0x7d> + add ecx,0x20 + lea r11d,[r8+0x8] ; r11d = x + 8 (compute next) + cmp r8d,0x2708 ; compare *old* x to 9992 + je 181 <__original_main+0x101> + mov r8,r11 ; commit x = x + 8 +jmp 102 <__original_main+0x82>Analysis 2
The rules in (2) and (3) are then now interesting. How did they "cure" the performance problem?
In fact, the rules in (3) can interact with these rules inicmp.isle:(rule (simplify (ne ty (iadd _ a k) (iadd _ b k))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ a k) (iadd _ k b))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ k a) (iadd _ b k))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ k a) (iadd _ k b))) (subsume (ne ty a b)))Then, similar to Analysis 1, it can cause the same problem in generating jump instructions.
Conclusion
- I don't have a theory how the removed rules in (2) increased the performance, yet.
- I couldn't decide where to fix: the mid-end or code-gen? The mid-end rules definitely made the code-gen worse, but those rules "looks fine". They just make sense, decreasing the number of IR instructions. Probably we should be more careful when it comes to "jumping" instructions, assuming there would be a reason for the way in which the wasm assembly is produced.
bongjunj commented on issue #12139:
pulldown-cmarkAnalysis
The hottest function for this benchmark program is
pulldown_cmark::parse::FirstPass::parse_line.
For this function, the latest version of Cranelift (Base) generated 7% larger x86 assembly than the no-opt version (NO).More "reading from the stack" instructions are there in Base.
- NO:
mov.*QWORD PTR.*rsp.*r.*511 lines- Base:
mov.*QWORD PTR.*rsp.*r.*637 linesThis might be from the IR transformation below (notice the increased usage of
v8, which can contribute to higher register pressure):@@ -2600,9 +2586,7 @@ @892d store little heap v3177, v3174 v4298 = isub.i32 v8, v9 ; v9 = 64 ... @895c v3246 = uextend.i64 v3239 @895c v3248 = iadd.i64 v13, v3246 @895c istore8 little heap v3245, v3248 - v4308 = iconst.i32 64 - v4309 = iadd v4298, v4308 ; v4308 = 64 -@8965 store notrap aligned table v4309, v0+400 +@8965 store.i32 notrap aligned table v8, v0+400 @896b jump block1This symptom repeats:
@752e v34 = iadd.i32 v3, v33 ; v33 = 92 -@8917 v3100 = iconst.i32 40 - v3558 = iadd v3555, v3100 ; v3100 = 40 -@87b9 v2349 = call fn11(v0, v0, v34, v3558) + v3602 = iconst.i32 -24 + v3603 = iadd.i32 v8, v3602 ; v3602 = -24 +@87b9 v2349 = call fn11(v0, v0, v34, v3603) @87c0 jump block11-@785a v315 = iadd v3567, v314 ; v314 = 44 -@8822 store little heap v3571, v315 - v3572 = iconst.i64 56 - v3573 = iadd v3567, v3572 ; v3572 = 56 -@8829 store.i32 little heap v47, v3573 -@8832 v2402 = iadd v3571, v47 - v3574 = iconst.i64 60 - v3575 = iadd v3567, v3574 ; v3574 = 60 -@8835 store little heap v2402, v3575 - v3576 = iconst.i32 92 - v3577 = iadd.i32 v3, v3576 ; v3576 = 92 - v3578 = iconst.i32 40 - v3579 = iadd v3565, v3578 ; v3578 = 40 -@883f v2413 = call fn11(v0, v0, v3577, v3579) -@8846 jump block10(v2387, v47, v2402, v259, v410, v1203, v3571) +@785a v315 = iadd v3720, v314 ; v314 = 44 +@8822 store little heap v3724, v315 + v3725 = iconst.i64 56 + v3726 = iadd v3720, v3725 ; v3725 = 56 +@8829 store.i32 little heap v47, v3726 +@8832 v2402 = iadd v3724, v47 + v3727 = iconst.i64 60 + v3728 = iadd v3720, v3727 ; v3727 = 60 +@8835 store little heap v2402, v3728 + v3729 = iconst.i32 92 + v3730 = iadd.i32 v3, v3729 ; v3729 = 92 + v3731 = iconst.i32 -24 + v3732 = iadd.i32 v8, v3731 ; v3731 = -24 +@883f v2413 = call fn11(v0, v0, v3730, v3732) +@8846 jump block10(v2387, v47, v2402, v259, v410, v1203, v3724)Sidenote
I am not sure this is an optimizing transformation which is intended):
- v3835 = iconst.i32 1 - v3836 = iadd.i32 v1156, v3835 ; v3835 = 1 - v3837 = iconst.i32 6 -@7fd7 v1163 = urem v3836, v3837 ; v3837 = 6 -@7fd8 br_table v1163, block201, [block200, block198, block199, block200, block199] + v3988 = iconst.i32 1 + v3989 = iadd.i32 v1156, v3988 ; v3988 = 1 + v3685 = iconst.i32 -1431655765 + v3686 = umulhi v3989, v3685 ; v3685 = -1431655765 + v3990 = iconst.i32 2 + v3687 = ushr v3686, v3990 ; v3990 = 2 + v3991 = iconst.i32 6 + v3688 = imul v3687, v3991 ; v3991 = 6 + v3689 = isub v3989, v3688 + v1163 -> v3689 +@7fd8 br_table v3689, block201, [block200, block198, block199, block200, block199]
bongjunj edited a comment on issue #12139:
pulldown-cmarkThe execution performance is regressed by -4.38% from the no-opt version (NO) to the lastest version (Base).
Analysis
The hottest function for this benchmark program is
pulldown_cmark::parse::FirstPass::parse_line.
For this function, the latest version of Base generated 7% larger x86 assembly than the no-opt version NO.
In addition, more "reading from the stack" instructions are there in Base, indicating a register allocation problem.
- NO:
mov.*QWORD PTR.*rsp.*r.*511 lines- Base:
mov.*QWORD PTR.*rsp.*r.*637 linesThis might be from the IR transformation below (notice the increased usage of
v8, which can contribute to higher register pressure):@@ -2600,9 +2586,7 @@ @892d store little heap v3177, v3174 v4298 = isub.i32 v8, v9 ; v9 = 64 ... @895c v3246 = uextend.i64 v3239 @895c v3248 = iadd.i64 v13, v3246 @895c istore8 little heap v3245, v3248 - v4308 = iconst.i32 64 - v4309 = iadd v4298, v4308 ; v4308 = 64 -@8965 store notrap aligned table v4309, v0+400 +@8965 store.i32 notrap aligned table v8, v0+400 @896b jump block1This symptom repeats:
@752e v34 = iadd.i32 v3, v33 ; v33 = 92 -@8917 v3100 = iconst.i32 40 - v3558 = iadd v3555, v3100 ; v3100 = 40 -@87b9 v2349 = call fn11(v0, v0, v34, v3558) + v3602 = iconst.i32 -24 + v3603 = iadd.i32 v8, v3602 ; v3602 = -24 +@87b9 v2349 = call fn11(v0, v0, v34, v3603) @87c0 jump block11-@785a v315 = iadd v3567, v314 ; v314 = 44 -@8822 store little heap v3571, v315 - v3572 = iconst.i64 56 - v3573 = iadd v3567, v3572 ; v3572 = 56 -@8829 store.i32 little heap v47, v3573 -@8832 v2402 = iadd v3571, v47 - v3574 = iconst.i64 60 - v3575 = iadd v3567, v3574 ; v3574 = 60 -@8835 store little heap v2402, v3575 - v3576 = iconst.i32 92 - v3577 = iadd.i32 v3, v3576 ; v3576 = 92 - v3578 = iconst.i32 40 - v3579 = iadd v3565, v3578 ; v3578 = 40 -@883f v2413 = call fn11(v0, v0, v3577, v3579) -@8846 jump block10(v2387, v47, v2402, v259, v410, v1203, v3571) +@785a v315 = iadd v3720, v314 ; v314 = 44 +@8822 store little heap v3724, v315 + v3725 = iconst.i64 56 + v3726 = iadd v3720, v3725 ; v3725 = 56 +@8829 store.i32 little heap v47, v3726 +@8832 v2402 = iadd v3724, v47 + v3727 = iconst.i64 60 + v3728 = iadd v3720, v3727 ; v3727 = 60 +@8835 store little heap v2402, v3728 + v3729 = iconst.i32 92 + v3730 = iadd.i32 v3, v3729 ; v3729 = 92 + v3731 = iconst.i32 -24 + v3732 = iadd.i32 v8, v3731 ; v3731 = -24 +@883f v2413 = call fn11(v0, v0, v3730, v3732) +@8846 jump block10(v2387, v47, v2402, v259, v410, v1203, v3724)Sidenote
I am not sure this is an optimizing transformation which is intended):
- v3835 = iconst.i32 1 - v3836 = iadd.i32 v1156, v3835 ; v3835 = 1 - v3837 = iconst.i32 6 -@7fd7 v1163 = urem v3836, v3837 ; v3837 = 6 -@7fd8 br_table v1163, block201, [block200, block198, block199, block200, block199] + v3988 = iconst.i32 1 + v3989 = iadd.i32 v1156, v3988 ; v3988 = 1 + v3685 = iconst.i32 -1431655765 + v3686 = umulhi v3989, v3685 ; v3685 = -1431655765 + v3990 = iconst.i32 2 + v3687 = ushr v3686, v3990 ; v3990 = 2 + v3991 = iconst.i32 6 + v3688 = imul v3687, v3991 ; v3991 = 6 + v3689 = isub v3989, v3688 + v1163 -> v3689 +@7fd8 br_table v3689, block201, [block200, block198, block199, block200, block199]
cfallin commented on issue #12139:
@bongjunj thanks for this analysis! I'll think about it in more detail on Monday but I wonder if you could provide one more data point: you have found how to remove a regression on
shootout-switchandpulldown-cmarkbut with the rules that cause this removed, how does performance look on the rest of the benchmarks?Basically: I want to think about performance with two metrics of goodness, namely average/overall speedup, and negative outliers. We want to avoid outliers, and they may point the way toward deficiencies in our rules or algorithms. But I also don't want to regress compiler performance as a whole by removing rules -- so it'd be useful to know the overall impact.
bongjunj commented on issue #12139:
I delta-debugged to locate rules that is causing performance regression for
pulldown-cmarkand found out a list of rules:(rule (simplify (ne ty (iadd _ a k) (iadd _ b k))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ a k) (iadd _ k b))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ k a) (iadd _ b k))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ k a) (iadd _ k b))) (subsume (ne ty a b))) (rule (simplify (isub ty (ishl ty x z) (ishl ty y z))) (ishl ty (isub ty x y) z)) (rule (simplify (uextend ty (uextend _intermediate_ty x))) (uextend ty x)) (rule (simplify (sextend ty (sextend _intermediate_ty x))) (sextend ty x)) ;; Reassociate across `==`/`!=` when we can simplify a constant ;; `x + K1 == K2` --> `x == K2 - K1` (rule (simplify (eq ty1 (iadd ty2 x k1@(iconst _ _)) k2@(iconst _ _))) (eq ty1 x (isub ty2 k2 k1))) (rule (simplify (ne ty1 (iadd ty2 x k1@(iconst _ _)) k2@(iconst _ _))) (ne ty1 x (isub ty2 k2 k1))) ;; `x - K1 == K2` --> `x == K2 + K1` (rule (simplify (eq ty1 (isub ty2 x k1@(iconst _ _)) k2@(iconst _ _))) (eq ty1 x (iadd ty2 k2 k1))) (rule (simplify (ne ty1 (isub ty2 x k1@(iconst _ _)) k2@(iconst _ _))) (ne ty1 x (iadd ty2 k2 k1))) ;; `x + K1 == y + K2` --> `x == y + (K2 - K1)` (rule (simplify (eq ty1 (iadd ty2 x k1@(iconst _ _)) (iadd ty3 y k2@(iconst _ _)))) (eq ty1 x (iadd ty2 y (isub ty3 k2 k1)))) (rule (simplify (ne ty1 (iadd ty2 x k1@(iconst _ _)) (iadd ty3 y k2@(iconst _ _)))) (ne ty1 x (iadd ty2 y (isub ty3 k2 k1)))) (rule (simplify (iadd ty (isub ty x (iconst ty (u64_from_imm64 k1))) (iconst ty (u64_from_imm64 k2)))) (iadd ty x (iconst ty (imm64_masked ty (u64_wrapping_sub k2 k1))))) (rule (simplify (iadd ty (isub ty (iconst ty (u64_from_imm64 k1)) x) (iconst ty (u64_from_imm64 k2)))) (isub ty (iconst ty (imm64_masked ty (u64_wrapping_add k1 k2))) x)) ;; iconst_[su] support $I128, but do so by extending, so restricting to ;; 64-bit or smaller keeps it from just remaking essentially the same thing. (rule (simplify (uextend (fits_in_64 wide) (iconst_u narrow k))) (subsume (iconst_u wide k))) (rule (simplify (sextend (fits_in_64 wide) (iconst_s narrow k))) (subsume (iconst_s wide k))) (rule (simplify (isub (fits_in_64 ty) (iconst ty (u64_from_imm64 k1)) (iconst ty (u64_from_imm64 k2)))) (subsume (iconst ty (imm64_masked ty (u64_wrapping_sub k1 k2))))) ;; (x - y) + x --> x (rule (simplify (iadd ty (isub ty x y) y)) x) (rule (simplify (iadd ty y (isub ty x y))) x) ;; (x + y) - y --> x (rule (simplify (isub ty (iadd ty x y) x)) y) (rule (simplify (isub ty (iadd ty x y) y)) x) (rule (simplify_skeleton (urem x (iconst_u $I32 (u64_extract_non_zero (u32_from_u64 d))))) (if-let false (u32_is_power_of_two d)) (apply_div_const_magic_u32 (Opcode.Urem) x d)) (rule (simplify_skeleton (udiv x (iconst_u $I32 (u64_extract_non_zero (u32_from_u64 d))))) (if-let false (u32_is_power_of_two d)) (apply_div_const_magic_u32 (Opcode.Udiv) x d)) ;; 0-x == (ineg x). (rule (simplify (isub ty (iconst_u ty 0) x)) (ineg ty x))A performance regression is observed from the no-opt Cranelift with each of these rules.
For experiment, I removed all of these rules, and could reduce the perf regression to -2.24% but failed to eliminate all.Here are the overall data.
- Fix1 is the Cranelift version without the perf-regressing rules for
shootout-switch(See https://github.com/bytecodealliance/wasmtime/issues/12139#issuecomment-3650453592)- Fix2 is the Cranelift version without the perf-regressing rules for
pulldown-cmark(See above)
Benchmark Fix1 Speedup Fix2 Speedup Main Speedup blake3-scalar -0.69% -0.27% 0.11% blake3-simd 0.64% 0.82% 2.51% bz2 0.36% 0.22% 0.99% pulldown-cmark -5.08% -2.24% -5.28% regex -0.51% 0.77% -0.11% shootout-ackermann -1.39% -1.52% 9.63% shootout-base64 8.93% 7.04% 8.23% shootout-ctype 1.07% 1.06% 4.30% shootout-ed25519 1.83% 0.70% 1.96% shootout-fib2 0.06% 0.08% 0.03% shootout-gimli -0.27% -0.16% -1.33% shootout-heapsort -2.33% -0.76% 0.08% shootout-keccak 19.10% 19.11% 19.10% shootout-matrix 0.17% 0.24% -0.19% shootout-memmove -6.34% -6.27% -0.17% shootout-minicsv 0.68% 0.69% 14.62% shootout-nestedloop -11.25% 4.71% -4.75% shootout-random 43.37% 0.00% 43.35% shootout-ratelimit -0.39% -0.52% -2.00% shootout-seqhash -0.85% -1.55% 1.84% shootout-sieve 0.07% -0.03% 7.77% shootout-switch 0.03% 0.00% -9.20% shootout-xblabla20 -3.06% -3.14% -0.34% shootout-xchacha20 0.15% -0.36% 0.00% spidermonkey -1.66% -2.58% 0.58% You can see that the performance is restored completely for
shootout-switchand partially forpulldown-cmark.
But we are losing atshootout-ackermann,shootout-base64,shootout-heapsort,shootout-memmove,spidermonkeyand others.Probably it is the time to tap in the backend codegen/instruction scheduling/register allocation because a subtle change in IR is causing a surprising performance regression. For example, the following IR transformation (no-opt to base) in
pulldown-cmarkv3555 = isub.i32 v8, v9 -@8917 v3100 = iconst.i32 40 - v3558 = iadd v3555, v3100 ; v3100 = 40 -@87b9 v2349 = call fn11(v0, v0, v34, v3558) + v3602 = iconst.i32 -24 + v3603 = iadd.i32 v8, v3602 ; v3602 = -24 +@87b9 v2349 = call fn11(v0, v0, v34, v3603)changed
3b9a: mov r8,rcx 3b9d: sub r8d,0x40 3ba1: mov rax,QWORD PTR [rsp+0x30] 3ba6: mov DWORD PTR [rbx+r8*1+0x3c],eax 3bab: mov rcx,QWORD PTR [rsp+0x48] 3bb0: mov DWORD PTR [rbx+r8*1+0x38],ecx 3bb5: mov r10d,0x1 3bbb: mov BYTE PTR [rbx+r8*1+0x28],r10b 3bc0: mov r11,QWORD PTR [rsp+0x20] 3bc5: lea edx,[r11+0x5c] 3bc9: lea ecx,[r8+0x28] ...to
3e8b: sub r9d,0x40 3e8f: mov rdi,QWORD PTR [rsp+0xc8] 3e97: mov r8,QWORD PTR [rsp+0x30] 3e9c: mov DWORD PTR [rdi+r9*1+0x3c],r8d 3ea1: mov rcx,QWORD PTR [rsp+0x48] 3ea6: mov DWORD PTR [rdi+r9*1+0x38],ecx 3eab: mov r11d,0x1 3eb1: mov BYTE PTR [rdi+r9*1+0x28],r11b 3eb6: mov r8,QWORD PTR [rsp+0x20] 3ebb: lea edx,[r8+0x5c] 3ebf: mov r9,QWORD PTR [rsp+0xd0] 3ec7: lea ecx,[r9-0x18] ...Notice that there are more spills/loads.
Similar regression happens forshootout-switch(one jump to two jumps) with a slight change in IR.
bongjunj edited a comment on issue #12139:
I delta-debugged to locate rules that is causing performance regression for
pulldown-cmarkand found out a list of rules:(rule (simplify (ne ty (iadd _ a k) (iadd _ b k))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ a k) (iadd _ k b))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ k a) (iadd _ b k))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ k a) (iadd _ k b))) (subsume (ne ty a b))) (rule (simplify (isub ty (ishl ty x z) (ishl ty y z))) (ishl ty (isub ty x y) z)) (rule (simplify (uextend ty (uextend _intermediate_ty x))) (uextend ty x)) (rule (simplify (sextend ty (sextend _intermediate_ty x))) (sextend ty x)) ;; Reassociate across `==`/`!=` when we can simplify a constant ;; `x + K1 == K2` --> `x == K2 - K1` (rule (simplify (eq ty1 (iadd ty2 x k1@(iconst _ _)) k2@(iconst _ _))) (eq ty1 x (isub ty2 k2 k1))) (rule (simplify (ne ty1 (iadd ty2 x k1@(iconst _ _)) k2@(iconst _ _))) (ne ty1 x (isub ty2 k2 k1))) ;; `x - K1 == K2` --> `x == K2 + K1` (rule (simplify (eq ty1 (isub ty2 x k1@(iconst _ _)) k2@(iconst _ _))) (eq ty1 x (iadd ty2 k2 k1))) (rule (simplify (ne ty1 (isub ty2 x k1@(iconst _ _)) k2@(iconst _ _))) (ne ty1 x (iadd ty2 k2 k1))) ;; `x + K1 == y + K2` --> `x == y + (K2 - K1)` (rule (simplify (eq ty1 (iadd ty2 x k1@(iconst _ _)) (iadd ty3 y k2@(iconst _ _)))) (eq ty1 x (iadd ty2 y (isub ty3 k2 k1)))) (rule (simplify (ne ty1 (iadd ty2 x k1@(iconst _ _)) (iadd ty3 y k2@(iconst _ _)))) (ne ty1 x (iadd ty2 y (isub ty3 k2 k1)))) (rule (simplify (iadd ty (isub ty x (iconst ty (u64_from_imm64 k1))) (iconst ty (u64_from_imm64 k2)))) (iadd ty x (iconst ty (imm64_masked ty (u64_wrapping_sub k2 k1))))) (rule (simplify (iadd ty (isub ty (iconst ty (u64_from_imm64 k1)) x) (iconst ty (u64_from_imm64 k2)))) (isub ty (iconst ty (imm64_masked ty (u64_wrapping_add k1 k2))) x)) ;; iconst_[su] support $I128, but do so by extending, so restricting to ;; 64-bit or smaller keeps it from just remaking essentially the same thing. (rule (simplify (uextend (fits_in_64 wide) (iconst_u narrow k))) (subsume (iconst_u wide k))) (rule (simplify (sextend (fits_in_64 wide) (iconst_s narrow k))) (subsume (iconst_s wide k))) (rule (simplify (isub (fits_in_64 ty) (iconst ty (u64_from_imm64 k1)) (iconst ty (u64_from_imm64 k2)))) (subsume (iconst ty (imm64_masked ty (u64_wrapping_sub k1 k2))))) ;; (x - y) + x --> x (rule (simplify (iadd ty (isub ty x y) y)) x) (rule (simplify (iadd ty y (isub ty x y))) x) ;; (x + y) - y --> x (rule (simplify (isub ty (iadd ty x y) x)) y) (rule (simplify (isub ty (iadd ty x y) y)) x) (rule (simplify_skeleton (urem x (iconst_u $I32 (u64_extract_non_zero (u32_from_u64 d))))) (if-let false (u32_is_power_of_two d)) (apply_div_const_magic_u32 (Opcode.Urem) x d)) (rule (simplify_skeleton (udiv x (iconst_u $I32 (u64_extract_non_zero (u32_from_u64 d))))) (if-let false (u32_is_power_of_two d)) (apply_div_const_magic_u32 (Opcode.Udiv) x d)) ;; 0-x == (ineg x). (rule (simplify (isub ty (iconst_u ty 0) x)) (ineg ty x))A performance regression is observed from the no-opt Cranelift with each of these rules.
For experiment, I removed all of these rules, and could reduce the perf regression to -2.24% but failed to eliminate all.Here are the overall data.
- Fix1 is the Cranelift version without the perf-regressing rules for
shootout-switch(See https://github.com/bytecodealliance/wasmtime/issues/12139#issuecomment-3650453592)- Fix2 is the Cranelift version without the perf-regressing rules for
pulldown-cmark(See above)
Benchmark Fix1 Speedup Fix2 Speedup Main Speedup blake3-scalar -0.69% -0.27% 0.11% blake3-simd 0.64% 0.82% 2.51% bz2 0.36% 0.22% 0.99% pulldown-cmark -5.08% -2.24% -5.28% regex -0.51% 0.77% -0.11% shootout-ackermann -1.39% -1.52% 9.63% shootout-base64 8.93% 7.04% 8.23% shootout-ctype 1.07% 1.06% 4.30% shootout-ed25519 1.83% 0.70% 1.96% shootout-fib2 0.06% 0.08% 0.03% shootout-gimli -0.27% -0.16% -1.33% shootout-heapsort -2.33% -0.76% 0.08% shootout-keccak 19.10% 19.11% 19.10% shootout-matrix 0.17% 0.24% -0.19% shootout-memmove -6.34% -6.27% -0.17% shootout-minicsv 0.68% 0.69% 14.62% shootout-nestedloop -11.25% 4.71% -4.75% shootout-random 43.37% 0.00% 43.35% shootout-ratelimit -0.39% -0.52% -2.00% shootout-seqhash -0.85% -1.55% 1.84% shootout-sieve 0.07% -0.03% 7.77% shootout-switch 0.03% 0.00% -9.20% shootout-xblabla20 -3.06% -3.14% -0.34% shootout-xchacha20 0.15% -0.36% 0.00% spidermonkey -1.66% -2.58% 0.58% You can see that the performance is restored completely for
shootout-switchand partially forpulldown-cmark.
But we are losing atshootout-ackermann,shootout-base64,shootout-heapsort,shootout-memmove,spidermonkeyand others.Probably it is the time to tap in the backend codegen/instruction scheduling/register allocation because a subtle change in IR is causing a surprising performance regression. I suspect the codegen is not handling IRs transformed by the mid-end optimizations and can produce a quality code only for non-optimized codes. For example, the following IR transformation (no-opt to base) in
pulldown-cmarkv3555 = isub.i32 v8, v9 -@8917 v3100 = iconst.i32 40 - v3558 = iadd v3555, v3100 ; v3100 = 40 -@87b9 v2349 = call fn11(v0, v0, v34, v3558) + v3602 = iconst.i32 -24 + v3603 = iadd.i32 v8, v3602 ; v3602 = -24 +@87b9 v2349 = call fn11(v0, v0, v34, v3603)changed
3b9a: mov r8,rcx 3b9d: sub r8d,0x40 3ba1: mov rax,QWORD PTR [rsp+0x30] 3ba6: mov DWORD PTR [rbx+r8*1+0x3c],eax 3bab: mov rcx,QWORD PTR [rsp+0x48] 3bb0: mov DWORD PTR [rbx+r8*1+0x38],ecx 3bb5: mov r10d,0x1 3bbb: mov BYTE PTR [rbx+r8*1+0x28],r10b 3bc0: mov r11,QWORD PTR [rsp+0x20] 3bc5: lea edx,[r11+0x5c] 3bc9: lea ecx,[r8+0x28] ...to
3e8b: sub r9d,0x40 3e8f: mov rdi,QWORD PTR [rsp+0xc8] 3e97: mov r8,QWORD PTR [rsp+0x30] 3e9c: mov DWORD PTR [rdi+r9*1+0x3c],r8d 3ea1: mov rcx,QWORD PTR [rsp+0x48] 3ea6: mov DWORD PTR [rdi+r9*1+0x38],ecx 3eab: mov r11d,0x1 3eb1: mov BYTE PTR [rdi+r9*1+0x28],r11b 3eb6: mov r8,QWORD PTR [rsp+0x20] 3ebb: lea edx,[r8+0x5c] 3ebf: mov r9,QWORD PTR [rsp+0xd0] 3ec7: lea ecx,[r9-0x18] ...Notice that there are more spills/loads.
Similar regression happens forshootout-switch(one jump to two jumps) with a slight change in IR.
bongjunj edited a comment on issue #12139:
I delta-debugged to locate rules that is causing performance regression for
pulldown-cmarkand found out a list of rules:(rule (simplify (ne ty (iadd _ a k) (iadd _ b k))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ a k) (iadd _ k b))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ k a) (iadd _ b k))) (subsume (ne ty a b))) (rule (simplify (ne ty (iadd _ k a) (iadd _ k b))) (subsume (ne ty a b))) (rule (simplify (isub ty (ishl ty x z) (ishl ty y z))) (ishl ty (isub ty x y) z)) (rule (simplify (uextend ty (uextend _intermediate_ty x))) (uextend ty x)) (rule (simplify (sextend ty (sextend _intermediate_ty x))) (sextend ty x)) ;; Reassociate across `==`/`!=` when we can simplify a constant ;; `x + K1 == K2` --> `x == K2 - K1` (rule (simplify (eq ty1 (iadd ty2 x k1@(iconst _ _)) k2@(iconst _ _))) (eq ty1 x (isub ty2 k2 k1))) (rule (simplify (ne ty1 (iadd ty2 x k1@(iconst _ _)) k2@(iconst _ _))) (ne ty1 x (isub ty2 k2 k1))) ;; `x - K1 == K2` --> `x == K2 + K1` (rule (simplify (eq ty1 (isub ty2 x k1@(iconst _ _)) k2@(iconst _ _))) (eq ty1 x (iadd ty2 k2 k1))) (rule (simplify (ne ty1 (isub ty2 x k1@(iconst _ _)) k2@(iconst _ _))) (ne ty1 x (iadd ty2 k2 k1))) ;; `x + K1 == y + K2` --> `x == y + (K2 - K1)` (rule (simplify (eq ty1 (iadd ty2 x k1@(iconst _ _)) (iadd ty3 y k2@(iconst _ _)))) (eq ty1 x (iadd ty2 y (isub ty3 k2 k1)))) (rule (simplify (ne ty1 (iadd ty2 x k1@(iconst _ _)) (iadd ty3 y k2@(iconst _ _)))) (ne ty1 x (iadd ty2 y (isub ty3 k2 k1)))) (rule (simplify (iadd ty (isub ty x (iconst ty (u64_from_imm64 k1))) (iconst ty (u64_from_imm64 k2)))) (iadd ty x (iconst ty (imm64_masked ty (u64_wrapping_sub k2 k1))))) (rule (simplify (iadd ty (isub ty (iconst ty (u64_from_imm64 k1)) x) (iconst ty (u64_from_imm64 k2)))) (isub ty (iconst ty (imm64_masked ty (u64_wrapping_add k1 k2))) x)) ;; iconst_[su] support $I128, but do so by extending, so restricting to ;; 64-bit or smaller keeps it from just remaking essentially the same thing. (rule (simplify (uextend (fits_in_64 wide) (iconst_u narrow k))) (subsume (iconst_u wide k))) (rule (simplify (sextend (fits_in_64 wide) (iconst_s narrow k))) (subsume (iconst_s wide k))) (rule (simplify (isub (fits_in_64 ty) (iconst ty (u64_from_imm64 k1)) (iconst ty (u64_from_imm64 k2)))) (subsume (iconst ty (imm64_masked ty (u64_wrapping_sub k1 k2))))) ;; (x - y) + x --> x (rule (simplify (iadd ty (isub ty x y) y)) x) (rule (simplify (iadd ty y (isub ty x y))) x) ;; (x + y) - y --> x (rule (simplify (isub ty (iadd ty x y) x)) y) (rule (simplify (isub ty (iadd ty x y) y)) x) (rule (simplify_skeleton (urem x (iconst_u $I32 (u64_extract_non_zero (u32_from_u64 d))))) (if-let false (u32_is_power_of_two d)) (apply_div_const_magic_u32 (Opcode.Urem) x d)) (rule (simplify_skeleton (udiv x (iconst_u $I32 (u64_extract_non_zero (u32_from_u64 d))))) (if-let false (u32_is_power_of_two d)) (apply_div_const_magic_u32 (Opcode.Udiv) x d)) ;; 0-x == (ineg x). (rule (simplify (isub ty (iconst_u ty 0) x)) (ineg ty x))A performance regression is observed from the no-opt Cranelift with each of these rules.
For experiment, I removed all of these rules, and could reduce the perf regression to -2.24% but failed to eliminate all.Here are the overall data.
- Fix1 is the Cranelift version without the perf-regressing rules for
shootout-switch(See https://github.com/bytecodealliance/wasmtime/issues/12139#issuecomment-3650453592)- Fix2 is the Cranelift version without the perf-regressing rules for
pulldown-cmark(See above)
Benchmark Fix1 Speedup Fix2 Speedup Main Speedup blake3-scalar -0.69% -0.27% 0.55% blake3-simd 0.64% 0.82% 2.35% bz2 0.36% 0.22% 1.05% pulldown-cmark -5.08% -2.24% 0.59% regex -0.51% 0.77% -0.22% shootout-ackermann -1.39% -1.52% 9.74% shootout-base64 8.93% 7.04% 8.22% shootout-ctype 1.07% 1.06% 4.27% shootout-ed25519 1.83% 0.70% 1.96% shootout-fib2 0.06% 0.08% 0.08% shootout-gimli -0.27% -0.16% -1.46% shootout-heapsort -2.33% -0.76% 0.11% shootout-keccak 19.10% 19.11% 19.12% shootout-matrix 0.17% 0.24% 0.81% shootout-memmove -6.34% -6.27% -0.60% shootout-minicsv 0.68% 0.69% 15.46% shootout-nestedloop -11.25% 4.71% 3.71% shootout-random 43.37% 0.00% 43.37% shootout-ratelimit -0.39% -0.52% -1.89% shootout-seqhash -0.85% -1.55% 1.86% shootout-sieve 0.07% -0.03% 7.67% shootout-switch 0.03% 0.00% -9.16% shootout-xblabla20 -3.06% -3.14% -0.31% shootout-xchacha20 0.15% -0.36% 0.07% spidermonkey -1.66% -2.58% 0.24% You can see that the performance is restored completely for
shootout-switchand partially forpulldown-cmark.
But we are losing atshootout-ackermann,shootout-base64,shootout-heapsort,shootout-memmove,spidermonkeyand others.Probably it is the time to tap in the backend codegen/instruction scheduling/register allocation because a subtle change in IR is causing a surprising performance regression. I suspect the codegen is not handling IRs transformed by the mid-end optimizations and can produce a quality code only for non-optimized codes. For example, the following IR transformation (no-opt to base) in
pulldown-cmarkv3555 = isub.i32 v8, v9 -@8917 v3100 = iconst.i32 40 - v3558 = iadd v3555, v3100 ; v3100 = 40 -@87b9 v2349 = call fn11(v0, v0, v34, v3558) + v3602 = iconst.i32 -24 + v3603 = iadd.i32 v8, v3602 ; v3602 = -24 +@87b9 v2349 = call fn11(v0, v0, v34, v3603)changed
3b9a: mov r8,rcx 3b9d: sub r8d,0x40 3ba1: mov rax,QWORD PTR [rsp+0x30] 3ba6: mov DWORD PTR [rbx+r8*1+0x3c],eax 3bab: mov rcx,QWORD PTR [rsp+0x48] 3bb0: mov DWORD PTR [rbx+r8*1+0x38],ecx 3bb5: mov r10d,0x1 3bbb: mov BYTE PTR [rbx+r8*1+0x28],r10b 3bc0: mov r11,QWORD PTR [rsp+0x20] 3bc5: lea edx,[r11+0x5c] 3bc9: lea ecx,[r8+0x28] ...to
3e8b: sub r9d,0x40 3e8f: mov rdi,QWORD PTR [rsp+0xc8] 3e97: mov r8,QWORD PTR [rsp+0x30] 3e9c: mov DWORD PTR [rdi+r9*1+0x3c],r8d 3ea1: mov rcx,QWORD PTR [rsp+0x48] 3ea6: mov DWORD PTR [rdi+r9*1+0x38],ecx 3eab: mov r11d,0x1 3eb1: mov BYTE PTR [rdi+r9*1+0x28],r11b 3eb6: mov r8,QWORD PTR [rsp+0x20] 3ebb: lea edx,[r8+0x5c] 3ebf: mov r9,QWORD PTR [rsp+0xd0] 3ec7: lea ecx,[r9-0x18] ...Notice that there are more spills/loads.
Similar regression happens forshootout-switch(one jump to two jumps) with a slight change in IR.
fitzgen commented on issue #12139:
Thanks (as always) for investigating and sharing your results with us, @bongjunj!
My theory is that the rules in (1) interferes the loop back-edge.
For (1) I suspect we just need to either
- remove the
subsumefrom those canonicalizing rules, or- add missing x64 lowering rules that recognize the canonicalized form
Probably the latter? Because it doesn't grow the number of enodes and lets us continue to do the most canonicalizing we can (and therefore the most const propagation we can).
Anyways, that CLIF snippet you isolated seems like it would be a great little filetest to add (even before we have a fix) and then we can have confidence that (a) whatever fix we implement does indeed fix that example code and (b) the codegen for this input pattern won't regress in the future.
fitzgen commented on issue #12139:
I am not sure this is an optimizing transformation which is intended):
diff - v3835 = iconst.i32 1 - v3836 = iadd.i32 v1156, v3835 ; v3835 = 1 - v3837 = iconst.i32 6 -@7fd7 v1163 = urem v3836, v3837 ; v3837 = 6 -@7fd8 br_table v1163, block201, [block200, block198, block199, block200, block199] + v3988 = iconst.i32 1 + v3989 = iadd.i32 v1156, v3988 ; v3988 = 1 + v3685 = iconst.i32 -1431655765 + v3686 = umulhi v3989, v3685 ; v3685 = -1431655765 + v3990 = iconst.i32 2 + v3687 = ushr v3686, v3990 ; v3990 = 2 + v3991 = iconst.i32 6 + v3688 = imul v3687, v3991 ; v3991 = 6 + v3689 = isub v3989, v3688 + v1163 -> v3689 +@7fd8 br_table v3689, block201, [block200, block198, block199, block200, block199]This is almost definitely the result of our division/remainder to magic-sequence rules.
- https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/arithmetic.isle#L113-L127
- https://github.com/bytecodealliance/wasmtime/blob/ab030780b67ce79c8567cd245353a0e7f1e05d27/cranelift/codegen/src/opts/div_const.rs
It produces more native instructions, but they should take less cycles in practice because there is no division/mod/rem operation anymore. Unless you can observe actual slowdowns to motivate otherwise, I think we do want to keep this rewrite.
fitzgen commented on issue #12139:
This might be from the IR transformation below (notice the increased usage of
v8, which can contribute to higher register pressure):Addressing this specifically requires support for rematerializing values. The best place to do this is probably in the register allocator (where we can keep reusing values until we get register pressure and can then rematerialize rather than spilling and reloading), although we can also do it in the mid-end (and in fact will rematerialize constants oncer-per-block IIRC in the mid-end). The problem with doing more remat in the mid-end is that it doesn't know about register pressure or even the number of available registers that the backend has. Seems like the kind of thing that would be best to do during regalloc.
Anyways, we have a couple issues related to rematerialization on file:
fitzgen commented on issue #12139:
Probably it is the time to tap in the backend codegen/instruction scheduling/register allocation because a subtle change in IR is causing a surprising performance regression. I suspect the codegen is not handling IRs transformed by the mid-end optimizations and can produce a quality code only for non-optimized codes. For example, the following IR transformation (no-opt to base) in
pulldown-cmarkAn interesting performance/codegen fuzzing experiment would be to do something like:
- Start with some small initial snippet of CLIF/Wasm.
- Ideally something like a single value that flows into a store, call, branch condition, etc... instead of like a whole function body consisting of many side effectful instructions and lots of control flow.
- Perhaps by looking at the LHSes of our lowering rules (similar to this paper)
- Perhaps by extracting snippets from Sightglass benchmarks.
- Perhaps generating arbitrary snippets (e.g. via
wasm-smith).- Make a semantics-preserving mutation to that input snippet.
- Compile both the original and mutated snippet to native code
- Assert that the generated code is the same
- If they are not the same, then either
- we are missing some kind of rewrite/simplification in our mid-end, or
- we are missing lowering rules for canonicalized IR patterns in our backend (or they are not general enough or something)
I suspect that we would find a lot of stuff this way. Probably more than we realistically have spare cycles to evaluate. Would need to put effort into prioritizing bugs (so extracting from benchmarks / existing lowering rules is nice because it would implicitly find higher-priority bugs than starting from an arbitrary sequence would).
We would also have to take care to keep the snippets small enough that we don't run into https://github.com/bytecodealliance/wasmtime/issues/12156, as that would lead to uninteresting results (we already know that our cost function has issues with large expressions that have lots of shared structure and know we need to fix that, and it is a separate issue from our actual rewrite/lowering rule set).
cfallin commented on issue #12139:
Thanks, @bongjunj, for this analysis. To add my take on the immediate issue first: your data makes clear that fixing one outlier creates several more; and on balance it seems better to keep the rules the way they are. Regressing SpiderMonkey by 2% is a far bigger deal than one of the shootout benchmarks -- the latter are very small "kernels" (single hot loops doing one thing) so are sensitive to particular cases of heuristics gone bad, while the SpiderMonkey case is more representative of real workloads.
I think it is worth setting expectations here: while it would be ideal for the compiler to never have optimizations that make results worse, in practice once we widen the scope of optimizations large enough, with the NP-complete multidimensional optimization problem, there will inevitably be outliers. Studying those outliers is helpful, but they are not the same as, say, correctness bugs that must be fixed.
I'll echo @fitzgen's point as well that more canonicalization is better, as a general principle. Removing rules that canonicalize might avoid some case in some outlier but could have many unseen effects later -- we aren't measuring the opportunity cost. It'd be better to "complete the work" and make the rest of the pipeline work well with the canonicalized form.
Rematerialization in regalloc is a holy-grail if we can achieve it but is very hard to do because it crosses abstraction layers in the compiler: it would mean that we would somehow need to represent the multiple possibilities to compute (or reuse) a value in VCode, and we'd need to reason about the indirect effects of liveranges on register pressure. The simplest case is a pure instruction that can be re-executed, but even that is nontrivial from a regalloc point of view because it could extend the liveranges of the inputs of that rematerialized instruction, force more register pressure elsewhere, and cascade into evictions and other rematerialization decisions. The most principled way to deal with all of this is to encode the whole thing -- lowering, code scheduling, and regalloc -- into an ILP (integer linear programming) problem and solve it, but that's essentially a rewrite of Cranelift's whole backend and also not compatible with the compiler-performance corner we inhabit. All of that to say: we can point in that direction and say that that is an eventual possible solution, but it's not clear to me that it's practical or solvable.
bongjunj commented on issue #12139:
add missing x64 lowering rules that recognize the canonicalized form
I'll echo @fitzgen's point as well that more canonicalization is better, as a general principle. Removing rules that canonicalize might avoid some case in some outlier but could have many unseen effects later -- we aren't measuring the opportunity cost. It'd be better to "complete the work" and make the rest of the pipeline work well with the canonicalized form.
Yes, I also think this is the right direction. Because some of the rules I removed are so general such as
(x - y) + y => x, getting rid of them does not make much sense. We have to work on using them properly to boost performance. The downside of removing such rules is already observed in the spidermonkey case in the data above.An interesting performance/codegen fuzzing experiment would be to do something like:
Start with some small initial snippet of CLIF/Wasm.
- Ideally something like a single value that flows into a store, call, branch condition, etc... instead of like a whole function body consisting of many side effectful instructions and lots of control flow.
- Perhaps by looking at the LHSes of our lowering rules (similar to this paper)
- Perhaps by extracting snippets from Sightglass benchmarks.
- Perhaps generating arbitrary snippets (e.g. via
wasm-smith).- Make a semantics-preserving mutation to that input snippet.
I think this can be a good microbenchmark to track code generation quality. Asserting the equality between the before- and after- mutations can be too strong. We could otherwise calculate the estimated number of CPU cycles using some tools like
llvm-mca(https://llvm.org/docs/CommandGuide/llvm-mca.html). This can automatically detect performance regression of generated codes and withstand minor changes due to backend updates in the future.
P.S.
I was investigating this problem because it was surprising that some rules I have submitted here was degrading performance in some cases. I think it is clear that optimized IR can degrade performance, affecting further code-gen processes. But it seems like the problem can be solved later with an improved reg-alloc and code-gens!
fitzgen commented on issue #12139:
Asserting the equality between the before- and after- mutations can be too strong. We could otherwise calculate the estimated number of CPU cycles using some tools like
llvm-mca(https://llvm.org/docs/CommandGuide/llvm-mca.html). This can automatically detect performance regression of generated codes and withstand minor changes due to backend updates in the future.Indeed, I was originally thinking something similar, but then realized that (so long as we didn't introduce any null side effects in the mutation, like reading a value from memory and writing it back to the same place, which we won't ever be able to optimize away in the limit in Cranelift) then it doesn't really matter whether the codegen for the initial or mutated snippet is better. We should be generating whichever code is better when given either the initial or mutated snippet.
llvm-mcawould be good for determining which version of code we should generate for both snippets, given that we've already found a divergence between them.P.S.
I was investigating this problem because it was surprising that some rules I have submitted here was degrading performance in some cases. I think it is clear that optimized IR can degrade performance, affecting further code-gen processes. But it seems like the problem can be solved later with an improved reg-alloc and code-gens!
It is true that these problems can and often should be solved at the regalloc or instruction selection levels, rather than in the mid-end, but that doesn't mean we shouldn't worry about performance regressions at all when adding new mid-end rules. Sometimes fixing the problem in the right place is pretty difficult and unlikely to happen soon (e.g. regalloc rematerialization). In general, we should strive not to regress the "default" suite of Sightglass benchmarks, which represent Real World programs; it can be okay, on the other hand, to regress the micro-benchmarks, like shootout, if we understand why the regression is happening and have other, higher-priority motivations to outweigh the micro-benchmark regression.
Last updated: Jan 10 2026 at 20:04 UTC