Stream: git-wasmtime

Topic: wasmtime / issue #13325 Cranelift: major performance regr...


view this post on Zulip Wasmtime GitHub notifications bot (May 08 2026 at 13:08):

bongjunj opened issue #13325:

Tested with sightglass for spidermonkey-json: https://github.com/bytecodealliance/sightglass/blob/main/benchmarks/spidermonkey/spidermonkey-json.wasm

Phase Base Upstream Relative Performance
Compilation 30125453533 29900200416 -0.75%
Instantiation 811322 803595.3 +0.96%
Execution 544499791.7 640038909.6 -14.93%

Expected Results

Spidermonkey represents a major workload for wasmtime.
Mid-end optimization should not regress performance.

Actual Results

The performance is severely regressed.

Versions and Environment

Cranelift version or commit: 8315a90ced0d01bdddfea92af514a6cd30da4abf

Operating system: Linux, x64

Architecture

<details>
<summary>lscpu</summary>

Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             46 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      64
  On-line CPU(s) list:       0-63
Vendor ID:                   GenuineIntel
  Model name:                Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
    CPU family:              6
    Model:                   85
    Thread(s) per core:      2
    Core(s) per socket:      16
    Socket(s):               2
    Stepping:                7
    CPU(s) scaling MHz:      33%
    CPU max MHz:             3900.0000
    CPU min MHz:             1200.0000
    BogoMIPS:                5800.00
    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch
                             _perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 s
                             se4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_en
                             hanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_p
                             t avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear
                             flush_l1d arch_capabilities ibpb_exit_to_user
Virtualization features:
  Virtualization:            VT-x
Caches (sum of all):
  L1d:                       1 MiB (32 instances)
  L1i:                       1 MiB (32 instances)
  L2:                        32 MiB (32 instances)
  L3:                        44 MiB (2 instances)
NUMA:
  NUMA node(s):              2
  NUMA node0 CPU(s):         0-15,32-47
  NUMA node1 CPU(s):         16-31,48-63
Vulnerabilities:
  Gather data sampling:      Mitigation; Microcode
  Indirect target selection: Mitigation; Aligned branch/return thunks
  Itlb multihit:             KVM: Mitigation: VMX disabled
  L1tf:                      Not affected
  Mds:                       Not affected
  Meltdown:                  Not affected
  Mmio stale data:           Mitigation; Clear CPU buffers; SMT vulnerable
  Reg file data sampling:    Not affected
  Retbleed:                  Mitigation; Enhanced IBRS
  Spec rstack overflow:      Not affected
  Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
  Srbds:                     Not affected
  Tsa:                       Not affected
  Tsx async abort:           Mitigation; TSX disabled
  Vmscape:                   Mitigation; IBPB before exit to userspace

</details>

Extra Info

view this post on Zulip Wasmtime GitHub notifications bot (May 08 2026 at 13:08):

bongjunj added the bug label to Issue #13325.

view this post on Zulip Wasmtime GitHub notifications bot (May 08 2026 at 13:08):

bongjunj added the cranelift label to Issue #13325.

view this post on Zulip Wasmtime GitHub notifications bot (May 08 2026 at 13:08):

bongjunj edited issue #13325:

Tested with sightglass for spidermonkey-json: https://github.com/bytecodealliance/sightglass/blob/main/benchmarks/spidermonkey/spidermonkey-json.wasm

Phase Base Upstream Relative Performance
Compilation 30125453533 29900200416 -0.75%
Instantiation 811322 803595.3 +0.96%
Execution 544499791.7 640038909.6 -14.93%

Expected Results

Spidermonkey represents a major workload for wasmtime.
Mid-end optimization should not regress performance.

Actual Results

The performance is severely regressed.

Versions and Environment

Cranelift version or commit: 8315a90ced0d01bdddfea92af514a6cd30da4abf

Operating system: Linux, x64

Architecture

<details>
<summary>lscpu</summary>

Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             46 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      64
  On-line CPU(s) list:       0-63
Vendor ID:                   GenuineIntel
  Model name:                Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
    CPU family:              6
    Model:                   85
    Thread(s) per core:      2
    Core(s) per socket:      16
    Socket(s):               2
    Stepping:                7
    CPU(s) scaling MHz:      33%
    CPU max MHz:             3900.0000
    CPU min MHz:             1200.0000
    BogoMIPS:                5800.00
    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch
                             _perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 s
                             se4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_en
                             hanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_p
                             t avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear
                             flush_l1d arch_capabilities ibpb_exit_to_user
Virtualization features:
  Virtualization:            VT-x
Caches (sum of all):
  L1d:                       1 MiB (32 instances)
  L1i:                       1 MiB (32 instances)
  L2:                        32 MiB (32 instances)
  L3:                        44 MiB (2 instances)
NUMA:
  NUMA node(s):              2
  NUMA node0 CPU(s):         0-15,32-47
  NUMA node1 CPU(s):         16-31,48-63
Vulnerabilities:
  Gather data sampling:      Mitigation; Microcode
  Indirect target selection: Mitigation; Aligned branch/return thunks
  Itlb multihit:             KVM: Mitigation: VMX disabled
  L1tf:                      Not affected
  Mds:                       Not affected
  Meltdown:                  Not affected
  Mmio stale data:           Mitigation; Clear CPU buffers; SMT vulnerable
  Reg file data sampling:    Not affected
  Retbleed:                  Mitigation; Enhanced IBRS
  Spec rstack overflow:      Not affected
  Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
  Srbds:                     Not affected
  Tsa:                       Not affected
  Tsx async abort:           Mitigation; TSX disabled
  Vmscape:                   Mitigation; IBPB before exit to userspace

</details>

Extra Info

view this post on Zulip Wasmtime GitHub notifications bot (May 12 2026 at 14:55):

tschneidereit commented on issue #13325:

cc @cfallin @fitzgen

view this post on Zulip Wasmtime GitHub notifications bot (May 12 2026 at 15:24):

cfallin commented on issue #13325:

Sorry, we had a long weekend (Fri/Mon off) during which this came in and I'm just catching up!

@bongjunj did you happen to bisect where the no-opt-vs-opt slowdown started? I can examine this if not, but thought I would ask first. (And, by the way, since this report looks auto-generated: if you're building tooling to do this, a bisection is the most useful bit of information for such issues -- thanks!)

view this post on Zulip Wasmtime GitHub notifications bot (May 13 2026 at 02:58):

bongjunj commented on issue #13325:

@cfallin Thanks for the comment!

I was running DDMin on the ruleset (it takes a lot of time), and it figured out that, when the following rules are removed, the performance of Opt vs. Base is not only restored but also becomes superior.

Iteration: 10 times each
Base: 549317765.0 cycles
Opt: 547874348.3 cycles

(rule (simplify (iadd (fits_in_64 ty) (iconst ty (u64_from_imm64 k1)) (iconst ty (u64_from_imm64 k2)))) (subsume (iconst ty (imm64_masked ty (u64_wrapping_add k1 k2)))))
(rule (simplify (iadd ty (isub ty x (iconst ty (u64_from_imm64 k1))) (iconst ty (u64_from_imm64 k2)))) (iadd ty x (iconst ty (imm64_masked ty (u64_wrapping_sub k2 k1)))))
(rule (simplify (iadd ty k @ (iconst ty _) x)) (iadd ty x k))
(rule (simplify (sshr ty x (uextend _ y))) (sshr ty x y))
(rule (simplify (uextend (fits_in_64 wide) (iconst_u narrow k))) (subsume (iconst_u wide k)))

You can see our base version here: https://github.com/prosyslab/wasmtime/commit/3149daa696be7f47b3464b82e8a14beea88c4a46,
and the compared version at https://github.com/prosyslab/wasmtime/commit/9b3e094ab56a3432e142408d258466b5d0596221

view this post on Zulip Wasmtime GitHub notifications bot (May 13 2026 at 02:59):

bongjunj edited a comment on issue #13325:

@cfallin Thanks for the comment!

I was running DDMin on the mid-end ruleset (and it took a lot of time), and it figured out that, when the following rules are removed, the performance of Opt vs. Base is not only restored but also becomes superior.

Iteration: 10 times each
Base: 549317765.0 cycles
Opt: 547874348.3 cycles

(rule (simplify (iadd (fits_in_64 ty) (iconst ty (u64_from_imm64 k1)) (iconst ty (u64_from_imm64 k2)))) (subsume (iconst ty (imm64_masked ty (u64_wrapping_add k1 k2)))))
(rule (simplify (iadd ty (isub ty x (iconst ty (u64_from_imm64 k1))) (iconst ty (u64_from_imm64 k2)))) (iadd ty x (iconst ty (imm64_masked ty (u64_wrapping_sub k2 k1)))))
(rule (simplify (iadd ty k @ (iconst ty _) x)) (iadd ty x k))
(rule (simplify (sshr ty x (uextend _ y))) (sshr ty x y))
(rule (simplify (uextend (fits_in_64 wide) (iconst_u narrow k))) (subsume (iconst_u wide k)))

You can see our base version here: https://github.com/prosyslab/wasmtime/commit/3149daa696be7f47b3464b82e8a14beea88c4a46,
and the compared version at https://github.com/prosyslab/wasmtime/commit/9b3e094ab56a3432e142408d258466b5d0596221

view this post on Zulip Wasmtime GitHub notifications bot (May 13 2026 at 03:02):

bongjunj edited a comment on issue #13325:

@cfallin Thanks for the comment!

I was running DDMin on the mid-end ruleset (and it took a lot of time), and it figured out that, when the following rules are removed, the performance of Opt vs. Base is not only restored but also becomes superior.

Iteration: 10 times each
Base: 549317765.0 cycles
Opt: 547874348.3 cycles

(rule (simplify (iadd (fits_in_64 ty) (iconst ty (u64_from_imm64 k1)) (iconst ty (u64_from_imm64 k2)))) (subsume (iconst ty (imm64_masked ty (u64_wrapping_add k1 k2)))))
(rule (simplify (iadd ty (isub ty x (iconst ty (u64_from_imm64 k1))) (iconst ty (u64_from_imm64 k2)))) (iadd ty x (iconst ty (imm64_masked ty (u64_wrapping_sub k2 k1)))))
(rule (simplify (iadd ty k @ (iconst ty _) x)) (iadd ty x k))
(rule (simplify (sshr ty x (uextend _ y))) (sshr ty x y))
(rule (simplify (uextend (fits_in_64 wide) (iconst_u narrow k))) (subsume (iconst_u wide k)))

You can see our base version here: https://github.com/prosyslab/wasmtime/commit/3149daa696be7f47b3464b82e8a14beea88c4a46,
and the compared version at https://github.com/prosyslab/wasmtime/commit/9b3e094ab56a3432e142408d258466b5d0596221

I guess constant foldings can sometimes break the assumptions that backends make (for example, what an IR should look like to emit an ideal x64 assembly...). Furthermore, as the suspected rules contains add/sub, it could be related to address or loop induction variable computations, which could affect branch/jump machine instructions.

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2026 at 06:25):

cfallin commented on issue #13325:

@bongjunj I wasn't able to reproduce this on main. It looks like your tested commits branch off of upstream around March 23. Would you be able to check on latest main and see if you still observe the issue?

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2026 at 12:30):

bongjunj commented on issue #13325:

I'm still observing this issue.
I tested upstream version(https://github.com/bytecodealliance/wasmtime/commit/3df8ce1b6ea12db4f3946a226e41b87a58a50d9d) and no-opt version(https://github.com/bongjunj/wasmtime/commit/43c5ff3dab0b7a12ed2755a71341e4f9e59cc6ab) with the sightglass (789ac095):

> cargo run --release benchmark benchmarks/spidermonkey/spidermonkey-json.wasm \
  -e v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so \
   v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so \
  --iterations-per-process=20

    Finished `release` profile [optimized] target(s) in 0.18s
     Running `target/release/sightglass-cli benchmark benchmarks/spidermonkey/spidermonkey-json.wasm -e v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so --iterations-per-process=20`

Running 400 total iterations (2 engines * 1 benchmarks * 10 processes * 20 iterations per process)

[Done] [Elapsed    ] [Est. Rem.  ]
[  5%] [00h:00m:16s] [00h:05m:04s] ..........
[ 55%] [00h:02m:59s] [00h:02m:26s] ..........

Finished benchmarking in 00h:05m:26s

execution :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  Δ = 37018177.31 ± 13458245.59 (confidence = 99%)

  v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so is 1.05x to 1.11x faster than v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so!

  [448576882 478089691.72 654247940] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [485508612 515107869.03 659251240] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

compilation :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  Δ = 39425286.83 ± 22027872.08 (confidence = 99%)

  v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so is 1.01x to 1.04x faster than v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so!

  [1367308154 1609211474.82 1915779638] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [1370695568 1569786187.99 1861128120] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

instantiation :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  No difference in performance.

  [647484 731867.46 1145284] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [657732 734917.52 1119248] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

Compared to the past versions, the regression is reduced, but this experiment shows that it still persists.

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2026 at 12:30):

bongjunj edited a comment on issue #13325:

@cfallin I'm still observing this issue.
I tested upstream version(https://github.com/bytecodealliance/wasmtime/commit/3df8ce1b6ea12db4f3946a226e41b87a58a50d9d) and no-opt version(https://github.com/bongjunj/wasmtime/commit/43c5ff3dab0b7a12ed2755a71341e4f9e59cc6ab) with the sightglass (789ac095):

> cargo run --release benchmark benchmarks/spidermonkey/spidermonkey-json.wasm \
  -e v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so \
   v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so \
  --iterations-per-process=20

    Finished `release` profile [optimized] target(s) in 0.18s
     Running `target/release/sightglass-cli benchmark benchmarks/spidermonkey/spidermonkey-json.wasm -e v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so --iterations-per-process=20`

Running 400 total iterations (2 engines * 1 benchmarks * 10 processes * 20 iterations per process)

[Done] [Elapsed    ] [Est. Rem.  ]
[  5%] [00h:00m:16s] [00h:05m:04s] ..........
[ 55%] [00h:02m:59s] [00h:02m:26s] ..........

Finished benchmarking in 00h:05m:26s

execution :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  Δ = 37018177.31 ± 13458245.59 (confidence = 99%)

  v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so is 1.05x to 1.11x faster than v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so!

  [448576882 478089691.72 654247940] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [485508612 515107869.03 659251240] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

compilation :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  Δ = 39425286.83 ± 22027872.08 (confidence = 99%)

  v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so is 1.01x to 1.04x faster than v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so!

  [1367308154 1609211474.82 1915779638] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [1370695568 1569786187.99 1861128120] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

instantiation :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  No difference in performance.

  [647484 731867.46 1145284] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [657732 734917.52 1119248] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

Compared to the past versions, the regression is reduced, but this experiment shows that it still persists.

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2026 at 12:37):

bongjunj edited a comment on issue #13325:

@cfallin I'm still observing this issue.
I tested upstream version(https://github.com/bytecodealliance/wasmtime/commit/3df8ce1b6ea12db4f3946a226e41b87a58a50d9d) and no-opt version(https://github.com/bongjunj/wasmtime/commit/43c5ff3dab0b7a12ed2755a71341e4f9e59cc6ab) with the sightglass (789ac095):

> cargo run --release benchmark benchmarks/spidermonkey/spidermonkey-json.wasm \
  -e v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so \
   v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so \
  --iterations-per-process=20

    Finished `release` profile [optimized] target(s) in 0.18s
     Running `target/release/sightglass-cli benchmark benchmarks/spidermonkey/spidermonkey-json.wasm -e v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so --iterations-per-process=20`

Running 400 total iterations (2 engines * 1 benchmarks * 10 processes * 20 iterations per process)

[Done] [Elapsed    ] [Est. Rem.  ]
[  5%] [00h:00m:16s] [00h:05m:04s] ..........
[ 55%] [00h:02m:59s] [00h:02m:26s] ..........

Finished benchmarking in 00h:05m:26s

execution :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  Δ = 37018177.31 ± 13458245.59 (confidence = 99%)

  v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so is 1.05x to 1.11x faster than v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so!

  [448576882 478089691.72 654247940] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [485508612 515107869.03 659251240] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

compilation :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  Δ = 39425286.83 ± 22027872.08 (confidence = 99%)

  v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so is 1.01x to 1.04x faster than v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so!

  [1367308154 1609211474.82 1915779638] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [1370695568 1569786187.99 1861128120] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

instantiation :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  No difference in performance.

  [647484 731867.46 1145284] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [657732 734917.52 1119248] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so
> cargo run --release benchmark benchmarks/spidermonkey/spidermonkey-json.wasm \
  -e v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so \
  --iterations-per-process=100 --processes=1

execution :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  Δ = 37230369.24 ± 16851855.78 (confidence = 99%)

  v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so is 1.04x to 1.12x faster than v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so!

  [448962628 469879647.64 604456052] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [485956342 507110016.88 655267842] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

instantiation :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  No difference in performance.

  [654016 704229.52 1043212] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [645674 687170.24 839154] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

compilation :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  No difference in performance.

  [1394066524 1576138899.08 1912092254] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [1384130466 1559159070.84 1685777032] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

Compared to the past versions, the regression is reduced, but this experiment shows that it still persists.

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2026 at 12:46):

bongjunj edited a comment on issue #13325:

@cfallin I'm still observing this issue.
I tested upstream version(https://github.com/bytecodealliance/wasmtime/commit/3df8ce1b6ea12db4f3946a226e41b87a58a50d9d) and no-opt version(https://github.com/bongjunj/wasmtime/commit/43c5ff3dab0b7a12ed2755a71341e4f9e59cc6ab) with the sightglass (789ac095):

> cargo run --release benchmark benchmarks/spidermonkey/spidermonkey-json.wasm -e v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so main-260517/wasmtime/target/release/libwasmtime_bench_api.so --iterations-per-process=10
    Finished `release` profile [optimized] target(s) in 0.18s
     Running `target/release/sightglass-cli benchmark benchmarks/spidermonkey/spidermonkey-json.wasm -e base-260517/wasmtime/target/release/libwasmtime_bench_api.so ntal/v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so --iterations-per-process=10`

Running 200 total iterations (2 engines * 1 benchmarks * 10 processes * 10 iterations per process)

[Done] [Elapsed    ] [Est. Rem.  ]
[  5%] [00h:00m:08s] [00h:02m:32s] ..........
[ 55%] [00h:01m:32s] [00h:01m:15s] ..........

Finished benchmarking in 00h:02m:48s

execution :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  Δ = 47061945.96 ± 7156475.56 (confidence = 99%)

  v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so is 1.07x to 1.09x faster than v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so!

  [581745126 593119831.98 690791320] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [631312126 640181777.94 739371462] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

compilation :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  Δ = 33776664.00 ± 26492776.81 (confidence = 99%)

  v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so is 1.00x to 1.04x faster than v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so!

  [1311399220 1534236565.90 1730050064] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [1344967908 1500459901.90 1699803632] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

instantiation :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  No difference in performance.

  [735834 808589.04 1171376] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [722858 795706.08 2101592] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

> cargo run --release benchmark benchmarks/spidermonkey/spidermonkey-json.wasm -e v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so main-260517/wasmtime/target/release/libwasmtime_bench_api.so --iterations-per-process=100 --processes=1
    Finished `release` profile [optimized] target(s) in 0.18s
     Running `target/release/sightglass-cli benchmark benchmarks/spidermonkey/spidermonkey-json.wasm -e base-260517/wasmtime/target/release/libwasmtime_bench_api.so ntal/v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so --iterations-per-process=100 --processes=1`

execution :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  Δ = 46004063.16 ± 2685004.67 (confidence = 99%)

  v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so is 1.07x to 1.08x faster than v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so!

  [581724878 585888810.80 645912744] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [627163556 631892873.96 652945300] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

compilation :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  Δ = 45072400.40 ± 25749426.66 (confidence = 99%)

  v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so is 1.01x to 1.05x faster than v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so!

  [1333486888 1503053131.78 1810715182] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [1299260868 1457980731.38 1840198928] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

instantiation :: cycles :: benchmarks/spidermonkey/spidermonkey-json.wasm

  No difference in performance.

  [732578 768323.40 1182792] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
  [751076 777308.76 969436] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so

Compared to the past versions, the regression is reduced, but this experiment shows that it still persists.

view this post on Zulip Wasmtime GitHub notifications bot (May 17 2026 at 18:36):

cfallin commented on issue #13325:

@bongjunj on my machine (AMD Zen 5 core, Ryzen 9 9950X), I see current main run spidermonkey-json.wasm in 252M cycles vs. with no simplify rules at all, 257M cycles.

Digging a bit further into the compiled-code instruction mix, one hypothesis: there are more LEAs in the version compiled with all rules, and the above rules do shuffle adds/subs around in a way that could allow us to use more complex LEAs in place of raw (destructive-source / two-register) add/sub. On your CPU it's possible that not all of these are fast. Could you try running perf with the following events: perf stat -e cycles,instructions,uops_retired.slow_lea and see how slow_lea differs before and after?

You could also try removing this x64 backend rule (I think that's the only relevant one) and then test with/without simplify rules and see if that does anything interesting. It's possible that our use of LEA could use tuning, if it does.

view this post on Zulip Wasmtime GitHub notifications bot (May 18 2026 at 04:10):

bongjunj commented on issue #13325:

@cfallin

Removing the rule directly causes a compilation error:

Caused by:
    Compilation error: Unsupported feature: should be implemented in ISLE: inst = `v199 = iadd.i32 v197, v198  ; v198 = -1`, type = `Some(types::I32)`

So I substitute the rule to produce x64 add, instead of lea, to have v-main-260518-fix-lea as below:

 (rule iadd_base_case_32_or_64_lea -5 (lower (has_type (ty_32_or_64 ty) (iadd _ x y)))
-      (x64_lea ty (to_amode_add (mem_flags_trusted) x y (zero_offset))))
+         (x64_add ty x y))
+;;       (x64_lea ty (to_amode_add (mem_flags_trusted) x y (zero_offset))))

Now we have three Cranelift variants:

Then now the result shows:

Running 300 total iterations (3 engines * 1 benchmarks * 10 processes * 10 iterations per process)

[Done] [Elapsed    ] [Est. Rem.  ]
[  3%] [00h:00m:09s] [00h:04m:21s] ..........
[ 37%] [00h:01m:39s] [00h:02m:51s] ..........
[ 70%] [00h:03m:12s] [00h:01m:22s] ..........

Finished benchmarking in 00h:04m:36s
compilation
  benchmarks/spidermonkey/spidermonkey-json.wasm
    cycles
      [1489826616 1706746668.00 1826139786] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [1472520810 1679407701.04 1870553120] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [1470006216 1687113788.56 1859469330] v-main-260518-fix-lea/wasmtime/target/release/libwasmtime_bench_api.so
instantiation
  benchmarks/spidermonkey/spidermonkey-json.wasm
    cycles
      [805008 838953.82 890184] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [801530 856876.24 925478] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [789654 846866.08 925892] v-main-260518-fix-lea/wasmtime/target/release/libwasmtime_bench_api.so
execution
  benchmarks/spidermonkey/spidermonkey-json.wasm
    cycles
      [640451418 649789176.14 687099432] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [692710300 701275360.92 749310260] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [590349128 595251905.74 634437796] v-main-260518-fix-lea/wasmtime/target/release/libwasmtime_bench_api.so

The performance of Main is now faster.

Plus, the performance counter shows that the slow_lea count increases in Main compare to Base:
Note: I could not find the slow_lea retired counter. I used issued counter instead.

Base

 5,312,391,306,049      cycles
 5,038,474,154,796      instructions                     #    0.95  insn per cycle
    50,718,194,166      uops_issued.slow_lea

Main

 5,363,341,529,486      cycles
 5,046,667,045,928      instructions                     #    0.94  insn per cycle
    50,791,375,446      uops_issued.slow_lea

view this post on Zulip Wasmtime GitHub notifications bot (May 18 2026 at 04:29):

bongjunj edited a comment on issue #13325:

@cfallin

Removing the rule directly causes a compilation error:

Caused by:
    Compilation error: Unsupported feature: should be implemented in ISLE: inst = `v199 = iadd.i32 v197, v198  ; v198 = -1`, type = `Some(types::I32)`

So I substitute the rule to produce x64 add, instead of lea, to have v-main-260518-fix-lea as below:

 (rule iadd_base_case_32_or_64_lea -5 (lower (has_type (ty_32_or_64 ty) (iadd _ x y)))
-      (x64_lea ty (to_amode_add (mem_flags_trusted) x y (zero_offset))))
+         (x64_add ty x y))
+;;       (x64_lea ty (to_amode_add (mem_flags_trusted) x y (zero_offset))))

Now we have three Cranelift variants:

Then now the result shows:

Running 300 total iterations (3 engines * 1 benchmarks * 10 processes * 10 iterations per process)

[Done] [Elapsed    ] [Est. Rem.  ]
[  3%] [00h:00m:09s] [00h:04m:21s] ..........
[ 37%] [00h:01m:39s] [00h:02m:51s] ..........
[ 70%] [00h:03m:12s] [00h:01m:22s] ..........

Finished benchmarking in 00h:04m:36s
compilation
  benchmarks/spidermonkey/spidermonkey-json.wasm
    cycles
      [1489826616 1706746668.00 1826139786] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [1472520810 1679407701.04 1870553120] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [1470006216 1687113788.56 1859469330] v-main-260518-fix-lea/wasmtime/target/release/libwasmtime_bench_api.so
instantiation
  benchmarks/spidermonkey/spidermonkey-json.wasm
    cycles
      [805008 838953.82 890184] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [801530 856876.24 925478] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [789654 846866.08 925892] v-main-260518-fix-lea/wasmtime/target/release/libwasmtime_bench_api.so
execution
  benchmarks/spidermonkey/spidermonkey-json.wasm
    cycles
      [640451418 649789176.14 687099432] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [692710300 701275360.92 749310260] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [590349128 595251905.74 634437796] v-main-260518-fix-lea/wasmtime/target/release/libwasmtime_bench_api.so

The performance of Main is now faster.

Plus, the performance counter shows that the slow_lea count increases in Main compare to Base:
Note: I could not find the slow_lea retired counter. I used issued counter instead.

Base

 5,312,391,306,049      cycles
 5,038,474,154,796      instructions                     #    0.95  insn per cycle
    50,718,194,166      uops_issued.slow_lea

Main

 5,363,341,529,486      cycles
 5,046,667,045,928      instructions                     #    0.94  insn per cycle
    50,791,375,446      uops_issued.slow_lea

MainFIxLea

 5,378,993,344,357      cycles
 5,143,450,545,303      instructions                     #    0.96  insn per cycle
    42,277,753,592      uops_issued.slow_lea

view this post on Zulip Wasmtime GitHub notifications bot (May 18 2026 at 04:30):

bongjunj edited a comment on issue #13325:

@cfallin

Removing the rule directly causes a compilation error:

Caused by:
    Compilation error: Unsupported feature: should be implemented in ISLE: inst = `v199 = iadd.i32 v197, v198  ; v198 = -1`, type = `Some(types::I32)`

So I substitute the rule to produce x64 add, instead of lea, to have v-main-260518-fix-lea as below:

 (rule iadd_base_case_32_or_64_lea -5 (lower (has_type (ty_32_or_64 ty) (iadd _ x y)))
-      (x64_lea ty (to_amode_add (mem_flags_trusted) x y (zero_offset))))
+         (x64_add ty x y))
+;;       (x64_lea ty (to_amode_add (mem_flags_trusted) x y (zero_offset))))

Now we have three Cranelift variants:

Then now the result shows:

Running 300 total iterations (3 engines * 1 benchmarks * 10 processes * 10 iterations per process)

[Done] [Elapsed    ] [Est. Rem.  ]
[  3%] [00h:00m:09s] [00h:04m:21s] ..........
[ 37%] [00h:01m:39s] [00h:02m:51s] ..........
[ 70%] [00h:03m:12s] [00h:01m:22s] ..........

Finished benchmarking in 00h:04m:36s
compilation
  benchmarks/spidermonkey/spidermonkey-json.wasm
    cycles
      [1489826616 1706746668.00 1826139786] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [1472520810 1679407701.04 1870553120] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [1470006216 1687113788.56 1859469330] v-main-260518-fix-lea/wasmtime/target/release/libwasmtime_bench_api.so
instantiation
  benchmarks/spidermonkey/spidermonkey-json.wasm
    cycles
      [805008 838953.82 890184] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [801530 856876.24 925478] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [789654 846866.08 925892] v-main-260518-fix-lea/wasmtime/target/release/libwasmtime_bench_api.so
execution
  benchmarks/spidermonkey/spidermonkey-json.wasm
    cycles
      [640451418 649789176.14 687099432] v-base-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [692710300 701275360.92 749310260] v-main-260517/wasmtime/target/release/libwasmtime_bench_api.so
      [590349128 595251905.74 634437796] v-main-260518-fix-lea/wasmtime/target/release/libwasmtime_bench_api.so

The performance of Main is now faster.

Plus, the performance counter shows that the slow_lea count increases in Main compare to Base:
Note: I could not find the slow_lea retired counter. I used issued counter instead.

Base

 5,312,391,306,049      cycles
 5,038,474,154,796      instructions                     #    0.95  insn per cycle
    50,718,194,166      uops_issued.slow_lea

Main

 5,363,341,529,486      cycles
 5,046,667,045,928      instructions                     #    0.94  insn per cycle
    50,791,375,446      uops_issued.slow_lea

MainFIxLea

 5,378,993,344,357      cycles
 5,143,450,545,303      instructions                     #    0.96  insn per cycle
    42,277,753,592      uops_issued.slow_lea

I made the rule to not produce lea so the result of MainFixLea could be obvious.

view this post on Zulip Wasmtime GitHub notifications bot (May 18 2026 at 19:19):

cfallin commented on issue #13325:

Interesting! I've had suspicions around our use of lea for adds in the past -- this is pretty good evidence that we need to be more careful with our microarchitectural performance assumptions. It's attractive from a regalloc perspective (does not clobber either source) and can also fold in a 1/2/4/8-wise scale so the tradeoff isn't completely one-sided but if you want to run more widespread benchmarks (i.e., all of Sightglass) on your no-LEA branch and show data, and if that data shows an overall improvement, I'd be happy to approve a PR for it. Thanks!

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2026 at 04:30):

bongjunj commented on issue #13325:

I've noticed that depending on CPUs, the performance changes inconsistently.
With Xeon 6226, the performance regression was recovered as previously discussed,
whereas, with Xeon 6326, the performance dropped in contrast.

Server Benchmark Base Upstream Upstream Ratio Upstream Fix Upstream Fix Ratio
Xeon 6226 spidermonkey/spidermonkey-json.wasm 646362269.2 698644064.9 -8.09% 595386696.3 7.89%
Xeon 6326 spidermonkey/spidermonkey-json.wasm 535991987.7 513818633.3 4.14% 521378279.5 2.73%

(sorry that the table form changes every comment, I'm juggling with my toolings at the moment)

I think the data tells that the current fix (just overriding lea with add) is too blunt.

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2026 at 12:13):

bongjunj commented on issue #13325:

Interesting fact: Xeon 6326 has lower cycles per instruction (CPI) for LEA (base + index + displacement).
https://uops.info/html-instr/LEA_B_I_D32_R64.html?utm_source=chatgpt.com#ICL shows the difference of the LEA µop between Ice Lake (Xeon 6326) and Cascade Lake (Xeon 6226). Ice Lake has reportedly much better performance for the µop, which explains the performance drop when ADD is generated instead of LEA.
Side note: there is no difference on LEA (base + index * scale) -- see https://uops.info/html-instr/LEA_B_IS_R64.html?utm_source=chatgpt.com#ICL

And probably (I do not know the exact µop executed in your CPU), the reason that your CPU did not reproduce my case could be the CPI is similar for LEA and ADD:

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2026 at 15:57):

cfallin commented on issue #13325:

So I guess the next major compiler-evolution step, to optimize to this level, would be microarchitectural modeling to take into account these differing latencies. There are a few levels of that: simply altering cost depending on target microarchitecture, or possibly even modeling the "pipeline fill", cycles when data will become available, etc. LLVM for example has detailed microarchitectural models it will use for instruction selection and scheduling if you specify.

I don't think it's within scope for Cranelift at the moment to develop similarly detailed microarchitectural models; we don't have a large team of folks to develop and maintain such infrastructure.

So the choice that is left is: use LEA for adds wherever possible, or use ADD for adds. I agree that's pretty blunt, but note the data could be used both ways depending on our desired policy outcome: we could choose to optimize for "the most modern possible CPU" (conclusion: keep LEA), "a reasonable baseline x86-64" (conclusion: choose ADD), or some weighted average of a population of users and take whichever based on the objective tradeoff.

IMHO, without detailed microarchitectural modeling, it's risky to use "weird" instruction choices, i.e. outside of their usual purpose, because they happen to be faster on some modern CPUs. It's also an unfortunate bias to make Cranelift fast on modern and expensive CPUs at the expense of a large existing user population. Those modern and expensive CPUs can also tolerate small (few-percent) slowdowns on some benchmarks relative to the best possible code for that CPU, because the CPU is overall faster. So I think all of this biases toward the "straightforward" instruction selection, that is, using ADD for adds always.

Curious what others think though too -- cc @fitzgen @alexcrichton

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2026 at 16:20):

fitzgen commented on issue #13325:

I don't think it's within scope for Cranelift at the moment to develop similarly detailed microarchitectural models; we don't have a large team of folks to develop and maintain such infrastructure.

:100:

IMHO, without detailed microarchitectural modeling, it's risky to use "weird" instruction choices, i.e. outside of their usual purpose, because they happen to be faster on some modern CPUs. It's also an unfortunate bias to make Cranelift fast on modern and expensive CPUs at the expense of a large existing user population. Those modern and expensive CPUs can also tolerate small (few-percent) slowdowns on some benchmarks relative to the best possible code for that CPU, because the CPU is overall faster. So I think all of this biases toward the "straightforward" instruction selection, that is, using ADD for adds always.

Agreed.

... But I also wonder if there are some simple heuristics we could use to still emit lea beneficially sometimes? I know that going too far down this road brings us back to microarch models, which is definitely too far for us with our present maintainership, but maybe there is something simple we could do? I know at one point we emitted add at encoding time when one of the operand registers was the same as the destination register, otherwise lea. Perhaps there is something similarly simple but less aggressive in emitting lea?

I guess maybe this is just adding rules (which I assumed we had but it seems we don't AFAICT?) to fold combinations of 64-bit adds and multiplies/shifts and such together into a single lea? That is, something like

;; Match when we can turn an `base + scale * index + displacement` into an `lea`, eg:
;;
;;     lea     dest, dword ptr [base + scale * index + displacement]
(rule (iadd $I64 base
                 (iadd $I64 (imul $I64 (iconst_u $I64 scale) index)
                            (iconst_u $I64 displacement)))
      (if let $true (encodeable_as_lea ...))
      (x64_lea ...))

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2026 at 16:36):

tschneidereit commented on issue #13325:

Importing LLVM's models isn't a reasonable thing to do, I assume?

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2026 at 16:57):

cfallin commented on issue #13325:

I know at one point we emitted add at encoding time when one of the operand registers was the same as the destination register, otherwise lea. Perhaps there is something similarly simple but less aggressive in emitting lea?

Perhaps yeah -- less a dynamic late decision wrt regalloc and perhaps more related to the exact shape of a "slow LEA". E.g. I think what's going on is that LEAs with larger displacements only get to use one dispatch port on less-recent CPUs (versus four in parallel for ADDs and simple LEAs) which is why they bottleneck. I'm not certain about that though.

Importing LLVM's models isn't a reasonable thing to do, I assume?

Definitely not -- (i) it's in TableGen, so now we have two problems; (ii) it's tied to LLVM's instruction definitions, which won't align with ours and are anyway in a different framework; (iii) a lot of it is algorithm that interprets the data, not just raw data, i.e. we'd still be signing up for building an out-of-order pipeline model. (Very fun work, for sure, but not something I think we can reasonably commit to!)


Last updated: Jun 01 2026 at 09:49 UTC