wasmtime / issue #12106 Cranelift: ISLE mid-end performan... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #12106 Cranelift: ISLE mid-end performan...

Wasmtime GitHub notifications bot (Dec 02 2025 at 02:39):

Hi,

I performance-tested a no-opt version of Cranelift without ISLE mid-end optimizations. (You can confirmed this at https://github.com/prosyslab/wasmtime/commit/c0df585025897098e83703cca343e139ed30a119)

Compared to the latest version of upstream Cranelift, surprisingly, the no-opt version produced a significantly faster x86_64 code for blake3, keccak, xchacha20. This experiment is conducted using the sightglass-cli.

Given that only mid-end rules are removed, some codegen backend might causing problem interacting with the mid-end. I want to investigate this problem, but I'm completely lost which part to look at first. Any comments will be appreciated.

Here is the demonstration:
> cargo run --release -- benchmark benchmarks/blake3-scalar/benchmark.wasm --engine engines/wasmtime/v-main/libengine.so engines/wasmtime/v-no-opts/libengine.so --pin
    Finished `release` profile [optimized] target(s) in 0.08s
     Running `target/release/sightglass-cli benchmark benchmarks/blake3-scalar/benchmark.wasm --engine engines/wasmtime/v-main/libengine.so engines/wasmtime/v-no-opts/libengine.so --pin`
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

execution :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 502562.18 ± 4271.59 (confidence = 99%)

  no-opts/libengine.so is 2.55x to 2.58x faster than main/libengine.so!

  [816778 823363.48 880808] main/libengine.so
  [315506 320801.30 460046] no-opts/libengine.so

compilation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 51372480.98 ± 1109163.62 (confidence = 99%)

  no-opts/libengine.so is 1.19x to 1.20x faster than main/libengine.so!

  [310860540 313559414.86 330384818] main/libengine.so
  [260245364 262186933.88 278160088] no-opts/libengine.so

instantiation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 17444.42 ± 10445.70 (confidence = 99%)

  no-opts/libengine.so is 1.06x to 1.23x faster than main/libengine.so!

  [90902 137381.98 258206] main/libengine.so
  [87766 119937.56 198140] no-opts/libengine.so
Plus, here are some data for other benchmarks.
--iterations-per-process 10 --benchmark-phase execution ----pin

bench v-no-opts base speedup

blake3-scalar 320,225 868,750 -63.14%

blake3-simd 320,689 945,427 -66.08%

bz2 88,887,466 86,904,121 2.28%

pulldown-cmark 6,630,447 6,705,562 -1.12%

regex 209,902,394 211,477,705 -0.74%

shootout-base64 383,700,851 352,817,318 8.75%

shootout-keccak 25,589,899 49,540,506 -48.35%

shootout-xchacha20 4,489,570 4,816,315 -6.78%

spidermonkey 644,434,235 627,374,660 2.72%

bench	v-no-opts	base	speedup
blake3-scalar	320,225	868,750	-63.14%
blake3-simd	320,689	945,427	-66.08%
bz2	88,887,466	86,904,121	2.28%
pulldown-cmark	6,630,447	6,705,562	-1.12%
regex	209,902,394	211,477,705	-0.74%
shootout-base64	383,700,851	352,817,318	8.75%
shootout-keccak	25,589,899	49,540,506	-48.35%
shootout-xchacha20	4,489,570	4,816,315	-6.78%
spidermonkey	644,434,235	627,374,660	2.72%

Wasmtime GitHub notifications bot (Dec 02 2025 at 02:39):

bongjunj edited issue #12106:

Hi,

I performance-tested a no-opt version of Cranelift without ISLE mid-end optimizations. (You can confirmed this at https://github.com/prosyslab/wasmtime/commit/c0df585025897098e83703cca343e139ed30a119)

Compared to the latest version of upstream Cranelift, surprisingly, the no-opt version produced a significantly faster x86_64 code for blake3, keccak, xchacha20. This experiment is conducted using the sightglass-cli.

Given that only mid-end rules are removed, some codegen backend might causing problem interacting with the mid-end. I want to investigate this problem, but I'm completely lost which part to look at first. Any comments will be appreciated.

Here is the demonstration:
> cargo run --release -- benchmark benchmarks/blake3-scalar/benchmark.wasm --engine engines/wasmtime/v-main/libengine.so engines/wasmtime/v-no-opts/libengine.so --pin
    Finished `release` profile [optimized] target(s) in 0.08s
     Running `target/release/sightglass-cli benchmark benchmarks/blake3-scalar/benchmark.wasm --engine engines/wasmtime/v-main/libengine.so engines/wasmtime/v-no-opts/libengine.so --pin`

execution :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 502562.18 ± 4271.59 (confidence = 99%)

  no-opts/libengine.so is 2.55x to 2.58x faster than main/libengine.so!

  [816778 823363.48 880808] main/libengine.so
  [315506 320801.30 460046] no-opts/libengine.so

compilation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 51372480.98 ± 1109163.62 (confidence = 99%)

  no-opts/libengine.so is 1.19x to 1.20x faster than main/libengine.so!

  [310860540 313559414.86 330384818] main/libengine.so
  [260245364 262186933.88 278160088] no-opts/libengine.so

instantiation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 17444.42 ± 10445.70 (confidence = 99%)

  no-opts/libengine.so is 1.06x to 1.23x faster than main/libengine.so!

  [90902 137381.98 258206] main/libengine.so
  [87766 119937.56 198140] no-opts/libengine.so
Plus, here are some data for other benchmarks.
--iterations-per-process 10 --benchmark-phase execution ----pin is used.

bench v-no-opts base speedup

blake3-scalar 320,225 868,750 -63.14%

blake3-simd 320,689 945,427 -66.08%

bz2 88,887,466 86,904,121 2.28%

pulldown-cmark 6,630,447 6,705,562 -1.12%

regex 209,902,394 211,477,705 -0.74%

shootout-base64 383,700,851 352,817,318 8.75%

shootout-keccak 25,589,899 49,540,506 -48.35%

shootout-xchacha20 4,489,570 4,816,315 -6.78%

spidermonkey 644,434,235 627,374,660 2.72%

bench	v-no-opts	base	speedup
blake3-scalar	320,225	868,750	-63.14%
blake3-simd	320,689	945,427	-66.08%
bz2	88,887,466	86,904,121	2.28%
pulldown-cmark	6,630,447	6,705,562	-1.12%
regex	209,902,394	211,477,705	-0.74%
shootout-base64	383,700,851	352,817,318	8.75%
shootout-keccak	25,589,899	49,540,506	-48.35%
shootout-xchacha20	4,489,570	4,816,315	-6.78%
spidermonkey	644,434,235	627,374,660	2.72%

Wasmtime GitHub notifications bot (Dec 02 2025 at 02:48):

cfallin commented on issue #12106:

Hi @bongjunj -- thanks for reporting this! The place I would start is profiling to find the hottest inner loops, then comparing the disassembly of these loops before-and-after. That should help us triage the overall kind of problem -- whether it is code scheduling, or bad regalloc, or poor choice of instruction, or something else...

Wasmtime GitHub notifications bot (Dec 02 2025 at 07:06):

bongjunj commented on issue #12106:

@cfallin thanks! working on the triage.

Wasmtime GitHub notifications bot (Dec 03 2025 at 03:57):

alexcrichton commented on issue #12106:

I can pretty easily reproduce this myself cherry-picking the commit @bongjunj mentioned. Sightglass is producing surprising results for me though. One run yields
execution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [165008 555680.24 1110872] less-isle.so
  [371539 548436.46 1682729] main.so
a few runs later is
execution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm

  Δ = 147722.45 ± 92286.49 (confidence = 99%)

  main.so is 1.13x to 1.55x faster than less-isle.so!

  [167456 580352.29 1502815] less-isle.so
  [363903 432629.84 590110] main.so
passing --pin shows
execution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm

  Δ = 311980.36 ± 2354.27 (confidence = 99%)

  less-isle.so is 2.18x to 2.19x faster than main.so!

  [259525 263320.66 327692] less-isle.so
  [569582 575301.02 595577] main.so
running with taskset 0x1 shows
execution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm

  Δ = 200594.18 ± 2696.25 (confidence = 99%)

  less-isle.so is 2.18x to 2.21x faster than main.so!

  [162200 167555.71 209294] less-isle.so
  [363482 368149.89 400345] main.so
I don't really know what to make of this. My machine has P and E cores which seems like it might be factoring in.

I made custom builds with jitdump turned on and ran with perf record -k mono plus --iterations-per-process cranked way up. All I got though was that wasm[0]::function[8] was a hot function with ~45% in the less-isle.so profile and ~61% in the main.so profile. It's a massive function with no apparent loops.

I don't know what's going on here myself.

Wasmtime GitHub notifications bot (Dec 03 2025 at 03:58):

alexcrichton edited a comment on issue #12106:

I can pretty easily reproduce this myself cherry-picking the commit @bongjunj mentioned. Sightglass is producing surprising results for me though. One run yields
execution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [165008 555680.24 1110872] less-isle.so
  [371539 548436.46 1682729] main.so
a few runs later is
execution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm

  Δ = 147722.45 ± 92286.49 (confidence = 99%)

  main.so is 1.13x to 1.55x faster than less-isle.so!

  [167456 580352.29 1502815] less-isle.so
  [363903 432629.84 590110] main.so
passing --pin shows
execution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm

  Δ = 311980.36 ± 2354.27 (confidence = 99%)

  less-isle.so is 2.18x to 2.19x faster than main.so!

  [259525 263320.66 327692] less-isle.so
  [569582 575301.02 595577] main.so
running with taskset 0x1 shows
execution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm

  Δ = 200594.18 ± 2696.25 (confidence = 99%)

  less-isle.so is 2.18x to 2.21x faster than main.so!

  [162200 167555.71 209294] less-isle.so
  [363482 368149.89 400345] main.so
I don't really know what to make of this. My machine has P and E cores which seems like it might be factoring in.

I made custom builds with jitdump turned on and ran with perf record -k mono plus --iterations-per-process cranked way up. All I got though was that wasm[0]::function[8] was a hot function with ~41% in the less-isle.so profile and ~61% in the main.so profile. It's a massive function with no apparent loops.

I don't know what's going on here myself.

Wasmtime GitHub notifications bot (Dec 03 2025 at 04:00):

alexcrichton commented on issue #12106:

Hm well the hot function is this one which is a massive function with zero control flow. That being said I could see egraphs deoptimizing this function in terms of instruction scheduling since we've seen pessimistic results in the past, so this might be an example of an overly-pessimistic case of instruction scheduling.

Wasmtime GitHub notifications bot (Dec 03 2025 at 04:24):

bongjunj commented on issue #12106:

@alexcrichton Can you share the workflow of your using jitdump and perf? Any existing docs will be helpful.

Wasmtime GitHub notifications bot (Dec 03 2025 at 04:36):

bongjunj edited a comment on issue #12106:

@alexcrichton Can you share the workflow of your using jitdump and perf? Any existing docs will be helpful.

I think I got this. thanks!

Wasmtime GitHub notifications bot (Dec 03 2025 at 04:45):

bongjunj edited a comment on issue #12106:

@alexcrichton Can you share the workflow of your using jitdump and perf? Any existing docs will be helpful.

Wasmtime GitHub notifications bot (Dec 03 2025 at 04:50):

bongjunj edited a comment on issue #12106:

@alexcrichton Can you share the workflow of your using jitdump and perf? Any existing docs will be helpful.

Just running perf seems not working.

perf record -k mono target/release/sightglass-cli benchmark --engine=engines/wasmtime/libengine.so --engine-flags=" --profile=jitdump" --iterations-per-process=1 --measure=perf-counters benchmarks/blake3-scalar/benchmark.wasm

Wasmtime GitHub notifications bot (Dec 03 2025 at 06:17):

bongjunj commented on issue #12106:

I removed the sightglass_api dependency of the blake3-scalar benchmark,
and measured jitdump/perf following https://docs.wasmtime.dev/examples-profiling-perf.html running wasmtime run command for each.

perf record -k mono wasmtime run --dir=. --profile=jitdump rust-benchmark/target/wasm32-wasip1/release/benchmark.wasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data

perf diff /data/blake3-scalar-main.perf.jit.data  /data/blake3-scalar-noopts.perf.jit.data
# Event 'cycles:P'
#
# Baseline  Delta Abs  Shared Object         Symbol
# ........  .........  ....................  ............................................................................................
#
    12.87%     +7.98%  wasmtime              [.] tokio::runtime::scheduler::multi_thread::worker::Context::run
     6.50%     -4.99%  [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath.part.0
     3.91%     +4.13%  [kernel.kallsyms]     [k] smp_call_function_many_cond
     1.13%     +2.82%  ld-linux-x86-64.so.2  [.] 0x0000000000009f0c
     5.75%     -2.52%  libc.so.6             [.] 0x000000000009c724
     0.57%     +1.74%  [kernel.kallsyms]     [k] next_uptodate_page
     3.66%     +1.73%  wasmtime              [.] std::sys::sync::mutex::futex::Mutex::lock_contended
     0.59%     +1.62%  [kernel.kallsyms]     [k] __handle_mm_fault
     0.44%     +1.52%  [kernel.kallsyms]     [k] native_write_msr
     1.84%     -1.46%  [kernel.kallsyms]     [k] _find_next_bit
               +1.44%  [kernel.kallsyms]     [k] __mod_lruvec_page_state
     1.47%     -1.39%  [kernel.kallsyms]     [k] zap_pte_range
               +1.21%  [kernel.kallsyms]     [k] update_curr
               +1.20%  [kernel.kallsyms]     [k] __remove_hrtimer
               +1.17%  [kernel.kallsyms]     [k] osq_unlock
     1.27%     -1.16%  [kernel.kallsyms]     [k] futex_wait
               +1.13%  [kernel.kallsyms]     [k] unmap_page_range
     0.97%     +1.10%  [kernel.kallsyms]     [k] restore_fpregs_from_fpstate
     1.21%     -1.09%  [kernel.kallsyms]     [k] llist_reverse_order
               +1.08%  [kernel.kallsyms]     [k] do_user_addr_fault
     2.85%     -1.05%  [kernel.kallsyms]     [k] osq_lock
     0.03%     +0.96%  [kernel.kallsyms]     [k] __x64_sys_futex
               +0.89%  [kernel.kallsyms]     [k] newidle_balance
               +0.89%  [kernel.kallsyms]     [k] rwsem_mark_wake
     0.40%     +0.89%  [kernel.kallsyms]     [k] delayacct_add_tsk
               +0.84%  [kernel.kallsyms]     [k] rb_first
               +0.74%  [kernel.kallsyms]     [k] memcg_slab_post_alloc_hook
               +0.73%  [kernel.kallsyms]     [k] sync_regs
               +0.72%  [kernel.kallsyms]     [k] change_protection_range
               +0.72%  [kernel.kallsyms]     [k] __mem_cgroup_charge
     1.24%     -0.71%  [kernel.kallsyms]     [k] update_sg_wakeup_stats
               +0.69%  [kernel.kallsyms]     [k] psi_group_change
               +0.69%  [kernel.kallsyms]     [k] try_charge_memcg
               +0.69%  [kernel.kallsyms]     [k] __memcg_kmem_charge_page
               +0.62%  [kernel.kallsyms]     [k] x64_sys_call
     0.07%     +0.60%  [kernel.kallsyms]     [k] find_vma
               +0.60%  [kernel.kallsyms]     [k] kfree
               +0.54%  [kernel.kallsyms]     [k] vm_area_dup
     0.19%     +0.53%  [kernel.kallsyms]     [k] __vma_adjust
     0.59%     -0.53%  [kernel.kallsyms]     [k] entry_SYSCALL_64_after_hwframe
               +0.52%  [kernel.kallsyms]     [k] __srcu_read_unlock
     0.62%     -0.51%  [kernel.kallsyms]     [k] rwsem_down_write_slowpath
     0.95%     -0.50%  libc.so.6             [.] malloc
               +0.49%  wasmtime              [.] tokio::runtime::blocking::pool::Spawner::spawn_task
     0.83%     +0.48%  [kernel.kallsyms]     [k] perf_event_alloc
               +0.47%  [kernel.kallsyms]     [k] __alloc_pages
               +0.46%  wasmtime              [.] HUF_decompress4X2_usingDTable_internal_fast
               +0.46%  [kernel.kallsyms]     [k] update_blocked_averages
               +0.46%  wasmtime              [.] <&[u8] as object::read::read_ref::ReadRef>::read_bytes_at_until
               +0.46%  libc.so.6             [.] cfree
               +0.46%  [kernel.kallsyms]     [k] refill_obj_stock
               +0.46%  [kernel.kallsyms]     [k] ext4_do_update_inode.isra.0
               +0.46%  [kernel.kallsyms]     [k] __ext4_get_inode_loc
               +0.45%  [kernel.kallsyms]     [k] kmem_cache_free
               +0.45%  [kernel.kallsyms]     [k] syscall_exit_to_user_mode
               +0.45%  jitted-438731-20.so   [.] blake3[e80d2cadb56cf0e0]::portable::compress_in_place
               +0.45%  [kernel.kallsyms]     [k] insert_work
               +0.43%  [kernel.kallsyms]     [k] __intel_pmu_enable_all.constprop.0
     1.33%     -0.41%  [kernel.kallsyms]     [k] llist_add_batch
     0.92%     +0.35%  [kernel.kallsyms]     [k] native_irq_return_iret
     0.62%     +0.31%  wasmtime              [.] ZSTD_decompressSequences_bmi2.constprop.0
     0.39%     +0.29%  [kernel.kallsyms]     [k] native_sched_clock
     1.45%     -0.29%  [kernel.kallsyms]     [k] clear_page_erms
     0.95%     -0.24%  [kernel.kallsyms]     [k] handle_mm_fault
     0.47%     -0.22%  wasmtime              [.] tokio::runtime::blocking::pool::Inner::run
               +0.21%  [kernel.kallsyms]     [k] update_load_avg
     0.22%     -0.20%  [kernel.kallsyms]     [k] exit_to_user_mode_loop
     3.70%     -0.19%  [kernel.kallsyms]     [k] update_sg_lb_stats
               +0.19%  [kernel.kallsyms]     [k] find_vma_prev
     0.98%     -0.19%  [kernel.kallsyms]     [k] its_return_thunk
     0.53%     -0.18%  [kernel.kallsyms]     [k] nmi_restore
               +0.17%  [kernel.kallsyms]     [k] __rseq_handle_notify_resume
               +0.16%  wasmtime              [.] std::sys::sync::condvar::futex::Condvar::wait
     0.20%     +0.15%  [kernel.kallsyms]     [k] __get_user_8
     0.08%     +0.15%  [kernel.kallsyms]     [k] __irqentry_text_end
     0.14%     -0.12%  [kernel.kallsyms]     [k] ghes_notify_nmi
               +0.12%  [vdso]                [.] 0x00000000000006e5
               +0.12%  [kernel.kallsyms]     [k] flush_signal_handlers
               +0.11%  wasmtime              [.] tokio::runtime::scheduler::multi_thread::worker::Core::maintenance
               +0.10%  [kernel.kallsyms]     [k] _raw_spin_lock
               +0.10%  [kernel.kallsyms]     [k] do_madvise.part.0
     0.02%     +0.10%  [kernel.kallsyms]     [k] futex_wait_queue_me
               +0.09%  wasmtime              [.] std::sys::sync::condvar::futex::Condvar::wait_timeout
               +0.09%  [kernel.kallsyms]     [k] __update_load_avg_cfs_rq
     0.32%     +0.08%  [kernel.kallsyms]     [k] finish_task_switch.isra.0
     0.19%     +0.08%  [kernel.kallsyms]     [k] exit_to_user_mode_prepare
     0.03%     +0.07%  [kernel.kallsyms]     [k] default_do_nmi
     0.08%     +0.07%  [kernel.kallsyms]     [k] switch_fpu_return
     0.08%     +0.07%  libc.so.6             [.] syscall
               +0.07%  [kernel.kallsyms]     [k] entry_SYSCALL_64_safe_stack
     0.64%     -0.06%  [kernel.kallsyms]     [k] mutex_unlock
               +0.06%  [kernel.kallsyms]     [k] hrtimer_try_to_cancel.part.0
               +0.06%  [kernel.kallsyms]     [k] sysvec_call_function
               +0.05%  [kernel.kallsyms]     [k] set_pte_vaddr_p4d
               +0.05%  [kernel.kallsyms]     [k] blkcg_maybe_throttle_current
               +0.05%  [kernel.kallsyms]     [k] native_set_fixmap
     0.02%     +0.04%  [kernel.kallsyms]     [k] nmi_handle.part.0
     0.30%     -0.04%  [kernel.kallsyms]     [k] native_flush_tlb_one_user
     0.07%     -0.04%  [kernel.kallsyms]     [k] native_flush_tlb_local
     0.04%     -0.04%  [kernel.kallsyms]     [k] native_apic_msr_write
     0.11%     +0.03%  [kernel.kallsyms]     [k] __perf_event_task_sched_in
     0.01%     +0.03%  [kernel.kallsyms]     [k] intel_pmu_handle_irq
               +0.02%  [kernel.kallsyms]     [k] do_futex
     0.01%     +0.01%  [kernel.kallsyms]     [k] sched_clock
     0.88%     +0.01%  [kernel.kallsyms]     [k] try_to_wake_up
               +0.01%  [kernel.kallsyms]     [k] perf_sample_event_took
     0.04%     -0.01%  [kernel.kallsyms]     [k] intel_bts_enable_local
               +0.01%  [kernel.kallsyms]     [k] end_repeat_nmi
     0.01%     +0.00%  [kernel.kallsyms]     [k] perf_event_nmi_handler
     0.04%     -0.00%  [kernel.kallsyms]     [k] acpi_os_read_memory
     2.38%             [kernel.kallsyms]     [k] idle_cpu
     1.40%             [kernel.kallsyms]     [k] kmem_cache_alloc
     1.33%             [kernel.kallsyms]     [k] copy_page
     1.23%             wasmtime              [.] core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h2abf5669ee3b7a52
     1.23%             [kernel.kallsyms]     [k] page_mapping
     1.15%             [kernel.kallsyms]     [k] vma_link
     1.13%             [kernel.kallsyms]     [k] get_mem_cgroup_from_mm
     1.10%             [kernel.kallsyms]     [k] native_apic_msr_eoi_write
     1.09%             [kernel.kallsyms]     [k] clear_bhb_loop
     1.06%             wasmtime              [.] <std::thread::spawnhook::SpawnHooks as core::ops::drop::Drop>::drop
     1.00%             [kernel.kallsyms]     [k] load_balance
     0.89%             libc.so.6             [.] __madvise
     0.88%             [kernel.kallsyms]     [k] alloc_thread_stack_node
     0.85%             wasmtime              [.] std::sys::thread_local::guard::key::enable::run
     0.82%             [kernel.kallsyms]     [k] task_h_load
     0.79%
[message truncated]

Wasmtime GitHub notifications bot (Dec 03 2025 at 06:25):

cfallin commented on issue #12106:

There is definitely something going on with function 8: the stack frame (grep for sub $N, %rsp in disassembly) is 0x130 bytes if I disable all opt rewrites (remove the call to constructor_simplify in the egraph pass) and 0x6b0 bytes in the unmodified run. The result is that the function body has a whole bunch more spills/reloads. Something is causing a lot more register pressure; isel is a plausible hypothesis there. I'll dig a bit more...

Wasmtime GitHub notifications bot (Dec 03 2025 at 06:51):

bongjunj deleted a comment on issue #12106:

perf record -k mono wasmtime run --dir=. --profile=jitdump rust-benchmark/target/wasm32-wasip1/release/benchmark.wasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data

perf diff /data/blake3-scalar-main.perf.jit.data  /data/blake3-scalar-noopts.perf.jit.data
# Event 'cycles:P'
#
# Baseline  Delta Abs  Shared Object         Symbol
# ........  .........  ....................  ............................................................................................
#
    12.87%     +7.98%  wasmtime              [.] tokio::runtime::scheduler::multi_thread::worker::Context::run
     6.50%     -4.99%  [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath.part.0
     3.91%     +4.13%  [kernel.kallsyms]     [k] smp_call_function_many_cond
     1.13%     +2.82%  ld-linux-x86-64.so.2  [.] 0x0000000000009f0c
     5.75%     -2.52%  libc.so.6             [.] 0x000000000009c724
     0.57%     +1.74%  [kernel.kallsyms]     [k] next_uptodate_page
     3.66%     +1.73%  wasmtime              [.] std::sys::sync::mutex::futex::Mutex::lock_contended
     0.59%     +1.62%  [kernel.kallsyms]     [k] __handle_mm_fault
     0.44%     +1.52%  [kernel.kallsyms]     [k] native_write_msr
     1.84%     -1.46%  [kernel.kallsyms]     [k] _find_next_bit
               +1.44%  [kernel.kallsyms]     [k] __mod_lruvec_page_state
     1.47%     -1.39%  [kernel.kallsyms]     [k] zap_pte_range
               +1.21%  [kernel.kallsyms]     [k] update_curr
               +1.20%  [kernel.kallsyms]     [k] __remove_hrtimer
               +1.17%  [kernel.kallsyms]     [k] osq_unlock
     1.27%     -1.16%  [kernel.kallsyms]     [k] futex_wait
               +1.13%  [kernel.kallsyms]     [k] unmap_page_range
     0.97%     +1.10%  [kernel.kallsyms]     [k] restore_fpregs_from_fpstate
     1.21%     -1.09%  [kernel.kallsyms]     [k] llist_reverse_order
               +1.08%  [kernel.kallsyms]     [k] do_user_addr_fault
     2.85%     -1.05%  [kernel.kallsyms]     [k] osq_lock
     0.03%     +0.96%  [kernel.kallsyms]     [k] __x64_sys_futex
               +0.89%  [kernel.kallsyms]     [k] newidle_balance
               +0.89%  [kernel.kallsyms]     [k] rwsem_mark_wake
     0.40%     +0.89%  [kernel.kallsyms]     [k] delayacct_add_tsk
               +0.84%  [kernel.kallsyms]     [k] rb_first
               +0.74%  [kernel.kallsyms]     [k] memcg_slab_post_alloc_hook
               +0.73%  [kernel.kallsyms]     [k] sync_regs
               +0.72%  [kernel.kallsyms]     [k] change_protection_range
               +0.72%  [kernel.kallsyms]     [k] __mem_cgroup_charge
     1.24%     -0.71%  [kernel.kallsyms]     [k] update_sg_wakeup_stats
               +0.69%  [kernel.kallsyms]     [k] psi_group_change
               +0.69%  [kernel.kallsyms]     [k] try_charge_memcg
               +0.69%  [kernel.kallsyms]     [k] __memcg_kmem_charge_page
               +0.62%  [kernel.kallsyms]     [k] x64_sys_call
     0.07%     +0.60%  [kernel.kallsyms]     [k] find_vma
               +0.60%  [kernel.kallsyms]     [k] kfree
               +0.54%  [kernel.kallsyms]     [k] vm_area_dup
     0.19%     +0.53%  [kernel.kallsyms]     [k] __vma_adjust
     0.59%     -0.53%  [kernel.kallsyms]     [k] entry_SYSCALL_64_after_hwframe
               +0.52%  [kernel.kallsyms]     [k] __srcu_read_unlock
     0.62%     -0.51%  [kernel.kallsyms]     [k] rwsem_down_write_slowpath
     0.95%     -0.50%  libc.so.6             [.] malloc
               +0.49%  wasmtime              [.] tokio::runtime::blocking::pool::Spawner::spawn_task
     0.83%     +0.48%  [kernel.kallsyms]     [k] perf_event_alloc
               +0.47%  [kernel.kallsyms]     [k] __alloc_pages
               +0.46%  wasmtime              [.] HUF_decompress4X2_usingDTable_internal_fast
               +0.46%  [kernel.kallsyms]     [k] update_blocked_averages
               +0.46%  wasmtime              [.] <&[u8] as object::read::read_ref::ReadRef>::read_bytes_at_until
               +0.46%  libc.so.6             [.] cfree
               +0.46%  [kernel.kallsyms]     [k] refill_obj_stock
               +0.46%  [kernel.kallsyms]     [k] ext4_do_update_inode.isra.0
               +0.46%  [kernel.kallsyms]     [k] __ext4_get_inode_loc
               +0.45%  [kernel.kallsyms]     [k] kmem_cache_free
               +0.45%  [kernel.kallsyms]     [k] syscall_exit_to_user_mode
               +0.45%  jitted-438731-20.so   [.] blake3[e80d2cadb56cf0e0]::portable::compress_in_place
               +0.45%  [kernel.kallsyms]     [k] insert_work
               +0.43%  [kernel.kallsyms]     [k] __intel_pmu_enable_all.constprop.0
     1.33%     -0.41%  [kernel.kallsyms]     [k] llist_add_batch
     0.92%     +0.35%  [kernel.kallsyms]     [k] native_irq_return_iret
     0.62%     +0.31%  wasmtime              [.] ZSTD_decompressSequences_bmi2.constprop.0
     0.39%     +0.29%  [kernel.kallsyms]     [k] native_sched_clock
     1.45%     -0.29%  [kernel.kallsyms]     [k] clear_page_erms
     0.95%     -0.24%  [kernel.kallsyms]     [k] handle_mm_fault
     0.47%     -0.22%  wasmtime              [.] tokio::runtime::blocking::pool::Inner::run
               +0.21%  [kernel.kallsyms]     [k] update_load_avg
     0.22%     -0.20%  [kernel.kallsyms]     [k] exit_to_user_mode_loop
     3.70%     -0.19%  [kernel.kallsyms]     [k] update_sg_lb_stats
               +0.19%  [kernel.kallsyms]     [k] find_vma_prev
     0.98%     -0.19%  [kernel.kallsyms]     [k] its_return_thunk
     0.53%     -0.18%  [kernel.kallsyms]     [k] nmi_restore
               +0.17%  [kernel.kallsyms]     [k] __rseq_handle_notify_resume
               +0.16%  wasmtime              [.] std::sys::sync::condvar::futex::Condvar::wait
     0.20%     +0.15%  [kernel.kallsyms]     [k] __get_user_8
     0.08%     +0.15%  [kernel.kallsyms]     [k] __irqentry_text_end
     0.14%     -0.12%  [kernel.kallsyms]     [k] ghes_notify_nmi
               +0.12%  [vdso]                [.] 0x00000000000006e5
               +0.12%  [kernel.kallsyms]     [k] flush_signal_handlers
               +0.11%  wasmtime              [.] tokio::runtime::scheduler::multi_thread::worker::Core::maintenance
               +0.10%  [kernel.kallsyms]     [k] _raw_spin_lock
               +0.10%  [kernel.kallsyms]     [k] do_madvise.part.0
     0.02%     +0.10%  [kernel.kallsyms]     [k] futex_wait_queue_me
               +0.09%  wasmtime              [.] std::sys::sync::condvar::futex::Condvar::wait_timeout
               +0.09%  [kernel.kallsyms]     [k] __update_load_avg_cfs_rq
     0.32%     +0.08%  [kernel.kallsyms]     [k] finish_task_switch.isra.0
     0.19%     +0.08%  [kernel.kallsyms]     [k] exit_to_user_mode_prepare
     0.03%     +0.07%  [kernel.kallsyms]     [k] default_do_nmi
     0.08%     +0.07%  [kernel.kallsyms]     [k] switch_fpu_return
     0.08%     +0.07%  libc.so.6             [.] syscall
               +0.07%  [kernel.kallsyms]     [k] entry_SYSCALL_64_safe_stack
     0.64%     -0.06%  [kernel.kallsyms]     [k] mutex_unlock
               +0.06%  [kernel.kallsyms]     [k] hrtimer_try_to_cancel.part.0
               +0.06%  [kernel.kallsyms]     [k] sysvec_call_function
               +0.05%  [kernel.kallsyms]     [k] set_pte_vaddr_p4d
               +0.05%  [kernel.kallsyms]     [k] blkcg_maybe_throttle_current
               +0.05%  [kernel.kallsyms]     [k] native_set_fixmap
     0.02%     +0.04%  [kernel.kallsyms]     [k] nmi_handle.part.0
     0.30%     -0.04%  [kernel.kallsyms]     [k] native_flush_tlb_one_user
     0.07%     -0.04%  [kernel.kallsyms]     [k] native_flush_tlb_local
     0.04%     -0.04%  [kernel.kallsyms]     [k] native_apic_msr_write
     0.11%     +0.03%  [kernel.kallsyms]     [k] __perf_event_task_sched_in
     0.01%     +0.03%  [kernel.kallsyms]     [k] intel_pmu_handle_irq
               +0.02%  [kernel.kallsyms]     [k] do_futex
     0.01%     +0.01%  [kernel.kallsyms]     [k] sched_clock
     0.88%     +0.01%  [kernel.kallsyms]     [k] try_to_wake_up
               +0.01%  [kernel.kallsyms]     [k] perf_sample_event_took
     0.04%     -0.01%  [kernel.kallsyms]     [k] intel_bts_enable_local
               +0.01%  [kernel.kallsyms]     [k] end_repeat_nmi
     0.01%     +0.00%  [kernel.kallsyms]     [k] perf_event_nmi_handler
     0.04%     -0.00%  [kernel.kallsyms]     [k] acpi_os_read_memory
     2.38%             [kernel.kallsyms]     [k] idle_cpu
     1.40%             [kernel.kallsyms]     [k] kmem_cache_alloc
     1.33%             [kernel.kallsyms]     [k] copy_page
     1.23%             wasmtime              [.] core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h2abf5669ee3b7a52
     1.23%             [kernel.kallsyms]     [k] page_mapping
     1.15%             [kernel.kallsyms]     [k] vma_link
     1.13%             [kernel.kallsyms]     [k] get_mem_cgroup_from_mm
     1.10%             [kernel.kallsyms]     [k] native_apic_msr_eoi_write
     1.09%             [kernel.kallsyms]     [k] clear_bhb_loop
     1.06%             wasmtime              [.] <std::thread::spawnhook::SpawnHooks as core::ops::drop::Drop>::drop
     1.00%             [kernel.kallsyms]     [k] load_balance
     0.89%             libc.so.6             [.] __madvise
     0.88%             [kernel.kallsyms]     [k] alloc_thread_stack_node
     0.85%             wasmtime              [.] std::sys::thread_local::guard::key::enable::run
     0.82%             [kernel.kallsyms]     [k] task_h_load
     0.7
[message truncated]

Wasmtime GitHub notifications bot (Dec 03 2025 at 07:23):

cfallin commented on issue #12106:

Here are the no opt rewrites (but still GVN/alias/LICM) and full main branch opt CLIF dumps for function 8 (wasmtime compile --emit-clif dir foo.wat then take wasm[0]--function[8].clif from the directory).

The "optimized" variant is about twice as long (!).

There's a repeating rewrite pattern that seems to be happening that is a sort of reassociation of a 3-input add sequence -- in the baseline we have
v34 = iadd v27, v33
... [ lots of intermediate insts ] ...
v43 = iadd v34, v42
v44 = ... v43 ...
and in opt we have
... [ lots of intermediate insts ] ...
v1242 = iadd v33, v42
v1243 = iadd v27, v1242
v44 = ... v1243 ...
So we've (i) needlessly reassociated (v27 + v33) + v42 into v27 + (v33 + v42) (none of these are iconsts) and (ii) sunk one of the adds. The former is weird but "shrug". The latter is painful for register pressure: if means we have three live values over "lots of intermediate insts" rather than two.

Better inst scheduling integrated with regalloc would solve that (but, NP-hard problem about which at least one PhD thesis has been written); or alternately, we could "not fix what's not broken" and just avoid the rewrite.

The reassociation seems to go more haywire further down; looking at a diff (diff -u base.clif opt.clif) things are more-or-less 1-for-1 trades early on, but lots of extra iadds later.

I'll dig into the reassociation rules we have and see if I can figure out what's happening!

Wasmtime GitHub notifications bot (Dec 03 2025 at 07:31):

cfallin commented on issue #12106:

OK, it seems that the reassociation rules added in #7466 are the issue. If I comment those out, I get this CLIF, and the compiled function body has a stack frame size of 0x130 bytes again.

cc @fitzgen -- it looks like there wasn't any data posted on that PR; did you have benchmarking results at the time? Mind benchmarking with/without those rules (and I guess the similar followups in #8518 as well)?

Wasmtime GitHub notifications bot (Dec 03 2025 at 15:43):

alexcrichton commented on issue #12106:

Can you share the workflow of your using jitdump and perf?

I applied this diff:

diff --git a/crates/bench-api/src/lib.rs b/crates/bench-api/src/lib.rs
index 86c99c8c98..8a5180ca5e 100644
--- a/crates/bench-api/src/lib.rs
+++ b/crates/bench-api/src/lib.rs
@@ -423,6 +423,7 @@ impl BenchState {
         let mut config = options.config(None)?;
         // NB: always disable the compilation cache.
         config.cache(None);
+        config.profiler(wasmtime::ProfilingStrategy::JitDump);
         let engine = Engine::new(&config)?;
         let mut linker = Linker::<HostState>::new(&engine);

and then I ran:

$ taskset 0x1 perf record -k mono ./target/release/sightglass-cli benchmark -e ./less-isle.so --benchmark-phase execution ./benchmarks/blake3-simd/benchmark.wasm --iterations-per-process 100000

after awhile I ctrl-c'd it and followed these instructions to process the profile

Wasmtime GitHub notifications bot (Dec 03 2025 at 22:39):

fitzgen commented on issue #12106:

Reverted the offending rewrite in https://github.com/bytecodealliance/wasmtime/pull/12116 -- could someone re-run these benchmarks on main and report whether everything is resolved or not now? Thanks!

Wasmtime GitHub notifications bot (Dec 03 2025 at 23:01):

cfallin commented on issue #12106:

Here are the benchmarks that had the biggest drops no-rules to rules above:

execution :: cycles :: benchmarks/shootout/shootout-keccak.wasm

  Δ = 4097364.22 ± 837090.89 (confidence = 99%)

  no-rules.so is 1.14x to 1.21x faster than main.so!

  [26507651 27605359.73 36644170] main.so
  [22595511 23507995.51 29631085] no-rules.so

execution :: cycles :: benchmarks/shootout/shootout-xchacha20.wasm

  Δ = 16237.66 ± 10867.13 (confidence = 99%)

  main.so is 1.00x to 1.01x faster than no-rules.so!

  [2634524 2662033.25 2766405] main.so
  [2645145 2678270.91 2835592] no-rules.so

execution :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  No difference in performance.

  [207174 209371.73 231598] main.so
  [207346 210394.27 257140] no-rules.so

execution :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [208507 212511.59 232974] main.so
  [208421 212331.42 290723] no-rules.so

So keccak is still weird, but "only" 14-21% slower rather than 48%. Giant drops in blake3-* went away entirely.

Wasmtime GitHub notifications bot (Dec 03 2025 at 23:21):

alexcrichton commented on issue #12106:

I also see (consistently) a 20% slowdown of shootout-keccak. That is, deleting all the ISLE bits makes shootout-keccak faster. Local profiling shows in both profiles 98% of the time is spent in the keccak_core function which looks similar to the blake3 one

Wasmtime GitHub notifications bot (Dec 04 2025 at 07:59):

bongjunj commented on issue #12106:

Performance evaluation (Execution) after removing the reassoc rules:

Benchmark No Opts Base Speedup

shootout-keccak 25,157,389 29,537,793 -14.83%

shootout-switch 139,546,912 153,616,699 -9.16%

shootout-ratelimit 39,101,368 39,894,279 -1.99%

shootout-gimli 5,360,983 5,418,553 -1.06%

shootout-xchacha20 4,376,708 4,389,354 -0.29%

regex 209,934,736 210,346,186 -0.20%

shootout-memmove 36,313,845 36,325,481 -0.03%

shootout-fib2 3,009,938,907 3,010,271,487 -0.01%

shootout-matrix 540,314,088 540,332,499 0.00%

shootout-heapsort 2,378,752,173 2,374,493,320 0.18%

spidermonkey 635,879,627 634,094,137 0.28%

blake3-scalar 318,837 316,866 0.62%

bz2 87,275,956 86,405,779 1.01%

pulldown-cmark 6,729,037 6,629,027 1.51%

shootout-ed25519 9,583,751,426 9,412,570,875 1.82%

shootout-xblabla20 2,974,201 2,902,069 2.49%

blake3-simd 313,009 305,049 2.61%

shootout-seqhash 8,971,675,191 8,663,650,423 3.56%

shootout-nestedloop 468 451 3.89%

shootout-ctype 830,650,139 796,881,603 4.24%

shootout-sieve 905,646,691 841,504,133 7.62%

shootout-base64 381,752,475 352,928,420 8.17%

shootout-ackermann 8,647,754 7,767,390 11.33%

shootout-minicsv 1,475,692,343 1,306,403,978 12.96%

shootout-random 630,367,818 439,901,877 43.30%

Benchmark	No Opts	Base	Speedup
shootout-keccak	25,157,389	29,537,793	-14.83%
shootout-switch	139,546,912	153,616,699	-9.16%
shootout-ratelimit	39,101,368	39,894,279	-1.99%
shootout-gimli	5,360,983	5,418,553	-1.06%
shootout-xchacha20	4,376,708	4,389,354	-0.29%
regex	209,934,736	210,346,186	-0.20%
shootout-memmove	36,313,845	36,325,481	-0.03%
shootout-fib2	3,009,938,907	3,010,271,487	-0.01%
shootout-matrix	540,314,088	540,332,499	0.00%
shootout-heapsort	2,378,752,173	2,374,493,320	0.18%
spidermonkey	635,879,627	634,094,137	0.28%
blake3-scalar	318,837	316,866	0.62%
bz2	87,275,956	86,405,779	1.01%
pulldown-cmark	6,729,037	6,629,027	1.51%
shootout-ed25519	9,583,751,426	9,412,570,875	1.82%
shootout-xblabla20	2,974,201	2,902,069	2.49%
blake3-simd	313,009	305,049	2.61%
shootout-seqhash	8,971,675,191	8,663,650,423	3.56%
shootout-nestedloop	468	451	3.89%
shootout-ctype	830,650,139	796,881,603	4.24%
shootout-sieve	905,646,691	841,504,133	7.62%
shootout-base64	381,752,475	352,928,420	8.17%
shootout-ackermann	8,647,754	7,767,390	11.33%
shootout-minicsv	1,475,692,343	1,306,403,978	12.96%
shootout-random	630,367,818	439,901,877	43.30%

Wasmtime GitHub notifications bot (Dec 04 2025 at 08:45):

bongjunj commented on issue #12106:

In keccak, the following (de-) optimizing pattern is repeated from no-opt version to base (main branch):

-@4cfc                               v6912 = bxor v6792, v158  ; v158 = -1
-@4cfd                               v6913 = band v6775, v6912
+                                    v10066 = bnot v6787
+                                    v10067 = bnot v6789
+                                    v10068 = band v10066, v10067
+                                    v10070 = bor v10068, v6991  ; v6991 = -4294967296
+@4cfd                               v6913 = band v6775, v10070
 @4d00                               v6914 = bxor v6913, v6803
 @4d01                               store little heap v6914, v105
-@4d0c                               v6921 = bxor v6823, v158  ; v158 = -1
-@4d0d                               v6922 = band v6656, v6921
+                                    v10075 = bnot v6818
+                                    v10076 = bnot v6820
+                                    v10077 = band v10075, v10076
+                                    v10079 = bor v10077, v6991  ; v6991 = -4294967296
+@4d0d                               v6922 = band v6656, v10079
 @4d10                               v6923 = bxor v6922, v6631
 @4d11                               store little heap v6923, v124
-@4d1c                               v6930 = bxor v6841, v158  ; v158 = -1
-@4d1d                               v6931 = band v6830, v6930
+                                    v10084 = bnot v6836
+                                    v10085 = bnot v6838
+                                    v10086 = band v10084, v10085
+                                    v10088 = bor v10086, v6991  ; v6991 = -4294967296
+@4d1d                               v6931 = band v6830, v10088

Wasmtime GitHub notifications bot (Dec 04 2025 at 09:02):

bongjunj edited a comment on issue #12106:

In keccak, the following (de-) optimizing pattern is repeated from no-opt version to base (main branch) in the keccak core function, when the instruction count exceeds 1,000.

-@11dd                               v1561 = bxor v1220, v158  ; v158 = -1
-@11de                               v1562 = band v1430, v1561
+                                    v7626 = bnot v1215
+                                    v7627 = bnot v1217
+                                    v7628 = band v7626, v7627
+                                    v7630 = bor v7628, v6991  ; v6991 = -4294967296
+@11de                               v1562 = band v1430, v7630

Wasmtime GitHub notifications bot (Dec 04 2025 at 09:05):

bongjunj edited a comment on issue #12106:

In keccak, the following (de-) optimizing pattern is repeated from no-opt version to base (main branch) in the keccak core function, when the instruction count exceeds 1,000.

-@11dd                               v1561 = bxor v1220, v158  ; v158 = -1
-@11de                               v1562 = band v1430, v1561
+                                    v7626 = bnot v1215
+                                    v7627 = bnot v1217
+                                    v7628 = band v7626, v7627
+                                    v7630 = bor v7628, v6991  ; v6991 = -4294967296
+@11de                               v1562 = band v1430, v7630

This transformation is defined at https://github.com/bytecodealliance/wasmtime/blob/058a9bc16bc204fec05993ce6a0b013cc3db55e6/cranelift/codegen/src/opts/bitops.isle#L44-L50

Wasmtime GitHub notifications bot (Dec 04 2025 at 10:33):

bongjunj commented on issue #12106:

With the DeMorgan rules removed, the performance degrades are almost gone except for the "switch" case.

Benchmark No Opts Base Speedup

shootout-switch 139,551,486.30 153696081.5 -9.20%

shootout-ratelimit 39,173,257.16 39952950.56 -1.95%

shootout-xchacha20 4,380,029.16 4444075.78 -1.44%

regex 210,231,630.92 212953957.4 -1.28%

shootout-gimli 5,345,271.14 5404313.96 -1.09%

shootout-fib2 3,010,356,555.06 3010366675 0.00%

shootout-matrix 539,106,883.22 538754055.2 0.07%

shootout-heapsort 2,378,812,047.98 2375114650 0.16%

spidermonkey 636,143,609.88 633256160.3 0.46%

bz2 87,099,312.82 86442047.12 0.76%

shootout-memmove 36,569,713.06 36246023.48 0.89%

blake3-scalar 319,175.94 316023.92 1.00%

pulldown-cmark 6,741,922.18 6632833.42 1.64%

shootout-ed25519 9,569,702,876.44 9402544055 1.78%

shootout-nestedloop 504.38 494.38 2.02%

blake3-simd 313,143.94 305664.92 2.45%

shootout-xblabla20 3,011,040.94 2926801.38 2.88%

shootout-seqhash 8,868,767,466.20 8580074421 3.36%

shootout-ctype 830,871,263.52 796659320 4.29%

shootout-sieve 906,201,119.02 841387362.9 7.70%

shootout-base64 381,928,675.18 352966325.4 8.21%

shootout-ackermann 8,657,031.60 7770921.4 11.40%

shootout-minicsv 1,478,013,885.20 1292314209 14.37%

shootout-keccak 25,147,664.38 21106966.72 19.14%

shootout-random 630,552,185.72 439772361.1 43.38%

Benchmark	No Opts	Base	Speedup
shootout-switch	139,551,486.30	153696081.5	-9.20%
shootout-ratelimit	39,173,257.16	39952950.56	-1.95%
shootout-xchacha20	4,380,029.16	4444075.78	-1.44%
regex	210,231,630.92	212953957.4	-1.28%
shootout-gimli	5,345,271.14	5404313.96	-1.09%
shootout-fib2	3,010,356,555.06	3010366675	0.00%
shootout-matrix	539,106,883.22	538754055.2	0.07%
shootout-heapsort	2,378,812,047.98	2375114650	0.16%
spidermonkey	636,143,609.88	633256160.3	0.46%
bz2	87,099,312.82	86442047.12	0.76%
shootout-memmove	36,569,713.06	36246023.48	0.89%
blake3-scalar	319,175.94	316023.92	1.00%
pulldown-cmark	6,741,922.18	6632833.42	1.64%
shootout-ed25519	9,569,702,876.44	9402544055	1.78%
shootout-nestedloop	504.38	494.38	2.02%
blake3-simd	313,143.94	305664.92	2.45%
shootout-xblabla20	3,011,040.94	2926801.38	2.88%
shootout-seqhash	8,868,767,466.20	8580074421	3.36%
shootout-ctype	830,871,263.52	796659320	4.29%
shootout-sieve	906,201,119.02	841387362.9	7.70%
shootout-base64	381,928,675.18	352966325.4	8.21%
shootout-ackermann	8,657,031.60	7770921.4	11.40%
shootout-minicsv	1,478,013,885.20	1292314209	14.37%
shootout-keccak	25,147,664.38	21106966.72	19.14%
shootout-random	630,552,185.72	439772361.1	43.38%

Wasmtime GitHub notifications bot (Dec 04 2025 at 10:33):

bongjunj edited a comment on issue #12106:

With the DeMorgan rules removed, the performance degrades are almost gone except for the "switch" case.

Benchmark No Opts Base Speedup

shootout-switch 139,551,486 153,696,081 -9.20%

shootout-ratelimit 39,173,257 39,952,951 -1.95%

shootout-xchacha20 4,380,029 4,444,076 -1.44%

regex 210,231,631 212,953,957 -1.28%

shootout-gimli 5,345,271 5,404,314 -1.09%

shootout-fib2 3,010,356,555 3,010,366,675 0.00%

shootout-matrix 539,106,883 538,754,055 0.07%

shootout-heapsort 2,378,812,048 2,375,114,650 0.16%

spidermonkey 636,143,610 633,256,160 0.46%

bz2 87,099,313 86,442,047 0.76%

shootout-memmove 36,569,713 36,246,023 0.89%

blake3-scalar 319,176 316,024 1.00%

pulldown-cmark 6,741,922 6,632,833 1.64%

shootout-ed25519 9,569,702,876 9,402,544,055 1.78%

shootout-nestedloop 504 494 2.02%

blake3-simd 313,144 305,665 2.45%

shootout-xblabla20 3,011,041 2,926,801 2.88%

shootout-seqhash 8,868,767,466 8,580,074,421 3.36%

shootout-ctype 830,871,264 796,659,320 4.29%

shootout-sieve 906,201,119 841,387,363 7.70%

shootout-base64 381,928,675 352,966,325 8.21%

shootout-ackermann 8,657,032 7,770,921 11.40%

shootout-minicsv 1,478,013,885 1,292,314,209 14.37%

shootout-keccak 25,147,664 21,106,967 19.14%

shootout-random 630,552,186 439,772,361 43.38%

Benchmark	No Opts	Base	Speedup
shootout-switch	139,551,486	153,696,081	-9.20%
shootout-ratelimit	39,173,257	39,952,951	-1.95%
shootout-xchacha20	4,380,029	4,444,076	-1.44%
regex	210,231,631	212,953,957	-1.28%
shootout-gimli	5,345,271	5,404,314	-1.09%
shootout-fib2	3,010,356,555	3,010,366,675	0.00%
shootout-matrix	539,106,883	538,754,055	0.07%
shootout-heapsort	2,378,812,048	2,375,114,650	0.16%
spidermonkey	636,143,610	633,256,160	0.46%
bz2	87,099,313	86,442,047	0.76%
shootout-memmove	36,569,713	36,246,023	0.89%
blake3-scalar	319,176	316,024	1.00%
pulldown-cmark	6,741,922	6,632,833	1.64%
shootout-ed25519	9,569,702,876	9,402,544,055	1.78%
shootout-nestedloop	504	494	2.02%
blake3-simd	313,144	305,665	2.45%
shootout-xblabla20	3,011,041	2,926,801	2.88%
shootout-seqhash	8,868,767,466	8,580,074,421	3.36%
shootout-ctype	830,871,264	796,659,320	4.29%
shootout-sieve	906,201,119	841,387,363	7.70%
shootout-base64	381,928,675	352,966,325	8.21%
shootout-ackermann	8,657,032	7,770,921	11.40%
shootout-minicsv	1,478,013,885	1,292,314,209	14.37%
shootout-keccak	25,147,664	21,106,967	19.14%
shootout-random	630,552,186	439,772,361	43.38%

Wasmtime GitHub notifications bot (Dec 04 2025 at 10:34):

bongjunj edited a comment on issue #12106:

With the DeMorgan rules removed, the performance degrades are almost gone except for the "switch" case.
keccak enjoys 19.14% execution performance improvement.

Benchmark No Opts Base Speedup

shootout-switch 139,551,486 153,696,081 -9.20%

shootout-ratelimit 39,173,257 39,952,951 -1.95%

shootout-xchacha20 4,380,029 4,444,076 -1.44%

regex 210,231,631 212,953,957 -1.28%

shootout-gimli 5,345,271 5,404,314 -1.09%

shootout-fib2 3,010,356,555 3,010,366,675 0.00%

shootout-matrix 539,106,883 538,754,055 0.07%

shootout-heapsort 2,378,812,048 2,375,114,650 0.16%

spidermonkey 636,143,610 633,256,160 0.46%

bz2 87,099,313 86,442,047 0.76%

shootout-memmove 36,569,713 36,246,023 0.89%

blake3-scalar 319,176 316,024 1.00%

pulldown-cmark 6,741,922 6,632,833 1.64%

shootout-ed25519 9,569,702,876 9,402,544,055 1.78%

shootout-nestedloop 504 494 2.02%

blake3-simd 313,144 305,665 2.45%

shootout-xblabla20 3,011,041 2,926,801 2.88%

shootout-seqhash 8,868,767,466 8,580,074,421 3.36%

shootout-ctype 830,871,264 796,659,320 4.29%

shootout-sieve 906,201,119 841,387,363 7.70%

shootout-base64 381,928,675 352,966,325 8.21%

shootout-ackermann 8,657,032 7,770,921 11.40%

shootout-minicsv 1,478,013,885 1,292,314,209 14.37%

shootout-keccak 25,147,664 21,106,967 19.14%

shootout-random 630,552,186 439,772,361 43.38%

Benchmark	No Opts	Base	Speedup
shootout-switch	139,551,486	153,696,081	-9.20%
shootout-ratelimit	39,173,257	39,952,951	-1.95%
shootout-xchacha20	4,380,029	4,444,076	-1.44%
regex	210,231,631	212,953,957	-1.28%
shootout-gimli	5,345,271	5,404,314	-1.09%
shootout-fib2	3,010,356,555	3,010,366,675	0.00%
shootout-matrix	539,106,883	538,754,055	0.07%
shootout-heapsort	2,378,812,048	2,375,114,650	0.16%
spidermonkey	636,143,610	633,256,160	0.46%
bz2	87,099,313	86,442,047	0.76%
shootout-memmove	36,569,713	36,246,023	0.89%
blake3-scalar	319,176	316,024	1.00%
pulldown-cmark	6,741,922	6,632,833	1.64%
shootout-ed25519	9,569,702,876	9,402,544,055	1.78%
shootout-nestedloop	504	494	2.02%
blake3-simd	313,144	305,665	2.45%
shootout-xblabla20	3,011,041	2,926,801	2.88%
shootout-seqhash	8,868,767,466	8,580,074,421	3.36%
shootout-ctype	830,871,264	796,659,320	4.29%
shootout-sieve	906,201,119	841,387,363	7.70%
shootout-base64	381,928,675	352,966,325	8.21%
shootout-ackermann	8,657,032	7,770,921	11.40%
shootout-minicsv	1,478,013,885	1,292,314,209	14.37%
shootout-keccak	25,147,664	21,106,967	19.14%
shootout-random	630,552,186	439,772,361	43.38%

Wasmtime GitHub notifications bot (Dec 04 2025 at 10:37):

bongjunj commented on issue #12106:

Unlike the addition-reassoc rule, the DeMorgan rule does not use subsume which destructively rewrites instructions.
How, the DeMorgan form wins over the none-optimized form? I thought the most cost-efficient instruction sequence (or somehow with good performance heuristic) would be selected.

Wasmtime GitHub notifications bot (Dec 04 2025 at 16:45):

cfallin commented on issue #12106:

I thought the most cost-efficient instruction sequence (or somehow with good performance heuristic) would be selected.

That's a good next question to investigate! You could get the egraph with multiple value representations from the log and look at the cost of each form. Our cost function is still pretty basic right now, and there hasn't been, e.g., an effort to try to correlate to real machines; only a relative ranking of opcode classes. Remember also that we don't do any sort of reasoning about amortized/shared cost of computing values used multiple times, nor do we integrate with regalloc to reason about register pressure (both are, as noted above, NP-hard problems).

Wasmtime GitHub notifications bot (Dec 05 2025 at 02:19):

bongjunj commented on issue #12106:

There is a strong evidence that the cost model has a bug when the input function is large.

--- keccak-no-opts.clif 2025-12-04 08:22:28.715675257 +0000
+++ keccak-main.clif    2025-12-04 08:18:25.654976355 +0000
@@ -116,10 +116,10 @@
 @0244                               v153 = bxor v152, v67
 @0245                               v154 = iconst.i64 19
 @0247                               v155 = ushr v153, v154  ; v154 = 19
-@024e                               v157 = band v155, v40  ; v40 = 0xffff_ffff
-@0251                               v158 = iconst.i64 -1
-@0253                               v159 = bxor v157, v158  ; v158 = -1
-@0254                               v160 = band v87, v159
+                                    v6989 = bnot v155
+                                    v6991 = iconst.i64 -4294967296
+                                    v6992 = bor v6989, v6991  ; v6991 = -4294967296
+@0254                               v160 = band v87, v6992

-@11dd      v1561 = bxor v1220, v158  ; v158 = -1
-@11de      v1562 = band v1430, v1561
+                v7626 = bnot v1215
+                v7627 = bnot v1217
+                v7628 = band v7626, v7627
+                v7630 = bor v7628, v6991  ; v6991 = -4294967296
+@11de     v1562 = band v1430, v7630

First, the deoptimizing code pattern only appears when the instruction count exceed a certain point. Such deoptimization does not occur near the beginning of the function.

Second, the cost model uses the optimizing pattern as intended near the beginning of the function. For example, it works well for v160. However, as cost "accumulates", all the costs are now "infinite". (See v1562 below) Thus, the cost model cannot choose the optimizing pattern ending up generating the undesired instruction sequence.

...
[2025-12-05T02:12:28Z TRACE cranelift_codegen::egraph::elaborate]  -> cost of value v160 = Cost::Finite { op_cost: 100, depth: 12 }
...
[2025-12-05T02:12:28Z TRACE cranelift_codegen::egraph::elaborate] -> best for eclass v1562: BestEntry(Cost::Infinite, v1562)

Wasmtime GitHub notifications bot (Dec 05 2025 at 02:26):

bongjunj commented on issue #12106:

https://github.com/bytecodealliance/wasmtime/blob/9dc6a7d01fb5f5573666bd9aa4788da05af4bb09/cranelift/codegen/src/egraph/elaborate.rs#L336-L351

Seems like this is the case.

Wasmtime GitHub notifications bot (Dec 05 2025 at 02:29):

bongjunj edited a comment on issue #12106:

https://github.com/bytecodealliance/wasmtime/blob/9dc6a7d01fb5f5573666bd9aa4788da05af4bb09/cranelift/codegen/src/egraph/elaborate.rs#L336-L351

Seems like this is the case. The README.md mentioned in the comment says Rules should not rewrite to worse code, however the de morgan rules do generate worse codes.

Now, we have options as follows:

We should update the cost model to take this possiblity into account, but this can make a large diff on the generated codes we are testing as of now, or

simply delete the rules to upholad the contract mentioned in README.md (+ enforce the rule with ISLE meta-compiler checking the cost "strictly" reduces)

Wasmtime GitHub notifications bot (Dec 05 2025 at 04:43):

bongjunj edited a comment on issue #12106:

There is a strong evidence that the cost model has a bug when the input function is large.

--- keccak-no-opts.clif 2025-12-04 08:22:28.715675257 +0000
+++ keccak-main.clif    2025-12-04 08:18:25.654976355 +0000
@@ -116,10 +116,10 @@
 @0244                               v153 = bxor v152, v67
 @0245                               v154 = iconst.i64 19
 @0247                               v155 = ushr v153, v154  ; v154 = 19
-@024e                               v157 = band v155, v40  ; v40 = 0xffff_ffff
-@0251                               v158 = iconst.i64 -1
-@0253                               v159 = bxor v157, v158  ; v158 = -1
-@0254                               v160 = band v87, v159
+                                    v6989 = bnot v155
+                                    v6991 = iconst.i64 -4294967296
+                                    v6992 = bor v6989, v6991  ; v6991 = -4294967296
+@0254                               v160 = band v87, v6992

-@11dd      v1561 = bxor v1220, v158  ; v158 = -1
-@11de      v1562 = band v1430, v1561
+                v7626 = bnot v1215
+                v7627 = bnot v1217
+                v7628 = band v7626, v7627
+                v7630 = bor v7628, v6991  ; v6991 = -4294967296
+@11de     v1562 = band v1430, v7630

First, the deoptimizing code pattern only appears when the instruction counts exceed a certain point. For example, at the code point @0254, the optimized code is okay. But at @11de, deoptimization happens.

Second, the cost model selects the optimized code as intended near the beginning of the function. For example, it works well for v160. However, as cost "accumulates", all the costs are now "infinite". (See v1562 below) Thus, the cost model cannot choose the optimizing pattern ending up generating the undesired instruction sequence.

...
[2025-12-05T02:12:28Z TRACE cranelift_codegen::egraph::elaborate]  -> cost of value v160 = Cost::Finite { op_cost: 100, depth: 12 }
...
[2025-12-05T02:12:28Z TRACE cranelift_codegen::egraph::elaborate] -> best for eclass v1562: BestEntry(Cost::Infinite, v1562)

Wasmtime GitHub notifications bot (Dec 05 2025 at 04:44):

bongjunj edited a comment on issue #12106:

https://github.com/bytecodealliance/wasmtime/blob/9dc6a7d01fb5f5573666bd9aa4788da05af4bb09/cranelift/codegen/src/egraph/elaborate.rs#L336-L351

It seems like the cost model is causing a problem with the canonicalizing rules when selecting instructions in large functions. The README.md mentioned in the comment says Rules should not rewrite to worse code, however the de morgan rules do generate worse codes.

Now, we have options as follows:

We should update the cost model to take this possiblity into account, but this can make a large diff on the generated codes we are testing as of now, or

simply delete the rules to upholad the contract mentioned in README.md (+ enforce the rule with ISLE meta-compiler checking the cost "strictly" reduces)

Wasmtime GitHub notifications bot (Dec 05 2025 at 06:18):

cfallin commented on issue #12106:

Fascinating. In general cost will only accumulate if there are very long chains, unbroken by any blockparams. I guess we're seeing exactly that in some benchmarks.

If that's the case, I agree we should delete rules that don't have a clear directionality.

I don't see a pairwise comparison above of all-opts versus all-opts-minus-DeMorgan. I want to make sure that they don't show improvements on other benchmarks that we'd be discarding. Could you do that run as well? Thanks!

Wasmtime GitHub notifications bot (Dec 05 2025 at 17:02):

fitzgen commented on issue #12106:

I wouldn't have expected that we would actually hit Cost::Infinity except in very rare cases. It makes me wonder if we should instead investigate widening Cost.

Wasmtime GitHub notifications bot (Dec 05 2025 at 17:05):

fitzgen commented on issue #12106:

Wait, we have 24 bits for the op cost, so we shouldn't be hitting infinity until we have on the order of millions deep chains of ops.

It seems like something else funky might be going on here.

Wasmtime GitHub notifications bot (Dec 05 2025 at 17:52):

fitzgen commented on issue #12106:

In keccak, the largest value is v43210. Our largest individual op cost is 64.

Wait, we have 24 bits for the op cost, so we shouldn't be hitting infinity until we have on the order of millions deep chains of ops.

But I guess actually it isn't chains but the whole tree, which makes hitting infinite cost more understandable, but still not ideal because I suspect it does not accurately reflect the real costs here.

This is because Cost doesn't take into account shared structure. For example, iadd v0, v0 should be cost_of_opcode(iadd) + cost_of_operand(v0), but the current code will compute cost_of_opcode(iadd) + cost_of_operand(v0) + cost_of_operand(v0). It shouldn't count the cost of v0 twice but it will. Now fixing this particular case, where operands are repeated in the same operation, would be relatively straightforward to fix. But fixing this in the general case, where v0 is never repeated within a single operation but is repeated at multiple levels of operations in the same expression, is a bit more difficult.

We could try to maintain the transitive set of all values that contributed to a cost in the cost, and only add new cost for new values (although this maybe doesn't accurately reflect register pressure and spills for very large expressions). That could be done in a perhaps not-horrendous manner via immutable persistent sets so that we aren't completely blowing up memory usage. But I suspect it would still be a large slow-down to our best-cost computation.

Wasmtime GitHub notifications bot (Dec 05 2025 at 17:54):

fitzgen commented on issue #12106:

So yeah, this leads me back towards:

It makes me wonder if we should instead investigate widening Cost.

We could probably do this without growing the size of Cost by just using less bits for depth and more for the total opcode cost.

Wasmtime GitHub notifications bot (Dec 05 2025 at 18:00):

fitzgen commented on issue #12106:

(Maybe we should even stop considering depth as part of Cost?)

Anyways, I'm doing a quick experiment to see the difference when we switch Cost from { op: u24, depth: u8 } to { op: u28, depth: u4 } to see if this frees up enough bits such that we don't hit infinite cost here.

Wasmtime GitHub notifications bot (Dec 05 2025 at 18:49):

fitzgen commented on issue #12106:

It seems like keccak has enough shared structure that moving more bits from depth to op actually makes more eclasses have infinite cost (that is, they saturate op cost before saturating depth, so lowering the depth bits just makes more things saturate both op and depth, i.e. be infinite).

Cost layout total # eclasses in keccak # infinit-cost eclasses in keccak % infinite-cost eclasses in keccak

{ op: u32, depth: u0 } 43211 31474 1.37%

{ op: u28, depth: u4 } 43211 33057 1.31%

{ op: u24, depth: u8 } (today's layout) 43211 34704 1.25%

Increasing the number of depth bits would probably make fewer eclasses have infinite cost, but we would also be extracting a larger percentage of them purely based on depth (since more of them would saturate the op portion of cost) which doesn't seem desirable.

The other thing we could do is move Cost from bitpacking into a u32 to bitpacking into a u64 or even a u128 but this would only add support for on the order of 32 and 96 respectively deeper expressions before hitting infinite cost again, which isn't very much. (Assuming we don't attempt to start representing shared structure in Cost).

So probably, at least in the short term, removing the demorgan rule (https://github.com/bytecodealliance/wasmtime/pull/12127) actually is probably the best incremental/practical step forward, even though I was initially skeptical.

Perhaps we should also be breaking cost ties with a bias towards the original input, on the assumption that the producer knows better than us if we can't tell if our rewrites are actually improvements or not. But looking at the code, it does seem like we take the minimum BestEntry, which if Cost is tied means we will in fact break ties with lower-numbered values (i.e. biasing towards the originals). So we are already doing this.

Fundamentally improving things here will require a cost function that understands shared structure, which, unless someone has an idea I haven't thought up yet, is almost definitely too expensive for the niche that Cranelift sits in.

`Cost` layout	total # eclasses in `keccak`	# infinit-cost eclasses in `keccak`	% infinite-cost eclasses in `keccak`
`{ op: u32, depth: u0 }`	43211	31474	1.37%
`{ op: u28, depth: u4 }`	43211	33057	1.31%
`{ op: u24, depth: u8 }` (today's layout)	43211	34704	1.25%

Wasmtime GitHub notifications bot (Dec 05 2025 at 18:55):

cfallin commented on issue #12106:

Yes, in particular because there seem to be no major regressions without the rules, I think the choice here is easy. It is unfortunate that we can't lean more heavily on cost optimization but (to write it down explicitly) doing cost optimization with shared structure is as hard as the weighted set-cover problem (that is, the latter can be reduced to a graph with shared structure on which we want to pick the lowest-cost implementations of terminals), so it's NP-hard. That doesn't mean there isn't a good approximation algorithm of course, but we won't solve that in this PR I don't think!

Wasmtime GitHub notifications bot (Dec 05 2025 at 19:19):

fitzgen commented on issue #12106:

That doesn't mean there isn't a good approximation algorithm of course,

Yeah, I suspect the trouble really begins when you want to make one expression in the current eclass have cheaper cost because we've already committed to computing some of its sub-expressions via choosing them as part of the best version of a different eclass.

If we limit ourselves to just the shared structure within a single expression in isolation from the best cost/expression we computed for different eclasses (i.e. only adding the cost of operand v123 once when computing the cost of a particular expression, ignoring what was chosen as best for other eclassses) then things seem pretty tractable, although still too expensive for Cranelift.

but we won't solve that in this PR I don't think!

Right, which is why I'm commenting on the issue, not the PR :-p

Wasmtime GitHub notifications bot (Dec 05 2025 at 19:19):

cfallin closed issue #12106:

Hi,

I performance-tested a no-opt version of Cranelift without ISLE mid-end optimizations. (You can confirmed this at https://github.com/prosyslab/wasmtime/commit/c0df585025897098e83703cca343e139ed30a119)

Compared to the latest version of upstream Cranelift, surprisingly, the no-opt version produced a significantly faster x86_64 code for blake3, keccak, xchacha20. This experiment is conducted using the sightglass-cli.

Given that only mid-end rules are removed, some codegen backend might causing problem interacting with the mid-end. I want to investigate this problem, but I'm completely lost which part to look at first. Any comments will be appreciated.

Here is the demonstration:
> cargo run --release -- benchmark benchmarks/blake3-scalar/benchmark.wasm --engine engines/wasmtime/v-main/libengine.so engines/wasmtime/v-no-opts/libengine.so --pin
    Finished `release` profile [optimized] target(s) in 0.08s
     Running `target/release/sightglass-cli benchmark benchmarks/blake3-scalar/benchmark.wasm --engine engines/wasmtime/v-main/libengine.so engines/wasmtime/v-no-opts/libengine.so --pin`

execution :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 502562.18 ± 4271.59 (confidence = 99%)

  no-opts/libengine.so is 2.55x to 2.58x faster than main/libengine.so!

  [816778 823363.48 880808] main/libengine.so
  [315506 320801.30 460046] no-opts/libengine.so

compilation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 51372480.98 ± 1109163.62 (confidence = 99%)

  no-opts/libengine.so is 1.19x to 1.20x faster than main/libengine.so!

  [310860540 313559414.86 330384818] main/libengine.so
  [260245364 262186933.88 278160088] no-opts/libengine.so

instantiation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 17444.42 ± 10445.70 (confidence = 99%)

  no-opts/libengine.so is 1.06x to 1.23x faster than main/libengine.so!

  [90902 137381.98 258206] main/libengine.so
  [87766 119937.56 198140] no-opts/libengine.so
Plus, here are some data for other benchmarks.
--iterations-per-process 10 --benchmark-phase execution ----pin is used.

bench v-no-opts base speedup

blake3-scalar 320,225 868,750 -63.14%

blake3-simd 320,689 945,427 -66.08%

bz2 88,887,466 86,904,121 2.28%

pulldown-cmark 6,630,447 6,705,562 -1.12%

regex 209,902,394 211,477,705 -0.74%

shootout-base64 383,700,851 352,817,318 8.75%

shootout-keccak 25,589,899 49,540,506 -48.35%

shootout-xchacha20 4,489,570 4,816,315 -6.78%

spidermonkey 644,434,235 627,374,660 2.72%

bench	v-no-opts	base	speedup
blake3-scalar	320,225	868,750	-63.14%
blake3-simd	320,689	945,427	-66.08%
bz2	88,887,466	86,904,121	2.28%
pulldown-cmark	6,630,447	6,705,562	-1.12%
regex	209,902,394	211,477,705	-0.74%
shootout-base64	383,700,851	352,817,318	8.75%
shootout-keccak	25,589,899	49,540,506	-48.35%
shootout-xchacha20	4,489,570	4,816,315	-6.78%
spidermonkey	644,434,235	627,374,660	2.72%

Wasmtime GitHub notifications bot (Dec 05 2025 at 19:35):

cfallin commented on issue #12106:

(D'oh, sorry, too many tabs open -- yes, this is indeed the issue and not the PR :-) )

It seems there are at least two possible directions for approximations to true cost minimization:

We could divide/amortize the cost by the number of users of any given value. This would require a little care with the union nodes -- we would want to "see through" the union-spin to the whole eclass and consider it as one. But e.g. if 100 users are asking for a value of cost 100, or alternately each uses its own input of cost 2, this would do the right thing.

This maintains the phased aspect (all costs computable before we pick winners) but feels "more approximate"

We could dynamically update costs as min-cost values are picked for classes -- so e.g. if I've already decided to use a value, then all subsequent uses get to use it for (additional/marginal) cost 0. This one is tricky because it requires mixing together the cost computation and min-cost choices.

@bongjunj if you're looking for interesting concrete research problems, this one might be a good one to play with!

Wasmtime GitHub notifications bot (Dec 05 2025 at 19:35):

cfallin edited a comment on issue #12106:

(D'oh, sorry, too many tabs open -- yes, this is indeed the issue and not the PR :-) )

It seems there are at least two possible directions for approximations to true cost minimization:

We could divide/amortize the cost by the number of users of any given value. This would require a little care with the union nodes -- we would want to "see through" the union-spine to the whole eclass and consider it as one. But e.g. if 100 users are asking for a value of cost 100, or alternately each uses its own input of cost 2, this would do the right thing.

This maintains the phased aspect (all costs computable before we pick winners) but feels "more approximate"

We could dynamically update costs as min-cost values are picked for classes -- so e.g. if I've already decided to use a value, then all subsequent uses get to use it for (additional/marginal) cost 0. This one is tricky because it requires mixing together the cost computation and min-cost choices.

@bongjunj if you're looking for interesting concrete research problems, this one might be a good one to play with!

Wasmtime GitHub notifications bot (Dec 05 2025 at 19:55):

fitzgen commented on issue #12106:

Now fixing this particular case, where operands are repeated in the same operation, would be relatively straightforward to fix.

FWIW, I hacked this trivial-case fix together to get numbers, and it did not change the infinite-cost-eclass count in keccak at all.

Wasmtime GitHub notifications bot (Dec 05 2025 at 20:15):

fitzgen commented on issue #12106:

We could divide/amortize the cost by the number of users of any given value.

The tricky bit here is that if these are "uncommitted" uses (i.e. incoming edges for each enode) rather than "committed" uses (i.e. uses from enodes that we chose as the best in their eclass) then the use count is kind of useless. There might be a bunch of uses inside the egraph, making the enode's cost cheap, but then if we don't actually extract those uses then it was all just a lie. How bad this will be in practice depends on the size of the saturated vs initial egraph.

On the other hand, if these are "committed" uses, then we wouldn't maintain the phasing, which leads to things getting really tricky, as you noted.

If we aren't maintaining phasing, then we could, for example, do something like
for inst in skeleton {
    compute_best_costs();
    elaborate_inst(inst);

    // Update costs for values we committed to computing so that the next iteration
    // of the loop can recompute best costs to bias towards reusing already-computed
    // values.
    for value in inst.operands().flat_map(|op| transitive_values(op)) {
        set_cost(value, 0);
    }
}
which is "only" O(n^3) (because compute_best_costs is an O(n^2) fixed point loop).

I guess we wouldn't need to complete the fixed point every time though. In practice I bet you would reach it with one or two iterations, but it also shouldn't ever be incorrect to stop and not choose the actual min cost but an approximation, so maybe doing just one more iteration of the fixed point loop is Good Enough after elaborating each instruction. This would keep things O(n^2) like they are today.

Anyways, yeah this is definitely an interesting research problem from my point of view, but I have no intentions of diving into this further so I'll stop commenting and try not to let myself be nerd-sniped any worse than I already have been this morning.

Wasmtime GitHub notifications bot (Dec 08 2025 at 05:22):

bongjunj commented on issue #12106:

@cfallin Thanks for the comment. I will look into those :+1:

Wasmtime GitHub notifications bot (Dec 10 2025 at 16:40):

fitzgen commented on issue #12106:

@bongjunj btw, you might also take a look to https://github.com/egraphs-good/extraction-gym/ for some inspiration as well. Good luck!

Last updated: Feb 24 2026 at 05:28 UTC