bongjunj opened issue #12106:
Hi,
I performance-tested a no-opt version of Cranelift without ISLE mid-end optimizations. (You can confirmed this at https://github.com/prosyslab/wasmtime/commit/c0df585025897098e83703cca343e139ed30a119)
Compared to the latest version of upstream Cranelift, surprisingly, the no-opt version produced a significantly faster x86_64 code for blake3, keccak, xchacha20. This experiment is conducted using the sightglass-cli.
Given that only mid-end rules are removed, some codegen backend might causing problem interacting with the mid-end. I want to investigate this problem, but I'm completely lost which part to look at first. Any comments will be appreciated.
Here is the demonstration:
> cargo run --release -- benchmark benchmarks/blake3-scalar/benchmark.wasm --engine engines/wasmtime/v-main/libengine.so engines/wasmtime/v-no-opts/libengine.so --pin Finished `release` profile [optimized] target(s) in 0.08s Running `target/release/sightglass-cli benchmark benchmarks/blake3-scalar/benchmark.wasm --engine engines/wasmtime/v-main/libengine.so engines/wasmtime/v-no-opts/libengine.so --pin` . . . . . . . . . . . . . . . . . . . . execution :: cycles :: benchmarks/blake3-scalar/benchmark.wasm Δ = 502562.18 ± 4271.59 (confidence = 99%) no-opts/libengine.so is 2.55x to 2.58x faster than main/libengine.so! [816778 823363.48 880808] main/libengine.so [315506 320801.30 460046] no-opts/libengine.so compilation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm Δ = 51372480.98 ± 1109163.62 (confidence = 99%) no-opts/libengine.so is 1.19x to 1.20x faster than main/libengine.so! [310860540 313559414.86 330384818] main/libengine.so [260245364 262186933.88 278160088] no-opts/libengine.so instantiation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm Δ = 17444.42 ± 10445.70 (confidence = 99%) no-opts/libengine.so is 1.06x to 1.23x faster than main/libengine.so! [90902 137381.98 258206] main/libengine.so [87766 119937.56 198140] no-opts/libengine.soPlus, here are some data for other benchmarks.
--iterations-per-process 10 --benchmark-phase execution ----pin
bench v-no-opts base speedup blake3-scalar 320,225 868,750 -63.14% blake3-simd 320,689 945,427 -66.08% bz2 88,887,466 86,904,121 2.28% pulldown-cmark 6,630,447 6,705,562 -1.12% regex 209,902,394 211,477,705 -0.74% shootout-base64 383,700,851 352,817,318 8.75% shootout-keccak 25,589,899 49,540,506 -48.35% shootout-xchacha20 4,489,570 4,816,315 -6.78% spidermonkey 644,434,235 627,374,660 2.72%
bongjunj edited issue #12106:
Hi,
I performance-tested a no-opt version of Cranelift without ISLE mid-end optimizations. (You can confirmed this at https://github.com/prosyslab/wasmtime/commit/c0df585025897098e83703cca343e139ed30a119)
Compared to the latest version of upstream Cranelift, surprisingly, the no-opt version produced a significantly faster x86_64 code for blake3, keccak, xchacha20. This experiment is conducted using the sightglass-cli.
Given that only mid-end rules are removed, some codegen backend might causing problem interacting with the mid-end. I want to investigate this problem, but I'm completely lost which part to look at first. Any comments will be appreciated.
Here is the demonstration:
> cargo run --release -- benchmark benchmarks/blake3-scalar/benchmark.wasm --engine engines/wasmtime/v-main/libengine.so engines/wasmtime/v-no-opts/libengine.so --pin Finished `release` profile [optimized] target(s) in 0.08s Running `target/release/sightglass-cli benchmark benchmarks/blake3-scalar/benchmark.wasm --engine engines/wasmtime/v-main/libengine.so engines/wasmtime/v-no-opts/libengine.so --pin` execution :: cycles :: benchmarks/blake3-scalar/benchmark.wasm Δ = 502562.18 ± 4271.59 (confidence = 99%) no-opts/libengine.so is 2.55x to 2.58x faster than main/libengine.so! [816778 823363.48 880808] main/libengine.so [315506 320801.30 460046] no-opts/libengine.so compilation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm Δ = 51372480.98 ± 1109163.62 (confidence = 99%) no-opts/libengine.so is 1.19x to 1.20x faster than main/libengine.so! [310860540 313559414.86 330384818] main/libengine.so [260245364 262186933.88 278160088] no-opts/libengine.so instantiation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm Δ = 17444.42 ± 10445.70 (confidence = 99%) no-opts/libengine.so is 1.06x to 1.23x faster than main/libengine.so! [90902 137381.98 258206] main/libengine.so [87766 119937.56 198140] no-opts/libengine.soPlus, here are some data for other benchmarks.
--iterations-per-process 10 --benchmark-phase execution ----pinis used.
bench v-no-opts base speedup blake3-scalar 320,225 868,750 -63.14% blake3-simd 320,689 945,427 -66.08% bz2 88,887,466 86,904,121 2.28% pulldown-cmark 6,630,447 6,705,562 -1.12% regex 209,902,394 211,477,705 -0.74% shootout-base64 383,700,851 352,817,318 8.75% shootout-keccak 25,589,899 49,540,506 -48.35% shootout-xchacha20 4,489,570 4,816,315 -6.78% spidermonkey 644,434,235 627,374,660 2.72%
cfallin commented on issue #12106:
Hi @bongjunj -- thanks for reporting this! The place I would start is profiling to find the hottest inner loops, then comparing the disassembly of these loops before-and-after. That should help us triage the overall kind of problem -- whether it is code scheduling, or bad regalloc, or poor choice of instruction, or something else...
bongjunj commented on issue #12106:
@cfallin thanks! working on the triage.
alexcrichton commented on issue #12106:
I can pretty easily reproduce this myself cherry-picking the commit @bongjunj mentioned. Sightglass is producing surprising results for me though. One run yields
execution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm No difference in performance. [165008 555680.24 1110872] less-isle.so [371539 548436.46 1682729] main.soa few runs later is
execution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm Δ = 147722.45 ± 92286.49 (confidence = 99%) main.so is 1.13x to 1.55x faster than less-isle.so! [167456 580352.29 1502815] less-isle.so [363903 432629.84 590110] main.sopassing
--pinshowsexecution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm Δ = 311980.36 ± 2354.27 (confidence = 99%) less-isle.so is 2.18x to 2.19x faster than main.so! [259525 263320.66 327692] less-isle.so [569582 575301.02 595577] main.sorunning with
taskset 0x1showsexecution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm Δ = 200594.18 ± 2696.25 (confidence = 99%) less-isle.so is 2.18x to 2.21x faster than main.so! [162200 167555.71 209294] less-isle.so [363482 368149.89 400345] main.soI don't really know what to make of this. My machine has P and E cores which seems like it might be factoring in.
I made custom builds with jitdump turned on and ran with
perf record -k monoplus--iterations-per-processcranked way up. All I got though was thatwasm[0]::function[8]was a hot function with ~45% in the less-isle.so profile and ~61% in the main.so profile. It's a massive function with no apparent loops.I don't know what's going on here myself.
alexcrichton edited a comment on issue #12106:
I can pretty easily reproduce this myself cherry-picking the commit @bongjunj mentioned. Sightglass is producing surprising results for me though. One run yields
execution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm No difference in performance. [165008 555680.24 1110872] less-isle.so [371539 548436.46 1682729] main.soa few runs later is
execution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm Δ = 147722.45 ± 92286.49 (confidence = 99%) main.so is 1.13x to 1.55x faster than less-isle.so! [167456 580352.29 1502815] less-isle.so [363903 432629.84 590110] main.sopassing
--pinshowsexecution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm Δ = 311980.36 ± 2354.27 (confidence = 99%) less-isle.so is 2.18x to 2.19x faster than main.so! [259525 263320.66 327692] less-isle.so [569582 575301.02 595577] main.sorunning with
taskset 0x1showsexecution :: cycles :: ./benchmarks/blake3-simd/benchmark.wasm Δ = 200594.18 ± 2696.25 (confidence = 99%) less-isle.so is 2.18x to 2.21x faster than main.so! [162200 167555.71 209294] less-isle.so [363482 368149.89 400345] main.soI don't really know what to make of this. My machine has P and E cores which seems like it might be factoring in.
I made custom builds with jitdump turned on and ran with
perf record -k monoplus--iterations-per-processcranked way up. All I got though was thatwasm[0]::function[8]was a hot function with ~41% in the less-isle.so profile and ~61% in the main.so profile. It's a massive function with no apparent loops.I don't know what's going on here myself.
alexcrichton commented on issue #12106:
Hm well the hot function is this one which is a massive function with zero control flow. That being said I could see egraphs deoptimizing this function in terms of instruction scheduling since we've seen pessimistic results in the past, so this might be an example of an overly-pessimistic case of instruction scheduling.
bongjunj commented on issue #12106:
@alexcrichton Can you share the workflow of your using jitdump and perf? Any existing docs will be helpful.
bongjunj edited a comment on issue #12106:
@alexcrichton Can you share the workflow of your using jitdump and perf? Any existing docs will be helpful.
I think I got this. thanks!
bongjunj edited a comment on issue #12106:
@alexcrichton Can you share the workflow of your using jitdump and perf? Any existing docs will be helpful.
bongjunj edited a comment on issue #12106:
@alexcrichton Can you share the workflow of your using jitdump and perf? Any existing docs will be helpful.
Just running
perfseems not working.
perf record -k mono target/release/sightglass-cli benchmark --engine=engines/wasmtime/libengine.so --engine-flags=" --profile=jitdump" --iterations-per-process=1 --measure=perf-counters benchmarks/blake3-scalar/benchmark.wasm
bongjunj commented on issue #12106:
I removed the sightglass_api dependency of the blake3-scalar benchmark,
and measured jitdump/perf following https://docs.wasmtime.dev/examples-profiling-perf.html runningwasmtime runcommand for each.perf record -k mono wasmtime run --dir=. --profile=jitdump rust-benchmark/target/wasm32-wasip1/release/benchmark.wasm perf inject --jit --input perf.data --output perf.jit.data perf report --input perf.jit.dataperf diff /data/blake3-scalar-main.perf.jit.data /data/blake3-scalar-noopts.perf.jit.data # Event 'cycles:P' # # Baseline Delta Abs Shared Object Symbol # ........ ......... .................... ............................................................................................ # 12.87% +7.98% wasmtime [.] tokio::runtime::scheduler::multi_thread::worker::Context::run 6.50% -4.99% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath.part.0 3.91% +4.13% [kernel.kallsyms] [k] smp_call_function_many_cond 1.13% +2.82% ld-linux-x86-64.so.2 [.] 0x0000000000009f0c 5.75% -2.52% libc.so.6 [.] 0x000000000009c724 0.57% +1.74% [kernel.kallsyms] [k] next_uptodate_page 3.66% +1.73% wasmtime [.] std::sys::sync::mutex::futex::Mutex::lock_contended 0.59% +1.62% [kernel.kallsyms] [k] __handle_mm_fault 0.44% +1.52% [kernel.kallsyms] [k] native_write_msr 1.84% -1.46% [kernel.kallsyms] [k] _find_next_bit +1.44% [kernel.kallsyms] [k] __mod_lruvec_page_state 1.47% -1.39% [kernel.kallsyms] [k] zap_pte_range +1.21% [kernel.kallsyms] [k] update_curr +1.20% [kernel.kallsyms] [k] __remove_hrtimer +1.17% [kernel.kallsyms] [k] osq_unlock 1.27% -1.16% [kernel.kallsyms] [k] futex_wait +1.13% [kernel.kallsyms] [k] unmap_page_range 0.97% +1.10% [kernel.kallsyms] [k] restore_fpregs_from_fpstate 1.21% -1.09% [kernel.kallsyms] [k] llist_reverse_order +1.08% [kernel.kallsyms] [k] do_user_addr_fault 2.85% -1.05% [kernel.kallsyms] [k] osq_lock 0.03% +0.96% [kernel.kallsyms] [k] __x64_sys_futex +0.89% [kernel.kallsyms] [k] newidle_balance +0.89% [kernel.kallsyms] [k] rwsem_mark_wake 0.40% +0.89% [kernel.kallsyms] [k] delayacct_add_tsk +0.84% [kernel.kallsyms] [k] rb_first +0.74% [kernel.kallsyms] [k] memcg_slab_post_alloc_hook +0.73% [kernel.kallsyms] [k] sync_regs +0.72% [kernel.kallsyms] [k] change_protection_range +0.72% [kernel.kallsyms] [k] __mem_cgroup_charge 1.24% -0.71% [kernel.kallsyms] [k] update_sg_wakeup_stats +0.69% [kernel.kallsyms] [k] psi_group_change +0.69% [kernel.kallsyms] [k] try_charge_memcg +0.69% [kernel.kallsyms] [k] __memcg_kmem_charge_page +0.62% [kernel.kallsyms] [k] x64_sys_call 0.07% +0.60% [kernel.kallsyms] [k] find_vma +0.60% [kernel.kallsyms] [k] kfree +0.54% [kernel.kallsyms] [k] vm_area_dup 0.19% +0.53% [kernel.kallsyms] [k] __vma_adjust 0.59% -0.53% [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe +0.52% [kernel.kallsyms] [k] __srcu_read_unlock 0.62% -0.51% [kernel.kallsyms] [k] rwsem_down_write_slowpath 0.95% -0.50% libc.so.6 [.] malloc +0.49% wasmtime [.] tokio::runtime::blocking::pool::Spawner::spawn_task 0.83% +0.48% [kernel.kallsyms] [k] perf_event_alloc +0.47% [kernel.kallsyms] [k] __alloc_pages +0.46% wasmtime [.] HUF_decompress4X2_usingDTable_internal_fast +0.46% [kernel.kallsyms] [k] update_blocked_averages +0.46% wasmtime [.] <&[u8] as object::read::read_ref::ReadRef>::read_bytes_at_until +0.46% libc.so.6 [.] cfree +0.46% [kernel.kallsyms] [k] refill_obj_stock +0.46% [kernel.kallsyms] [k] ext4_do_update_inode.isra.0 +0.46% [kernel.kallsyms] [k] __ext4_get_inode_loc +0.45% [kernel.kallsyms] [k] kmem_cache_free +0.45% [kernel.kallsyms] [k] syscall_exit_to_user_mode +0.45% jitted-438731-20.so [.] blake3[e80d2cadb56cf0e0]::portable::compress_in_place +0.45% [kernel.kallsyms] [k] insert_work +0.43% [kernel.kallsyms] [k] __intel_pmu_enable_all.constprop.0 1.33% -0.41% [kernel.kallsyms] [k] llist_add_batch 0.92% +0.35% [kernel.kallsyms] [k] native_irq_return_iret 0.62% +0.31% wasmtime [.] ZSTD_decompressSequences_bmi2.constprop.0 0.39% +0.29% [kernel.kallsyms] [k] native_sched_clock 1.45% -0.29% [kernel.kallsyms] [k] clear_page_erms 0.95% -0.24% [kernel.kallsyms] [k] handle_mm_fault 0.47% -0.22% wasmtime [.] tokio::runtime::blocking::pool::Inner::run +0.21% [kernel.kallsyms] [k] update_load_avg 0.22% -0.20% [kernel.kallsyms] [k] exit_to_user_mode_loop 3.70% -0.19% [kernel.kallsyms] [k] update_sg_lb_stats +0.19% [kernel.kallsyms] [k] find_vma_prev 0.98% -0.19% [kernel.kallsyms] [k] its_return_thunk 0.53% -0.18% [kernel.kallsyms] [k] nmi_restore +0.17% [kernel.kallsyms] [k] __rseq_handle_notify_resume +0.16% wasmtime [.] std::sys::sync::condvar::futex::Condvar::wait 0.20% +0.15% [kernel.kallsyms] [k] __get_user_8 0.08% +0.15% [kernel.kallsyms] [k] __irqentry_text_end 0.14% -0.12% [kernel.kallsyms] [k] ghes_notify_nmi +0.12% [vdso] [.] 0x00000000000006e5 +0.12% [kernel.kallsyms] [k] flush_signal_handlers +0.11% wasmtime [.] tokio::runtime::scheduler::multi_thread::worker::Core::maintenance +0.10% [kernel.kallsyms] [k] _raw_spin_lock +0.10% [kernel.kallsyms] [k] do_madvise.part.0 0.02% +0.10% [kernel.kallsyms] [k] futex_wait_queue_me +0.09% wasmtime [.] std::sys::sync::condvar::futex::Condvar::wait_timeout +0.09% [kernel.kallsyms] [k] __update_load_avg_cfs_rq 0.32% +0.08% [kernel.kallsyms] [k] finish_task_switch.isra.0 0.19% +0.08% [kernel.kallsyms] [k] exit_to_user_mode_prepare 0.03% +0.07% [kernel.kallsyms] [k] default_do_nmi 0.08% +0.07% [kernel.kallsyms] [k] switch_fpu_return 0.08% +0.07% libc.so.6 [.] syscall +0.07% [kernel.kallsyms] [k] entry_SYSCALL_64_safe_stack 0.64% -0.06% [kernel.kallsyms] [k] mutex_unlock +0.06% [kernel.kallsyms] [k] hrtimer_try_to_cancel.part.0 +0.06% [kernel.kallsyms] [k] sysvec_call_function +0.05% [kernel.kallsyms] [k] set_pte_vaddr_p4d +0.05% [kernel.kallsyms] [k] blkcg_maybe_throttle_current +0.05% [kernel.kallsyms] [k] native_set_fixmap 0.02% +0.04% [kernel.kallsyms] [k] nmi_handle.part.0 0.30% -0.04% [kernel.kallsyms] [k] native_flush_tlb_one_user 0.07% -0.04% [kernel.kallsyms] [k] native_flush_tlb_local 0.04% -0.04% [kernel.kallsyms] [k] native_apic_msr_write 0.11% +0.03% [kernel.kallsyms] [k] __perf_event_task_sched_in 0.01% +0.03% [kernel.kallsyms] [k] intel_pmu_handle_irq +0.02% [kernel.kallsyms] [k] do_futex 0.01% +0.01% [kernel.kallsyms] [k] sched_clock 0.88% +0.01% [kernel.kallsyms] [k] try_to_wake_up +0.01% [kernel.kallsyms] [k] perf_sample_event_took 0.04% -0.01% [kernel.kallsyms] [k] intel_bts_enable_local +0.01% [kernel.kallsyms] [k] end_repeat_nmi 0.01% +0.00% [kernel.kallsyms] [k] perf_event_nmi_handler 0.04% -0.00% [kernel.kallsyms] [k] acpi_os_read_memory 2.38% [kernel.kallsyms] [k] idle_cpu 1.40% [kernel.kallsyms] [k] kmem_cache_alloc 1.33% [kernel.kallsyms] [k] copy_page 1.23% wasmtime [.] core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h2abf5669ee3b7a52 1.23% [kernel.kallsyms] [k] page_mapping 1.15% [kernel.kallsyms] [k] vma_link 1.13% [kernel.kallsyms] [k] get_mem_cgroup_from_mm 1.10% [kernel.kallsyms] [k] native_apic_msr_eoi_write 1.09% [kernel.kallsyms] [k] clear_bhb_loop 1.06% wasmtime [.] <std::thread::spawnhook::SpawnHooks as core::ops::drop::Drop>::drop 1.00% [kernel.kallsyms] [k] load_balance 0.89% libc.so.6 [.] __madvise 0.88% [kernel.kallsyms] [k] alloc_thread_stack_node 0.85% wasmtime [.] std::sys::thread_local::guard::key::enable::run 0.82% [kernel.kallsyms] [k] task_h_load 0.79% [message truncated]
cfallin commented on issue #12106:
There is definitely something going on with function 8: the stack frame (grep for
sub $N, %rspin disassembly) is 0x130 bytes if I disable all opt rewrites (remove the call toconstructor_simplifyin the egraph pass) and 0x6b0 bytes in the unmodified run. The result is that the function body has a whole bunch more spills/reloads. Something is causing a lot more register pressure; isel is a plausible hypothesis there. I'll dig a bit more...
bongjunj deleted a comment on issue #12106:
I removed the sightglass_api dependency of the blake3-scalar benchmark,
and measured jitdump/perf following https://docs.wasmtime.dev/examples-profiling-perf.html runningwasmtime runcommand for each.perf record -k mono wasmtime run --dir=. --profile=jitdump rust-benchmark/target/wasm32-wasip1/release/benchmark.wasm perf inject --jit --input perf.data --output perf.jit.data perf report --input perf.jit.dataperf diff /data/blake3-scalar-main.perf.jit.data /data/blake3-scalar-noopts.perf.jit.data # Event 'cycles:P' # # Baseline Delta Abs Shared Object Symbol # ........ ......... .................... ............................................................................................ # 12.87% +7.98% wasmtime [.] tokio::runtime::scheduler::multi_thread::worker::Context::run 6.50% -4.99% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath.part.0 3.91% +4.13% [kernel.kallsyms] [k] smp_call_function_many_cond 1.13% +2.82% ld-linux-x86-64.so.2 [.] 0x0000000000009f0c 5.75% -2.52% libc.so.6 [.] 0x000000000009c724 0.57% +1.74% [kernel.kallsyms] [k] next_uptodate_page 3.66% +1.73% wasmtime [.] std::sys::sync::mutex::futex::Mutex::lock_contended 0.59% +1.62% [kernel.kallsyms] [k] __handle_mm_fault 0.44% +1.52% [kernel.kallsyms] [k] native_write_msr 1.84% -1.46% [kernel.kallsyms] [k] _find_next_bit +1.44% [kernel.kallsyms] [k] __mod_lruvec_page_state 1.47% -1.39% [kernel.kallsyms] [k] zap_pte_range +1.21% [kernel.kallsyms] [k] update_curr +1.20% [kernel.kallsyms] [k] __remove_hrtimer +1.17% [kernel.kallsyms] [k] osq_unlock 1.27% -1.16% [kernel.kallsyms] [k] futex_wait +1.13% [kernel.kallsyms] [k] unmap_page_range 0.97% +1.10% [kernel.kallsyms] [k] restore_fpregs_from_fpstate 1.21% -1.09% [kernel.kallsyms] [k] llist_reverse_order +1.08% [kernel.kallsyms] [k] do_user_addr_fault 2.85% -1.05% [kernel.kallsyms] [k] osq_lock 0.03% +0.96% [kernel.kallsyms] [k] __x64_sys_futex +0.89% [kernel.kallsyms] [k] newidle_balance +0.89% [kernel.kallsyms] [k] rwsem_mark_wake 0.40% +0.89% [kernel.kallsyms] [k] delayacct_add_tsk +0.84% [kernel.kallsyms] [k] rb_first +0.74% [kernel.kallsyms] [k] memcg_slab_post_alloc_hook +0.73% [kernel.kallsyms] [k] sync_regs +0.72% [kernel.kallsyms] [k] change_protection_range +0.72% [kernel.kallsyms] [k] __mem_cgroup_charge 1.24% -0.71% [kernel.kallsyms] [k] update_sg_wakeup_stats +0.69% [kernel.kallsyms] [k] psi_group_change +0.69% [kernel.kallsyms] [k] try_charge_memcg +0.69% [kernel.kallsyms] [k] __memcg_kmem_charge_page +0.62% [kernel.kallsyms] [k] x64_sys_call 0.07% +0.60% [kernel.kallsyms] [k] find_vma +0.60% [kernel.kallsyms] [k] kfree +0.54% [kernel.kallsyms] [k] vm_area_dup 0.19% +0.53% [kernel.kallsyms] [k] __vma_adjust 0.59% -0.53% [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe +0.52% [kernel.kallsyms] [k] __srcu_read_unlock 0.62% -0.51% [kernel.kallsyms] [k] rwsem_down_write_slowpath 0.95% -0.50% libc.so.6 [.] malloc +0.49% wasmtime [.] tokio::runtime::blocking::pool::Spawner::spawn_task 0.83% +0.48% [kernel.kallsyms] [k] perf_event_alloc +0.47% [kernel.kallsyms] [k] __alloc_pages +0.46% wasmtime [.] HUF_decompress4X2_usingDTable_internal_fast +0.46% [kernel.kallsyms] [k] update_blocked_averages +0.46% wasmtime [.] <&[u8] as object::read::read_ref::ReadRef>::read_bytes_at_until +0.46% libc.so.6 [.] cfree +0.46% [kernel.kallsyms] [k] refill_obj_stock +0.46% [kernel.kallsyms] [k] ext4_do_update_inode.isra.0 +0.46% [kernel.kallsyms] [k] __ext4_get_inode_loc +0.45% [kernel.kallsyms] [k] kmem_cache_free +0.45% [kernel.kallsyms] [k] syscall_exit_to_user_mode +0.45% jitted-438731-20.so [.] blake3[e80d2cadb56cf0e0]::portable::compress_in_place +0.45% [kernel.kallsyms] [k] insert_work +0.43% [kernel.kallsyms] [k] __intel_pmu_enable_all.constprop.0 1.33% -0.41% [kernel.kallsyms] [k] llist_add_batch 0.92% +0.35% [kernel.kallsyms] [k] native_irq_return_iret 0.62% +0.31% wasmtime [.] ZSTD_decompressSequences_bmi2.constprop.0 0.39% +0.29% [kernel.kallsyms] [k] native_sched_clock 1.45% -0.29% [kernel.kallsyms] [k] clear_page_erms 0.95% -0.24% [kernel.kallsyms] [k] handle_mm_fault 0.47% -0.22% wasmtime [.] tokio::runtime::blocking::pool::Inner::run +0.21% [kernel.kallsyms] [k] update_load_avg 0.22% -0.20% [kernel.kallsyms] [k] exit_to_user_mode_loop 3.70% -0.19% [kernel.kallsyms] [k] update_sg_lb_stats +0.19% [kernel.kallsyms] [k] find_vma_prev 0.98% -0.19% [kernel.kallsyms] [k] its_return_thunk 0.53% -0.18% [kernel.kallsyms] [k] nmi_restore +0.17% [kernel.kallsyms] [k] __rseq_handle_notify_resume +0.16% wasmtime [.] std::sys::sync::condvar::futex::Condvar::wait 0.20% +0.15% [kernel.kallsyms] [k] __get_user_8 0.08% +0.15% [kernel.kallsyms] [k] __irqentry_text_end 0.14% -0.12% [kernel.kallsyms] [k] ghes_notify_nmi +0.12% [vdso] [.] 0x00000000000006e5 +0.12% [kernel.kallsyms] [k] flush_signal_handlers +0.11% wasmtime [.] tokio::runtime::scheduler::multi_thread::worker::Core::maintenance +0.10% [kernel.kallsyms] [k] _raw_spin_lock +0.10% [kernel.kallsyms] [k] do_madvise.part.0 0.02% +0.10% [kernel.kallsyms] [k] futex_wait_queue_me +0.09% wasmtime [.] std::sys::sync::condvar::futex::Condvar::wait_timeout +0.09% [kernel.kallsyms] [k] __update_load_avg_cfs_rq 0.32% +0.08% [kernel.kallsyms] [k] finish_task_switch.isra.0 0.19% +0.08% [kernel.kallsyms] [k] exit_to_user_mode_prepare 0.03% +0.07% [kernel.kallsyms] [k] default_do_nmi 0.08% +0.07% [kernel.kallsyms] [k] switch_fpu_return 0.08% +0.07% libc.so.6 [.] syscall +0.07% [kernel.kallsyms] [k] entry_SYSCALL_64_safe_stack 0.64% -0.06% [kernel.kallsyms] [k] mutex_unlock +0.06% [kernel.kallsyms] [k] hrtimer_try_to_cancel.part.0 +0.06% [kernel.kallsyms] [k] sysvec_call_function +0.05% [kernel.kallsyms] [k] set_pte_vaddr_p4d +0.05% [kernel.kallsyms] [k] blkcg_maybe_throttle_current +0.05% [kernel.kallsyms] [k] native_set_fixmap 0.02% +0.04% [kernel.kallsyms] [k] nmi_handle.part.0 0.30% -0.04% [kernel.kallsyms] [k] native_flush_tlb_one_user 0.07% -0.04% [kernel.kallsyms] [k] native_flush_tlb_local 0.04% -0.04% [kernel.kallsyms] [k] native_apic_msr_write 0.11% +0.03% [kernel.kallsyms] [k] __perf_event_task_sched_in 0.01% +0.03% [kernel.kallsyms] [k] intel_pmu_handle_irq +0.02% [kernel.kallsyms] [k] do_futex 0.01% +0.01% [kernel.kallsyms] [k] sched_clock 0.88% +0.01% [kernel.kallsyms] [k] try_to_wake_up +0.01% [kernel.kallsyms] [k] perf_sample_event_took 0.04% -0.01% [kernel.kallsyms] [k] intel_bts_enable_local +0.01% [kernel.kallsyms] [k] end_repeat_nmi 0.01% +0.00% [kernel.kallsyms] [k] perf_event_nmi_handler 0.04% -0.00% [kernel.kallsyms] [k] acpi_os_read_memory 2.38% [kernel.kallsyms] [k] idle_cpu 1.40% [kernel.kallsyms] [k] kmem_cache_alloc 1.33% [kernel.kallsyms] [k] copy_page 1.23% wasmtime [.] core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h2abf5669ee3b7a52 1.23% [kernel.kallsyms] [k] page_mapping 1.15% [kernel.kallsyms] [k] vma_link 1.13% [kernel.kallsyms] [k] get_mem_cgroup_from_mm 1.10% [kernel.kallsyms] [k] native_apic_msr_eoi_write 1.09% [kernel.kallsyms] [k] clear_bhb_loop 1.06% wasmtime [.] <std::thread::spawnhook::SpawnHooks as core::ops::drop::Drop>::drop 1.00% [kernel.kallsyms] [k] load_balance 0.89% libc.so.6 [.] __madvise 0.88% [kernel.kallsyms] [k] alloc_thread_stack_node 0.85% wasmtime [.] std::sys::thread_local::guard::key::enable::run 0.82% [kernel.kallsyms] [k] task_h_load 0.7 [message truncated]
cfallin commented on issue #12106:
Here are the no opt rewrites (but still GVN/alias/LICM) and full
mainbranch opt CLIF dumps for function 8 (wasmtime compile --emit-clif dir foo.watthen takewasm[0]--function[8].cliffrom the directory).The "optimized" variant is about twice as long (!).
There's a repeating rewrite pattern that seems to be happening that is a sort of reassociation of a 3-input add sequence -- in the baseline we have
v34 = iadd v27, v33 ... [ lots of intermediate insts ] ... v43 = iadd v34, v42 v44 = ... v43 ...and in opt we have
... [ lots of intermediate insts ] ... v1242 = iadd v33, v42 v1243 = iadd v27, v1242 v44 = ... v1243 ...So we've (i) needlessly reassociated
(v27 + v33) + v42intov27 + (v33 + v42)(none of these are iconsts) and (ii) sunk one of the adds. The former is weird but "shrug". The latter is painful for register pressure: if means we have three live values over "lots of intermediate insts" rather than two.Better inst scheduling integrated with regalloc would solve that (but, NP-hard problem about which at least one PhD thesis has been written); or alternately, we could "not fix what's not broken" and just avoid the rewrite.
The reassociation seems to go more haywire further down; looking at a diff (
diff -u base.clif opt.clif) things are more-or-less 1-for-1 trades early on, but lots of extraiadds later.I'll dig into the reassociation rules we have and see if I can figure out what's happening!
cfallin commented on issue #12106:
OK, it seems that the reassociation rules added in #7466 are the issue. If I comment those out, I get this CLIF, and the compiled function body has a stack frame size of 0x130 bytes again.
cc @fitzgen -- it looks like there wasn't any data posted on that PR; did you have benchmarking results at the time? Mind benchmarking with/without those rules (and I guess the similar followups in #8518 as well)?
alexcrichton commented on issue #12106:
Can you share the workflow of your using jitdump and perf?
I applied this diff:
diff --git a/crates/bench-api/src/lib.rs b/crates/bench-api/src/lib.rs index 86c99c8c98..8a5180ca5e 100644 --- a/crates/bench-api/src/lib.rs +++ b/crates/bench-api/src/lib.rs @@ -423,6 +423,7 @@ impl BenchState { let mut config = options.config(None)?; // NB: always disable the compilation cache. config.cache(None); + config.profiler(wasmtime::ProfilingStrategy::JitDump); let engine = Engine::new(&config)?; let mut linker = Linker::<HostState>::new(&engine);and then I ran:
$ taskset 0x1 perf record -k mono ./target/release/sightglass-cli benchmark -e ./less-isle.so --benchmark-phase execution ./benchmarks/blake3-simd/benchmark.wasm --iterations-per-process 100000after awhile I ctrl-c'd it and followed these instructions to process the profile
fitzgen commented on issue #12106:
Reverted the offending rewrite in https://github.com/bytecodealliance/wasmtime/pull/12116 -- could someone re-run these benchmarks on
mainand report whether everything is resolved or not now? Thanks!
cfallin commented on issue #12106:
Here are the benchmarks that had the biggest drops no-rules to rules above:
execution :: cycles :: benchmarks/shootout/shootout-keccak.wasm Δ = 4097364.22 ± 837090.89 (confidence = 99%) no-rules.so is 1.14x to 1.21x faster than main.so! [26507651 27605359.73 36644170] main.so [22595511 23507995.51 29631085] no-rules.so execution :: cycles :: benchmarks/shootout/shootout-xchacha20.wasm Δ = 16237.66 ± 10867.13 (confidence = 99%) main.so is 1.00x to 1.01x faster than no-rules.so! [2634524 2662033.25 2766405] main.so [2645145 2678270.91 2835592] no-rules.so execution :: cycles :: benchmarks/blake3-scalar/benchmark.wasm No difference in performance. [207174 209371.73 231598] main.so [207346 210394.27 257140] no-rules.so execution :: cycles :: benchmarks/blake3-simd/benchmark.wasm No difference in performance. [208507 212511.59 232974] main.so [208421 212331.42 290723] no-rules.soSo
keccakis still weird, but "only" 14-21% slower rather than 48%. Giant drops in blake3-* went away entirely.
alexcrichton commented on issue #12106:
I also see (consistently) a 20% slowdown of shootout-keccak. That is, deleting all the ISLE bits makes shootout-keccak faster. Local profiling shows in both profiles 98% of the time is spent in the
keccak_corefunction which looks similar to the blake3 one
bongjunj commented on issue #12106:
Performance evaluation (Execution) after removing the reassoc rules:
Benchmark No Opts Base Speedup shootout-keccak 25,157,389 29,537,793 -14.83% shootout-switch 139,546,912 153,616,699 -9.16% shootout-ratelimit 39,101,368 39,894,279 -1.99% shootout-gimli 5,360,983 5,418,553 -1.06% shootout-xchacha20 4,376,708 4,389,354 -0.29% regex 209,934,736 210,346,186 -0.20% shootout-memmove 36,313,845 36,325,481 -0.03% shootout-fib2 3,009,938,907 3,010,271,487 -0.01% shootout-matrix 540,314,088 540,332,499 0.00% shootout-heapsort 2,378,752,173 2,374,493,320 0.18% spidermonkey 635,879,627 634,094,137 0.28% blake3-scalar 318,837 316,866 0.62% bz2 87,275,956 86,405,779 1.01% pulldown-cmark 6,729,037 6,629,027 1.51% shootout-ed25519 9,583,751,426 9,412,570,875 1.82% shootout-xblabla20 2,974,201 2,902,069 2.49% blake3-simd 313,009 305,049 2.61% shootout-seqhash 8,971,675,191 8,663,650,423 3.56% shootout-nestedloop 468 451 3.89% shootout-ctype 830,650,139 796,881,603 4.24% shootout-sieve 905,646,691 841,504,133 7.62% shootout-base64 381,752,475 352,928,420 8.17% shootout-ackermann 8,647,754 7,767,390 11.33% shootout-minicsv 1,475,692,343 1,306,403,978 12.96% shootout-random 630,367,818 439,901,877 43.30%
bongjunj commented on issue #12106:
In keccak, the following (de-) optimizing pattern is repeated from no-opt version to base (main branch):
-@4cfc v6912 = bxor v6792, v158 ; v158 = -1 -@4cfd v6913 = band v6775, v6912 + v10066 = bnot v6787 + v10067 = bnot v6789 + v10068 = band v10066, v10067 + v10070 = bor v10068, v6991 ; v6991 = -4294967296 +@4cfd v6913 = band v6775, v10070 @4d00 v6914 = bxor v6913, v6803 @4d01 store little heap v6914, v105 -@4d0c v6921 = bxor v6823, v158 ; v158 = -1 -@4d0d v6922 = band v6656, v6921 + v10075 = bnot v6818 + v10076 = bnot v6820 + v10077 = band v10075, v10076 + v10079 = bor v10077, v6991 ; v6991 = -4294967296 +@4d0d v6922 = band v6656, v10079 @4d10 v6923 = bxor v6922, v6631 @4d11 store little heap v6923, v124 -@4d1c v6930 = bxor v6841, v158 ; v158 = -1 -@4d1d v6931 = band v6830, v6930 + v10084 = bnot v6836 + v10085 = bnot v6838 + v10086 = band v10084, v10085 + v10088 = bor v10086, v6991 ; v6991 = -4294967296 +@4d1d v6931 = band v6830, v10088
bongjunj edited a comment on issue #12106:
In keccak, the following (de-) optimizing pattern is repeated from no-opt version to base (main branch) in the keccak core function, when the instruction count exceeds 1,000.
-@11dd v1561 = bxor v1220, v158 ; v158 = -1 -@11de v1562 = band v1430, v1561 + v7626 = bnot v1215 + v7627 = bnot v1217 + v7628 = band v7626, v7627 + v7630 = bor v7628, v6991 ; v6991 = -4294967296 +@11de v1562 = band v1430, v7630
bongjunj edited a comment on issue #12106:
In keccak, the following (de-) optimizing pattern is repeated from no-opt version to base (main branch) in the keccak core function, when the instruction count exceeds 1,000.
-@11dd v1561 = bxor v1220, v158 ; v158 = -1 -@11de v1562 = band v1430, v1561 + v7626 = bnot v1215 + v7627 = bnot v1217 + v7628 = band v7626, v7627 + v7630 = bor v7628, v6991 ; v6991 = -4294967296 +@11de v1562 = band v1430, v7630This transformation is defined at https://github.com/bytecodealliance/wasmtime/blob/058a9bc16bc204fec05993ce6a0b013cc3db55e6/cranelift/codegen/src/opts/bitops.isle#L44-L50
bongjunj commented on issue #12106:
With the DeMorgan rules removed, the performance degrades are almost gone except for the "switch" case.
Benchmark No Opts Base Speedup shootout-switch 139,551,486.30 153696081.5 -9.20% shootout-ratelimit 39,173,257.16 39952950.56 -1.95% shootout-xchacha20 4,380,029.16 4444075.78 -1.44% regex 210,231,630.92 212953957.4 -1.28% shootout-gimli 5,345,271.14 5404313.96 -1.09% shootout-fib2 3,010,356,555.06 3010366675 0.00% shootout-matrix 539,106,883.22 538754055.2 0.07% shootout-heapsort 2,378,812,047.98 2375114650 0.16% spidermonkey 636,143,609.88 633256160.3 0.46% bz2 87,099,312.82 86442047.12 0.76% shootout-memmove 36,569,713.06 36246023.48 0.89% blake3-scalar 319,175.94 316023.92 1.00% pulldown-cmark 6,741,922.18 6632833.42 1.64% shootout-ed25519 9,569,702,876.44 9402544055 1.78% shootout-nestedloop 504.38 494.38 2.02% blake3-simd 313,143.94 305664.92 2.45% shootout-xblabla20 3,011,040.94 2926801.38 2.88% shootout-seqhash 8,868,767,466.20 8580074421 3.36% shootout-ctype 830,871,263.52 796659320 4.29% shootout-sieve 906,201,119.02 841387362.9 7.70% shootout-base64 381,928,675.18 352966325.4 8.21% shootout-ackermann 8,657,031.60 7770921.4 11.40% shootout-minicsv 1,478,013,885.20 1292314209 14.37% shootout-keccak 25,147,664.38 21106966.72 19.14% shootout-random 630,552,185.72 439772361.1 43.38%
bongjunj edited a comment on issue #12106:
With the DeMorgan rules removed, the performance degrades are almost gone except for the "switch" case.
Benchmark No Opts Base Speedup shootout-switch 139,551,486 153,696,081 -9.20% shootout-ratelimit 39,173,257 39,952,951 -1.95% shootout-xchacha20 4,380,029 4,444,076 -1.44% regex 210,231,631 212,953,957 -1.28% shootout-gimli 5,345,271 5,404,314 -1.09% shootout-fib2 3,010,356,555 3,010,366,675 0.00% shootout-matrix 539,106,883 538,754,055 0.07% shootout-heapsort 2,378,812,048 2,375,114,650 0.16% spidermonkey 636,143,610 633,256,160 0.46% bz2 87,099,313 86,442,047 0.76% shootout-memmove 36,569,713 36,246,023 0.89% blake3-scalar 319,176 316,024 1.00% pulldown-cmark 6,741,922 6,632,833 1.64% shootout-ed25519 9,569,702,876 9,402,544,055 1.78% shootout-nestedloop 504 494 2.02% blake3-simd 313,144 305,665 2.45% shootout-xblabla20 3,011,041 2,926,801 2.88% shootout-seqhash 8,868,767,466 8,580,074,421 3.36% shootout-ctype 830,871,264 796,659,320 4.29% shootout-sieve 906,201,119 841,387,363 7.70% shootout-base64 381,928,675 352,966,325 8.21% shootout-ackermann 8,657,032 7,770,921 11.40% shootout-minicsv 1,478,013,885 1,292,314,209 14.37% shootout-keccak 25,147,664 21,106,967 19.14% shootout-random 630,552,186 439,772,361 43.38%
bongjunj edited a comment on issue #12106:
With the DeMorgan rules removed, the performance degrades are almost gone except for the "switch" case.
keccak enjoys 19.14% execution performance improvement.
Benchmark No Opts Base Speedup shootout-switch 139,551,486 153,696,081 -9.20% shootout-ratelimit 39,173,257 39,952,951 -1.95% shootout-xchacha20 4,380,029 4,444,076 -1.44% regex 210,231,631 212,953,957 -1.28% shootout-gimli 5,345,271 5,404,314 -1.09% shootout-fib2 3,010,356,555 3,010,366,675 0.00% shootout-matrix 539,106,883 538,754,055 0.07% shootout-heapsort 2,378,812,048 2,375,114,650 0.16% spidermonkey 636,143,610 633,256,160 0.46% bz2 87,099,313 86,442,047 0.76% shootout-memmove 36,569,713 36,246,023 0.89% blake3-scalar 319,176 316,024 1.00% pulldown-cmark 6,741,922 6,632,833 1.64% shootout-ed25519 9,569,702,876 9,402,544,055 1.78% shootout-nestedloop 504 494 2.02% blake3-simd 313,144 305,665 2.45% shootout-xblabla20 3,011,041 2,926,801 2.88% shootout-seqhash 8,868,767,466 8,580,074,421 3.36% shootout-ctype 830,871,264 796,659,320 4.29% shootout-sieve 906,201,119 841,387,363 7.70% shootout-base64 381,928,675 352,966,325 8.21% shootout-ackermann 8,657,032 7,770,921 11.40% shootout-minicsv 1,478,013,885 1,292,314,209 14.37% shootout-keccak 25,147,664 21,106,967 19.14% shootout-random 630,552,186 439,772,361 43.38%
bongjunj commented on issue #12106:
Unlike the addition-reassoc rule, the DeMorgan rule does not use
subsumewhich destructively rewrites instructions.
How, the DeMorgan form wins over the none-optimized form? I thought the most cost-efficient instruction sequence (or somehow with good performance heuristic) would be selected.
cfallin commented on issue #12106:
I thought the most cost-efficient instruction sequence (or somehow with good performance heuristic) would be selected.
That's a good next question to investigate! You could get the egraph with multiple value representations from the log and look at the cost of each form. Our cost function is still pretty basic right now, and there hasn't been, e.g., an effort to try to correlate to real machines; only a relative ranking of opcode classes. Remember also that we don't do any sort of reasoning about amortized/shared cost of computing values used multiple times, nor do we integrate with regalloc to reason about register pressure (both are, as noted above, NP-hard problems).
bongjunj commented on issue #12106:
There is a strong evidence that the cost model has a bug when the input function is large.
--- keccak-no-opts.clif 2025-12-04 08:22:28.715675257 +0000 +++ keccak-main.clif 2025-12-04 08:18:25.654976355 +0000 @@ -116,10 +116,10 @@ @0244 v153 = bxor v152, v67 @0245 v154 = iconst.i64 19 @0247 v155 = ushr v153, v154 ; v154 = 19 -@024e v157 = band v155, v40 ; v40 = 0xffff_ffff -@0251 v158 = iconst.i64 -1 -@0253 v159 = bxor v157, v158 ; v158 = -1 -@0254 v160 = band v87, v159 + v6989 = bnot v155 + v6991 = iconst.i64 -4294967296 + v6992 = bor v6989, v6991 ; v6991 = -4294967296 +@0254 v160 = band v87, v6992-@11dd v1561 = bxor v1220, v158 ; v158 = -1 -@11de v1562 = band v1430, v1561 + v7626 = bnot v1215 + v7627 = bnot v1217 + v7628 = band v7626, v7627 + v7630 = bor v7628, v6991 ; v6991 = -4294967296 +@11de v1562 = band v1430, v7630First, the deoptimizing code pattern only appears when the instruction count exceed a certain point. Such deoptimization does not occur near the beginning of the function.
Second, the cost model uses the optimizing pattern as intended near the beginning of the function. For example, it works well for v160. However, as cost "accumulates", all the costs are now "infinite". (See v1562 below) Thus, the cost model cannot choose the optimizing pattern ending up generating the undesired instruction sequence.
... [2025-12-05T02:12:28Z TRACE cranelift_codegen::egraph::elaborate] -> cost of value v160 = Cost::Finite { op_cost: 100, depth: 12 } ... [2025-12-05T02:12:28Z TRACE cranelift_codegen::egraph::elaborate] -> best for eclass v1562: BestEntry(Cost::Infinite, v1562)
bongjunj commented on issue #12106:
Seems like this is the case.
bongjunj edited a comment on issue #12106:
Seems like this is the case. The README.md mentioned in the comment says Rules should not rewrite to worse code, however the de morgan rules do generate worse codes.
Now, we have options as follows:
- We should update the cost model to take this possiblity into account, but this can make a large diff on the generated codes we are testing as of now, or
- simply delete the rules to upholad the contract mentioned in README.md (+ enforce the rule with ISLE meta-compiler checking the cost "strictly" reduces)
bongjunj edited a comment on issue #12106:
There is a strong evidence that the cost model has a bug when the input function is large.
--- keccak-no-opts.clif 2025-12-04 08:22:28.715675257 +0000 +++ keccak-main.clif 2025-12-04 08:18:25.654976355 +0000 @@ -116,10 +116,10 @@ @0244 v153 = bxor v152, v67 @0245 v154 = iconst.i64 19 @0247 v155 = ushr v153, v154 ; v154 = 19 -@024e v157 = band v155, v40 ; v40 = 0xffff_ffff -@0251 v158 = iconst.i64 -1 -@0253 v159 = bxor v157, v158 ; v158 = -1 -@0254 v160 = band v87, v159 + v6989 = bnot v155 + v6991 = iconst.i64 -4294967296 + v6992 = bor v6989, v6991 ; v6991 = -4294967296 +@0254 v160 = band v87, v6992-@11dd v1561 = bxor v1220, v158 ; v158 = -1 -@11de v1562 = band v1430, v1561 + v7626 = bnot v1215 + v7627 = bnot v1217 + v7628 = band v7626, v7627 + v7630 = bor v7628, v6991 ; v6991 = -4294967296 +@11de v1562 = band v1430, v7630First, the deoptimizing code pattern only appears when the instruction counts exceed a certain point. For example, at the code point
@0254, the optimized code is okay. But at@11de, deoptimization happens.Second, the cost model selects the optimized code as intended near the beginning of the function. For example, it works well for v160. However, as cost "accumulates", all the costs are now "infinite". (See v1562 below) Thus, the cost model cannot choose the optimizing pattern ending up generating the undesired instruction sequence.
... [2025-12-05T02:12:28Z TRACE cranelift_codegen::egraph::elaborate] -> cost of value v160 = Cost::Finite { op_cost: 100, depth: 12 } ... [2025-12-05T02:12:28Z TRACE cranelift_codegen::egraph::elaborate] -> best for eclass v1562: BestEntry(Cost::Infinite, v1562)
bongjunj edited a comment on issue #12106:
It seems like the cost model is causing a problem with the canonicalizing rules when selecting instructions in large functions. The README.md mentioned in the comment says Rules should not rewrite to worse code, however the de morgan rules do generate worse codes.
Now, we have options as follows:
- We should update the cost model to take this possiblity into account, but this can make a large diff on the generated codes we are testing as of now, or
- simply delete the rules to upholad the contract mentioned in README.md (+ enforce the rule with ISLE meta-compiler checking the cost "strictly" reduces)
cfallin commented on issue #12106:
Fascinating. In general cost will only accumulate if there are very long chains, unbroken by any blockparams. I guess we're seeing exactly that in some benchmarks.
If that's the case, I agree we should delete rules that don't have a clear directionality.
I don't see a pairwise comparison above of all-opts versus all-opts-minus-DeMorgan. I want to make sure that they don't show improvements on other benchmarks that we'd be discarding. Could you do that run as well? Thanks!
fitzgen commented on issue #12106:
I wouldn't have expected that we would actually hit
Cost::Infinityexcept in very rare cases. It makes me wonder if we should instead investigate wideningCost.
fitzgen commented on issue #12106:
Wait, we have 24 bits for the op cost, so we shouldn't be hitting infinity until we have on the order of millions deep chains of ops.
It seems like something else funky might be going on here.
fitzgen commented on issue #12106:
In
keccak, the largest value isv43210. Our largest individual op cost is 64.Wait, we have 24 bits for the op cost, so we shouldn't be hitting infinity until we have on the order of millions deep chains of ops.
But I guess actually it isn't chains but the whole tree, which makes hitting infinite cost more understandable, but still not ideal because I suspect it does not accurately reflect the real costs here.
This is because
Costdoesn't take into account shared structure. For example,iadd v0, v0should becost_of_opcode(iadd) + cost_of_operand(v0), but the current code will computecost_of_opcode(iadd) + cost_of_operand(v0) + cost_of_operand(v0). It shouldn't count the cost ofv0twice but it will. Now fixing this particular case, where operands are repeated in the same operation, would be relatively straightforward to fix. But fixing this in the general case, wherev0is never repeated within a single operation but is repeated at multiple levels of operations in the same expression, is a bit more difficult.We could try to maintain the transitive set of all values that contributed to a cost in the cost, and only add new cost for new values (although this maybe doesn't accurately reflect register pressure and spills for very large expressions). That could be done in a perhaps not-horrendous manner via immutable persistent sets so that we aren't completely blowing up memory usage. But I suspect it would still be a large slow-down to our best-cost computation.
fitzgen commented on issue #12106:
So yeah, this leads me back towards:
It makes me wonder if we should instead investigate widening
Cost.We could probably do this without growing the size of
Costby just using less bits for depth and more for the total opcode cost.
fitzgen commented on issue #12106:
(Maybe we should even stop considering depth as part of
Cost?)Anyways, I'm doing a quick experiment to see the difference when we switch
Costfrom{ op: u24, depth: u8 }to{ op: u28, depth: u4 }to see if this frees up enough bits such that we don't hit infinite cost here.
fitzgen commented on issue #12106:
It seems like
keccakhas enough shared structure that moving more bits from depth to op actually makes more eclasses have infinite cost (that is, they saturate op cost before saturating depth, so lowering the depth bits just makes more things saturate both op and depth, i.e. be infinite).
Costlayouttotal # eclasses in keccak# infinit-cost eclasses in keccak% infinite-cost eclasses in keccak{ op: u32, depth: u0 }43211 31474 1.37% { op: u28, depth: u4 }43211 33057 1.31% { op: u24, depth: u8 }(today's layout)43211 34704 1.25% Increasing the number of
depthbits would probably make fewer eclasses have infinite cost, but we would also be extracting a larger percentage of them purely based on depth (since more of them would saturate the op portion of cost) which doesn't seem desirable.The other thing we could do is move
Costfrom bitpacking into au32to bitpacking into au64or even au128but this would only add support for on the order of 32 and 96 respectively deeper expressions before hitting infinite cost again, which isn't very much. (Assuming we don't attempt to start representing shared structure inCost).So probably, at least in the short term, removing the demorgan rule (https://github.com/bytecodealliance/wasmtime/pull/12127) actually is probably the best incremental/practical step forward, even though I was initially skeptical.
Perhaps we should also be breaking cost ties with a bias towards the original input, on the assumption that the producer knows better than us if we can't tell if our rewrites are actually improvements or not. But looking at the code, it does seem like we take the minimum
BestEntry, which ifCostis tied means we will in fact break ties with lower-numbered values (i.e. biasing towards the originals). So we are already doing this.Fundamentally improving things here will require a cost function that understands shared structure, which, unless someone has an idea I haven't thought up yet, is almost definitely too expensive for the niche that Cranelift sits in.
cfallin commented on issue #12106:
Yes, in particular because there seem to be no major regressions without the rules, I think the choice here is easy. It is unfortunate that we can't lean more heavily on cost optimization but (to write it down explicitly) doing cost optimization with shared structure is as hard as the weighted set-cover problem (that is, the latter can be reduced to a graph with shared structure on which we want to pick the lowest-cost implementations of terminals), so it's NP-hard. That doesn't mean there isn't a good approximation algorithm of course, but we won't solve that in this PR I don't think!
fitzgen commented on issue #12106:
That doesn't mean there isn't a good approximation algorithm of course,
Yeah, I suspect the trouble really begins when you want to make one expression in the current eclass have cheaper cost because we've already committed to computing some of its sub-expressions via choosing them as part of the best version of a different eclass.
If we limit ourselves to just the shared structure within a single expression in isolation from the best cost/expression we computed for different eclasses (i.e. only adding the cost of operand
v123once when computing the cost of a particular expression, ignoring what was chosen as best for other eclassses) then things seem pretty tractable, although still too expensive for Cranelift.but we won't solve that in this PR I don't think!
Right, which is why I'm commenting on the issue, not the PR :-p
cfallin closed issue #12106:
Hi,
I performance-tested a no-opt version of Cranelift without ISLE mid-end optimizations. (You can confirmed this at https://github.com/prosyslab/wasmtime/commit/c0df585025897098e83703cca343e139ed30a119)
Compared to the latest version of upstream Cranelift, surprisingly, the no-opt version produced a significantly faster x86_64 code for blake3, keccak, xchacha20. This experiment is conducted using the sightglass-cli.
Given that only mid-end rules are removed, some codegen backend might causing problem interacting with the mid-end. I want to investigate this problem, but I'm completely lost which part to look at first. Any comments will be appreciated.
Here is the demonstration:
> cargo run --release -- benchmark benchmarks/blake3-scalar/benchmark.wasm --engine engines/wasmtime/v-main/libengine.so engines/wasmtime/v-no-opts/libengine.so --pin Finished `release` profile [optimized] target(s) in 0.08s Running `target/release/sightglass-cli benchmark benchmarks/blake3-scalar/benchmark.wasm --engine engines/wasmtime/v-main/libengine.so engines/wasmtime/v-no-opts/libengine.so --pin` execution :: cycles :: benchmarks/blake3-scalar/benchmark.wasm Δ = 502562.18 ± 4271.59 (confidence = 99%) no-opts/libengine.so is 2.55x to 2.58x faster than main/libengine.so! [816778 823363.48 880808] main/libengine.so [315506 320801.30 460046] no-opts/libengine.so compilation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm Δ = 51372480.98 ± 1109163.62 (confidence = 99%) no-opts/libengine.so is 1.19x to 1.20x faster than main/libengine.so! [310860540 313559414.86 330384818] main/libengine.so [260245364 262186933.88 278160088] no-opts/libengine.so instantiation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm Δ = 17444.42 ± 10445.70 (confidence = 99%) no-opts/libengine.so is 1.06x to 1.23x faster than main/libengine.so! [90902 137381.98 258206] main/libengine.so [87766 119937.56 198140] no-opts/libengine.soPlus, here are some data for other benchmarks.
--iterations-per-process 10 --benchmark-phase execution ----pinis used.
bench v-no-opts base speedup blake3-scalar 320,225 868,750 -63.14% blake3-simd 320,689 945,427 -66.08% bz2 88,887,466 86,904,121 2.28% pulldown-cmark 6,630,447 6,705,562 -1.12% regex 209,902,394 211,477,705 -0.74% shootout-base64 383,700,851 352,817,318 8.75% shootout-keccak 25,589,899 49,540,506 -48.35% shootout-xchacha20 4,489,570 4,816,315 -6.78% spidermonkey 644,434,235 627,374,660 2.72%
cfallin commented on issue #12106:
(D'oh, sorry, too many tabs open -- yes, this is indeed the issue and not the PR :-) )
It seems there are at least two possible directions for approximations to true cost minimization:
- We could divide/amortize the cost by the number of users of any given value. This would require a little care with the union nodes -- we would want to "see through" the union-spin to the whole eclass and consider it as one. But e.g. if 100 users are asking for a value of cost 100, or alternately each uses its own input of cost 2, this would do the right thing.
- This maintains the phased aspect (all costs computable before we pick winners) but feels "more approximate"
- We could dynamically update costs as min-cost values are picked for classes -- so e.g. if I've already decided to use a value, then all subsequent uses get to use it for (additional/marginal) cost 0. This one is tricky because it requires mixing together the cost computation and min-cost choices.
@bongjunj if you're looking for interesting concrete research problems, this one might be a good one to play with!
cfallin edited a comment on issue #12106:
(D'oh, sorry, too many tabs open -- yes, this is indeed the issue and not the PR :-) )
It seems there are at least two possible directions for approximations to true cost minimization:
- We could divide/amortize the cost by the number of users of any given value. This would require a little care with the union nodes -- we would want to "see through" the union-spine to the whole eclass and consider it as one. But e.g. if 100 users are asking for a value of cost 100, or alternately each uses its own input of cost 2, this would do the right thing.
- This maintains the phased aspect (all costs computable before we pick winners) but feels "more approximate"
- We could dynamically update costs as min-cost values are picked for classes -- so e.g. if I've already decided to use a value, then all subsequent uses get to use it for (additional/marginal) cost 0. This one is tricky because it requires mixing together the cost computation and min-cost choices.
@bongjunj if you're looking for interesting concrete research problems, this one might be a good one to play with!
fitzgen commented on issue #12106:
Now fixing this particular case, where operands are repeated in the same operation, would be relatively straightforward to fix.
FWIW, I hacked this trivial-case fix together to get numbers, and it did not change the infinite-cost-eclass count in
keccakat all.
fitzgen commented on issue #12106:
- We could divide/amortize the cost by the number of users of any given value.
The tricky bit here is that if these are "uncommitted" uses (i.e. incoming edges for each enode) rather than "committed" uses (i.e. uses from enodes that we chose as the best in their eclass) then the use count is kind of useless. There might be a bunch of uses inside the egraph, making the enode's cost cheap, but then if we don't actually extract those uses then it was all just a lie. How bad this will be in practice depends on the size of the saturated vs initial egraph.
On the other hand, if these are "committed" uses, then we wouldn't maintain the phasing, which leads to things getting really tricky, as you noted.
If we aren't maintaining phasing, then we could, for example, do something like
for inst in skeleton { compute_best_costs(); elaborate_inst(inst); // Update costs for values we committed to computing so that the next iteration // of the loop can recompute best costs to bias towards reusing already-computed // values. for value in inst.operands().flat_map(|op| transitive_values(op)) { set_cost(value, 0); } }which is "only"
O(n^3)(becausecompute_best_costsis anO(n^2)fixed point loop).I guess we wouldn't need to complete the fixed point every time though. In practice I bet you would reach it with one or two iterations, but it also shouldn't ever be incorrect to stop and not choose the actual min cost but an approximation, so maybe doing just one more iteration of the fixed point loop is Good Enough after elaborating each instruction. This would keep things
O(n^2)like they are today.Anyways, yeah this is definitely an interesting research problem from my point of view, but I have no intentions of diving into this further so I'll stop commenting and try not to let myself be nerd-sniped any worse than I already have been this morning.
Last updated: Dec 06 2025 at 06:05 UTC