unicornt edited issue #7246:
Test Case
Steps to Reproduce
time wasmtime case.wasm
Actual Results
Wasmtime
takes an abnormal long time to execute this wasm file and no exception is thrown.Expected Results
Other runtime use much less time or report an
out of memory access
exception<img width="1230" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/18662046/626efd7d-16aa-49d7-ae8d-4e66c5f857ce">
Versions and Environment
Wasmtime version or commit: 15.0.0
Operating system: Ubuntu 20.04
Architecture: x86_64
afonso360 commented on issue #7246:
It looks like wasmtime wasn't compiled in release mode, would you be able to retest using the
--release
flag when compiling wasmtime?
afonso360 edited a comment on issue #7246:
It looks like wasmtime wasn't compiled in release mode, would you be able to re-test using the
--release
flag when compiling wasmtime?
unicornt commented on issue #7246:
It looks like wasmtime wasn't compiled in release mode, would you be able to re-test using the
--release
flag when compiling wasmtime?Thank you for the reply! I tried to build wasmtime with
--release
flag, but it turned out to be useless.<img width="1199" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/18662046/c4e02349-3151-4ca7-9f53-0d165c875c46">
alexcrichton commented on issue #7246:
Would you be able to help diagnose what's happening locally? I'm unable to reproduce this and the runtime for this module isn't too bad locally. If you could run
perf record -k mono wasmtime run --profile jitdump -C case=no case.wasm perf inject --jit --input perf.data --output perf.jit.data perf report --input perf.jit.data --no-children
and then copy/paste the top 10-20 functions into this issue that'll help explain where all the time is going.
One thing I've noticed is that this module is trying to write to file descriptor 0 inside the wasm itself. Wasmtime is returning
EBADF
because of that (0 is stdin, not stdout). Other runtimes likely aren't performing such a check in their WASI implementations. This provides a point of divergence between runtimes which may explain why runtimes are behaving differently.Nevertheless though I'm not sure what's taking 20+ seconds for you in Wasmtime. Locally this module executes in 1-3 seconds for me. I unfortunately don't have access to native x86_64 hardware though and so this could be specific to that.
afonso360 commented on issue #7246:
Nevertheless though I'm not sure what's taking 20+ seconds for you in Wasmtime. Locally this module executes in 1-3 seconds for me. I unfortunately don't have access to native x86_64 hardware though and so this could be specific to that.
I'm also unable to reproduce this on x86, it takes ~3 seconds for me. With 98.86% of the time being spent on
wasm[0]::function[3]
.
unicornt commented on issue #7246:
Nevertheless though I'm not sure what's taking 20+ seconds for you in Wasmtime. Locally this module executes in 1-3 seconds for me. I unfortunately don't have access to native x86_64 hardware though and so this could be specific to that.
I'm also unable to reproduce this on x86, it takes ~3 seconds for me. With 98.86% of the time being spent on
wasm[0]::function[3]
.I got similar output that 99.7% of the time is spent on function[3]. However, the total time is still 20+ seconds.
<img width="1389" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/18662046/6bff0da0-a3ef-4d78-aa51-5d889a141491">
Here is
perf stat
output of wasmtime and wasmedge(AOT mod). Hope it can help to find out why it happened!<img width="1224" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/18662046/b639c272-7297-4d8d-b8ca-f6eea6ccda9c">
bjorn3 commented on issue #7246:
An IPC of 0.08 is abnornally low. Could you show the annotated assembly of
wasm[0]::function[3]
as reported byperf report
?
unicornt commented on issue #7246:
An IPC of 0.08 is abnornally low. Could you show the annotated assembly of
wasm[0]::function[3]
as reported byperf report
?Is this what you want? Sorry I'm not familiar with
perf
tools.<img width="1407" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/18662046/ff6e3c6e-a0da-4fe7-8068-f48d9a67d8ea">
bjorn3 commented on issue #7246:
Yes. The amount of time spent in what should be trivial instructions (mov between registers and xor) is weird.
unicornt commented on issue #7246:
Yes. The amount of time spent in what should be trivial instructions (mov between registers and xor) is weird.
Do you have any idea about why it happened? Anything I can do to figure it out?
alexcrichton commented on issue #7246:
Could you share the output of
lscpu
? This might be specific to your particular CPU and some edge case related to cache sizes or something like that perhaps.
unicornt commented on issue #7246:
Here is my output of
lspcpu
➜ ~ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 165 Model name: Intel(R) Core(TM) i7-10700F CPU @ 2.90GHz Stepping: 5 CPU MHz: 2900.000 CPU max MHz: 4800.0000 CPU min MHz: 800.0000 BogoMIPS: 5799.77 Virtualization: VT-x L1d cache: 256 KiB L1i cache: 256 KiB L2 cache: 2 MiB L3 cache: 16 MiB NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscal l nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulq dq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer a es xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vn mi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt x saveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp pku ospke md_clear flush_l1d arch_ capabilities
unicornt commented on issue #7246:
Hi, I tried the same code on another machine. The result is almost
the same
, including the weird time ofmov
andxor
.The OS of this machine is
ubuntu 20.04
and version of wasmtime is15.0.0
.Here is the output of
lscpu
on this machine.Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 6 On-line CPU(s) list: 0-5 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz Stepping: 10 CPU MHz: 2200.000 CPU max MHz: 3700.0000 CPU min MHz: 800.0000 BogoMIPS: 4399.99 Virtualization: VT-x L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 1.5 MiB L3 cache: 9 MiB NUMA node0 CPU(s): 0-5 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled Vulnerability Mds: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Mitigation; TSX disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_de adline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_sin gle pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjus t bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec x getbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l 1d arch_capabilities
alexcrichton commented on issue #7246:
Ok thanks again for both reporting this and helping us follow-up! @elliottt was able to reproduce on hardware he has which resulted in us bottoming out in https://github.com/bytecodealliance/wasmtime/issues/7283 as the cause of this issue. I'm going to close this in favor of that issue to keep the issue discussion a bit shorter, but if you'd like to confirm locally @unicornt you can comment out these lines in a custom Wasmtime build and the performance of this program should be better. Note though that it's just a proof-of-concept of what the slowdown is and is not a full complete fix, that'll get tracked by https://github.com/bytecodealliance/wasmtime/issues/7283
alexcrichton closed issue #7246:
Test Case
Steps to Reproduce
time wasmtime case.wasm
Actual Results
Wasmtime
takes an abnormal long time to execute this wasm file and no exception is thrown.Expected Results
Other runtime use much less time or report an
out of memory access
exception<img width="1230" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/18662046/626efd7d-16aa-49d7-ae8d-4e66c5f857ce">
Versions and Environment
Wasmtime version or commit: 15.0.0
Operating system: Ubuntu 20.04
Architecture: x86_64
yfaming commented on issue #7246:
I'm using a MacBook Pro with a 2.6 GHz 6-Core Intel Core i7 CPU. I attempted to run the .wasm module provided by @alexcrichton @unicornt , and it was indeed quite slow, taking around 27s. I also tried wasmer, and it took about 7s.
Following @alexcrichton @alexcrichton's advice, I commented out the last 4 lines in
cranelift/codegen/src/opts/remat.isle
, recompiled wasmtime, and ran it again. The execution time dropped to less than 1s. Amazing!
yfaming edited a comment on issue #7246:
I'm using a MacBook Pro with a 2.6 GHz 6-Core Intel Core i7 CPU. I attempted to run the .wasm module provided by @alexcrichton @unicornt , and it was indeed quite slow, taking around 27s. I also tried wasmer, and it took about 7s.
Following @alexcrichton's advice, I commented out the last 4 lines in
cranelift/codegen/src/opts/remat.isle
, recompiled wasmtime, and ran it again. The execution time dropped to less than 1s. Amazing!
yfaming edited a comment on issue #7246:
I'm using a MacBook Pro with a 2.6 GHz 6-Core Intel Core i7 CPU. I attempted to run the .wasm module provided by @unicornt , and it was indeed quite slow, taking around 27s. I also tried wasmer, and it took about 7s.
Following @alexcrichton's advice, I commented out the last 4 lines in
cranelift/codegen/src/opts/remat.isle
, recompiled wasmtime, and ran it again. The execution time dropped to less than 1s. Amazing!
hungryzzz commented on issue #7246:
@alexcrichton Hi, I try to reduce the above
wasm
bytecode to get a buggy case as shortest as possible, the reduced results are attached. The execute time inWasmer
andWasmtime
still have a big difference:
Wasmer
: 0.67sWasmtime
: 12.24sBut I find that the performance regression is not only related to the https://github.com/bytecodealliance/wasmtime/issues/7283, because if I change the constant in
f64.const 5.43231e-312
tof64.const 5.4
, the execution time will be back to normal(but indeed, the loop-invariant code will be sinked into loop in the corresponding binary generated byWasmtime
):
Wasmer
: 0.70sWasmtime
: 0.34sBy the way, I also check the binary generated by
LLVM
, I find that the whole computation will be optimized to empty(I think because there is nothing depends on the result of the loop).I think it is also a strange case I cannot understand and explain...
;; reduce.wat (module (type (;0;) (func (param i32))) (type (;1;) (func)) (type (;2;) (func (result i32))) (type (;3;) (func (param i32) (result i32))) (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0))) (func (;2;) (type 1) call 2 i32.const 0 call 0 unreachable) (func (;2;) (type 1) (local f64 f64 i32) f64.const 5.43231e-312 ;; Magic Number!!! f64.const 1.0 f64.mul local.set 1 loop ;; label = @1 local.get 1 local.get 0 f64.add local.set 0 local.get 2 ;; condition variable i i32.const 1 i32.add local.tee 2 ;; i++ i32.const 312500000 i32.ne br_if 0 (;@1;) end ) (table (;0;) 2 2 funcref) (memory (;0;) 8192 8192) (global (;0;) (mut i32) (i32.const 66576)) (export "memory" (memory 0)) (export "__indirect_function_table" (table 0)) (export "_start" (func 1)))
Environment Version
Wasmer
: wasmer 4.1.2 (98188df 2023-09-19)
binary: wasmer-cli
commit-hash: 98188df80096dfc7134ce202731878cf9d8f6f4e
commit-date: 2023-09-19
host: x86_64-unknown-linux-gnu
compiler: singlepass,cranelift,llvmWasmtime
: build from 8e00cc202f82445a42adbad6e8187e901f91833b (release version)- CPU
➜ ~ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 6 On-line CPU(s) list: 0-5 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz Stepping: 10 CPU MHz: 2200.000 CPU max MHz: 3700.0000 CPU min MHz: 800.0000 BogoMIPS: 4399.99 Virtualization: VT-x L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 1.5 MiB L3 cache: 9 MiB NUMA node0 CPU(s): 0-5 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled Vulnerability Mds: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Mitigation; TSX disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscal l nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulq dq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim er aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsa veopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
hungryzzz edited a comment on issue #7246:
@alexcrichton Hi, I try to reduce the above
wasm
bytecode to get a buggy case as shortest as possible, the reduced results are attached. The execute time inWasmer
andWasmtime
still have a big difference:
Wasmer
: 0.67sWasmtime
: 12.24sBut I find that the performance regression is not only related to the https://github.com/bytecodealliance/wasmtime/issues/7283, because if I change the constant in
f64.const 5.43231e-312
tof64.const 5.4
, the execution time will back to normal(but indeed, the loop-invariant code will be sinked into loop in the corresponding binary generated byWasmtime
):
Wasmer
: 0.70sWasmtime
: 0.34sBy the way, I also check the binary generated by
LLVM
, I find that the whole computation will be optimized to empty(I think because there is nothing depends on the result of the loop).I think it is also a strange case I cannot understand and explain...
;; reduce.wat (module (type (;0;) (func (param i32))) (type (;1;) (func)) (type (;2;) (func (result i32))) (type (;3;) (func (param i32) (result i32))) (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0))) (func (;2;) (type 1) call 2 i32.const 0 call 0 unreachable) (func (;2;) (type 1) (local f64 f64 i32) f64.const 5.43231e-312 ;; Magic Number!!! f64.const 1.0 f64.mul local.set 1 loop ;; label = @1 local.get 1 local.get 0 f64.add local.set 0 local.get 2 ;; condition variable i i32.const 1 i32.add local.tee 2 ;; i++ i32.const 312500000 i32.ne br_if 0 (;@1;) end ) (table (;0;) 2 2 funcref) (memory (;0;) 8192 8192) (global (;0;) (mut i32) (i32.const 66576)) (export "memory" (memory 0)) (export "__indirect_function_table" (table 0)) (export "_start" (func 1)))
Environment Version
Wasmer
: wasmer 4.1.2 (98188df 2023-09-19)
binary: wasmer-cli
commit-hash: 98188df80096dfc7134ce202731878cf9d8f6f4e
commit-date: 2023-09-19
host: x86_64-unknown-linux-gnu
compiler: singlepass,cranelift,llvmWasmtime
: build from 8e00cc202f82445a42adbad6e8187e901f91833b (release version)- CPU
➜ ~ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 6 On-line CPU(s) list: 0-5 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz Stepping: 10 CPU MHz: 2200.000 CPU max MHz: 3700.0000 CPU min MHz: 800.0000 BogoMIPS: 4399.99 Virtualization: VT-x L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 1.5 MiB L3 cache: 9 MiB NUMA node0 CPU(s): 0-5 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled Vulnerability Mds: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Mitigation; TSX disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscal l nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulq dq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim er aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsa veopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
hungryzzz edited a comment on issue #7246:
@alexcrichton Hi, I try to reduce the above
wasm
bytecode to get a buggy case as shortest as possible, the reduced results are attached. The execute time inWasmer
andWasmtime
still have a big difference:
Wasmer
: 0.67sWasmtime
: 12.24sBut I find that the performance regression is not only related to the https://github.com/bytecodealliance/wasmtime/issues/7283, because if I change the constant in
f64.const 5.43231e-312
tof64.const 5.4
, the execution time will back to normal(but indeed, the loop-invariant code will be sinked into loop in the corresponding binary generated byWasmtime
):
Wasmer
: 0.70sWasmtime
: 0.34sBy the way, I also check the binary generated by
LLVM
, I find that the whole computation will be optimized to empty(I think because there is nothing depends on the result of the loop).I think it is also a strange case I cannot understand and explain...
;; reduce.wat (module (type (;0;) (func (param i32))) (type (;1;) (func)) (type (;2;) (func (result i32))) (type (;3;) (func (param i32) (result i32))) (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0))) (func (;2;) (type 1) call 2 i32.const 0 call 0 unreachable) (func (;2;) (type 1) (local f64 f64 i32) f64.const 5.43231e-312 ;; Magic Number!!! f64.const 1.0 f64.mul local.set 1 loop ;; label = @1 local.get 1 local.get 0 f64.add local.set 0 local.get 2 ;; condition variable i i32.const 1 i32.add local.tee 2 ;; i++ i32.const 312500000 i32.ne br_if 0 (;@1;) end ) (table (;0;) 2 2 funcref) (memory (;0;) 8192 8192) (global (;0;) (mut i32) (i32.const 66576)) (export "memory" (memory 0)) (export "__indirect_function_table" (table 0)) (export "_start" (func 1)))
Environment Version
Wasmer
: wasmer 4.1.2 (98188df 2023-09-19)
binary: wasmer-cli
commit-hash: 98188df80096dfc7134ce202731878cf9d8f6f4e
commit-date: 2023-09-19
host: x86_64-unknown-linux-gnu
compiler: singlepass,cranelift,llvmWasmtime
: build from 8e00cc202f82445a42adbad6e8187e901f91833b (release version)- CPU
➜ ~ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 6 On-line CPU(s) list: 0-5 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz Stepping: 10 CPU MHz: 2200.000 CPU max MHz: 3700.0000 CPU min MHz: 800.0000 BogoMIPS: 4399.99 Virtualization: VT-x L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 1.5 MiB L3 cache: 9 MiB NUMA node0 CPU(s): 0-5 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled Vulnerability Mds: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Mitigation; TSX disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscal l nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulq dq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim er aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsa veopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
Also, I use the inline assembly to compare the execution time (but get the almost same execution time) and maybe it is helpful to your debugging so I paste it here.
#include <stdint.h> double X = 1.0; uint64_t test_wasmtime() { __asm__ __volatile__( "vxorpd %%xmm5,%%xmm5,%%xmm6\n\t" "xor %%r11d,%%r11d\n\t" "START1%=:\n\t" "add $0x1,%%r11d\n\t" "movabs $0x1cd1a94e7ac14ff,%%rsi\n\t" "vmovq %%rsi,%%xmm0\n\t" "vmulsd %0,%%xmm0,%%xmm7\n\t" "vaddsd %%xmm6,%%xmm7,%%xmm6\n\t" "cmp $0x12a05f20,%%r11d\n\t" "jne START1%=\n\t":"=m"(X)::"rsi","r11"); return 0; } uint64_t test_wasmer() { __asm__ __volatile__( "movabs $0x1cd1a94e7ac14ff,%%r11\n\t" "movq %%r11,%%xmm9\n\t" "mulsd %0,%%xmm9\n\t" "xorpd %%xmm7,%%xmm7\n\t" "xor %%r8d,%%r8d\n\t" "START2%=:\n\t" "movdqa %%xmm9,%%xmm15\n\t" "addsd %%xmm7,%%xmm15\n\t" "add $0x1,%%r8d\n\t" "cmp $0x12a05f20,%%r8d\n\t" "setne %%dil\n\t" "movzbl %%dil,%%edi\n\t" "test %%edi,%%edi\n\t" "je EXIT2%=\n\t" "movdqa %%xmm15,%%xmm7\n\t" "jmp START2%=\n\t" "EXIT2%=:\n\t":"=m"(X)::"r11","r8","rdi"); return 0; } int main() { test_wasmtime(); // test_wasmer(); return 0; }
hungryzzz edited a comment on issue #7246:
@alexcrichton Hi, I try to reduce the above
wasm
bytecode to get a buggy case as shortest as possible, the reduced results are attached. The execute time inWasmer
andWasmtime
still have a big difference:
Wasmer
: 0.67sWasmtime
: 12.24sBut I find that the performance regression is not only related to the https://github.com/bytecodealliance/wasmtime/issues/7283, because if I change the constant in
f64.const 5.43231e-312
tof64.const 5.4
, the execution time will back to normal(but indeed, the loop-invariant code will be sinked into loop in the corresponding binary generated byWasmtime
):
Wasmer
: 0.70sWasmtime
: 0.34sBy the way, I also check the binary generated by
LLVM
, I find that the whole computation will be optimized to empty(I think because there is nothing depends on the result of the loop).I think it is also a strange case I cannot understand and explain...
;; reduce.wat (module (type (;0;) (func (param i32))) (type (;1;) (func)) (type (;2;) (func (result i32))) (type (;3;) (func (param i32) (result i32))) (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0))) (func (;2;) (type 1) call 2 i32.const 0 call 0 unreachable) (func (;2;) (type 1) (local f64 f64 i32) f64.const 5.43231e-312 ;; Magic Number!!! f64.const 1.0 f64.mul local.set 1 loop ;; label = @1 local.get 1 local.get 0 f64.add local.set 0 local.get 2 ;; condition variable i i32.const 1 i32.add local.tee 2 ;; i++ i32.const 312500000 i32.ne br_if 0 (;@1;) end ) (table (;0;) 2 2 funcref) (memory (;0;) 8192 8192) (global (;0;) (mut i32) (i32.const 66576)) (export "memory" (memory 0)) (export "__indirect_function_table" (table 0)) (export "_start" (func 1)))
Environment Version
Wasmer
: wasmer 4.1.2 (98188df 2023-09-19)
binary: wasmer-cli
commit-hash: 98188df80096dfc7134ce202731878cf9d8f6f4e
commit-date: 2023-09-19
host: x86_64-unknown-linux-gnu
compiler: singlepass,cranelift,llvmWasmtime
: build from 8e00cc202f82445a42adbad6e8187e901f91833b (release version)- CPU
➜ ~ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 6 On-line CPU(s) list: 0-5 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz Stepping: 10 CPU MHz: 2200.000 CPU max MHz: 3700.0000 CPU min MHz: 800.0000 BogoMIPS: 4399.99 Virtualization: VT-x L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 1.5 MiB L3 cache: 9 MiB NUMA node0 CPU(s): 0-5 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled Vulnerability Mds: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Mitigation; TSX disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscal l nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulq dq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim er aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsa veopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
Also, I use the inline assembly to run the binary generated by
Wasmer
andWasmtime
and compare the execution time (but get the almost same execution time) and maybe it is helpful to your debugging so I paste it here.#include <stdint.h> double X = 1.0; uint64_t test_wasmtime() { __asm__ __volatile__( "vxorpd %%xmm5,%%xmm5,%%xmm6\n\t" "xor %%r11d,%%r11d\n\t" "START1%=:\n\t" "add $0x1,%%r11d\n\t" "movabs $0x1cd1a94e7ac14ff,%%rsi\n\t" "vmovq %%rsi,%%xmm0\n\t" "vmulsd %0,%%xmm0,%%xmm7\n\t" "vaddsd %%xmm6,%%xmm7,%%xmm6\n\t" "cmp $0x12a05f20,%%r11d\n\t" "jne START1%=\n\t":"=m"(X)::"rsi","r11"); return 0; } uint64_t test_wasmer() { __asm__ __volatile__( "movabs $0x1cd1a94e7ac14ff,%%r11\n\t" "movq %%r11,%%xmm9\n\t" "mulsd %0,%%xmm9\n\t" "xorpd %%xmm7,%%xmm7\n\t" "xor %%r8d,%%r8d\n\t" "START2%=:\n\t" "movdqa %%xmm9,%%xmm15\n\t" "addsd %%xmm7,%%xmm15\n\t" "add $0x1,%%r8d\n\t" "cmp $0x12a05f20,%%r8d\n\t" "setne %%dil\n\t" "movzbl %%dil,%%edi\n\t" "test %%edi,%%edi\n\t" "je EXIT2%=\n\t" "movdqa %%xmm15,%%xmm7\n\t" "jmp START2%=\n\t" "EXIT2%=:\n\t":"=m"(X)::"r11","r8","rdi"); return 0; } int main() { test_wasmtime(); // test_wasmer(); return 0; }
hungryzzz edited a comment on issue #7246:
@alexcrichton Hi, I try to reduce the above
wasm
bytecode to get a buggy case as shortest as possible, the reduced results are attached. The execute time inWasmer
andWasmtime
still have a big difference:
Wasmer
: 0.67sWasmtime
: 12.24sBut I find that the performance regression is not only related to the https://github.com/bytecodealliance/wasmtime/issues/7283, because if I change the constant in
f64.const 5.43231e-312
tof64.const 5.4
, the execution time will back to normal(but indeed, the loop-invariant code will be sinked into loop in the corresponding binary generated byWasmtime
):
Wasmer
: 0.70sWasmtime
: 0.34sBy the way, I also check the binary generated by
LLVM
, I find that the whole computation will be optimized to empty(I think because there is nothing depends on the result of the loop).I think it is also a strange case I cannot understand and explain...
;; reduce.wat (module (type (;0;) (func (param i32))) (type (;1;) (func)) (type (;2;) (func (result i32))) (type (;3;) (func (param i32) (result i32))) (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0))) (func (;2;) (type 1) call 2 i32.const 0 call 0 unreachable) (func (;2;) (type 1) (local f64 f64 i32) f64.const 5.43231e-312 ;; Magic Number!!! f64.const 1.0 f64.mul local.set 1 loop ;; label = @1 local.get 1 local.get 0 f64.add local.set 0 local.get 2 ;; condition variable i i32.const 1 i32.add local.tee 2 ;; i++ i32.const 312500000 i32.ne br_if 0 (;@1;) end ) (table (;0;) 2 2 funcref) (memory (;0;) 8192 8192) (global (;0;) (mut i32) (i32.const 66576)) (export "memory" (memory 0)) (export "__indirect_function_table" (table 0)) (export "_start" (func 1)))
Environment Version
Wasmer
: wasmer 4.1.2 (98188df 2023-09-19)
binary: wasmer-cli
commit-hash: 98188df80096dfc7134ce202731878cf9d8f6f4e
commit-date: 2023-09-19
host: x86_64-unknown-linux-gnu
compiler: singlepass,cranelift,llvmWasmtime
: build from 8e00cc202f82445a42adbad6e8187e901f91833b (release version)- CPU
➜ ~ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 6 On-line CPU(s) list: 0-5 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz Stepping: 10 CPU MHz: 2200.000 CPU max MHz: 3700.0000 CPU min MHz: 800.0000 BogoMIPS: 4399.99 Virtualization: VT-x L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 1.5 MiB L3 cache: 9 MiB NUMA node0 CPU(s): 0-5 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled Vulnerability Mds: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Mitigation; TSX disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscal l nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulq dq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim er aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsa veopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
Also, I use the inline assembly to run the binary generated by
Wasmer
andWasmtime
and compare the execution time, I also find the difference in execution time. Maybe it is helpful to your debugging so I paste it here.
-test_wasmtime
: 12.59s
-test_wasmer
: 0.35s#include <stdint.h> double X = 1.0; uint64_t test_wasmtime() { __asm__ __volatile__( "vxorpd %%xmm5,%%xmm5,%%xmm6\n\t" "xor %%r11d,%%r11d\n\t" "START1%=:\n\t" "add $0x1,%%r11d\n\t" "movabs $0x100000264d8,%%rsi\n\t" "vmovq %%rsi,%%xmm0\n\t" "vmulsd %0,%%xmm0,%%xmm7\n\t" "vaddsd %%xmm6,%%xmm7,%%xmm6\n\t" "cmp $0x12a05f20,%%r11d\n\t" "jne START1%=\n\t":"=m"(X)::"rsi","r11"); return 0; } uint64_t test_wasmer() { __asm__ __volatile__( "movabs $0x100000264d8,%%r11\n\t" "movq %%r11,%%xmm9\n\t" "mulsd %0,%%xmm9\n\t" "xorpd %%xmm7,%%xmm7\n\t" "xor %%r8d,%%r8d\n\t" "START2%=:\n\t" "movdqa %%xmm9,%%xmm15\n\t" "addsd %%xmm7,%%xmm15\n\t" "add $0x1,%%r8d\n\t" "cmp $0x12a05f20,%%r8d\n\t" "setne %%dil\n\t" "movzbl %%dil,%%edi\n\t" "test %%edi,%%edi\n\t" "je EXIT2%=\n\t" "movdqa %%xmm15,%%xmm7\n\t" "jmp START2%=\n\t" "EXIT2%=:\n\t":"=m"(X)::"r11","r8","rdi"); return 0; } int main() { test_wasmtime(); // test_wasmer(); return 0; }
alexcrichton commented on issue #7246:
Thanks for the follow-up @hungryzzz, although I've got a few questions for you. It's known that Wasmtime is slow with respect to this computation on its
main
branch (as of right now and the commit you said you were using). https://github.com/bytecodealliance/wasmtime/pull/7306 fixes the problem though by keeping one of the computations here out of the loop. In yourtest_wasmtime
function, for example, the loop has both avmulsd
and avaddsd
, but thevmulsd
computation is loop-invariant and shouldn't be in the loop.Generating the x64 assembly with #7306 I get:
0000000000000040 <wasm[0]::function[2]>: 40: 55 pushq %rbp 41: 48 89 e5 movq %rsp, %rbp 44: 4c 8b 57 08 movq 8(%rdi), %r10 48: 4d 8b 12 movq (%r10), %r10 4b: 49 39 e2 cmpq %rsp, %r10 4e: 0f 87 38 00 00 00 ja 0x8c <wasm[0]::function[2]+0x4c> 54: c5 d1 57 e5 vxorpd %xmm5, %xmm5, %xmm4 58: 45 31 c9 xorl %r9d, %r9d 5b: 48 ba d8 64 02 00 00 01 00 00 movabsq $1099511784664, %rdx # imm = 0x100000264D8 65: c4 e1 f9 6e fa vmovq %rdx, %xmm7 6a: c5 c3 59 35 1e 00 00 00 vmulsd 30(%rip), %xmm7, %xmm6 # 0x90 <wasm[0]::function[2]+0x50> 72: 41 83 c1 01 addl $1, %r9d 76: c5 cb 58 e4 vaddsd %xmm4, %xmm6, %xmm4 7a: 41 81 f9 20 5f a0 12 cmpl $312500000, %r9d # imm = 0x12A05F20 81: 0f 85 eb ff ff ff jne 0x72 <wasm[0]::function[2]+0x32> 87: 48 89 ec movq %rbp, %rsp 8a: 5d popq %rbp 8b: c3 retq 8c: 0f 0b ud2
where here the conditional jump at 0x81 goes to 0x72 meaning that the
vmulsd
is lifted outside of the loop as expected. I unfortunately do not have an x64 machine to test the relative performance before and after #7306, but I wanted to ask whether you tested this? When testing the performance of this program I'd recommend using #7306 as the version of Wasmtime to use. If you're using that though and still seeing performance problems let us know!
hungryzzz commented on issue #7246:
@alexcrichton Hi, the performance regression disappeared in the latest
Wasmtime
.But I think the performance regression in the old version
Wasmtime
is not only related to the wrong code sinking, because if I change the magic number5.43231e-312
to5.4
(or other normal floating point numbers), the execution time will back to normal. So I think the performance regression is the result of both wrong code sinking and this magic number5.43231e-312
.After searching, I find the reason from this paper On Subnormal Floating Point and Abnormal Timing.
We identify a timing channel in the floating point instructions of modern x86 processors: the running time of floating point addition and multiplication instructions can vary by two orders of magnitude depending on their operands.
This paper finds that floating point operands ranging from
4.9e-324 ~ 2.23e-308
will result in execution times of two orders of magnitude, so if I use other floating point numbers, the wrong code sinking to loop does not bring such a big performance regression. But you're right, https://github.com/bytecodealliance/wasmtime/pull/7306 will fix this performance regression.;; reduce.wat (module (type (;0;) (func (param i32))) (type (;1;) (func)) (type (;2;) (func (result i32))) (type (;3;) (func (param i32) (result i32))) (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0))) (func (;2;) (type 1) call 2 i32.const 0 call 0 unreachable) (func (;2;) (type 1) (local f64 f64 i32) f64.const 5.43231e-312 ;; Magic Number!!! f64.const 1.0 f64.mul local.set 1 loop ;; label = @1 local.get 1 local.get 0 f64.add local.set 0 local.get 2 ;; condition variable i i32.const 1 i32.add local.tee 2 ;; i++ i32.const 312500000 i32.ne br_if 0 (;@1;) end ) (table (;0;) 2 2 funcref) (memory (;0;) 8192 8192) (global (;0;) (mut i32) (i32.const 66576)) (export "memory" (memory 0)) (export "__indirect_function_table" (table 0)) (export "_start" (func 1)))
hungryzzz edited a comment on issue #7246:
@alexcrichton Hi, the performance regression disappeared in the latest
Wasmtime
.But I think the performance regression in the old version
Wasmtime
is not only related to the wrong code sinking, because if I change the magic number5.43231e-312
to5.4
(or other normal floating point numbers), the execution time will back to normal. So I think the performance regression is the result of both wrong code sinking and this magic number5.43231e-312
.After searching, We find the reason from this paper On Subnormal Floating Point and Abnormal Timing.
We identify a timing channel in the floating point instructions of modern x86 processors: the running time of floating point addition and multiplication instructions can vary by two orders of magnitude depending on their operands.
This paper finds that floating point operands ranging from
4.9e-324 ~ 2.23e-308
will result in execution times of two orders of magnitude, so if I use other floating point numbers, the wrong code sinking to loop does not bring such a big performance regression. But you're right, https://github.com/bytecodealliance/wasmtime/pull/7306 will fix this performance regression.;; reduce.wat (module (type (;0;) (func (param i32))) (type (;1;) (func)) (type (;2;) (func (result i32))) (type (;3;) (func (param i32) (result i32))) (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0))) (func (;2;) (type 1) call 2 i32.const 0 call 0 unreachable) (func (;2;) (type 1) (local f64 f64 i32) f64.const 5.43231e-312 ;; Magic Number!!! f64.const 1.0 f64.mul local.set 1 loop ;; label = @1 local.get 1 local.get 0 f64.add local.set 0 local.get 2 ;; condition variable i i32.const 1 i32.add local.tee 2 ;; i++ i32.const 312500000 i32.ne br_if 0 (;@1;) end ) (table (;0;) 2 2 funcref) (memory (;0;) 8192 8192) (global (;0;) (mut i32) (i32.const 66576)) (export "memory" (memory 0)) (export "__indirect_function_table" (table 0)) (export "_start" (func 1)))
hungryzzz edited a comment on issue #7246:
@alexcrichton Hi, the performance regression disappeared in the latest
Wasmtime
.But I think the performance regression in the old version
Wasmtime
is not only related to the wrong code sinking, because if I change the magic number5.43231e-312
to5.4
(or other normal floating point numbers), the execution time will back to normal. So I think the performance regression is the result of both wrong code sinking and this magic number5.43231e-312
.After searching, we find the reason from this paper On Subnormal Floating Point and Abnormal Timing.
We identify a timing channel in the floating point instructions of modern x86 processors: the running time of floating point addition and multiplication instructions can vary by two orders of magnitude depending on their operands.
This paper finds that floating point operands ranging from
4.9e-324 ~ 2.23e-308
will result in execution times of two orders of magnitude, so if I use other floating point numbers, the wrong code sinking to loop does not bring such a big performance regression. But you're right, https://github.com/bytecodealliance/wasmtime/pull/7306 will fix this performance regression.;; reduce.wat (module (type (;0;) (func (param i32))) (type (;1;) (func)) (type (;2;) (func (result i32))) (type (;3;) (func (param i32) (result i32))) (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0))) (func (;2;) (type 1) call 2 i32.const 0 call 0 unreachable) (func (;2;) (type 1) (local f64 f64 i32) f64.const 5.43231e-312 ;; Magic Number!!! f64.const 1.0 f64.mul local.set 1 loop ;; label = @1 local.get 1 local.get 0 f64.add local.set 0 local.get 2 ;; condition variable i i32.const 1 i32.add local.tee 2 ;; i++ i32.const 312500000 i32.ne br_if 0 (;@1;) end ) (table (;0;) 2 2 funcref) (memory (;0;) 8192 8192) (global (;0;) (mut i32) (i32.const 66576)) (export "memory" (memory 0)) (export "__indirect_function_table" (table 0)) (export "_start" (func 1)))
alexcrichton commented on issue #7246:
Fascinating! Not what I would have expected but helps explain the results for sure!
Last updated: Jan 24 2025 at 00:11 UTC