Stream: git-wasmtime

Topic: wasmtime / issue #7246 Abnormal long execute time running...


view this post on Zulip Wasmtime GitHub notifications bot (Oct 16 2023 at 04:59):

unicornt edited issue #7246:

Test Case

report.zip

Steps to Reproduce

time wasmtime case.wasm

Actual Results

Wasmtime takes an abnormal long time to execute this wasm file and no exception is thrown.

Expected Results

Other runtime use much less time or report an out of memory access exception

<img width="1230" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/18662046/626efd7d-16aa-49d7-ae8d-4e66c5f857ce">

Versions and Environment

Wasmtime version or commit: 15.0.0

Operating system: Ubuntu 20.04

Architecture: x86_64

view this post on Zulip Wasmtime GitHub notifications bot (Oct 16 2023 at 10:09):

afonso360 commented on issue #7246:

It looks like wasmtime wasn't compiled in release mode, would you be able to retest using the --release flag when compiling wasmtime?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 16 2023 at 10:09):

afonso360 edited a comment on issue #7246:

It looks like wasmtime wasn't compiled in release mode, would you be able to re-test using the --release flag when compiling wasmtime?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 16 2023 at 13:04):

unicornt commented on issue #7246:

It looks like wasmtime wasn't compiled in release mode, would you be able to re-test using the --release flag when compiling wasmtime?

Thank you for the reply! I tried to build wasmtime with --release flag, but it turned out to be useless.

<img width="1199" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/18662046/c4e02349-3151-4ca7-9f53-0d165c875c46">

view this post on Zulip Wasmtime GitHub notifications bot (Oct 16 2023 at 14:26):

alexcrichton commented on issue #7246:

Would you be able to help diagnose what's happening locally? I'm unable to reproduce this and the runtime for this module isn't too bad locally. If you could run

perf record -k mono wasmtime run --profile jitdump -C case=no case.wasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

and then copy/paste the top 10-20 functions into this issue that'll help explain where all the time is going.

One thing I've noticed is that this module is trying to write to file descriptor 0 inside the wasm itself. Wasmtime is returning EBADF because of that (0 is stdin, not stdout). Other runtimes likely aren't performing such a check in their WASI implementations. This provides a point of divergence between runtimes which may explain why runtimes are behaving differently.

Nevertheless though I'm not sure what's taking 20+ seconds for you in Wasmtime. Locally this module executes in 1-3 seconds for me. I unfortunately don't have access to native x86_64 hardware though and so this could be specific to that.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 16 2023 at 14:40):

afonso360 commented on issue #7246:

Nevertheless though I'm not sure what's taking 20+ seconds for you in Wasmtime. Locally this module executes in 1-3 seconds for me. I unfortunately don't have access to native x86_64 hardware though and so this could be specific to that.

I'm also unable to reproduce this on x86, it takes ~3 seconds for me. With 98.86% of the time being spent on wasm[0]::function[3].

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2023 at 05:53):

unicornt commented on issue #7246:

Nevertheless though I'm not sure what's taking 20+ seconds for you in Wasmtime. Locally this module executes in 1-3 seconds for me. I unfortunately don't have access to native x86_64 hardware though and so this could be specific to that.

I'm also unable to reproduce this on x86, it takes ~3 seconds for me. With 98.86% of the time being spent on wasm[0]::function[3].

I got similar output that 99.7% of the time is spent on function[3]. However, the total time is still 20+ seconds.

<img width="1389" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/18662046/6bff0da0-a3ef-4d78-aa51-5d889a141491">

Here is perf stat output of wasmtime and wasmedge(AOT mod). Hope it can help to find out why it happened!

<img width="1224" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/18662046/b639c272-7297-4d8d-b8ca-f6eea6ccda9c">

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2023 at 06:52):

bjorn3 commented on issue #7246:

An IPC of 0.08 is abnornally low. Could you show the annotated assembly of wasm[0]::function[3] as reported by perf report?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2023 at 07:17):

unicornt commented on issue #7246:

An IPC of 0.08 is abnornally low. Could you show the annotated assembly of wasm[0]::function[3] as reported by perf report?

Is this what you want? Sorry I'm not familiar with perf tools.

<img width="1407" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/18662046/ff6e3c6e-a0da-4fe7-8068-f48d9a67d8ea">

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2023 at 08:23):

bjorn3 commented on issue #7246:

Yes. The amount of time spent in what should be trivial instructions (mov between registers and xor) is weird.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2023 at 13:49):

unicornt commented on issue #7246:

Yes. The amount of time spent in what should be trivial instructions (mov between registers and xor) is weird.

Do you have any idea about why it happened? Anything I can do to figure it out?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2023 at 14:28):

alexcrichton commented on issue #7246:

Could you share the output of lscpu? This might be specific to your particular CPU and some edge case related to cache sizes or something like that perhaps.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2023 at 14:48):

unicornt commented on issue #7246:

Here is my output of lspcpu

  ~ lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      39 bits physical, 48 bits virtual
CPU(s):                             16
On-line CPU(s) list:                0-15
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              165
Model name:                         Intel(R) Core(TM) i7-10700F CPU @ 2.90GHz
Stepping:                           5
CPU MHz:                            2900.000
CPU max MHz:                        4800.0000
CPU min MHz:                        800.0000
BogoMIPS:                           5799.77
Virtualization:                     VT-x
L1d cache:                          256 KiB
L1i cache:                          256 KiB
L2 cache:                           2 MiB
L3 cache:                           16 MiB
NUMA node0 CPU(s):                  0-15
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Mitigation; Microcode
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscal
                                    l nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulq
                                    dq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer a
                                    es xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vn
                                    mi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt x
                                    saveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp pku ospke md_clear flush_l1d arch_
                                    capabilities

view this post on Zulip Wasmtime GitHub notifications bot (Oct 18 2023 at 05:23):

unicornt commented on issue #7246:

Hi, I tried the same code on another machine. The result is almost the same, including the weird time of mov and xor.

The OS of this machine is ubuntu 20.04 and version of wasmtime is 15.0.0.

Here is the output of lscpu on this machine.

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      39 bits physical, 48 bits virtual
CPU(s):                             6
On-line CPU(s) list:                0-5
Thread(s) per core:                 1
Core(s) per socket:                 6
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              158
Model name:                         Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
Stepping:                           10
CPU MHz:                            2200.000
CPU max MHz:                        3700.0000
CPU min MHz:                        800.0000
BogoMIPS:                           4399.99
Virtualization:                     VT-x
L1d cache:                          192 KiB
L1i cache:                          192 KiB
L2 cache:                           1.5 MiB
L3 cache:                           9 MiB
NUMA node0 CPU(s):                  0-5
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Retbleed:             Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Mitigation; Microcode
Vulnerability Tsx async abort:      Mitigation; TSX disabled
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx
                                     fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs
                                    bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
                                     vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_de
                                    adline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_sin
                                    gle pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjus
                                    t bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec x
                                    getbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l
                                    1d arch_capabilities

view this post on Zulip Wasmtime GitHub notifications bot (Oct 18 2023 at 17:44):

alexcrichton commented on issue #7246:

Ok thanks again for both reporting this and helping us follow-up! @elliottt was able to reproduce on hardware he has which resulted in us bottoming out in https://github.com/bytecodealliance/wasmtime/issues/7283 as the cause of this issue. I'm going to close this in favor of that issue to keep the issue discussion a bit shorter, but if you'd like to confirm locally @unicornt you can comment out these lines in a custom Wasmtime build and the performance of this program should be better. Note though that it's just a proof-of-concept of what the slowdown is and is not a full complete fix, that'll get tracked by https://github.com/bytecodealliance/wasmtime/issues/7283

view this post on Zulip Wasmtime GitHub notifications bot (Oct 18 2023 at 17:44):

alexcrichton closed issue #7246:

Test Case

report.zip

Steps to Reproduce

time wasmtime case.wasm

Actual Results

Wasmtime takes an abnormal long time to execute this wasm file and no exception is thrown.

Expected Results

Other runtime use much less time or report an out of memory access exception

<img width="1230" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/18662046/626efd7d-16aa-49d7-ae8d-4e66c5f857ce">

Versions and Environment

Wasmtime version or commit: 15.0.0

Operating system: Ubuntu 20.04

Architecture: x86_64

view this post on Zulip Wasmtime GitHub notifications bot (Oct 19 2023 at 02:38):

yfaming commented on issue #7246:

I'm using a MacBook Pro with a 2.6 GHz 6-Core Intel Core i7 CPU. I attempted to run the .wasm module provided by @alexcrichton @unicornt , and it was indeed quite slow, taking around 27s. I also tried wasmer, and it took about 7s.

Following @alexcrichton @alexcrichton's advice, I commented out the last 4 lines in cranelift/codegen/src/opts/remat.isle, recompiled wasmtime, and ran it again. The execution time dropped to less than 1s. Amazing!

view this post on Zulip Wasmtime GitHub notifications bot (Oct 19 2023 at 02:38):

yfaming edited a comment on issue #7246:

I'm using a MacBook Pro with a 2.6 GHz 6-Core Intel Core i7 CPU. I attempted to run the .wasm module provided by @alexcrichton @unicornt , and it was indeed quite slow, taking around 27s. I also tried wasmer, and it took about 7s.

Following @alexcrichton's advice, I commented out the last 4 lines in cranelift/codegen/src/opts/remat.isle, recompiled wasmtime, and ran it again. The execution time dropped to less than 1s. Amazing!

view this post on Zulip Wasmtime GitHub notifications bot (Oct 19 2023 at 02:38):

yfaming edited a comment on issue #7246:

I'm using a MacBook Pro with a 2.6 GHz 6-Core Intel Core i7 CPU. I attempted to run the .wasm module provided by @unicornt , and it was indeed quite slow, taking around 27s. I also tried wasmer, and it took about 7s.

Following @alexcrichton's advice, I commented out the last 4 lines in cranelift/codegen/src/opts/remat.isle, recompiled wasmtime, and ran it again. The execution time dropped to less than 1s. Amazing!

view this post on Zulip Wasmtime GitHub notifications bot (Oct 20 2023 at 07:42):

hungryzzz commented on issue #7246:

@alexcrichton Hi, I try to reduce the above wasm bytecode to get a buggy case as shortest as possible, the reduced results are attached. The execute time in Wasmer and Wasmtime still have a big difference:

But I find that the performance regression is not only related to the https://github.com/bytecodealliance/wasmtime/issues/7283, because if I change the constant in f64.const 5.43231e-312 to f64.const 5.4, the execution time will be back to normal(but indeed, the loop-invariant code will be sinked into loop in the corresponding binary generated by Wasmtime):

By the way, I also check the binary generated by LLVM, I find that the whole computation will be optimized to empty(I think because there is nothing depends on the result of the loop).

I think it is also a strange case I cannot understand and explain...

;; reduce.wat
(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (type (;2;) (func (result i32)))
  (type (;3;) (func (param i32) (result i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;2;) (type 1)
    call 2
    i32.const 0
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local f64 f64 i32)
    f64.const 5.43231e-312 ;; Magic Number!!!
    f64.const 1.0
    f64.mul
    local.set 1
    loop  ;; label = @1
      local.get 1
      local.get 0
      f64.add
      local.set 0
      local.get 2   ;; condition variable i
      i32.const 1
      i32.add
      local.tee 2   ;; i++
      i32.const 312500000
      i32.ne
      br_if 0 (;@1;)
    end
  )
  (table (;0;) 2 2 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 66576))
  (export "memory" (memory 0))
  (export "__indirect_function_table" (table 0))
  (export "_start" (func 1)))

Environment Version

  ~ lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      39 bits physical, 48 bits virtual
CPU(s):                             6
On-line CPU(s) list:                0-5
Thread(s) per core:                 1
Core(s) per socket:                 6
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              158
Model name:                         Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
Stepping:                           10
CPU MHz:                            2200.000
CPU max MHz:                        3700.0000
CPU min MHz:                        800.0000
BogoMIPS:                           4399.99
Virtualization:                     VT-x
L1d cache:                          192 KiB
L1i cache:                          192 KiB
L2 cache:                           1.5 MiB
L3 cache:                           9 MiB
NUMA node0 CPU(s):                  0-5
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Retbleed:             Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Mitigation; Microcode
Vulnerability Tsx async abort:      Mitigation; TSX disabled
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscal
                                    l nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulq
                                    dq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim
                                    er aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi
                                     flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsa
                                    veopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

reduce.zip

view this post on Zulip Wasmtime GitHub notifications bot (Oct 20 2023 at 07:50):

hungryzzz edited a comment on issue #7246:

@alexcrichton Hi, I try to reduce the above wasm bytecode to get a buggy case as shortest as possible, the reduced results are attached. The execute time in Wasmer and Wasmtime still have a big difference:

But I find that the performance regression is not only related to the https://github.com/bytecodealliance/wasmtime/issues/7283, because if I change the constant in f64.const 5.43231e-312 to f64.const 5.4, the execution time will back to normal(but indeed, the loop-invariant code will be sinked into loop in the corresponding binary generated by Wasmtime):

By the way, I also check the binary generated by LLVM, I find that the whole computation will be optimized to empty(I think because there is nothing depends on the result of the loop).

I think it is also a strange case I cannot understand and explain...

;; reduce.wat
(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (type (;2;) (func (result i32)))
  (type (;3;) (func (param i32) (result i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;2;) (type 1)
    call 2
    i32.const 0
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local f64 f64 i32)
    f64.const 5.43231e-312 ;; Magic Number!!!
    f64.const 1.0
    f64.mul
    local.set 1
    loop  ;; label = @1
      local.get 1
      local.get 0
      f64.add
      local.set 0
      local.get 2   ;; condition variable i
      i32.const 1
      i32.add
      local.tee 2   ;; i++
      i32.const 312500000
      i32.ne
      br_if 0 (;@1;)
    end
  )
  (table (;0;) 2 2 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 66576))
  (export "memory" (memory 0))
  (export "__indirect_function_table" (table 0))
  (export "_start" (func 1)))

Environment Version

  ~ lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      39 bits physical, 48 bits virtual
CPU(s):                             6
On-line CPU(s) list:                0-5
Thread(s) per core:                 1
Core(s) per socket:                 6
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              158
Model name:                         Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
Stepping:                           10
CPU MHz:                            2200.000
CPU max MHz:                        3700.0000
CPU min MHz:                        800.0000
BogoMIPS:                           4399.99
Virtualization:                     VT-x
L1d cache:                          192 KiB
L1i cache:                          192 KiB
L2 cache:                           1.5 MiB
L3 cache:                           9 MiB
NUMA node0 CPU(s):                  0-5
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Retbleed:             Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Mitigation; Microcode
Vulnerability Tsx async abort:      Mitigation; TSX disabled
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscal
                                    l nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulq
                                    dq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim
                                    er aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi
                                     flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsa
                                    veopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

reduce.zip

view this post on Zulip Wasmtime GitHub notifications bot (Oct 20 2023 at 08:16):

hungryzzz edited a comment on issue #7246:

@alexcrichton Hi, I try to reduce the above wasm bytecode to get a buggy case as shortest as possible, the reduced results are attached. The execute time in Wasmer and Wasmtime still have a big difference:

But I find that the performance regression is not only related to the https://github.com/bytecodealliance/wasmtime/issues/7283, because if I change the constant in f64.const 5.43231e-312 to f64.const 5.4, the execution time will back to normal(but indeed, the loop-invariant code will be sinked into loop in the corresponding binary generated by Wasmtime):

By the way, I also check the binary generated by LLVM, I find that the whole computation will be optimized to empty(I think because there is nothing depends on the result of the loop).

I think it is also a strange case I cannot understand and explain...

;; reduce.wat
(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (type (;2;) (func (result i32)))
  (type (;3;) (func (param i32) (result i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;2;) (type 1)
    call 2
    i32.const 0
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local f64 f64 i32)
    f64.const 5.43231e-312 ;; Magic Number!!!
    f64.const 1.0
    f64.mul
    local.set 1
    loop  ;; label = @1
      local.get 1
      local.get 0
      f64.add
      local.set 0
      local.get 2   ;; condition variable i
      i32.const 1
      i32.add
      local.tee 2   ;; i++
      i32.const 312500000
      i32.ne
      br_if 0 (;@1;)
    end
  )
  (table (;0;) 2 2 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 66576))
  (export "memory" (memory 0))
  (export "__indirect_function_table" (table 0))
  (export "_start" (func 1)))

Environment Version

  ~ lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      39 bits physical, 48 bits virtual
CPU(s):                             6
On-line CPU(s) list:                0-5
Thread(s) per core:                 1
Core(s) per socket:                 6
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              158
Model name:                         Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
Stepping:                           10
CPU MHz:                            2200.000
CPU max MHz:                        3700.0000
CPU min MHz:                        800.0000
BogoMIPS:                           4399.99
Virtualization:                     VT-x
L1d cache:                          192 KiB
L1i cache:                          192 KiB
L2 cache:                           1.5 MiB
L3 cache:                           9 MiB
NUMA node0 CPU(s):                  0-5
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Retbleed:             Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Mitigation; Microcode
Vulnerability Tsx async abort:      Mitigation; TSX disabled
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscal
                                    l nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulq
                                    dq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim
                                    er aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi
                                     flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsa
                                    veopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

Also, I use the inline assembly to compare the execution time (but get the almost same execution time) and maybe it is helpful to your debugging so I paste it here.

#include <stdint.h>


double X = 1.0;

uint64_t test_wasmtime() {
  __asm__ __volatile__(
"vxorpd %%xmm5,%%xmm5,%%xmm6\n\t"
"xor    %%r11d,%%r11d\n\t"
"START1%=:\n\t"
"add    $0x1,%%r11d\n\t"
"movabs $0x1cd1a94e7ac14ff,%%rsi\n\t"
"vmovq  %%rsi,%%xmm0\n\t"
"vmulsd %0,%%xmm0,%%xmm7\n\t"
"vaddsd %%xmm6,%%xmm7,%%xmm6\n\t"
"cmp    $0x12a05f20,%%r11d\n\t"
"jne    START1%=\n\t":"=m"(X)::"rsi","r11");
  return 0;
}

uint64_t test_wasmer() {
  __asm__ __volatile__(
"movabs $0x1cd1a94e7ac14ff,%%r11\n\t"
"movq   %%r11,%%xmm9\n\t"
"mulsd  %0,%%xmm9\n\t"
"xorpd  %%xmm7,%%xmm7\n\t"
"xor    %%r8d,%%r8d\n\t"
"START2%=:\n\t"
"movdqa %%xmm9,%%xmm15\n\t"
"addsd  %%xmm7,%%xmm15\n\t"
"add    $0x1,%%r8d\n\t"
"cmp    $0x12a05f20,%%r8d\n\t"
"setne  %%dil\n\t"
"movzbl %%dil,%%edi\n\t"
"test   %%edi,%%edi\n\t"
"je     EXIT2%=\n\t"
"movdqa %%xmm15,%%xmm7\n\t"
"jmp   START2%=\n\t"
"EXIT2%=:\n\t":"=m"(X)::"r11","r8","rdi");
  return 0;
}

int main() {
  test_wasmtime();
  // test_wasmer();
  return 0;
}

reduce.zip

view this post on Zulip Wasmtime GitHub notifications bot (Oct 20 2023 at 08:17):

hungryzzz edited a comment on issue #7246:

@alexcrichton Hi, I try to reduce the above wasm bytecode to get a buggy case as shortest as possible, the reduced results are attached. The execute time in Wasmer and Wasmtime still have a big difference:

But I find that the performance regression is not only related to the https://github.com/bytecodealliance/wasmtime/issues/7283, because if I change the constant in f64.const 5.43231e-312 to f64.const 5.4, the execution time will back to normal(but indeed, the loop-invariant code will be sinked into loop in the corresponding binary generated by Wasmtime):

By the way, I also check the binary generated by LLVM, I find that the whole computation will be optimized to empty(I think because there is nothing depends on the result of the loop).

I think it is also a strange case I cannot understand and explain...

;; reduce.wat
(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (type (;2;) (func (result i32)))
  (type (;3;) (func (param i32) (result i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;2;) (type 1)
    call 2
    i32.const 0
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local f64 f64 i32)
    f64.const 5.43231e-312 ;; Magic Number!!!
    f64.const 1.0
    f64.mul
    local.set 1
    loop  ;; label = @1
      local.get 1
      local.get 0
      f64.add
      local.set 0
      local.get 2   ;; condition variable i
      i32.const 1
      i32.add
      local.tee 2   ;; i++
      i32.const 312500000
      i32.ne
      br_if 0 (;@1;)
    end
  )
  (table (;0;) 2 2 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 66576))
  (export "memory" (memory 0))
  (export "__indirect_function_table" (table 0))
  (export "_start" (func 1)))

Environment Version

  ~ lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      39 bits physical, 48 bits virtual
CPU(s):                             6
On-line CPU(s) list:                0-5
Thread(s) per core:                 1
Core(s) per socket:                 6
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              158
Model name:                         Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
Stepping:                           10
CPU MHz:                            2200.000
CPU max MHz:                        3700.0000
CPU min MHz:                        800.0000
BogoMIPS:                           4399.99
Virtualization:                     VT-x
L1d cache:                          192 KiB
L1i cache:                          192 KiB
L2 cache:                           1.5 MiB
L3 cache:                           9 MiB
NUMA node0 CPU(s):                  0-5
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Retbleed:             Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Mitigation; Microcode
Vulnerability Tsx async abort:      Mitigation; TSX disabled
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscal
                                    l nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulq
                                    dq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim
                                    er aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi
                                     flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsa
                                    veopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

Also, I use the inline assembly to run the binary generated by Wasmer and Wasmtime and compare the execution time (but get the almost same execution time) and maybe it is helpful to your debugging so I paste it here.

#include <stdint.h>


double X = 1.0;

uint64_t test_wasmtime() {
  __asm__ __volatile__(
"vxorpd %%xmm5,%%xmm5,%%xmm6\n\t"
"xor    %%r11d,%%r11d\n\t"
"START1%=:\n\t"
"add    $0x1,%%r11d\n\t"
"movabs $0x1cd1a94e7ac14ff,%%rsi\n\t"
"vmovq  %%rsi,%%xmm0\n\t"
"vmulsd %0,%%xmm0,%%xmm7\n\t"
"vaddsd %%xmm6,%%xmm7,%%xmm6\n\t"
"cmp    $0x12a05f20,%%r11d\n\t"
"jne    START1%=\n\t":"=m"(X)::"rsi","r11");
  return 0;
}

uint64_t test_wasmer() {
  __asm__ __volatile__(
"movabs $0x1cd1a94e7ac14ff,%%r11\n\t"
"movq   %%r11,%%xmm9\n\t"
"mulsd  %0,%%xmm9\n\t"
"xorpd  %%xmm7,%%xmm7\n\t"
"xor    %%r8d,%%r8d\n\t"
"START2%=:\n\t"
"movdqa %%xmm9,%%xmm15\n\t"
"addsd  %%xmm7,%%xmm15\n\t"
"add    $0x1,%%r8d\n\t"
"cmp    $0x12a05f20,%%r8d\n\t"
"setne  %%dil\n\t"
"movzbl %%dil,%%edi\n\t"
"test   %%edi,%%edi\n\t"
"je     EXIT2%=\n\t"
"movdqa %%xmm15,%%xmm7\n\t"
"jmp   START2%=\n\t"
"EXIT2%=:\n\t":"=m"(X)::"r11","r8","rdi");
  return 0;
}

int main() {
  test_wasmtime();
  // test_wasmer();
  return 0;
}

reduce.zip

view this post on Zulip Wasmtime GitHub notifications bot (Oct 20 2023 at 12:24):

hungryzzz edited a comment on issue #7246:

@alexcrichton Hi, I try to reduce the above wasm bytecode to get a buggy case as shortest as possible, the reduced results are attached. The execute time in Wasmer and Wasmtime still have a big difference:

But I find that the performance regression is not only related to the https://github.com/bytecodealliance/wasmtime/issues/7283, because if I change the constant in f64.const 5.43231e-312 to f64.const 5.4, the execution time will back to normal(but indeed, the loop-invariant code will be sinked into loop in the corresponding binary generated by Wasmtime):

By the way, I also check the binary generated by LLVM, I find that the whole computation will be optimized to empty(I think because there is nothing depends on the result of the loop).

I think it is also a strange case I cannot understand and explain...

;; reduce.wat
(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (type (;2;) (func (result i32)))
  (type (;3;) (func (param i32) (result i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;2;) (type 1)
    call 2
    i32.const 0
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local f64 f64 i32)
    f64.const 5.43231e-312 ;; Magic Number!!!
    f64.const 1.0
    f64.mul
    local.set 1
    loop  ;; label = @1
      local.get 1
      local.get 0
      f64.add
      local.set 0
      local.get 2   ;; condition variable i
      i32.const 1
      i32.add
      local.tee 2   ;; i++
      i32.const 312500000
      i32.ne
      br_if 0 (;@1;)
    end
  )
  (table (;0;) 2 2 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 66576))
  (export "memory" (memory 0))
  (export "__indirect_function_table" (table 0))
  (export "_start" (func 1)))

Environment Version

  ~ lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      39 bits physical, 48 bits virtual
CPU(s):                             6
On-line CPU(s) list:                0-5
Thread(s) per core:                 1
Core(s) per socket:                 6
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              158
Model name:                         Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
Stepping:                           10
CPU MHz:                            2200.000
CPU max MHz:                        3700.0000
CPU min MHz:                        800.0000
BogoMIPS:                           4399.99
Virtualization:                     VT-x
L1d cache:                          192 KiB
L1i cache:                          192 KiB
L2 cache:                           1.5 MiB
L3 cache:                           9 MiB
NUMA node0 CPU(s):                  0-5
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Retbleed:             Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Mitigation; Microcode
Vulnerability Tsx async abort:      Mitigation; TSX disabled
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscal
                                    l nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulq
                                    dq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim
                                    er aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi
                                     flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsa
                                    veopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

Also, I use the inline assembly to run the binary generated by Wasmer and Wasmtime and compare the execution time, I also find the difference in execution time. Maybe it is helpful to your debugging so I paste it here.
- test_wasmtime: 12.59s
- test_wasmer: 0.35s

#include <stdint.h>


double X = 1.0;

uint64_t test_wasmtime() {
  __asm__ __volatile__(
"vxorpd %%xmm5,%%xmm5,%%xmm6\n\t"
"xor    %%r11d,%%r11d\n\t"
"START1%=:\n\t"
"add    $0x1,%%r11d\n\t"
"movabs $0x100000264d8,%%rsi\n\t"
"vmovq  %%rsi,%%xmm0\n\t"
"vmulsd %0,%%xmm0,%%xmm7\n\t"
"vaddsd %%xmm6,%%xmm7,%%xmm6\n\t"
"cmp    $0x12a05f20,%%r11d\n\t"
"jne    START1%=\n\t":"=m"(X)::"rsi","r11");
  return 0;
}

uint64_t test_wasmer() {
  __asm__ __volatile__(
"movabs $0x100000264d8,%%r11\n\t"
"movq   %%r11,%%xmm9\n\t"
"mulsd  %0,%%xmm9\n\t"
"xorpd  %%xmm7,%%xmm7\n\t"
"xor    %%r8d,%%r8d\n\t"
"START2%=:\n\t"
"movdqa %%xmm9,%%xmm15\n\t"
"addsd  %%xmm7,%%xmm15\n\t"
"add    $0x1,%%r8d\n\t"
"cmp    $0x12a05f20,%%r8d\n\t"
"setne  %%dil\n\t"
"movzbl %%dil,%%edi\n\t"
"test   %%edi,%%edi\n\t"
"je     EXIT2%=\n\t"
"movdqa %%xmm15,%%xmm7\n\t"
"jmp   START2%=\n\t"
"EXIT2%=:\n\t":"=m"(X)::"r11","r8","rdi");
  return 0;
}

int main() {
  test_wasmtime();
  // test_wasmer();
  return 0;
}

reduce.zip

view this post on Zulip Wasmtime GitHub notifications bot (Oct 20 2023 at 15:11):

alexcrichton commented on issue #7246:

Thanks for the follow-up @hungryzzz, although I've got a few questions for you. It's known that Wasmtime is slow with respect to this computation on its main branch (as of right now and the commit you said you were using). https://github.com/bytecodealliance/wasmtime/pull/7306 fixes the problem though by keeping one of the computations here out of the loop. In your test_wasmtime function, for example, the loop has both a vmulsd and a vaddsd, but the vmulsd computation is loop-invariant and shouldn't be in the loop.

Generating the x64 assembly with #7306 I get:

0000000000000040 <wasm[0]::function[2]>:
      40: 55                            pushq   %rbp
      41: 48 89 e5                      movq    %rsp, %rbp
      44: 4c 8b 57 08                   movq    8(%rdi), %r10
      48: 4d 8b 12                      movq    (%r10), %r10
      4b: 49 39 e2                      cmpq    %rsp, %r10
      4e: 0f 87 38 00 00 00             ja      0x8c <wasm[0]::function[2]+0x4c>
      54: c5 d1 57 e5                   vxorpd  %xmm5, %xmm5, %xmm4
      58: 45 31 c9                      xorl    %r9d, %r9d
      5b: 48 ba d8 64 02 00 00 01 00 00 movabsq $1099511784664, %rdx    # imm = 0x100000264D8
      65: c4 e1 f9 6e fa                vmovq   %rdx, %xmm7
      6a: c5 c3 59 35 1e 00 00 00       vmulsd  30(%rip), %xmm7, %xmm6  # 0x90 <wasm[0]::function[2]+0x50>
      72: 41 83 c1 01                   addl    $1, %r9d
      76: c5 cb 58 e4                   vaddsd  %xmm4, %xmm6, %xmm4
      7a: 41 81 f9 20 5f a0 12          cmpl    $312500000, %r9d        # imm = 0x12A05F20
      81: 0f 85 eb ff ff ff             jne     0x72 <wasm[0]::function[2]+0x32>
      87: 48 89 ec                      movq    %rbp, %rsp
      8a: 5d                            popq    %rbp
      8b: c3                            retq
      8c: 0f 0b                         ud2

where here the conditional jump at 0x81 goes to 0x72 meaning that the vmulsd is lifted outside of the loop as expected. I unfortunately do not have an x64 machine to test the relative performance before and after #7306, but I wanted to ask whether you tested this? When testing the performance of this program I'd recommend using #7306 as the version of Wasmtime to use. If you're using that though and still seeing performance problems let us know!

view this post on Zulip Wasmtime GitHub notifications bot (Oct 22 2023 at 07:17):

hungryzzz commented on issue #7246:

@alexcrichton Hi, the performance regression disappeared in the latest Wasmtime.

But I think the performance regression in the old version Wasmtime is not only related to the wrong code sinking, because if I change the magic number 5.43231e-312 to 5.4(or other normal floating point numbers), the execution time will back to normal. So I think the performance regression is the result of both wrong code sinking and this magic number 5.43231e-312.

After searching, I find the reason from this paper On Subnormal Floating Point and Abnormal Timing.

We identify a timing channel in the floating point instructions of modern x86 processors: the running time of floating point addition and multiplication instructions can vary by two orders of magnitude depending on their operands.

This paper finds that floating point operands ranging from 4.9e-324 ~ 2.23e-308 will result in execution times of two orders of magnitude, so if I use other floating point numbers, the wrong code sinking to loop does not bring such a big performance regression. But you're right, https://github.com/bytecodealliance/wasmtime/pull/7306 will fix this performance regression.

;; reduce.wat
(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (type (;2;) (func (result i32)))
  (type (;3;) (func (param i32) (result i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;2;) (type 1)
    call 2
    i32.const 0
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local f64 f64 i32)
    f64.const 5.43231e-312 ;; Magic Number!!!
    f64.const 1.0
    f64.mul
    local.set 1
    loop  ;; label = @1
      local.get 1
      local.get 0
      f64.add
      local.set 0
      local.get 2   ;; condition variable i
      i32.const 1
      i32.add
      local.tee 2   ;; i++
      i32.const 312500000
      i32.ne
      br_if 0 (;@1;)
    end
  )
  (table (;0;) 2 2 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 66576))
  (export "memory" (memory 0))
  (export "__indirect_function_table" (table 0))
  (export "_start" (func 1)))

view this post on Zulip Wasmtime GitHub notifications bot (Oct 22 2023 at 07:26):

hungryzzz edited a comment on issue #7246:

@alexcrichton Hi, the performance regression disappeared in the latest Wasmtime.

But I think the performance regression in the old version Wasmtime is not only related to the wrong code sinking, because if I change the magic number 5.43231e-312 to 5.4(or other normal floating point numbers), the execution time will back to normal. So I think the performance regression is the result of both wrong code sinking and this magic number 5.43231e-312.

After searching, We find the reason from this paper On Subnormal Floating Point and Abnormal Timing.

We identify a timing channel in the floating point instructions of modern x86 processors: the running time of floating point addition and multiplication instructions can vary by two orders of magnitude depending on their operands.

This paper finds that floating point operands ranging from 4.9e-324 ~ 2.23e-308 will result in execution times of two orders of magnitude, so if I use other floating point numbers, the wrong code sinking to loop does not bring such a big performance regression. But you're right, https://github.com/bytecodealliance/wasmtime/pull/7306 will fix this performance regression.

;; reduce.wat
(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (type (;2;) (func (result i32)))
  (type (;3;) (func (param i32) (result i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;2;) (type 1)
    call 2
    i32.const 0
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local f64 f64 i32)
    f64.const 5.43231e-312 ;; Magic Number!!!
    f64.const 1.0
    f64.mul
    local.set 1
    loop  ;; label = @1
      local.get 1
      local.get 0
      f64.add
      local.set 0
      local.get 2   ;; condition variable i
      i32.const 1
      i32.add
      local.tee 2   ;; i++
      i32.const 312500000
      i32.ne
      br_if 0 (;@1;)
    end
  )
  (table (;0;) 2 2 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 66576))
  (export "memory" (memory 0))
  (export "__indirect_function_table" (table 0))
  (export "_start" (func 1)))

view this post on Zulip Wasmtime GitHub notifications bot (Oct 22 2023 at 07:27):

hungryzzz edited a comment on issue #7246:

@alexcrichton Hi, the performance regression disappeared in the latest Wasmtime.

But I think the performance regression in the old version Wasmtime is not only related to the wrong code sinking, because if I change the magic number 5.43231e-312 to 5.4(or other normal floating point numbers), the execution time will back to normal. So I think the performance regression is the result of both wrong code sinking and this magic number 5.43231e-312.

After searching, we find the reason from this paper On Subnormal Floating Point and Abnormal Timing.

We identify a timing channel in the floating point instructions of modern x86 processors: the running time of floating point addition and multiplication instructions can vary by two orders of magnitude depending on their operands.

This paper finds that floating point operands ranging from 4.9e-324 ~ 2.23e-308 will result in execution times of two orders of magnitude, so if I use other floating point numbers, the wrong code sinking to loop does not bring such a big performance regression. But you're right, https://github.com/bytecodealliance/wasmtime/pull/7306 will fix this performance regression.

;; reduce.wat
(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (type (;2;) (func (result i32)))
  (type (;3;) (func (param i32) (result i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;2;) (type 1)
    call 2
    i32.const 0
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local f64 f64 i32)
    f64.const 5.43231e-312 ;; Magic Number!!!
    f64.const 1.0
    f64.mul
    local.set 1
    loop  ;; label = @1
      local.get 1
      local.get 0
      f64.add
      local.set 0
      local.get 2   ;; condition variable i
      i32.const 1
      i32.add
      local.tee 2   ;; i++
      i32.const 312500000
      i32.ne
      br_if 0 (;@1;)
    end
  )
  (table (;0;) 2 2 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 66576))
  (export "memory" (memory 0))
  (export "__indirect_function_table" (table 0))
  (export "_start" (func 1)))

view this post on Zulip Wasmtime GitHub notifications bot (Oct 23 2023 at 14:34):

alexcrichton commented on issue #7246:

Fascinating! Not what I would have expected but helps explain the results for sure!


Last updated: Dec 23 2024 at 13:07 UTC