hungryzzz opened issue #8573:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
,WasmEdge(AOT)
andWamr(AOT)
, and collect their execution time respectively (measured bytime
tool).Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
Wasmtime
: 2.81sWasmEdge
: 4.67sWamr
: 3.29sFor
bad.wasm
, the execution time in different runtimes are as follows:
Wasmtime
: 2.35sWasmEdge
: 9.34sWamr
: 3.29sThe difference between the attached two cases is as follow, i.e., changing the address of
i32.load
fromlocal.get 0
toi32.const 0
, the case withi32.const 0
has the bad performance inWasmtime
, whose execution time is 2x slower than the former one. I think the decreasing is confused and suspicious, why is an immediate operation slower than a local variable operation? Moreover, thei32.load
is out of the loop, I cannot understand why the operation could affect the overall performance so vastly.![截屏2024-05-07 23 19 36](https://github.com/bytecodealliance/wasmtime/assets/32137313/be606e8b-7da9-4f99-9697-e23b3b53e376)
Versions and Environment
Wasmtime version or commit: d0cf46a098d97bab9c4cf54a2f3aeaddea7d3831
Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
hungryzzz edited issue #8573:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
,WasmEdge(AOT)
andWamr(AOT)
, and collect their execution time respectively (measured bytime
tool).Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
Wasmtime
: 4.67sWasmEdge
: 2.81sWamr
: 3.29sFor
bad.wasm
, the execution time in different runtimes are as follows:
Wasmtime
: 9.34sWasmEdge
: 2.35sWamr
: 3.29sThe difference between the attached two cases is as follow, i.e., changing the address of
i32.load
fromlocal.get 0
toi32.const 0
, the case withi32.const 0
has the bad performance inWasmtime
, whose execution time is 2x slower than the former one. I think the decreasing is confused and suspicious, why is an immediate operation slower than a local variable operation? Moreover, thei32.load
is out of the loop, I cannot understand why the operation could affect the overall performance so vastly.![截屏2024-05-07 23 19 36](https://github.com/bytecodealliance/wasmtime/assets/32137313/be606e8b-7da9-4f99-9697-e23b3b53e376)
Versions and Environment
Wasmtime version or commit: d0cf46a098d97bab9c4cf54a2f3aeaddea7d3831
Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
cfallin commented on issue #8573:
cc @fitzgen , perhaps this is something to do with our bounds-checking cases?
fitzgen commented on issue #8573:
@hungryzzz can you share how you are configuring and executing the Wasm in Wasmtime? I.e. the exact flags you are passing to the CLI or the exact Rust code that you are using when embedding Wasmtime if you aren't using the CLI.
cc @fitzgen , perhaps this is something to do with our bounds-checking cases?
With the default configuration of Wasmtime we should be eliding basically all bounds checks, I'd be surprised if this was bounds-check related unless guard pages are being disabled or something.
hungryzzz commented on issue #8573:
@fitzgen I use the CLI to run the case, the command I use is
wasmtime good.wasm
without extra flags.
fitzgen commented on issue #8573:
@hungryzzz ah okay, that means that you are measuring compilation time as well, not just execution time.
If you want to measure execution performance specifically, you should instead do something like
$ wasmtime compile test.wasm -o test.cwasm $ time wasmtime run --allow-precompiled test.cwasm
Wasmtime is technically a JIT in the sense that it can compile code on demand and map it executable into the same process that did the compiling right before executing the code. But it is not a JIT in the same sense that engines like V8 and SpiderMonkey are, where there are speculative optimizations, lazy tiering up based on which functions are hot, dynamic deoptimization when speculative assumptions fail, etc... Given all that, I'd highly recommend the two-phase process I sketched above, where the Wasm is first compiled and then second the compiled code is executed (and this second part is what you are timing).
hungryzzz commented on issue #8573:
@fitzgen Thank you for your advice, I recollect the execution time with the above commands and the results are as follows.
<img width="1385" alt="截屏2024-05-08 13 40 58" src="https://github.com/bytecodealliance/wasmtime/assets/32137313/88a176a6-ea67-475d-8474-90d61f0fbed6">
cfallin commented on issue #8573:
@hungryzzz a request for your posts: would you mind posting your results as plain text (or formatted as appropriate), rather than PNG screenshots of a terminal?
There are several reasons for this:
- Plain text shows up in email clients for folks who have GitHub email notifications, whereas images may not load at all, or require an extra action to load (personally the case for me: I have to click through to see your content)
- Plain text follows the configuration of the user's browser, including font-sizes, color and contrast settings, etc., some of which are not just cosmetic but necessary for folks with vision issues (which will be most of us eventually!)
- Plain text can be searched (local Ctrl-F, and also indexed by search engines at GitHub's search functionality), copied and pasted, and otherwise handled as needed, whereas a PNG is just pixels
- Plain text is visible to screen-readers and other tools that inspect the semantic page contents, whereas an image is not
It may take ever so slightly longer (copy & paste, use markdown triple-backtick quotes) but it's a great kindness to readers here for all the reasons above, and we want to set a standard here for maximally accessible and usable discussion content. Thanks!
cfallin edited a comment on issue #8573:
@hungryzzz a request for your posts: would you mind posting your results as plain text (or formatted as appropriate), rather than PNG screenshots of a terminal?
There are several reasons for this:
- Plain text shows up in email clients for folks who have GitHub email notifications, whereas images may not load at all, or require an extra action to load (personally the case for me: I have to click through to see your content)
- Plain text follows the configuration of the user's browser, including font-sizes, color and contrast settings, etc., some of which are not just cosmetic but necessary for folks with vision issues (which will be most of us eventually!)
- Plain text can be searched (local Ctrl-F, and also indexed by search engines and GitHub's search functionality), copied and pasted, and otherwise handled as needed, whereas a PNG is just pixels
- Plain text is visible to screen-readers and other tools that inspect the semantic page contents, whereas an image is not
It may take ever so slightly longer (copy & paste, use markdown triple-backtick quotes) but it's a great kindness to readers here for all the reasons above, and we want to set a standard here for maximally accessible and usable discussion content. Thanks!
hungryzzz edited a comment on issue #8573:
@fitzgen Thank you for your advice, I recollect the execution time with the above commands and the results are as follows. I guess the JITed code would be cached after the first compilation, so the execution results are nearly the same as the above one.
➜ case1 git:(wasm-mutate) ✗ wasmtime compile good.wasm -o good.cwasm ➜ case1 git:(wasm-mutate) ✗ wasmtime compile bad.wasm -o bad.cwasm ➜ case1 git:(wasm-mutate) ✗ time wasmtime run --allow-precompiled good.cwasm /home/ringzzz/wasm_runtime/wasmtime/target/release/wasmtime run good.cwasm 4.67s user 0.00s system 100% cpu 4.674 total ➜ case1 git:(wasm-mutate) ✗ time wasmtime run --allow-precompiled bad.cwasm /home/ringzzz/wasm_runtime/wasmtime/target/release/wasmtime run bad.cwasm 9.36s user 0.01s system 100% cpu 9.365 total
cfallin edited a comment on issue #8573:
@hungryzzz a request for your posts: would you mind posting your results as plain text (or formatted as appropriate), rather than PNG screenshots of a terminal?
There are several reasons for this:
- Plain text shows up in email clients for folks who have GitHub email notifications, whereas images may not load at all, or require an extra action to load (personally the case for me: I have to click through to see your content)
- Plain text follows the configuration of the user's browser, including font-sizes, color and contrast settings, etc., some of which are not just cosmetic but necessary for folks with vision issues (which will be most of us eventually!)
- Plain text can be searched (local Ctrl-F, and also indexed by search engines and GitHub's search functionality), copied and pasted, and otherwise handled as needed, whereas a PNG is just pixels
- Plain text is visible to screen-readers and other tools that inspect the semantic page contents, whereas an image is not
It may take ever so slightly longer (copy & paste, use markdown triple-backtick quotes) but it's a great kindness to readers here for all the reasons above, and we want to set a standard here for maximally accessible and usable discussion content. Thanks!
(EDIT: thanks for your edit! I'll leave this here for posterity)
hungryzzz edited a comment on issue #8573:
@fitzgen Thank you for your advice, I recollect the execution time with the above commands and the results are as follows. I guess the JITed code would be cached after the first compilation, so the execution results are nearly the same as the above one.
➜ case ✗ wasmtime compile good.wasm -o good.cwasm ➜ case ✗ wasmtime compile bad.wasm -o bad.cwasm ➜ case ✗ time wasmtime run --allow-precompiled good.cwasm ~/wasmtime/target/release/wasmtime run good.cwasm 4.67s user 0.00s system 100% cpu 4.674 total ➜ case ✗ time wasmtime run --allow-precompiled bad.cwasm ~/wasmtime/target/release/wasmtime run bad.cwasm 9.36s user 0.01s system 100% cpu 9.365 total
primoly commented on issue #8573:
Probably irrelevant, but could code alignment play a role? I noticed that both modules compile to basically the same instructions, just assigned different registers. Yet for
func1
, good gets more operations aligned to four bytes than bad.bad.cwasm
func1
x86-64: 60 instructions, 17 start at addresses divisible by four (28%)
good.cwasmfunc1
x86-64: 62 instructions, 25 start at addresses divisible by four (40%)Unsurprisingly no difference for
func2
since it generates identical instructions for better or worse... I mean good and bad.
hungryzzz commented on issue #8573:
I use
Perf
tool to profile the execution time and surprisingly find that the hotspot is not infunc1
where the small difference happens, but in thefunc2
. I continue to check the binary compiled byCranelift
and find that the binary offunc2
is the same before and after the small difference. It is really confused to me how the difference infunc1
could affect the execution offunc2
.Samples: 38K of event 'cycles', Event count (approx.): 34572923013 Overhead Command Shared Object Symbol 74.91% wasmtime jitted-4044282-1.so [.] 000000000000004f ◆ 24.63% wasmtime jitted-4044282-1.so [.] 0000000000000071 ▒ 0.32% wasmtime jitted-4044282-1.so [.] 0x0000000000000067 ▒ 0.01% wasmtime [kernel.kallsyms] [k] __handle_mm_fault ▒ 0.01% wasmtime [kernel.kallsyms] [k] sync_regs ▒ 0.01% wasmtime [kernel.kallsyms] [k] native_irq_return_iret ▒ 0.01% wasmtime [kernel.kallsyms] [k] vm_unmapped_area ▒
Samples: 38K of event 'cycles', 4000 Hz, Event count (approx.): 34572923013 000000000000004f /home/ringzzz/performance_debug/case1-wasmtime/wasmtime-perf/jitted-4044282-1.so [Percent: local period] Percent│ │ │ │ Disassembly of section .text: │ │ 0000000000000040 <wasm[0]::function[2]>: │ wasm[0]::function[2](): │ push %rbp │ mov %rsp,%rbp │ xor %eax,%eax │ mov %rax,%r8 │ 3a: cmp $0xffff,%eax │ ↓ jne 58 │ mov 0x80(%rdi),%rdi │ movl $0x1,(%rdi) │ mov %rbp,%rsp │ pop %rbp │ ← retq │ 58: lea 0x1(%r8),%edx │ mov %r8,%rax │ mov %rdx,%r8 │ ↑ jmpq 3a
# the commands to profile perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm perf inject --jit --input perf.data --output perf.jit.data perf report --input perf.jit.data --no-children
fitzgen commented on issue #8573:
It is really confused to me how the difference in
func1
could affect the execution offunc2
.Is it changing the alignment of
func2
? We have both preferred and required alignments inside Cranelift, and it is possible that something is getting bumped from its preferred down to just its required alignment.
alexcrichton commented on issue #8573:
Testing locally I've got the "good" function 2 starting at 0x140 and "bad" starts at 0x130. That makes sense to me because we align functions to 16-bytes and the "bad" version has more constant propagation so function is a bit smaller in the "bad" version, hence starting a bit earlier.
This program appears to be effectively a benchmark of the second function, which you've pasted the machine code above. In
perf
locally thejmpq
andjne
instructions are the hottest instructions. My suspicion is that this is CPU-specific for you (I can't reproduce the slowdown myself) where it's falling over some branch predictor heuristics.That being said in general the "function 2" here is a pretty bad layout choice for cranelift. The loop is spread across the epilogue instead of tightly put together which might omit a jump. Cranelift can probably do better here in terms of block layout on the second function and that might help branch predictor heuristics on your CPU as well
cfallin commented on issue #8573:
FWIW, the usual "fetch bandwidth sin" to avoid is forcing partial fetch lines -- the CPU frontend grabs an aligned 32B or 64B chunk at a time so if a function starts halfway through a fetch line, we're wasting frontend bandwidth. I suspect a 32B function alignment would be a pretty reasonable default in general -- is there a reason we only require 16B now? Did we find it mattered a lot for object file bloat or somesuch?
alexcrichton commented on issue #8573:
Looks like the 16-byte alignment was added in #4826 as the minimum require for functions at the time to properly align constants in the constant pool. I don't think 32-bytes was specifically rejected in favor of a smaller size. Looks. like aarch64 is aligned to 32-bytes.
cfallin commented on issue #8573:
Cool -- @hungryzzz, I think we'd be happy to take a PR to change x86-64's function alignment to 32 bytes as well, if you'd like to create one!
hungryzzz commented on issue #8573:
Is it changing the alignment of
func2
?Yes, the machine codes of
func2
before and after changes are the same excpet the start address of thefunc2
.# machine code of `func2` in `bad.wasm` generated by Cranelift 0000000000000130 <wasm[0]::function[2]>: 130: 55 push %rbp 131: 48 89 e5 mov %rsp,%rbp 134: 31 c0 xor %eax,%eax 136: 49 89 c0 mov %rax,%r8 139: 81 f8 ff ff 00 00 cmp $0xffff,%eax 13f: 0f 85 12 00 00 00 jne 157 <wasm[0]::function[2]+0x27> 145: 48 8b bf 80 00 00 00 mov 0x80(%rdi),%rdi 14c: c7 07 01 00 00 00 movl $0x1,(%rdi) 152: 48 89 ec mov %rbp,%rsp 155: 5d pop %rbp 156: c3 retq 157: 41 8d 50 01 lea 0x1(%r8),%edx 15b: 4c 89 c0 mov %r8,%rax 15e: 49 89 d0 mov %rdx,%r8 161: e9 d3 ff ff ff jmpq 139 <wasm[0]::function[2]+0x9> # machine code of `func2` in `good.wasm` generated by Cranelift 0000000000000120 <wasm[0]::function[2]>: 120: 55 push %rbp 121: 48 89 e5 mov %rsp,%rbp 124: 31 c0 xor %eax,%eax 126: 49 89 c0 mov %rax,%r8 129: 81 f8 ff ff 00 00 cmp $0xffff,%eax 12f: 0f 85 12 00 00 00 jne 147 <wasm[0]::function[2]+0x27> 135: 48 8b bf 80 00 00 00 mov 0x80(%rdi),%rdi 13c: c7 07 01 00 00 00 movl $0x1,(%rdi) 142: 48 89 ec mov %rbp,%rsp 145: 5d pop %rbp 146: c3 retq 147: 41 8d 50 01 lea 0x1(%r8),%edx 14b: 4c 89 c0 mov %r8,%rax 14e: 49 89 d0 mov %rdx,%r8 151: e9 d3 ff ff ff jmpq 129 <wasm[0]::function[2]+0x9>
In
perf
locally thejmpq
andjne
instructions are the hottest instructions. My suspicion is that this is CPU-specific for you (I can't reproduce the slowdown myself) where it's falling over some branch predictor heuristics.I am confused that how the function alignment could affect the CPU branch prediction.
hungryzzz edited a comment on issue #8573:
Is it changing the alignment of
func2
?Yes, the machine codes of
func2
before and after changes are the same excpet the start address of thefunc2
.# machine code of `func2` in `bad.wasm` generated by Cranelift 0000000000000130 <wasm[0]::function[2]>: 130: 55 push %rbp 131: 48 89 e5 mov %rsp,%rbp 134: 31 c0 xor %eax,%eax 136: 49 89 c0 mov %rax,%r8 139: 81 f8 ff ff 00 00 cmp $0xffff,%eax 13f: 0f 85 12 00 00 00 jne 157 <wasm[0]::function[2]+0x27> 145: 48 8b bf 80 00 00 00 mov 0x80(%rdi),%rdi 14c: c7 07 01 00 00 00 movl $0x1,(%rdi) 152: 48 89 ec mov %rbp,%rsp 155: 5d pop %rbp 156: c3 retq 157: 41 8d 50 01 lea 0x1(%r8),%edx 15b: 4c 89 c0 mov %r8,%rax 15e: 49 89 d0 mov %rdx,%r8 161: e9 d3 ff ff ff jmpq 139 <wasm[0]::function[2]+0x9> # machine code of `func2` in `good.wasm` generated by Cranelift 0000000000000120 <wasm[0]::function[2]>: 120: 55 push %rbp 121: 48 89 e5 mov %rsp,%rbp 124: 31 c0 xor %eax,%eax 126: 49 89 c0 mov %rax,%r8 129: 81 f8 ff ff 00 00 cmp $0xffff,%eax 12f: 0f 85 12 00 00 00 jne 147 <wasm[0]::function[2]+0x27> 135: 48 8b bf 80 00 00 00 mov 0x80(%rdi),%rdi 13c: c7 07 01 00 00 00 movl $0x1,(%rdi) 142: 48 89 ec mov %rbp,%rsp 145: 5d pop %rbp 146: c3 retq 147: 41 8d 50 01 lea 0x1(%r8),%edx 14b: 4c 89 c0 mov %r8,%rax 14e: 49 89 d0 mov %rdx,%r8 151: e9 d3 ff ff ff jmpq 129 <wasm[0]::function[2]+0x9>
In
perf
locally thejmpq
andjne
instructions are the hottest instructions. My suspicion is that this is CPU-specific for you (I can't reproduce the slowdown myself) where it's falling over some branch predictor heuristics.Hi, could I ask why the function alignment could affect the CPU branch prediction? Thanks!
cfallin commented on issue #8573:
Hi, could I ask why the function alignment could affect the CPU branch prediction? Thanks!
I addressed this a few comments up:
FWIW, the usual "fetch bandwidth sin" to avoid is forcing partial fetch lines -- the CPU frontend grabs an aligned 32B or 64B chunk at a time so if a function starts halfway through a fetch line, we're wasting frontend bandwidth. I suspect a 32B function alignment would be a pretty reasonable default in general -- is there a reason we only require 16B now? Did we find it mattered a lot for object file bloat or somesuch?
The problem isn't actually with branch prediction, but with the way the instruction fetch works on most modern superscalar machines. (In more detail, fetches are big aligned chunks because it's fewer wires and transistors to grab only an aligned chunk from an icache line rather than any arbitrary alignment with arbitrary shifting.)
Would you be willing to send a PR to update the alignment on x86-64? Someone else can do it if not, but hopefully it's an easy-ish change!
hungryzzz commented on issue #8573:
Would you be willing to send a PR to update the alignment on x86-64? Someone else can do it if not, but hopefully it's an easy-ish change!
Hi, I am really willing to do it, but I am not familiar with
Rust
and code generation implementation, could you please give me some references or advices to help me start it?
cfallin commented on issue #8573:
I think it's just this constant that needs to be changed; probably with a comment update noting that we're aligning for better formance and a link to this issue?
hungryzzz commented on issue #8573:
The problem isn't actually with branch prediction, but with the way the instruction fetch works on most modern superscalar machines.
The
func2
would be called many times, in my opinion, once the instructions offunc2
have been fetched into the icache, wouldn't it be necessary to fetch them again when callingfunc2
again?
hungryzzz commented on issue #8573:
I think it's just this constant that needs to be changed; probably with a comment update noting that we're aligning for better formance and a link to this issue?
Thank you very much! I'll try it!
cfallin commented on issue #8573:
The problem isn't actually with branch prediction, but with the way the instruction fetch works on most modern superscalar machines.
The
func2
would be called many times, in my opinion, once the instructions offunc2
have been fetched into the icache, wouldn't it be necessary to fetch them again when callingfunc2
again?Fetch is part of the CPU pipeline, and happens every time an instruction executes: the instructions are fetched from the icache into the pipeline (and then decoded, renamed, scheduled, executed, ...). So fetch bottlenecks affect even inner-loop performance. The step that puts instructions into the icache on a miss is usually called "cache fill" and happens in a separate pipeline.
(This is slightly less true on machines with uop caches and/or loop stream buffers, but those have their own limitations as well)
hungryzzz closed issue #8573:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
,WasmEdge(AOT)
andWamr(AOT)
, and collect their execution time respectively (measured bytime
tool).Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
Wasmtime
: 4.67sWasmEdge
: 2.81sWamr
: 3.29sFor
bad.wasm
, the execution time in different runtimes are as follows:
Wasmtime
: 9.34sWasmEdge
: 2.35sWamr
: 3.29sThe difference between the attached two cases is as follow, i.e., changing the address of
i32.load
fromlocal.get 0
toi32.const 0
, the case withi32.const 0
has the bad performance inWasmtime
, whose execution time is 2x slower than the former one. I think the decreasing is confused and suspicious, why is an immediate operation slower than a local variable operation? Moreover, thei32.load
is out of the loop, I cannot understand why the operation could affect the overall performance so vastly.![截屏2024-05-07 23 19 36](https://github.com/bytecodealliance/wasmtime/assets/32137313/be606e8b-7da9-4f99-9697-e23b3b53e376)
Versions and Environment
Wasmtime version or commit: d0cf46a098d97bab9c4cf54a2f3aeaddea7d3831
Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
Last updated: Dec 23 2024 at 12:05 UTC