hungryzzz opened issue #8706:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
andWasmEdge
(AOT), and collect their execution time respectively (measured bytime
tool).# command to collect execution time of wasmtime wasmtime compile bad.wasm -o bad.cwasm time wasmtime run --allow-precompiled bad.cwasm # command to collect execution time of wasmedge wasmedgec bad.wasm bad-wasmedge-aot.wasm time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 1.42s
- WasmEdge: 1.05s
For
bad.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 3.29s
- WasmEdge: 0.91s
The difference between the attached two cases is as follow, i.e., changing one of the operand of
i32.and
from0
to1
. The difference and bring 1.8s performance decreasing onWasmtime
but has no negative effect onWasmEdge
.➜ cases diff good.wat bad.wat 35c35 < i32.const 0 --- > i32.const 1
;; part of good.wat if i32.const 1 local.set 2 i32.const 1 local.get 0 i32.const 0 ;; here i32.and i32.add local.set 6 br 1 ;; part of bad.wat if i32.const 1 local.set 2 i32.const 1 local.get 0 i32.const 1 ;; here i32.and i32.add local.set 6 br 1
I check the machine code generated by
Wasmtime
andWasmEdge
respectively, the instruction numbers offunc2
(the difference is in it) are as follows:
WasmEdge
good.wasm
: 89bad.wasm
: 92- diff: 3
Wasmtime
good.wasm
: 132bad.wasm
: 157- diff: 25
I think maybe the difference may affect the data analysis during optimization, but I really confuse why the instructions number of generated machine code could change so much, and I think maybe the extra instructions in
bad.wasm
make it slower than the good one.
I also use Perf tool to profile the hotspots inWasmtime
but I cannot find something useful.Versions and Environment
- Wasmtime version or commit: 7f7064c74
- Operating system: Linux ringzzz-OptiPlex-Micro-Plus-7010 6.5.0-18-generic
- Architecture: Intel(R) Core(TM) i5-13500
hungryzzz edited issue #8706:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
andWasmEdge
(AOT), and collect their execution time respectively (measured bytime
tool).# command to collect execution time of wasmtime wasmtime compile bad.wasm -o bad.cwasm time wasmtime run --allow-precompiled bad.cwasm # command to collect execution time of wasmedge wasmedgec bad.wasm bad-wasmedge-aot.wasm time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 1.42s
- WasmEdge: 1.05s
For
bad.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 3.29s
- WasmEdge: 0.91s
The difference between the attached two cases is as follow, i.e., changing one of the operand of
i32.and
from0
to1
. The difference and bring 1.8s performance decreasing onWasmtime
but has no negative effect onWasmEdge
.➜ cases diff good.wat bad.wat 35c35 < i32.const 0 --- > i32.const 1
;; part of good.wat if i32.const 1 local.set 2 i32.const 1 local.get 0 i32.const 0 ;; here i32.and i32.add local.set 6 br 1 ;; part of bad.wat if i32.const 1 local.set 2 i32.const 1 local.get 0 i32.const 1 ;; here i32.and i32.add local.set 6 br 1
I check the machine code generated by
Wasmtime
andWasmEdge
respectively, the instruction numbers offunc2
(the difference is in it) are as follows:
WasmEdge
good.wasm
: 89bad.wasm
: 92- diff: 3
Wasmtime
good.wasm
: 132bad.wasm
: 157- diff: 25
I think maybe the difference may affect the data flow analysis during optimization so maybe some optimization decisions are different , but I really confuse why the instructions number of generated machine code could change so much, and I think maybe the extra instructions in
bad.wasm
make it slower than the good one.
I also use Perf tool to profile the hotspots inWasmtime
but I cannot find something useful.Versions and Environment
- Wasmtime version or commit: 7f7064c74
- Operating system: Linux ringzzz-OptiPlex-Micro-Plus-7010 6.5.0-18-generic
- Architecture: Intel(R) Core(TM) i5-13500
hungryzzz edited issue #8706:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
andWasmEdge
(AOT), and collect their execution time respectively (measured bytime
tool).# command to collect execution time of wasmtime wasmtime compile bad.wasm -o bad.cwasm time wasmtime run --allow-precompiled bad.cwasm # command to collect execution time of wasmedge wasmedgec bad.wasm bad-wasmedge-aot.wasm time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 1.42s
- WasmEdge: 1.05s
For
bad.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 3.29s
- WasmEdge: 0.91s
The difference between the attached two cases is as follow, i.e., changing one of the operand of
i32.and
from0
to1
. The difference and bring 1.8s performance decreasing onWasmtime
but has no negative effect onWasmEdge
.➜ cases diff good.wat bad.wat 35c35 < i32.const 0 --- > i32.const 1
;; part of good.wat if i32.const 1 local.set 2 i32.const 1 local.get 0 i32.const 0 ;; here i32.and i32.add local.set 6 br 1 ;; part of bad.wat if i32.const 1 local.set 2 i32.const 1 local.get 0 i32.const 1 ;; here i32.and i32.add local.set 6 br 1
I check the machine code generated by
Wasmtime
andWasmEdge
respectively, the instruction numbers offunc2
(the difference is in it) are as follows:
WasmEdge
good.wasm
: 89bad.wasm
: 92- diff: 3
Wasmtime
good.wasm
: 132bad.wasm
: 157- diff: 25
I think maybe the difference affect the data flow analysis during optimization so maybe some optimization decisions are different , but I really confuse why the instructions number of generated machine code could change so much, and I think maybe the extra instructions in
bad.wasm
make it slower than the good one.
I also use Perf tool to profile the hotspots inWasmtime
but I cannot find something useful.Versions and Environment
- Wasmtime version or commit: 7f7064c74
- Operating system: Linux ringzzz-OptiPlex-Micro-Plus-7010 6.5.0-18-generic
- Architecture: Intel(R) Core(TM) i5-13500
cfallin commented on issue #8706:
Hi @hungryzzz -- while we appreciate these reports, they seem to be generated by some infrastructure and "mass-produced", so to speak, so to help us out, it may be good to do a little bit more work. In particular: for each performance-delta report (including this one), would you be able to reduce the report to assembly, show us the difference in generated assembly, and write a microbenchmark in assembly that isolates the performance effect and shows it to be valid (as I mentioned in a previous comment on a very similar issue)?
hungryzzz commented on issue #8706:
Hi @cfallin, I write a microbenchmark in assembly based on the generated machine code of
good.wasm
andbad.wasm
. And I also provide a comparison report show the differences of the machine code.
The microbenchmark and report are attached in reports.zip. Thank you!# The execution times of `asm_good` and `asm_bad` are as follows: ➜ binary-asm time ./asm_good ./asm_good 1.52s user 0.00s system 99% cpu 1.524 total ➜ binary-asm time ./asm_bad ./asm_bad 3.00s user 0.00s system 99% cpu 3.004 total
cfallin commented on issue #8706:
Thanks @hungryzzz. Unfortunately there's still a nontrivial diff, I guess I had hoped for a little more investigation to point to exactly what the microarchitectural difference was; take the two inner loops, transform one into the other one step at a time, see where the cliff happens, that sort of thing. To be more direct: most folks here (at least those working on Cranelift full-time) don't have time to dig into a performance anomaly of several hundred lines of assembly with unclear prioritization (i.e., found via fuzzing rather than via real-use-case report) so for this effort to be successful you'll have to do a little more of the investigation yourself and point to what we can change. Sorry and thanks!
iximeow commented on issue #8706:
@hungryzzz another point to keep in mind, the diff represented by
report.html
understates the true difference betweenasm_good
andasm_bad
: in addition to the additional instructions inasm_bad
, many of the instructions that aremov
in both are actuallymov %reg, offset(%rsp)
inasm_bad
, where they were register-register movs on the right. i skimmedasm_good.c
andasm_bad.c
and the additional saves and loads from the stack seem to be the vast majority of the difference between these two.incidentally, these run in about the same time on a Ryzen 9 3950X and i'm downright shocked at the IPC it managed. here's
asm_bad.c
with almost 5.5 IPC (!):[11:44:28] # iximeow:~/reports> sudo perf stat -e cycles -e instructions -e L1-dcache-loads \ -e L1-dcache-misses -e stalled-cycles-frontend -e stalled-cycles-backend \ ./bad Performance counter stats for './bad': 17,220,141,631 cycles (83.31%) 94,551,730,488 instructions # 5.49 insn per cycle # 0.05 stalled cycles per insn (83.35%) 12,890,324,509 L1-dcache-loads (83.36%) 270,501 L1-dcache-misses # 0.00% of all L1-dcache hits (83.36%) 756,707 stalled-cycles-frontend # 0.00% frontend cycles idle (83.35%) 4,303,679,448 stalled-cycles-backend # 24.99% backend cycles idle (83.27%) 3.869444889 seconds time elapsed 3.867886000 seconds user 0.000000000 seconds sys
vs
asm_good.c
which runs half as many instructions in almost the same total time:[11:45:00] # iximeow:~/reports> sudo perf stat -e cycles -e instructions -e L1-dcache-loads -e L1-dcache-misses \ -e stalled-cycles-frontend -e stalled-cycles-backend \ ./good Performance counter stats for './good': 17,212,385,161 cycles (83.29%) 47,256,102,329 instructions # 2.75 insn per cycle # 0.00 stalled cycles per insn (83.29%) 3,791,613 L1-dcache-loads (83.30%) 134,600 L1-dcache-misses # 3.55% of all L1-dcache hits (83.40%) 645,951 stalled-cycles-frontend # 0.00% frontend cycles idle (83.39%) 8,859,964 stalled-cycles-backend # 0.05% backend cycles idle (83.32%) 3.975132228 seconds time elapsed 3.969715000 seconds user 0.003997000 seconds sys
i suspect an iterative transformation from one loop to the other probably won't be super illuminating: something like "movs to and from the stack are slow on some machines, and as you add more it gets more slow". but @hungryzzz another approach that would certainly help here is if you can reduce the input wasm program down - if the program were 30 instructions and switching the constant from 0 to 1 still produces significantly different codegen, that's much easier to investigate.
hungryzzz commented on issue #8706:
Hi, when I try to delete part of the code in the above bad case, I find many small deletion cannot replay this bug, so I think I cannot reduce the bad case to a shorter one.
But I find another interesting thing during the reduction. I comment out the following part of the code as this branch must be triggered, and execution times are quite different after this deletion.
bad.wasm
: 3.33s- after deletion (called
good.wasm
in the attached file): 1.80sAnd still I find the generated machine codes are quite difference which are attached. So I guess maybe some analysis during compilation are wrong, resulting the suboptimal code generation. But actually I don't know how to inspect further for this. If you need more investigation, I am willing to do that under your suggestions.
block ;; i32.const 1 ;; if local.get 6 local.set 5 ;; br 1 ;; end end
hungryzzz edited a comment on issue #8706:
Hi, when I try to delete part of the code in the above bad case, I find many small deletion cannot replay this bug, so I think I cannot reduce the bad case to a shorter one.
But I find another interesting thing during the reduction. I comment out the following part of the code as this branch must be triggered. However, execution times are quite different after this deletion.
bad.wasm
: 3.33s- after deletion (called
good.wasm
in the attached file): 1.80sAnd still I find the generated machine codes are quite difference which are attached. So I guess maybe some analysis during compilation are wrong, resulting the suboptimal code generation. But actually I don't know how to inspect further for this. If you need more investigation, I am willing to do that under your suggestions.
block ;; i32.const 1 ;; if local.get 6 local.set 5 ;; br 1 ;; end end
cfallin commented on issue #8706:
Ah, if making that assignment from local 6 to local 5 unconditional causes the performance delta to disappear, likely what is happening here is that you're seeing the lack of constant-branch folding in Cranelift. That's not a "wrong analysis", that's just an optimization we haven't implemented yet (and it's somewhat nontrivial to do in our egraph mid-end, though we have ideas). It's entirely possible that that could cascade into other optimizations (GVN, ...?) applying or not applying, causing more register pressure and the spills you're seeing.
hungryzzz commented on issue #8706:
So the only way to validate the guessing is to implement the constant-branch folding in Cranelift, right? But how to explain the performance difference shown at first?
cfallin commented on issue #8706:
The direct cause of the perf delta seems to be the additional spills, as @iximeow pointed out.
hungryzzz commented on issue #8706:
I got it, thank you! But I still cannot understand the two cases shown at first would have such difference machine code...so I continue to do some changes in the
bad.wasm
and I get another strange finding.This time I change a value which will be set to the global variable [1] from
-65537
to65536
, and their executiom times are also quite different.
- bad.wasm: 3.26s
- after change (called good.wasm in the attached file): 1.80s
➜ cases diff bad.wat good.wat 124c124 < i32.const -65537 --- > i32.const 65536
I have some confusions and findings about these cases:
1. Originally, I thought maybe the changed control flows bring additional spills inbad.wasm
. However, the global variable[1] will not be used in program (I don't findglobal.get 1
), so I think they should have nearly the same generated code.
2. I try to change-65537
to another negtive numbers, such as-1
,-2
, and the execution times are the same as the bad one. However, once I change it to some positive numbers, such as65536
,100
, and the executiom times are good. So I also try to change another number in the program to some negtive number, and the execution time is bad.
3. I run these two cases inwasmer
which also usesCranelift
as their compiler (but they use old version), but the executime times ofwasmer
before and after the changes are the same.
-bad.wasm
(wasmer
): 2.70s
- after change(wasmer
): 2.69sTherefore, I still think there is something wrong during the compilation of
bad.wasm
. But still, I don't know how to do further investigation.
The attached files are the newgood.wasm
and thereport.html
.
hungryzzz edited a comment on issue #8706:
I got it, thank you! But I still cannot understand the two cases shown at first would have such difference machine code...so I continue to do some changes in the
bad.wasm
and I get another strange case.This time I change a value which will be set to the global variable [1] from
-65537
to65536
, and their executiom times are also quite different.
- bad.wasm: 3.26s
- after change (called good.wasm in the attached file): 1.80s
➜ cases diff bad.wat good.wat 124c124 < i32.const -65537 --- > i32.const 65536
I have some confusions and findings about these cases:
1. Originally, I thought maybe the changed control flows bring additional spills inbad.wasm
. However, the global variable[1] will not be used in program (I don't findglobal.get 1
), so I think they should have nearly the same generated code.
2. I try to change-65537
to another negtive numbers, such as-1
,-2
, and the execution times are the same as the bad one. However, once I change it to some positive numbers, such as65536
,100
, and the executiom times are good. So I also try to change another number in the program to some negtive number, and the execution time is bad.
3. I run these two cases inwasmer
which also usesCranelift
as their compiler (but they use old version), but the executime times ofwasmer
before and after the changes are the same.
-bad.wasm
(wasmer
): 2.70s
- after change(wasmer
): 2.69sTherefore, I still think there is something wrong during the compilation of
bad.wasm
. But still, I don't know how to do further investigation.
The attached files are the newgood.wasm
and thereport.html
.
hungryzzz edited a comment on issue #8706:
I got it, thank you! But I still cannot understand the two cases shown at first would have such difference machine code...so I continue to do some changes in the
bad.wasm
and I get another strange case.This time I change a value which will be set to the global variable [1] from
-65537
to65536
, and their executiom times are also quite different.
- bad.wasm: 3.26s
- after change (called good.wasm in the attached file): 1.80s
➜ cases diff bad.wat good.wat 124c124 < i32.const -65537 --- > i32.const 65536
;; part of bad.wat i32.const -65537 global.set 1 ;; part of good.wat i32.const 65536 global.set 1
I have some confusions and findings about these cases:
1. Originally, I thought maybe the changed control flows bring additional spills inbad.wasm
. However, the global variable[1] will not be used in program (I don't findglobal.get 1
), so I think they should have nearly the same generated code.
2. I try to change-65537
to another negtive numbers, such as-1
,-2
, and the execution times are the same as the bad one. However, once I change it to some positive numbers, such as65536
,100
, and the executiom times are good. So I also try to change another number in the program to some negtive number, and the execution time is bad.
3. I run these two cases inwasmer
which also usesCranelift
as their compiler (but they use old version), but the executime times ofwasmer
before and after the changes are the same.
-bad.wasm
(wasmer
): 2.70s
- after change(wasmer
): 2.69sTherefore, I still think there is something wrong during the compilation of
bad.wasm
. But still, I don't know how to do further investigation.
The attached files are the newgood.wasm
and thereport.html
.
hungryzzz edited a comment on issue #8706:
I got it, thank you! But I still cannot understand the two cases shown at first would have such difference machine code...so I continue to do some changes in the
bad.wasm
and I get another strange case.This time I change a value which will be set to the global variable [1] from
-65537
to65536
, and their executiom times are also quite different.
bad.wasm
: 3.26s- after change (called
good.wasm
in the attached file): 1.80s➜ cases diff bad.wat good.wat 124c124 < i32.const -65537 --- > i32.const 65536
;; part of bad.wat i32.const -65537 global.set 1 ;; part of good.wat i32.const 65536 global.set 1
I have some confusions and findings about these cases:
1. Originally, I thought maybe the changed control flows bring additional spills inbad.wasm
. However, the global variable[1] will not be used in program (I don't findglobal.get 1
), so I think they should have nearly the same generated code.
2. I try to change-65537
to another negtive numbers, such as-1
,-2
, and the execution times are the same as the bad one. However, once I change it to some positive numbers, such as65536
,100
, and the executiom times are good. So I also try to change another number in the program to some negtive number, and the execution time is bad.
3. I run these two cases inwasmer
which also usesCranelift
as their compiler (but they use old version), but the executime times ofwasmer
before and after the changes are the same.
-bad.wasm
(wasmer
): 2.70s
- after change(wasmer
): 2.69sTherefore, I still think there is something wrong during the compilation of
bad.wasm
. But still, I don't know how to do further investigation.
The attached files are the newgood.wasm
and thereport.html
.
hungryzzz edited a comment on issue #8706:
I got it, thank you! But I still cannot understand the two cases shown at first would have such difference machine code...so I continue to do some changes in the
bad.wasm
and I get another strange case.This time I change a value which will be set to the global variable [1] from
-65537
to65536
, and their executiom times are also quite different.
bad.wasm
: 3.26s- after change (called
good.wasm
in the attached file): 1.80s➜ cases diff bad.wat good.wat 124c124 < i32.const -65537 --- > i32.const 65536
;; part of bad.wat i32.const -65537 global.set 1 ;; part of good.wat i32.const 65536 global.set 1
I have some confusions and findings about these cases:
1. Originally, I thought maybe the changed control flows bring additional spills inbad.wasm
. However, the global variable[1] will not be used in program (I don't findglobal.get 1
), so I think they should have nearly the same generated code.
2. I try to change-65537
to another negtive numbers, such as-1
,-2
, and the execution times are the same as the bad one. However, once I change it to some positive numbers, such as65536
,100
, and the executiom times are good. So I also try to change another number in the program to some negtive number, and the execution time is bad.
3. I run these two cases inwasmer
which also usesCranelift
as their compiler (they use old version one), the executime times ofwasmer
before and after the changes are the same.
-bad.wasm
(wasmer
): 2.70s
- after change(wasmer
): 2.69sTherefore, I still think there is something wrong during the compilation of
bad.wasm
. But still, I don't know how to do further investigation.
The attached files are the newgood.wasm
and thereport.html
.
hungryzzz edited a comment on issue #8706:
I got it, thank you! But I still cannot understand the two cases shown at first would have such difference machine code...so I continue to do some changes in the
bad.wasm
and I get another strange case.This time I change a value which will be set to the global variable [1] from
-65537
to65536
, and their executiom times are also quite different.
bad.wasm
: 3.26s- after change (called
good.wasm
in the attached file): 1.80s➜ cases diff bad.wat good.wat 124c124 < i32.const -65537 --- > i32.const 65536
;; part of bad.wat i32.const -65537 global.set 1 ;; part of good.wat i32.const 65536 global.set 1
I have some confusions and findings about these cases:
1. Originally, I thought maybe the changed control flows bring additional spills inbad.wasm
. However, the global variable[1] will not be used in program (I don't findglobal.get 1
), so I think they should have nearly the same generated code.
2. I try to change-65537
to another negtive numbers, such as-1
,-2
, and the execution times are the same as the bad one. However, once I change it to some positive numbers, such as65536
,100
, and the executiom times are good. So I also try to change another number in the program to some negtive number, and the execution time is bad.
3. I run these two cases inwasmer
which also usesCranelift
as their compiler (they use old version one,0.91.1
), the executime times ofwasmer
before and after the changes are the same.
-bad.wasm
(wasmer
): 2.70s
- after change(wasmer
): 2.69sTherefore, I still think there is something wrong during the compilation of
bad.wasm
. But still, I don't know how to do further investigation.
The attached files are the newgood.wasm
and thereport.html
.
hungryzzz commented on issue #8706:
Hi, I go though the generated machine codes and I think I know something about the performance difference.
I find that if the value which will be set to global variable[1] is a negtive number, then the compiler will use two instructions to represent the
global.set
, i.e., first moving$0xfffeffff
to a register%r10d
then moving the register to the specified memory address. And if the value is a positive number, then the compiler will use only one instruction to do it, i.e., directly moving the$0x10000
to the specified address. Because the first one needs one more register, additional spills are needed.
By the way, I check the machine code generated byWasmEdge
, it usemovl $0xfffeffff,0xa0(%rdi)
to execute thebad.wasm
.# part of the machine code of bad.wasm 104 | mov $0xfffeffff,%r10d 105 | mov %r10d,0xa0(%rdi) # part of the machine code of good.wasm 95 | movl $0x10000,0xa0(%rdi)
After the above all investigation, I guess the root cause of all the performance differences is the different register allocation algorithm used by
Wasmtime
and others. I think maybe slightly changes on thebad.wasm
will cause more requirements for register, then resutling in the additional spills to the stack.
hungryzzz edited a comment on issue #8706:
Hi, I go though the generated machine codes and I think I know something about the performance difference.
I find that if the value which will be set to global variable[1] is a negtive number, then the compiler will use two instructions to represent the
global.set
, i.e., first moving$0xfffeffff
to a register%r10d
then moving the register to the specified memory address. And if the value is a positive number, then the compiler will use only one instruction to do it, i.e., directly moving the$0x10000
to the specified address. Because the first one needs one more register, additional spills are needed.
By the way, I check the machine code generated byWasmEdge
, it use likemovl $0xfffeffff,0xa0(%rdi)
to execute thebad.wasm
.# part of the machine code of bad.wasm 104 | mov $0xfffeffff,%r10d 105 | mov %r10d,0xa0(%rdi) # part of the machine code of good.wasm 95 | movl $0x10000,0xa0(%rdi)
After the above all investigation, I guess the root cause of all the performance differences is the different register allocation algorithm used by
Wasmtime
and others. I think maybe slightly changes on thebad.wasm
will cause more requirements for register, then resutling in the additional spills to the stack.
hungryzzz edited a comment on issue #8706:
@cfallin Hi, I go though the generated machine codes and I think I know something about the performance difference.
I find that if the value which will be set to global variable[1] is a negtive number, then the compiler will use two instructions to represent the
global.set
, i.e., first moving$0xfffeffff
to a register%r10d
then moving the register to the specified memory address. And if the value is a positive number, then the compiler will use only one instruction to do it, i.e., directly moving the$0x10000
to the specified address. Because the first one needs one more register, additional spills are needed.
By the way, I check the machine code generated byWasmEdge
, it use likemovl $0xfffeffff,0xa0(%rdi)
to execute thebad.wasm
.# part of the machine code of bad.wasm 104 | mov $0xfffeffff,%r10d 105 | mov %r10d,0xa0(%rdi) # part of the machine code of good.wasm 95 | movl $0x10000,0xa0(%rdi)
After the above all investigation, I guess the root cause of all the performance differences is the different register allocation algorithm used by
Wasmtime
and others. I think maybe slightly changes on thebad.wasm
will cause more requirements for register, then resutling in the additional spills to the stack.
cfallin commented on issue #8706:
Ah, interesting, there may be something odd about the way we lower stores-of-immediates in the x64 backend that precludes this case -- @abrown do you think you'd be willing to take a look at this? (If true it should be reproducible with a small CLIF test that stores a constant
0x10000
or0xfffeffff
to some given address)
abrown commented on issue #8706:
Sure, I can put this on the TODO list but no promises about when I get to it!
hungryzzz commented on issue #8706:
@alexcrichton Hi, I pulled your PR and found it can fix the above case, thank you! But could I ask why positive and negative numbers are stored in different ways in memory before this PR? And how do you fix it? (I tried to understand these when I went through the PR but failed...) Thanks again!
cfallin commented on issue #8706:
@hungryzzz the issue was a missed case in the instruction lowering. x86-64 has a store instruction (a form of
mov
) that can store arbitrary 32-bit values to the destination; however, we weren't matching signed-negative 32-bit values in our existing lowering rules, so we fell back to the slower path of "put the value in a register then store it". Alex's PR fixed the rule match (left-hand side) to catch that case as well.
hungryzzz commented on issue #8706:
Got it! Thanks a lot!
hungryzzz commented on issue #8706:
Hi, I am also wondering why only the one register requirement leads to so many additional spills? If requiring more registers, would it get worse? Thank you very much! (I really hope that my questions would not spend too much of your time...)
cfallin commented on issue #8706:
@hungryzzz we're happy to help with questions like these, but the Bytecode Alliance Zulip is probably easier for such discussions (that are really about how compilers work rather than resolving this particular issue).
With Alex's fix in #8842 are we ok to close this issue or do you see any more performance deltas?
hungryzzz closed issue #8706:
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
andWasmEdge
(AOT), and collect their execution time respectively (measured bytime
tool).# command to collect execution time of wasmtime wasmtime compile bad.wasm -o bad.cwasm time wasmtime run --allow-precompiled bad.cwasm # command to collect execution time of wasmedge wasmedgec bad.wasm bad-wasmedge-aot.wasm time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 1.42s
- WasmEdge: 1.05s
For
bad.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 3.29s
- WasmEdge: 0.91s
The difference between the attached two cases is as follow, i.e., changing one of the operand of
i32.and
from0
to1
. The difference and bring 1.8s performance decreasing onWasmtime
but has no negative effect onWasmEdge
.➜ cases diff good.wat bad.wat 35c35 < i32.const 0 --- > i32.const 1
;; part of good.wat if i32.const 1 local.set 2 i32.const 1 local.get 0 i32.const 0 ;; here i32.and i32.add local.set 6 br 1 ;; part of bad.wat if i32.const 1 local.set 2 i32.const 1 local.get 0 i32.const 1 ;; here i32.and i32.add local.set 6 br 1
I check the machine code generated by
Wasmtime
andWasmEdge
respectively, the instruction numbers offunc2
(the difference is in it) are as follows:
WasmEdge
good.wasm
: 89bad.wasm
: 92- diff: 3
Wasmtime
good.wasm
: 132bad.wasm
: 157- diff: 25
I think maybe the difference affect the data flow analysis during optimization so maybe some optimization decisions are different , but I really confuse why the instructions number of generated machine code could change so much, and I think maybe the extra instructions in
bad.wasm
make it slower than the good one.
I also use Perf tool to profile the hotspots inWasmtime
but I cannot find something useful.Versions and Environment
- Wasmtime version or commit: 7f7064c74
- Operating system: Linux ringzzz-OptiPlex-Micro-Plus-7010 6.5.0-18-generic
- Architecture: Intel(R) Core(TM) i5-13500
hungryzzz commented on issue #8706:
The performance anomaly has been fixed, thanks a lot!
Last updated: Jan 24 2025 at 00:11 UTC