TimonPost opened issue #6798:
We have noticed a recent regression in module compilation times on the MBP M1/M2. Sometimes it takes over 100 seconds when it used to take less than 15.
<img width="1325" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/19969910/d6751bb1-5254-4f28-ae66-0cd913370fc2">
Several non-scientific numbers have been reported on various platforms. At first three individual Mac users started reporting extreme numbers, often exceeding +90 seconds, while Windows and Linux users report an average of around 50 seconds, with some reporting more normal numbers below 10 seconds. We have attempted to isolate the problem by manually clearing all caches and see how long it takes to recompile from scratch.
Upon further profiling, it appears that the performance bottleneck is related to the "cranelift" library (see image).
Any tips on how we can help find out the root cause here are appreciated. Are there perhaps benchmarks to run to see if regressions happened?
Some non-scientific data:
47s, 11th Gen Intel® Core™ i7-11800H, linux 15s, linux, i9-12900H (22s without cache) 6s, i9-13900k, windows (14s without cache) 8.2s amd 5950x, linux 45s 9th gen intel i7, windows 56s AMD Ryzen 9 6900HS, windows (36s without cache) 90s Mac Pro M1/M2, macos
Perhaps introduced in v10.0.0 or v10.0.1, as we have not upgraded to 11 yet, and in 9.0.4 we fixed another module compilation issue and had not experienced any problems yet up till quite recently
TimonPost closed issue #6798:
We have noticed a recent regression in module compilation times on the MBP M1/M2. Sometimes it takes over 100 seconds when it used to take less than 15.
<img width="1325" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/19969910/d6751bb1-5254-4f28-ae66-0cd913370fc2">
Several non-scientific numbers have been reported on various platforms. At first three individual Mac users started reporting extreme numbers, often exceeding +90 seconds, while Windows and Linux users report an average of around 50 seconds, with some reporting more normal numbers below 10 seconds. We have attempted to isolate the problem by manually clearing all caches and see how long it takes to recompile from scratch.
Upon further profiling, it appears that the performance bottleneck is related to the "cranelift" library (see image).
Any tips on how we can help find out the root cause here are appreciated. Are there perhaps benchmarks to run to see if regressions happened?
Some non-scientific data:
47s, 11th Gen Intel® Core™ i7-11800H, linux 15s, linux, i9-12900H (22s without cache) 6s, i9-13900k, windows (14s without cache) 8.2s amd 5950x, linux 45s 9th gen intel i7, windows 56s AMD Ryzen 9 6900HS, windows (36s without cache) 90s Mac Pro M1/M2, macos
Perhaps introduced in v10.0.0 or v10.0.1, as we have not upgraded to 11 yet, and in 9.0.4 we fixed another module compilation issue and had not experienced any problems yet up till quite recently
TimonPost commented on issue #6798:
Going to gather more data to make this issue more useful
TimonPost edited a comment on issue #6798:
Going to gather some more data and reopen if I have more.
cfallin commented on issue #6798:
Upon further profiling, it appears that the performance bottleneck is related to the "cranelift" library (see image).
Indeed, Cranelift is the whole compiler, so it makes sense that "slow compilation" involves Cranelift!
Thanks very much for the report; we definitely weren't aware of this. If you're able to bisect to a particular commit that would be really useful; otherwise, if you have an example wasm module that gets much slower to compile, that would be the next best thing.
jameysharp commented on issue #6798:
In addition to Chris' comments, here are a couple other thoughts:
The profile shows a third of time is spent in emitting islands, which makes me think this module has some very large functions in it. That could explain some of the difference between targets: the threshold where a function is large enough to need islands is different on aarch64 than on x86.
That said, I'm suspicious about the fact that different x86 systems have wildly different compile times too. It might be useful to run
wasmtime settings
on each x86 test system and see if there are differences in the "Settings inferred for the current host:" section. We could be hitting slow code paths that are only enabled when certain instruction-set features are present, or perhaps only when they're absent.To normalize timing measurements you can do all your tests on the same computer so you aren't comparing the performance of different CPUs. You can use
cargo build --features all-arch
to enable Wasmtime support for all architectures, or use the release artifacts from GitHub which are already built with that turned on. Then you can usewasmtime compile --target x86_64
oraarch64
. You may need to pass the various settings printed bywasmtime settings
to the--cranelift-enable
flag in order to get a fair comparison. I think there's also some magic when running on macOS but I can't find the details, so it might be easiest to run all the tests from your Mac Pro test system and cross-compile for x86.
TimonPost commented on issue #6798:
I wrote a small program to compile a 115 MB
.wasm
file for bothx86_64
andaarch64
and it's considerably slower onaarch64
.Use
wasmtime
to compile different targetsI build a script to test compiling our module and ran it three times on Windows 11 with AMD Ryzen 9 5900X 12-Core:
settings enforced `wasmtime settings` has_avx has_avx2 has_bmi1 has_bmi2 has_fma has_lzcnt has_popcnt has_sse3 has_sse41 has_sse42 has_ssse3 x86_64: 13.9465882s, 11.2770526s, 11.0808334s aarch64: 157.5406149s, 144.1816011s, 144.1495896s
Someone ran it on Windows, Intel Core i7 9750H:
settings enforced `wasmtime settings` has_avx has_avx2 has_bmi1 has_bmi2 has_fma has_lzcnt has_popcnt has_sse3 has_sse41 has_sse42 has_ssse3 x86_64: 23.5362889s, 24.3254295s, aarch64: 211.1731269s, 207.9991321s
Macbook pro M1 (from 2 devs initially reporting the problem)
Settings inferred for the current host: has_lse has_pauth sign_return_address sign_return_address_with_bkey
<details>
fn main() { let mut times: HashMap<String, std::time::Duration,RandomState> = HashMap::default(); for os in &["x86_64", "aarch64"] { let start = Instant::now(); let mut command = Command::new("wasmtime"); command.arg("compile").arg(format!("--target={os}")).arg( "path\xxxxx.wasm", ); let child = command.spawn().unwrap(); let output = child.wait_with_output().unwrap(); times.insert(format!("sandbox-{}", os), start.elapsed()); println!( "[{:?}]: {os} ({})", start.elapsed(), String::from_utf8(output.stdout).unwrap() ); } println!("{:?}", times); }
</details>
I also did some queries on how often a module is loaded (not compiled) over 50 seconds in production environments. Over the course of 2 months after my previous patch #6548, where it was almost gone, it came back. It is a bit harder to visualize it in relation to the total amounts of module loading to get a percentile, but it shows something is going on.
![image](https://github.com/bytecodealliance/wasmtime/assets/19969910/0fab3df8-5507-4fdf-9328-8f7e6b9ad9b2)
And also a query on the median and mean times per day it took for a module to load (not only compiled).
![image](https://github.com/bytecodealliance/wasmtime/assets/19969910/6e7f13e8-2183-4324-a64c-96f48985b9ba)
I can unfortunately not share the wasm module.
TimonPost reopened issue #6798:
We have noticed a recent regression in module compilation times on the MBP M1/M2. Sometimes it takes over 100 seconds when it used to take less than 15.
<img width="1325" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/19969910/d6751bb1-5254-4f28-ae66-0cd913370fc2">
Several non-scientific numbers have been reported on various platforms. At first three individual Mac users started reporting extreme numbers, often exceeding +90 seconds, while Windows and Linux users report an average of around 50 seconds, with some reporting more normal numbers below 10 seconds. We have attempted to isolate the problem by manually clearing all caches and see how long it takes to recompile from scratch.
Upon further profiling, it appears that the performance bottleneck is related to the "cranelift" library (see image).
Any tips on how we can help find out the root cause here are appreciated. Are there perhaps benchmarks to run to see if regressions happened?
Some non-scientific data:
47s, 11th Gen Intel® Core™ i7-11800H, linux 15s, linux, i9-12900H (22s without cache) 6s, i9-13900k, windows (14s without cache) 8.2s amd 5950x, linux 45s 9th gen intel i7, windows 56s AMD Ryzen 9 6900HS, windows (36s without cache) 90s Mac Pro M1/M2, macos
Perhaps introduced in v10.0.0 or v10.0.1, as we have not upgraded to 11 yet, and in 9.0.4 we fixed another module compilation issue and had not experienced any problems yet up till quite recently
TimonPost edited a comment on issue #6798:
To the above points I wrote a small program to compile a 115 MB
.wasm
file for bothx86_64
andaarch64
and it's considerably slower onaarch64
.Use
wasmtime
to compile different targetsI build a script to test compiling our module and ran it three times on Windows 11 with AMD Ryzen 9 5900X 12-Core:
settings enforced `wasmtime settings` has_avx has_avx2 has_bmi1 has_bmi2 has_fma has_lzcnt has_popcnt has_sse3 has_sse41 has_sse42 has_ssse3 x86_64: 13.9465882s, 11.2770526s, 11.0808334s aarch64: 157.5406149s, 144.1816011s, 144.1495896s
Someone ran it on Windows, Intel Core i7 9750H:
settings enforced `wasmtime settings` has_avx has_avx2 has_bmi1 has_bmi2 has_fma has_lzcnt has_popcnt has_sse3 has_sse41 has_sse42 has_ssse3 x86_64: 23.5362889s, 24.3254295s, aarch64: 211.1731269s, 207.9991321s
Macbook pro M1 (from 2 devs initially reporting the problem)
Settings inferred for the current host: has_lse has_pauth sign_return_address sign_return_address_with_bkey
<details>
fn main() { let mut times: HashMap<String, std::time::Duration,RandomState> = HashMap::default(); for os in &["x86_64", "aarch64"] { let start = Instant::now(); let mut command = Command::new("wasmtime"); command.arg("compile").arg(format!("--target={os}")).arg( "path\xxxxx.wasm", ); let child = command.spawn().unwrap(); let output = child.wait_with_output().unwrap(); times.insert(format!("sandbox-{}", os), start.elapsed()); println!( "[{:?}]: {os} ({})", start.elapsed(), String::from_utf8(output.stdout).unwrap() ); } println!("{:?}", times); }
</details>
I also did some queries on how often a module is loaded (not compiled) over 50 seconds in production environments. Over the course of 2 months after my previous patch #6548, where it was almost gone, it came back. It is a bit harder to visualize it in relation to the total amounts of module loading to get a percentile, but it shows something is going on.
![image](https://github.com/bytecodealliance/wasmtime/assets/19969910/0fab3df8-5507-4fdf-9328-8f7e6b9ad9b2)
And also a query on the median and mean times per day it took for a module to load (not only compiled).
![image](https://github.com/bytecodealliance/wasmtime/assets/19969910/6e7f13e8-2183-4324-a64c-96f48985b9ba)
I can unfortunately not share the wasm module.
TimonPost edited a comment on issue #6798:
To the above points I wrote a small program to compile a 115 MB
.wasm
file for bothx86_64
andaarch64
and it's considerably slower onaarch64
.Use
wasmtime
to compile different targetsI build a script to test compiling our module and ran it three times on Windows 11 with AMD Ryzen 9 5900X 12-Core:
settings enforced `wasmtime settings` has_avx has_avx2 has_bmi1 has_bmi2 has_fma has_lzcnt has_popcnt has_sse3 has_sse41 has_sse42 has_ssse3 x86_64: 13.9465882s, 11.2770526s, 11.0808334s aarch64: 157.5406149s, 144.1816011s, 144.1495896s
Other dev ran it twice on Windows, Intel Core i7 9750H:
settings enforced `wasmtime settings` has_avx has_avx2 has_bmi1 has_bmi2 has_fma has_lzcnt has_popcnt has_sse3 has_sse41 has_sse42 has_ssse3 x86_64: 23.5362889s, 24.3254295s, aarch64: 211.1731269s, 207.9991321s
Macbook pro M1 (from 2 devs initially reporting the problem)
Settings inferred for the current host: has_lse has_pauth sign_return_address sign_return_address_with_bkey
<details>
fn main() { let mut times: HashMap<String, std::time::Duration,RandomState> = HashMap::default(); for os in &["x86_64", "aarch64"] { let start = Instant::now(); let mut command = Command::new("wasmtime"); command.arg("compile").arg(format!("--target={os}")).arg( "path\xxxxx.wasm", ); let child = command.spawn().unwrap(); let output = child.wait_with_output().unwrap(); times.insert(format!("sandbox-{}", os), start.elapsed()); println!( "[{:?}]: {os} ({})", start.elapsed(), String::from_utf8(output.stdout).unwrap() ); } println!("{:?}", times); }
</details>
I also did some queries on how often a module is loaded (not compiled) over 50 seconds in production environments. Over the course of 2 months after my previous patch #6548, where it was almost gone, it came back. It is a bit harder to visualize it in relation to the total amounts of module loading to get a percentile, but it shows something is going on.
![image](https://github.com/bytecodealliance/wasmtime/assets/19969910/0fab3df8-5507-4fdf-9328-8f7e6b9ad9b2)
And also a query on the median and mean times per day it took for a module to load (not only compiled).
![image](https://github.com/bytecodealliance/wasmtime/assets/19969910/6e7f13e8-2183-4324-a64c-96f48985b9ba)
I can unfortunately not share the wasm module.
TimonPost edited a comment on issue #6798:
To the above points I wrote a small program to compile a 115 MB
.wasm
file for bothx86_64
andaarch64
and it's considerably slower onaarch64
.Use
wasmtime
to compile different targetsI build a script to test compiling our module and ran it three times on Windows 11 with AMD Ryzen 9 5900X 12-Core:
settings enforced `wasmtime settings` has_avx has_avx2 has_bmi1 has_bmi2 has_fma has_lzcnt has_popcnt has_sse3 has_sse41 has_sse42 has_ssse3 x86_64: 13.9465882s, 11.2770526s, 11.0808334s aarch64: 157.5406149s, 144.1816011s, 144.1495896s
Other dev ran it twice on Windows, Intel Core i7 9750H:
settings enforced `wasmtime settings` has_avx has_avx2 has_bmi1 has_bmi2 has_fma has_lzcnt has_popcnt has_sse3 has_sse41 has_sse42 has_ssse3 x86_64: 23.5362889s, 24.3254295s, aarch64: 211.1731269s, 207.9991321s
Macbook pro M1 (from 2 devs initially reporting the problem)
Settings inferred for the current host: has_lse has_pauth sign_return_address sign_return_address_with_bkey
<details>
fn main() { let mut times: HashMap<String, std::time::Duration,RandomState> = HashMap::default(); for os in &["x86_64", "aarch64"] { let start = Instant::now(); let mut command = Command::new("wasmtime"); command.arg("compile").arg(format!("--target={os}")).arg( "path\xxxxx.wasm", ); let child = command.spawn().unwrap(); let output = child.wait_with_output().unwrap(); times.insert(format!("sandbox-{}", os), start.elapsed()); println!( "[{:?}]: {os} ({})", start.elapsed(), String::from_utf8(output.stdout).unwrap() ); } println!("{:?}", times); }
</details>
I also did some queries on how often a module is loaded (not compiled) over 50 seconds in production environments. Over the course of 2 months after my previous patch #6548, where it was almost gone, it came back. It is a bit harder to visualize it in relation to the total amounts of module loading to get a percentile, but it shows something is going on.
![image](https://github.com/bytecodealliance/wasmtime/assets/19969910/0fab3df8-5507-4fdf-9328-8f7e6b9ad9b2)
And also a query on the median and max time per day it took for a module to load (not only compiled).
![image](https://github.com/bytecodealliance/wasmtime/assets/19969910/6e7f13e8-2183-4324-a64c-96f48985b9ba)
I can unfortunately not share the wasm module.
TimonPost edited a comment on issue #6798:
To the above points I wrote a small program to compile a 115 MB
.wasm
file for bothx86_64
andaarch64
and it's considerably slower onaarch64
.Use
wasmtime
to compile different targetsI build a script to test compiling our module and ran it three times on Windows 11 with AMD Ryzen 9 5900X 12-Core:
settings inferred `wasmtime settings` has_avx has_avx2 has_bmi1 has_bmi2 has_fma has_lzcnt has_popcnt has_sse3 has_sse41 has_sse42 has_ssse3 x86_64: 13.9465882s, 11.2770526s, 11.0808334s aarch64: 157.5406149s, 144.1816011s, 144.1495896s
Other dev ran it twice on Windows, Intel Core i7 9750H:
settings inferred `wasmtime settings` has_avx has_avx2 has_bmi1 has_bmi2 has_fma has_lzcnt has_popcnt has_sse3 has_sse41 has_sse42 has_ssse3 x86_64: 23.5362889s, 24.3254295s, aarch64: 211.1731269s, 207.9991321s
Macbook pro M1 (from 2 devs initially reporting the problem)
Settings inferred for the current host: has_lse has_pauth sign_return_address sign_return_address_with_bkey
<details>
fn main() { let mut times: HashMap<String, std::time::Duration,RandomState> = HashMap::default(); for os in &["x86_64", "aarch64"] { let start = Instant::now(); let mut command = Command::new("wasmtime"); command.arg("compile").arg(format!("--target={os}")).arg( "path\xxxxx.wasm", ); let child = command.spawn().unwrap(); let output = child.wait_with_output().unwrap(); times.insert(format!("sandbox-{}", os), start.elapsed()); println!( "[{:?}]: {os} ({})", start.elapsed(), String::from_utf8(output.stdout).unwrap() ); } println!("{:?}", times); }
</details>
I also did some queries on how often a module is loaded (not compiled) over 50 seconds in production environments. Over the course of 2 months after my previous patch #6548, where it was almost gone, it came back. It is a bit harder to visualize it in relation to the total amounts of module loading to get a percentile, but it shows something is going on.
![image](https://github.com/bytecodealliance/wasmtime/assets/19969910/0fab3df8-5507-4fdf-9328-8f7e6b9ad9b2)
And also a query on the median and max time per day it took for a module to load (not only compiled).
![image](https://github.com/bytecodealliance/wasmtime/assets/19969910/6e7f13e8-2183-4324-a64c-96f48985b9ba)
I can unfortunately not share the wasm module.
TimonPost edited a comment on issue #6798:
To the above points I wrote a small program to compile a 115 MB
.wasm
file for bothx86_64
andaarch64
and it's considerably slower onaarch64
.Use
wasmtime
to compile different targetsI build a script to test compiling our module and ran it three times on Windows 11 with AMD Ryzen 9 5900X 12-Core:
settings inferred `wasmtime settings` has_avx has_avx2 has_bmi1 has_bmi2 has_fma has_lzcnt has_popcnt has_sse3 has_sse41 has_sse42 has_ssse3 wasmtime 11.0.1 x86_64: 13.9465882s, 11.2770526s, 11.0808334s aarch64: 157.5406149s, 144.1816011s, 144.1495896s wasmtime 9.0.4: x86_64: 11.2984875s aarch64: 147.1249103s
There seems to be not much of a difference between wasmtime 9 and 11 :thinking:
Other dev ran it twice on Windows, Intel Core i7 9750H:
settings inferred `wasmtime settings` has_avx has_avx2 has_bmi1 has_bmi2 has_fma has_lzcnt has_popcnt has_sse3 has_sse41 has_sse42 has_ssse3 x86_64: 23.5362889s, 24.3254295s, aarch64: 211.1731269s, 207.9991321s
Macbook pro M1 (from 2 devs initially reporting the problem)
Settings inferred for the current host: has_lse has_pauth sign_return_address sign_return_address_with_bkey
<details>
fn main() { let mut times: HashMap<String, std::time::Duration,RandomState> = HashMap::default(); for os in &["x86_64", "aarch64"] { let start = Instant::now(); let mut command = Command::new("wasmtime"); command.arg("compile").arg(format!("--target={os}")).arg( "path\xxxxx.wasm", ); let child = command.spawn().unwrap(); let output = child.wait_with_output().unwrap(); times.insert(format!("sandbox-{}", os), start.elapsed()); println!( "[{:?}]: {os} ({})", start.elapsed(), String::from_utf8(output.stdout).unwrap() ); } println!("{:?}", times); }
</details>
I also did some queries on how often a module is loaded (not compiled) over 50 seconds in production environments. Over the course of 2 months after my previous patch #6548, where it was almost gone, it came back. It is a bit harder to visualize it in relation to the total amounts of module loading to get a percentile, but it shows something is going on.
![image](https://github.com/bytecodealliance/wasmtime/assets/19969910/0fab3df8-5507-4fdf-9328-8f7e6b9ad9b2)
And also a query on the median and max time per day it took for a module to load (not only compiled).
![image](https://github.com/bytecodealliance/wasmtime/assets/19969910/6e7f13e8-2183-4324-a64c-96f48985b9ba)
I can unfortunately not share the wasm module.
TimonPost edited a comment on issue #6798:
To the above points I wrote a small program to compile a 115 MB
.wasm
file for bothx86_64
andaarch64
and it's considerably slower onaarch64
.Use
wasmtime
to compile different targetsI build a script to test compiling our module and ran it three times on Windows 11 with AMD Ryzen 9 5900X 12-Core:
settings inferred `wasmtime settings` has_avx has_avx2 has_bmi1 has_bmi2 has_fma has_lzcnt has_popcnt has_sse3 has_sse41 has_sse42 has_ssse3 wasmtime 11.0.1 x86_64: 13.9465882s, 11.2770526s, 11.0808334s aarch64: 157.5406149s, 144.1816011s, 144.1495896s wasmtime 9.0.4: x86_64: 11.2984875s aarch64: 147.1249103s wasmtime 9.0 x86_64: 11.0617372s aarch64: 159.7219714s
There seems to be not much of a difference between wasmtime 9 and 11 :thinking:
Other dev ran it twice on Windows, Intel Core i7 9750H:
settings inferred `wasmtime settings` has_avx has_avx2 has_bmi1 has_bmi2 has_fma has_lzcnt has_popcnt has_sse3 has_sse41 has_sse42 has_ssse3 x86_64: 23.5362889s, 24.3254295s, aarch64: 211.1731269s, 207.9991321s
Macbook pro M1 (from 2 devs initially reporting the problem)
Settings inferred for the current host: has_lse has_pauth sign_return_address sign_return_address_with_bkey
<details>
fn main() { let mut times: HashMap<String, std::time::Duration,RandomState> = HashMap::default(); for os in &["x86_64", "aarch64"] { let start = Instant::now(); let mut command = Command::new("wasmtime"); command.arg("compile").arg(format!("--target={os}")).arg( "path\xxxxx.wasm", ); let child = command.spawn().unwrap(); let output = child.wait_with_output().unwrap(); times.insert(format!("sandbox-{}", os), start.elapsed()); println!( "[{:?}]: {os} ({})", start.elapsed(), String::from_utf8(output.stdout).unwrap() ); } println!("{:?}", times); }
</details>
I also did some queries on how often a module is loaded (not compiled) over 50 seconds in production environments. Over the course of 2 months after my previous patch #6548, where it was almost gone, it came back. It is a bit harder to visualize it in relation to the total amounts of module loading to get a percentile, but it shows something is going on.
![image](https://github.com/bytecodealliance/wasmtime/assets/19969910/0fab3df8-5507-4fdf-9328-8f7e6b9ad9b2)
And also a query on the median and max time per day it took for a module to load (not only compiled).
![image](https://github.com/bytecodealliance/wasmtime/assets/19969910/6e7f13e8-2183-4324-a64c-96f48985b9ba)
I can unfortunately not share the wasm module.
alexcrichton commented on issue #6798:
Given this program:
use std::fmt::{self, Write}; const F: usize = 2000; // number of functions const N: usize = 1000; // number of calls per function fn main() -> fmt::Result { let mut s = String::new(); writeln!(s, "(module")?; for _ in 0..F { writeln!(s, "(func")?; for _ in 0..N { writeln!(s, "call $f")?; } writeln!(s, ")")?; } writeln!(s, "(func $f)")?; writeln!(s, ")")?; println!("{s}"); Ok(()) }
which generates a wasm via:
rustc -O foo.rs && ./foo | wasm-tools parse -o foo.wasm
I then can reproduce this issue I believe with:
$ time wasmtime compile foo.wasm --cranelift-set wasmtime_linkopt_padding_between_functions=$((1<<16)) --target x86_64 wasmtime compile foo.wasm --cranelift-set --target x86_64 5.43s user 0.31s system 736% cpu 0.780 total $ time wasmtime compile foo.wasm --cranelift-set wasmtime_linkopt_padding_between_functions=$((1<<16)) --target aarch64 wasmtime compile foo.wasm --cranelift-set --target aarch64 15.14s user 2.60s system 180% cpu 9.809 total
I believe the issue here is that there's quadratic behavior the way that fixups are handled in the
MachBuffer
right now. Specifically when an island is emitted any fixups which don't need to be present in the island are re-inserted back into thefixup_records
list. This means that, in what I suspect is happening in the OP and what's happening in my test case, O(N) list entries are processed but O(N) are pushed back onto the list to get processed later. This means that there's O(N) times where O(N) items are processed, leading to quadratic behavior.I stressed this above by setting the
wasmtime_linkopt_padding_between_functions
option to force space between functions to simulate big functions. Each function has a lot of calls to the final function in the module which means that there is always a lot of pending fixup records. After ~1300 functions are emitted each function has an island after it for prior functions and the next 1k functions each process ~1M fixup records each time.One "easy" fix is to change this line to a
VecDeque
instead of aSmallVec
. That doesn't actually fix the quadratic behavior though, it just removes a large constant factor from it. For me that brought the runtime to ~5s.I was otherwise testing out locally an entirely different strategy where the
fixup_records
are stored as aBinaryHeap
and only those necessary to be emitted into the island were emitted, leaving everything else in place. This is not a viable solution as-is because I had to disable theMachBuffer
branching optimizations, but it did bring the compile time down to 2s from the original ~10s for aarch64. Profiling shows the 2 seconds is almost entirely spent in b-tree operations as part of regalloc, which I'm interpreting as meaning that this is indeed the issue and would bring aarch64 and x86_64 back in parity with each other.
Do others have a better idea about how to handle the
fixup_records
list? The problem with theBinaryHeap
approach is that the branch optimizations want to delete last-added records and additionally mutate records arbitrarily within the heap, both of which aren't operations thatBinaryHeap
supports.Alternatively, one perhaps more radical idea would be to remove the branch optimizations entirely. My naive understanding of them is that they're better suited for restructuring the CFG (e.g. basic-block style mid-end optimizations) rather than emission-time optimizations.
One unfortunate part here is that the branch optimizations which require
BinaryHeap
aren't used at all when assembling functions into one large image. Only intra-function assembly uses branch optimization which is less of an issue here because functions are typically not large enough to hit this quadratic behavior. I'm not sure that this can be extrapolated into something useful, though, and inevitably a big function will hit this quadratic behavior too (in addition to the big module in the OP)
cfallin commented on issue #6798:
@alexcrichton I'm taking a look at this now. My most basic unanswered question is actually: what changed recently? The original report above is that this is a regression between v9 and v10; the core
MachBuffer
algorithm has not changed in three years (around June 2020).Alternatively, one perhaps more radical idea would be to remove the branch optimizations entirely. My naive understanding of them is that they're better suited for restructuring the CFG (e.g. basic-block style mid-end optimizations) rather than emission-time optimizations.
Do you mean avoiding use of the
MachBuffer
for the inter-function linking? (My alternative reading of this suggetion -- not doing peephole opts for each function body emission -- seems unlikely as they're really important for perf, something like 25% when I introduced them at least.)That's possible for sure; I guess the question would then be whether we switch back to full Abs8 relocations for all calls (and reintroduce support for relocs into Wasmtime as code won't be fully PIC anymore) or get island handling some other way. Fundamentally it seems to me that:
- We do need islands (because of above);
- Islands require some sort of labell-resolution;
- Label resolution needs some way of handling both forward and reverse refs;
- Forward refs (labels not yet defined) are the core of the issue here, causing the re-insertion (here).
It seems to me that the most elegant approach may be to kick some label-fixup records off to a "at max range, don't reconsider until final fixup" list; then not consider them in
emit_island
again. We'd need a notion inLabelUse
of "maximum range"; for aarch64 that's one step or two (19 bit -> 26 bit -> full abs8, or 26 bit -> abs8) and in Wasmtime in practice just one step for conditionals and no escalation at all for jumps.emit_island
clears the most-recent-branch list so we don't have to worry about invariants wrt its connection tofixup_records
. Thoughts?I can try to prototype this today (but I have a pretty meeting-heavy day sadly so if you want to get to it first, please feel free!).
cfallin edited a comment on issue #6798:
@alexcrichton I'm taking a look at this now. My most basic unanswered question is actually: what changed recently? The original report above is that this is a regression between v9 and v10; the core
MachBuffer
algorithm has not changed in three years (around June 2020).Alternatively, one perhaps more radical idea would be to remove the branch optimizations entirely. My naive understanding of them is that they're better suited for restructuring the CFG (e.g. basic-block style mid-end optimizations) rather than emission-time optimizations.
Do you mean avoiding use of the
MachBuffer
for the inter-function linking? (My alternative reading of this suggetion -- not doing peephole opts for each function body emission -- seems unlikely as they're really important for perf, something like 25% when I introduced them at least.)That's possible for sure; I guess the question would then be whether we switch back to full Abs8 relocations for all calls (and reintroduce support for relocs into Wasmtime as code won't be fully PIC anymore) or get island handling some other way. Fundamentally it seems to me that:
- We do need islands (because of above);
- Islands require some sort of label-resolution;
- Label resolution needs some way of handling both forward and reverse refs;
- Forward refs (labels not yet defined) are the core of the issue here, causing the re-insertion (here).
It seems to me that the most elegant approach may be to kick some label-fixup records off to a "at max range, don't reconsider until final fixup" list; then not consider them in
emit_island
again. We'd need a notion inLabelUse
of "maximum range"; for aarch64 that's one step or two (19 bit -> 26 bit -> full abs8, or 26 bit -> abs8) and in Wasmtime in practice just one step for conditionals and no escalation at all for jumps.emit_island
clears the most-recent-branch list so we don't have to worry about invariants wrt its connection tofixup_records
. Thoughts?I can try to prototype this today (but I have a pretty meeting-heavy day sadly so if you want to get to it first, please feel free!).
cfallin commented on issue #6798:
Actually it turns out that a simple implementation of the above is a Caltrain-ride-sized chunk of work: #6804. It still doesn't resolve the issue because there is actually a 32-bit PCrel label kind that we want to be able to use, so the Branch26's stick around and participate in the quadratic dance. I think we want some sort of tiered deadline approach; continuing to think!
cfallin commented on issue #6798:
The latest commit in #6804 now solves the issue, I think. I'll do more thorough testing and cleanup later in the day when I have a chance. One interesting consequence of the new change (every forward-ref that crosses an island gets a veneer) is that I think we no longer need a notion of worst-case size; but I'll verify that. It does also pessimize the case where a bunch of 26-bit forward refs cross a 19-bit-caused island in a very large function body, but "calls jump through an island in the middle of a function" is not technically wrong, so... (I'll measure with some benchmarks to be sure!)
TimonPost commented on issue #6798:
Thank you for confirming and swiftly addressing the optimization approach. I have a question that hasn't been answered yet: Why is there such a significant difference in performance between the aarch64 and x86_64 architectures?
Also, I'd like to mention @cfallin and express a bit of uncertainty regarding whether this could be related to version regression. I can not confirm it does (also see benchmarks https://github.com/bytecodealliance/wasmtime/issues/6798#issuecomment-1665189801). However, I've noticed that recently the aarch64 performance has been noticeably slower compared to before.
Our code base is growing fast so it is not odd to experience slower compile times, however, I wonder if there are perhaps certain things we do that are very heavy on wasm-generated-code?
TimonPost edited a comment on issue #6798:
Thank you for confirming and swiftly addressing the optimization approach. I have a question that hasn't been answered yet: Why is there such a significant difference in performance between the aarch64 and x86_64 architectures?
Also, I'd like to mention @cfallin and express a bit of uncertainty regarding whether this could be related to version regression. I can not confirm it does (also see benchmarks https://github.com/bytecodealliance/wasmtime/issues/6798#issuecomment-1665189801). However, various people in our team have noticed that recently the aarch64 performance has been noticeably slower compared to before.
Our code base is growing fast so it is not odd to experience slower compile times, however, I wonder if there are perhaps certain things we do that are very heavy on wasm-generated-code (.wasm), and why there is such a big mismatch between targets?
cfallin commented on issue #6798:
Why is there such a significant difference in performance between the aarch64 and x86_64 architectures?
This is a consequence of the kinds of label-references (relocations, kind of) that the two architectures have, and more broadly a difference between RISC-y architectures and CISC-y ones. AArch64 has constant-size (32-bit) instructions, so its branches have either a 19-bit of 26-bit offset field (former for conditionals, which need to encode more information) -- but code can be larger than that. Likewise references to the "constant pool" have a 19-bit offset. To make larger code work, we create "veneers" in an "island" embedded in the code, where a branch first jumps to a longer-range branch (the veneer). The island also contains constants whose references are about to go out of range.This allows us to do single-pass emission, rather than going back and re-emitting with a longer-form branch (which may be a sequence of multiple insts and thus shift other offsets, causing a cascading fixup process). On the other hand, x86-64 has variable-length instructions and so can encode a 32-bit offset almost everywhere; we unconditionally use 32-bit offsets for conditional and unconditional jumps, and data references, so we never need an island of veneers/constants.
recently the aarch64 performance has been noticeably slower compared to before.
If you have an example module we can benchmark and a bisection range ("faster with vX, slower since vY") that'd be very helpful!
jameysharp commented on issue #6798:
If I understand Chris and Alex correctly, the relevant difference between x86_64 and aarch64 is that x86 branch instructions take 32-bit signed offsets, while aarch64 unconditional branches take effectively 28-bit signed offsets.
Your 115MB wasm module is almost 2^27 bytes, so if the compiled aarch64 binary is bigger than the wasm input, then a branch from one end of it to the other is just past the threshold. If you had a 2GB wasm module I'd expect to start seeing similar problems on x86, assuming you didn't run into other bugs first.
I would guess that things have gotten slower not because of Wasmtime changes but because as your wasm modules get bigger, the number of branches that overflow the signed offset limit increases.
Oddly, I think this means that topo-sorting the functions in the wasm module would tend to avoid hitting this behavior in Wasmtime, and also let Wasmtime generate slightly better code. I'm not sure that's worth pursuing though.
alexcrichton commented on issue #6798:
what changed recently?
I think the answer here is @TimonPost's module got bigger. If I take my example module and compile it with Wasmtime 4.0.0 (to pick a random version) which is circa Dec 2022 I get x86_64 being 0.9s and aarch64 being 16s (different computer than my original measurements).
In that sense I don't think that this is a new issue, I think this is a scaling issue that @TimonPost's module has run into now.
Do you mean avoiding use of the MachBuffer for the inter-function linking?
Oh sorry no definitely not, that's too load bearing to replace with something else! What I was saying is that for the inter-function use case the branch optimizations are not necessary (as there are no branches). I'll also clarify in that I'm not saying the branch optimizations aren't worth it, I'm saying that, if it works out perf-wise, it might be better to perform these optimizations at the mid-end instead of just before emission. (e.g. basic block jump threading and other basic block optimizations on the clif layer) Note though that I only think this is is a possible good idea if it retains all of the benefits that the current branch optimization brings, if something must be done at the mach buffer layer and can't be done at the clif layer then it's good to keep.
The latest commit in https://github.com/bytecodealliance/wasmtime/pull/6804 now solves the issue
Nice! I'll also see your train ride and raise you a plane ride :)
I'll take a look probably early next week :+1: (unless someone beats me to it)
cfallin commented on issue #6798:
I'm saying that, if it works out perf-wise, it might be better to perform these optimizations at the mid-end instead of just before emission. (e.g. basic block jump threading and other basic block optimizations on the clif layer) Note though that I only think this is is a possible good idea if it retains all of the benefits that the current branch optimization brings, if something must be done at the mach buffer layer and can't be done at the clif layer then it's good to keep.
Ah, I see. I think it's probably best to keep
MachBuffer
's optimizations where they are: for historical context, when it replaced a separate pass, it resulted in 10% better compile time as well as 24% better perf overall. The reason is that the peephole pass is designed to be as unintrusive as possible: no editing of data structures, just "chomping" of branches where we see they're unnecessary. The two-target branch form is also deeply embedded in the design of the VCode and the register allocator; it's a much more natural representation of a CFG than the "one-sided" form (branch or fallthrough) that true machine branches have.
TimonPost commented on issue #6798:
I can confirm our module is growing with an average of 8Mb/month, and that it might play a role in why it becomes more noticeable over time.
TimonPost edited a comment on issue #6798:
I can confirm our
.wasm
file is growing with an average of 8Mb/month, and that it might play a role in why it becomes more noticeable over time.
TimonPost edited a comment on issue #6798:
I can confirm our
.wasm
file is growing with an average of 8Mb/month over the last year, and that it might play a role in why it becomes more noticeable over time.
cfallin closed issue #6798:
We have noticed a recent regression in module compilation times on the MBP M1/M2. Sometimes it takes over 100 seconds when it used to take less than 15.
<img width="1325" alt="image" src="https://github.com/bytecodealliance/wasmtime/assets/19969910/d6751bb1-5254-4f28-ae66-0cd913370fc2">
Several non-scientific numbers have been reported on various platforms. At first three individual Mac users started reporting extreme numbers, often exceeding +90 seconds, while Windows and Linux users report an average of around 50 seconds, with some reporting more normal numbers below 10 seconds. We have attempted to isolate the problem by manually clearing all caches and see how long it takes to recompile from scratch.
Upon further profiling, it appears that the performance bottleneck is related to the "cranelift" library (see image).
Any tips on how we can help find out the root cause here are appreciated. Are there perhaps benchmarks to run to see if regressions happened?
Some non-scientific data:
47s, 11th Gen Intel® Core™ i7-11800H, linux 15s, linux, i9-12900H (22s without cache) 6s, i9-13900k, windows (14s without cache) 8.2s amd 5950x, linux 45s 9th gen intel i7, windows 56s AMD Ryzen 9 6900HS, windows (36s without cache) 90s Mac Pro M1/M2, macos
Perhaps introduced in v10.0.0 or v10.0.1, as we have not upgraded to 11 yet, and in 9.0.4 we fixed another module compilation issue and had not experienced any problems yet up till quite recently
Last updated: Nov 22 2024 at 17:03 UTC