I just noticed that LTO wasn't enabled for wasmtime, would it make sense to enable it at some point, for release builds? (Or is it set automatically under a specific Rust profile?)
ah, good question! @Alex Crichton, do you see any reason not to enable thin LTO? AFAICT it's probably indeed not enabled through some indirect means?
The main reason is build time to benefit gained, even Thin LTO for the full crate graph can sometimes take awhile
mostly because we produce a "full release" on every commit
but other than that should be fine to enable
ah, cool. @Benjamin Bouvier would you be up for doing an experiment with this as a PR, so we can see the impact on build times?
I'm happy to try this, yes!
Do we already have a small, blessed set of benchmarks to get an idea of the performance impact of enabling LTO?
I've looked at the benchmarking rfc, and it seems this is still in flux
I've got some initial measurements on my machine for compile time, at least. My machine is quite beefy (32 cores at 4 GHz), so it might not be representative and we should measure the effect on CI, in particular. All measures done after a build cache clear (cargo clean
).
In terms of throughput: I only have local benchmarks which measure the throughput of generated code, since I've always been working on mostly Cranelift and codegen :-) So the measurement might not be very telling, as there's only very little time spent in initializing the VM and calling into the codegen'd code. On the 4 synthetic benchmarks I've tried, the speedup is in the noise range, for both thin and fat LTO modes.
With valgrind, I get the measured the instructions retired count for 2 very small benchmarks:
so 1.5 to 2% retired instructions decrease for fat LTO, when compared to the baseline.
At this point, I think that:
And here's some summary of the 3 PRs:
So, for LTO=fat, it seems that the huge increase in compile times, and relatively low benefits in runtimes (with the caveat that the benchmarks are synthetic codegen benchmarks), it would be counterproductive to enable it right now. More realistic benchmarks should be used to determine if this is worth it.
For LTO=thin, the mingw failures would need to be investigated and solved first. Locally the increase in compile times has been unnoticeable, the effect on run time (same caveat applies) has been very low as well. So it may be fine to postpone it as well.
Overall, it seems that more VM- heavy benchmarks (e.g. WASI benchmarks) would be required to make interesting measurements here, and that enabling LTO would be a perfect testbed for the benchmarking infrastructure in general.
thank you for looking into this! :heart:
I agree that fat LTO is clearly not viable, which doesn't seem surprising. I also agree that we should have more useful benchmarks to evaluate the rest, including whether it's worth it to sort out the issues with thin-LTO on Windows. Though perhaps @Alex Crichton knows what's going on with those, and how to easily address them?
sure yeah happy to look into mingw issues, but we should be careful with evaluation numbers. LTO shouldn't be 2x slower but with Cargo it's easy to accidentally build things 2x more than before. I think when just the release
profile is changed then it means cargo build --release
will shared probably only build dependencies with cargo test --release
, so there's a huge duplication of artifacts built. Similarly ThinLTO will probably hit duplicate build issues.
What we probably want to do is to drill down what we want LTO'd and perhaps do that on a separate builder? That way builders can ideally sharae a cache.
ah, that makes sense!
and makes me think even more that we should tackle this with useful benchmarks in hand that exercise the runtime itself in a meaningful way. Once we do, I guess we could even look into using PGO on published builds, which seems like it could make a meaningful difference
I gave PGO a spin the other day for testing the compile time of a few modules but unfortunately it didn't give really any meaningful difference for me locally
although I was pretty un-scientific in my measurments
yeah, that all seems like it really wants benchmarks to run that exercise enough of the runtime. I'd bet that a lot of stuff one would choose somewhat arbitrarily ends up spending most time in Cranelift-compiled code, and thus not really benefit from PGO
but perhaps you accounted for that?
oh yeah what I was testing was exclusively compile time
no runtime at all
yeah, I guess compile time is something we should in theory already be well positioned to evaluate, and where naively I would've expected to see PGO make a difference. Oh well ...
also reading just the error message of MinGW, one of the issues with lto is that it builds all the examples with LTO as well, so we're doing the full LTO passes maybe 10-ish times, which as you can imagine increases compile times a lot
lol
oh well
Last updated: Dec 23 2024 at 12:05 UTC