enable lto for production builds of wasmtime? · general

Stream: general

Topic: enable lto for production builds of wasmtime?

Benjamin Bouvier (Nov 19 2020 at 09:35):

I just noticed that LTO wasn't enabled for wasmtime, would it make sense to enable it at some point, for release builds? (Or is it set automatically under a specific Rust profile?)

Till Schneidereit (Nov 19 2020 at 12:38):

ah, good question! @Alex Crichton, do you see any reason not to enable thin LTO? AFAICT it's probably indeed not enabled through some indirect means?

Alex Crichton (Nov 19 2020 at 16:02):

The main reason is build time to benefit gained, even Thin LTO for the full crate graph can sometimes take awhile

Alex Crichton (Nov 19 2020 at 16:03):

mostly because we produce a "full release" on every commit

Alex Crichton (Nov 19 2020 at 16:03):

but other than that should be fine to enable

Till Schneidereit (Nov 19 2020 at 16:24):

ah, cool. @Benjamin Bouvier would you be up for doing an experiment with this as a PR, so we can see the impact on build times?

Benjamin Bouvier (Nov 19 2020 at 17:10):

I'm happy to try this, yes!

Benjamin Bouvier (Nov 23 2020 at 11:11):

Do we already have a small, blessed set of benchmarks to get an idea of the performance impact of enabling LTO?

Benjamin Bouvier (Nov 23 2020 at 11:13):

I've looked at the benchmarking rfc, and it seems this is still in flux

Benjamin Bouvier (Nov 23 2020 at 11:39):

I've got some initial measurements on my machine for compile time, at least. My machine is quite beefy (32 cores at 4 GHz), so it might not be representative and we should measure the effect on CI, in particular. All measures done after a build cache clear (cargo clean).

lto off: total build times of 41 seconds
lto thin: total build tmies of 42 seconds
lto fat: 1' 57 seconds

In terms of throughput: I only have local benchmarks which measure the throughput of generated code, since I've always been working on mostly Cranelift and codegen :-) So the measurement might not be very telling, as there's only very little time spent in initializing the VM and calling into the codegen'd code. On the 4 synthetic benchmarks I've tried, the speedup is in the noise range, for both thin and fat LTO modes.

With valgrind, I get the measured the instructions retired count for 2 very small benchmarks:

fbench: no LTO: 1660M / thin LTO: 1648M / fat LTO: 1632M
ffbench: no LTO: 1465M / thin LTO: 1452M / fat LTO: 1436M

so 1.5 to 2% retired instructions decrease for fat LTO, when compared to the baseline.

At this point, I think that:

we should find benchmarks that are more representative of this use case, that is, that spend more time in the VM itself.
thin LTO seems to be the nice midpoint, with almost no impact on compile times, and a small decrease in retired instruction counts.
we should measure the impact on CI to get a better understanding of the total effect on runtime. I'll start this right now; in advance sorry for the noise this will generate for new PRs. (Also, I don't know how predictable are the CI runners...)

Benjamin Bouvier (Nov 23 2020 at 13:12):

And here's some summary of the 3 PRs:

test time isn't affected, because tests run in a different profile. Kind of obvious after the fact, but I didn't realize this before, so worth mentioning!
thin LTO resulted in a build failure on Windows with mingw, making it impossible to get the actual build times.
fat LTO worked, though, so we have an idea of the overhead:
- build linux: from 18 minutes (lto=off) to 51 minutes (lto=fat)
- mac: 16 to 40
- win: 31 to 60
- aarch64: 23 to 48

Benjamin Bouvier (Nov 23 2020 at 13:14):

So, for LTO=fat, it seems that the huge increase in compile times, and relatively low benefits in runtimes (with the caveat that the benchmarks are synthetic codegen benchmarks), it would be counterproductive to enable it right now. More realistic benchmarks should be used to determine if this is worth it.

For LTO=thin, the mingw failures would need to be investigated and solved first. Locally the increase in compile times has been unnoticeable, the effect on run time (same caveat applies) has been very low as well. So it may be fine to postpone it as well.

Benjamin Bouvier (Nov 23 2020 at 13:15):

Overall, it seems that more VM- heavy benchmarks (e.g. WASI benchmarks) would be required to make interesting measurements here, and that enabling LTO would be a perfect testbed for the benchmarking infrastructure in general.

Till Schneidereit (Nov 23 2020 at 13:19):

thank you for looking into this! :heart:

I agree that fat LTO is clearly not viable, which doesn't seem surprising. I also agree that we should have more useful benchmarks to evaluate the rest, including whether it's worth it to sort out the issues with thin-LTO on Windows. Though perhaps @Alex Crichton knows what's going on with those, and how to easily address them?

Alex Crichton (Nov 23 2020 at 15:02):

sure yeah happy to look into mingw issues, but we should be careful with evaluation numbers. LTO shouldn't be 2x slower but with Cargo it's easy to accidentally build things 2x more than before. I think when just the release profile is changed then it means cargo build --release will shared probably only build dependencies with cargo test --release, so there's a huge duplication of artifacts built. Similarly ThinLTO will probably hit duplicate build issues.

What we probably want to do is to drill down what we want LTO'd and perhaps do that on a separate builder? That way builders can ideally sharae a cache.

Till Schneidereit (Nov 23 2020 at 15:39):

ah, that makes sense!

Till Schneidereit (Nov 23 2020 at 15:41):

and makes me think even more that we should tackle this with useful benchmarks in hand that exercise the runtime itself in a meaningful way. Once we do, I guess we could even look into using PGO on published builds, which seems like it could make a meaningful difference

Alex Crichton (Nov 23 2020 at 15:45):

I gave PGO a spin the other day for testing the compile time of a few modules but unfortunately it didn't give really any meaningful difference for me locally

Alex Crichton (Nov 23 2020 at 15:45):

although I was pretty un-scientific in my measurments

Till Schneidereit (Nov 23 2020 at 16:08):

yeah, that all seems like it really wants benchmarks to run that exercise enough of the runtime. I'd bet that a lot of stuff one would choose somewhat arbitrarily ends up spending most time in Cranelift-compiled code, and thus not really benefit from PGO

Till Schneidereit (Nov 23 2020 at 16:08):

but perhaps you accounted for that?

Alex Crichton (Nov 23 2020 at 16:20):

oh yeah what I was testing was exclusively compile time

Alex Crichton (Nov 23 2020 at 16:20):

no runtime at all

Till Schneidereit (Nov 23 2020 at 16:22):

yeah, I guess compile time is something we should in theory already be well positioned to evaluate, and where naively I would've expected to see PGO make a difference. Oh well ...

Alex Crichton (Nov 23 2020 at 17:41):

also reading just the error message of MinGW, one of the issues with lto is that it builds all the examples with LTO as well, so we're doing the full LTO passes maybe 10-ish times, which as you can imagine increases compile times a lot

Till Schneidereit (Nov 23 2020 at 19:36):

lol

Benjamin Bouvier (Nov 24 2020 at 10:13):

oh well

Last updated: Apr 09 2025 at 11:03 UTC