Stream: general

Topic: enable lto for production builds of wasmtime?


view this post on Zulip Benjamin Bouvier (Nov 19 2020 at 09:35):

I just noticed that LTO wasn't enabled for wasmtime, would it make sense to enable it at some point, for release builds? (Or is it set automatically under a specific Rust profile?)

view this post on Zulip Till Schneidereit (Nov 19 2020 at 12:38):

ah, good question! @Alex Crichton, do you see any reason not to enable thin LTO? AFAICT it's probably indeed not enabled through some indirect means?

view this post on Zulip Alex Crichton (Nov 19 2020 at 16:02):

The main reason is build time to benefit gained, even Thin LTO for the full crate graph can sometimes take awhile

view this post on Zulip Alex Crichton (Nov 19 2020 at 16:03):

mostly because we produce a "full release" on every commit

view this post on Zulip Alex Crichton (Nov 19 2020 at 16:03):

but other than that should be fine to enable

view this post on Zulip Till Schneidereit (Nov 19 2020 at 16:24):

ah, cool. @Benjamin Bouvier would you be up for doing an experiment with this as a PR, so we can see the impact on build times?

view this post on Zulip Benjamin Bouvier (Nov 19 2020 at 17:10):

I'm happy to try this, yes!

view this post on Zulip Benjamin Bouvier (Nov 23 2020 at 11:11):

Do we already have a small, blessed set of benchmarks to get an idea of the performance impact of enabling LTO?

view this post on Zulip Benjamin Bouvier (Nov 23 2020 at 11:13):

I've looked at the benchmarking rfc, and it seems this is still in flux

view this post on Zulip Benjamin Bouvier (Nov 23 2020 at 11:39):

I've got some initial measurements on my machine for compile time, at least. My machine is quite beefy (32 cores at 4 GHz), so it might not be representative and we should measure the effect on CI, in particular. All measures done after a build cache clear (cargo clean).

In terms of throughput: I only have local benchmarks which measure the throughput of generated code, since I've always been working on mostly Cranelift and codegen :-) So the measurement might not be very telling, as there's only very little time spent in initializing the VM and calling into the codegen'd code. On the 4 synthetic benchmarks I've tried, the speedup is in the noise range, for both thin and fat LTO modes.

With valgrind, I get the measured the instructions retired count for 2 very small benchmarks:

so 1.5 to 2% retired instructions decrease for fat LTO, when compared to the baseline.

At this point, I think that:

view this post on Zulip Benjamin Bouvier (Nov 23 2020 at 13:12):

And here's some summary of the 3 PRs:

view this post on Zulip Benjamin Bouvier (Nov 23 2020 at 13:14):

So, for LTO=fat, it seems that the huge increase in compile times, and relatively low benefits in runtimes (with the caveat that the benchmarks are synthetic codegen benchmarks), it would be counterproductive to enable it right now. More realistic benchmarks should be used to determine if this is worth it.

For LTO=thin, the mingw failures would need to be investigated and solved first. Locally the increase in compile times has been unnoticeable, the effect on run time (same caveat applies) has been very low as well. So it may be fine to postpone it as well.

view this post on Zulip Benjamin Bouvier (Nov 23 2020 at 13:15):

Overall, it seems that more VM- heavy benchmarks (e.g. WASI benchmarks) would be required to make interesting measurements here, and that enabling LTO would be a perfect testbed for the benchmarking infrastructure in general.

view this post on Zulip Till Schneidereit (Nov 23 2020 at 13:19):

thank you for looking into this! :heart:

I agree that fat LTO is clearly not viable, which doesn't seem surprising. I also agree that we should have more useful benchmarks to evaluate the rest, including whether it's worth it to sort out the issues with thin-LTO on Windows. Though perhaps @Alex Crichton knows what's going on with those, and how to easily address them?

view this post on Zulip Alex Crichton (Nov 23 2020 at 15:02):

sure yeah happy to look into mingw issues, but we should be careful with evaluation numbers. LTO shouldn't be 2x slower but with Cargo it's easy to accidentally build things 2x more than before. I think when just the release profile is changed then it means cargo build --release will shared probably only build dependencies with cargo test --release, so there's a huge duplication of artifacts built. Similarly ThinLTO will probably hit duplicate build issues.

What we probably want to do is to drill down what we want LTO'd and perhaps do that on a separate builder? That way builders can ideally sharae a cache.

view this post on Zulip Till Schneidereit (Nov 23 2020 at 15:39):

ah, that makes sense!

view this post on Zulip Till Schneidereit (Nov 23 2020 at 15:41):

and makes me think even more that we should tackle this with useful benchmarks in hand that exercise the runtime itself in a meaningful way. Once we do, I guess we could even look into using PGO on published builds, which seems like it could make a meaningful difference

view this post on Zulip Alex Crichton (Nov 23 2020 at 15:45):

I gave PGO a spin the other day for testing the compile time of a few modules but unfortunately it didn't give really any meaningful difference for me locally

view this post on Zulip Alex Crichton (Nov 23 2020 at 15:45):

although I was pretty un-scientific in my measurments

view this post on Zulip Till Schneidereit (Nov 23 2020 at 16:08):

yeah, that all seems like it really wants benchmarks to run that exercise enough of the runtime. I'd bet that a lot of stuff one would choose somewhat arbitrarily ends up spending most time in Cranelift-compiled code, and thus not really benefit from PGO

view this post on Zulip Till Schneidereit (Nov 23 2020 at 16:08):

but perhaps you accounted for that?

view this post on Zulip Alex Crichton (Nov 23 2020 at 16:20):

oh yeah what I was testing was exclusively compile time

view this post on Zulip Alex Crichton (Nov 23 2020 at 16:20):

no runtime at all

view this post on Zulip Till Schneidereit (Nov 23 2020 at 16:22):

yeah, I guess compile time is something we should in theory already be well positioned to evaluate, and where naively I would've expected to see PGO make a difference. Oh well ...

view this post on Zulip Alex Crichton (Nov 23 2020 at 17:41):

also reading just the error message of MinGW, one of the issues with lto is that it builds all the examples with LTO as well, so we're doing the full LTO passes maybe 10-ish times, which as you can imagine increases compile times a lot

view this post on Zulip Till Schneidereit (Nov 23 2020 at 19:36):

lol

view this post on Zulip Benjamin Bouvier (Nov 24 2020 at 10:13):

oh well


Last updated: Jan 24 2025 at 00:11 UTC