Stream: cranelift

Topic: Sightglass benchmarks


view this post on Zulip Till Schneidereit (Jun 29 2020 at 16:55):

Follow-up question to the conversation about Sightglass in the Cranelift meeting today: how long does it take to run the benchmarks, and what are the hardware requirements, apart from performance counters? @Andy Wortman @Johnnie Birch, can you give input on this?

view this post on Zulip Till Schneidereit (Jun 29 2020 at 16:56):

Asking because I'm wondering if we could run these as part of the normal CI, given that we want to measure instruction counts, not wall clock time, and thus don't need to worry about noise.

view this post on Zulip Johnnie Birch (Jun 29 2020 at 17:33):

Hi @Till Schneidereit I believe just a few minutes ... 5 minutes to run the tests. What ran was the test and a baseline using native. Note however, I know I made some updates that allowed you to download and rebuild a new version of wasmtime each experiment and all of that adds to the time. I will take a look later this afternoon to see the state and attempt to run.

view this post on Zulip iximeow (Jun 29 2020 at 18:05):

yeah running just now on my laptop there was about a minute for release builds of the benchmark programs, another two or so for the benchmarks to run. another minute or two on top for a fresh release build of wasmtime sounds about right?

view this post on Zulip Till Schneidereit (Jun 29 2020 at 19:28):

Ok, that sounds like stuff we could easily do in CI. In particular since we already have the builds. What we don't have are perf counters in GitHub Actions, so either we'd need to run this somewhere else, or change the setup accordingly. @Julian Seward IIUC you suggested we should change how we measure things anyway?

view this post on Zulip Julian Seward (Jun 30 2020 at 08:59):

@Till Schneidereit what I proposed to measure was: for compiler run time, just the instruction count. For the generated code, the instruction count, the data read count and the data write count. Measuring wallclock time is so noisy in practice as to be useless. Measuring cache misses etc is also pointless because these depend on the processor implementation (cache sizes, prefetch strategies, other stuff running at the same time) and so will tell us nothing useful.

view this post on Zulip Julian Seward (Jun 30 2020 at 09:00):

For the generated code I've suggested including the data read/write counts because those relate directly to decisions the register allocator makes (spilling) and so any regressions in that area should be obvious from those numbers.

view this post on Zulip Benjamin Bouvier (Jun 30 2020 at 09:25):

Wouldn't measuring only instruction count hide the pipelining and other architectural effects of CPUs, though? And if so, would measure both values yield better trends?

view this post on Zulip Julian Seward (Jun 30 2020 at 09:34):

(umm) when you say "both", what is the second value that you mention?

view this post on Zulip Benjamin Bouvier (Jun 30 2020 at 09:40):

Time, in addition to instruction count.

view this post on Zulip Julian Seward (Jun 30 2020 at 09:43):

Ah, ok. Well, we could measure time too; but my experience with doing that has mostly been bad. Eg, for the 1%-sized icount changes that we were dealing with during the RA tuning phase, we could not have done that work if we'd used only run times.

view this post on Zulip Till Schneidereit (Jun 30 2020 at 09:53):

perf.rust-lang.org allows people to select a number of different data sets: cpu-clock,cycles:u,faults,instructions:u,max-rss,task-clock,wall-time. It'd be good to learn more about what people find most useful of those

view this post on Zulip Alex Crichton (Jun 30 2020 at 13:35):

In my experience.even with a dedicated setup like rust's wall clock time is almost always not useful

view this post on Zulip Alex Crichton (Jun 30 2020 at 13:35):

It's only useful when adding parallelism to show that increased instructions have decreased wall time

view this post on Zulip Alex Crichton (Jun 30 2020 at 13:35):

It's.worth collecting because it's easy but its rarely the main statistic when measuring changes

view this post on Zulip Alex Crichton (Jun 30 2020 at 13:36):

But it's also important to recognize that instruction counting just may be a proxy for time, it isn't always. In practice though I think it's been really consistent with rust

view this post on Zulip Benjamin Bouvier (Jun 30 2020 at 13:39):

as an additional data point, i've also seen inconsistencies between instruction count and wallclock time with regalloc.rs (e.g. on two runs, one has lower icount but higher wallclock, in a consistent manner), so having both gives a slightly better idea of what's going on.

view this post on Zulip Benjamin Bouvier (Jun 30 2020 at 13:40):

(would surely be nice to note the total number of calls to malloc, if that's not too hard to get from e.g. callgrind)

view this post on Zulip Julian Seward (Jun 30 2020 at 13:42):

I can believe that (about icounts vs wallclock times).

view this post on Zulip Julian Seward (Jun 30 2020 at 13:43):

One thing we could also maybe collect is cache misses; maybe those can partly account for situations where the insn count goes down but the wallclock time goes up.

view this post on Zulip Alex Crichton (Jun 30 2020 at 14:09):

One thing I'd recommend is to collect everything that's reasonable for profiling-over-time

view this post on Zulip Alex Crichton (Jun 30 2020 at 14:09):

it's rare that one metric fits all situations/improvements/regressions

view this post on Zulip Alex Crichton (Jun 30 2020 at 14:09):

but having multiple allows you to correlate various things if necessary

view this post on Zulip Lars Hansen (Jul 02 2020 at 09:13):

FWIW, sometimes the instruction count does not change even if runtime increases by 100%: https://bugzilla.mozilla.org/show_bug.cgi?id=1649109. That bug is about a Spectre mitigation but it could have been about a change in code generation strategy that was thought to be innocuous b/c branch prediction would paper over it. Insn count is a good starting point for assessing the quality of the compiler's output (and has the virtue of having a stable meaning over time and being independent of the microarchitecture) but microarchitectural effects means insn count alone can be highly misleading in at least some cases.


Last updated: Nov 22 2024 at 16:03 UTC