Follow-up question to the conversation about Sightglass in the Cranelift meeting today: how long does it take to run the benchmarks, and what are the hardware requirements, apart from performance counters? @Andy Wortman @Johnnie Birch, can you give input on this?
Asking because I'm wondering if we could run these as part of the normal CI, given that we want to measure instruction counts, not wall clock time, and thus don't need to worry about noise.
Hi @Till Schneidereit I believe just a few minutes ... 5 minutes to run the tests. What ran was the test and a baseline using native. Note however, I know I made some updates that allowed you to download and rebuild a new version of wasmtime each experiment and all of that adds to the time. I will take a look later this afternoon to see the state and attempt to run.
yeah running just now on my laptop there was about a minute for release builds of the benchmark programs, another two or so for the benchmarks to run. another minute or two on top for a fresh release build of wasmtime sounds about right?
Ok, that sounds like stuff we could easily do in CI. In particular since we already have the builds. What we don't have are perf counters in GitHub Actions, so either we'd need to run this somewhere else, or change the setup accordingly. @Julian Seward IIUC you suggested we should change how we measure things anyway?
@Till Schneidereit what I proposed to measure was: for compiler run time, just the instruction count. For the generated code, the instruction count, the data read count and the data write count. Measuring wallclock time is so noisy in practice as to be useless. Measuring cache misses etc is also pointless because these depend on the processor implementation (cache sizes, prefetch strategies, other stuff running at the same time) and so will tell us nothing useful.
For the generated code I've suggested including the data read/write counts because those relate directly to decisions the register allocator makes (spilling) and so any regressions in that area should be obvious from those numbers.
Wouldn't measuring only instruction count hide the pipelining and other architectural effects of CPUs, though? And if so, would measure both values yield better trends?
(umm) when you say "both", what is the second value that you mention?
Time, in addition to instruction count.
Ah, ok. Well, we could measure time too; but my experience with doing that has mostly been bad. Eg, for the 1%-sized icount changes that we were dealing with during the RA tuning phase, we could not have done that work if we'd used only run times.
perf.rust-lang.org allows people to select a number of different data sets: cpu-clock
,cycles:u
,faults
,instructions:u
,max-rss
,task-clock
,wall-time
. It'd be good to learn more about what people find most useful of those
In my experience.even with a dedicated setup like rust's wall clock time is almost always not useful
It's only useful when adding parallelism to show that increased instructions have decreased wall time
It's.worth collecting because it's easy but its rarely the main statistic when measuring changes
But it's also important to recognize that instruction counting just may be a proxy for time, it isn't always. In practice though I think it's been really consistent with rust
as an additional data point, i've also seen inconsistencies between instruction count and wallclock time with regalloc.rs (e.g. on two runs, one has lower icount but higher wallclock, in a consistent manner), so having both gives a slightly better idea of what's going on.
(would surely be nice to note the total number of calls to malloc, if that's not too hard to get from e.g. callgrind)
I can believe that (about icounts vs wallclock times).
One thing we could also maybe collect is cache misses; maybe those can partly account for situations where the insn count goes down but the wallclock time goes up.
One thing I'd recommend is to collect everything that's reasonable for profiling-over-time
it's rare that one metric fits all situations/improvements/regressions
but having multiple allows you to correlate various things if necessary
FWIW, sometimes the instruction count does not change even if runtime increases by 100%: https://bugzilla.mozilla.org/show_bug.cgi?id=1649109. That bug is about a Spectre mitigation but it could have been about a change in code generation strategy that was thought to be innocuous b/c branch prediction would paper over it. Insn count is a good starting point for assessing the quality of the compiler's output (and has the virtue of having a stable meaning over time and being independent of the microarchitecture) but microarchitectural effects means insn count alone can be highly misleading in at least some cases.
Last updated: Jan 24 2025 at 00:11 UTC