CI for sightglass appears to be broken due to using out of date actions, so I'm working on a basic update to improve that. I'm wondering if other contributors have thoughts on other critical work or workflow improvements to sightglass that we could prioritize.
I'm testing a fork now that fixes CI but it has been running for over 3 hours :/ https://github.com/posborne/sightglass/pull/1
Some ideas that come to mind to encourage contributions in the near term are probably to:
all suite.Ultimately, just using the stock gha runners are probably not ideal for benchmarking but near term I think we can probably improve a lot without tackling that.
Something along the lines of https://github.com/bytecodealliance/sightglass/issues/93 looks great to me, but it may be useful to even target something easier to deploy or enable local experimentation and comparison of results. I'm thinking just something that would be able to take the structured output of runs from a few different configurations (for nightly we could just do main of wasmtime plus a few tagged versions) and outputs results showing percentage differences.
Having a nightly runner of some sort that provides trends over time would be great, and we've talked about that / wanted that for five years or so now...
At one point we (BA) were renting a dedicated x86-64 machine on Hetzner because this seemed "coming soon" but we gave that up a year or so ago. I imagine once we have a workable runner it wouldn't be a huge deal to get this going again. The main concern at the time was security -- we didn't want to allow PRs from arbitrary people to spawn execution and there was a bit of indirection around that IIRC (private repo, ...) A nightly run of main solves that problem at least.
I can't comment much on the topic of CI for the repo itself, but on the topic of continually running benchmarks it's true that security was historically a concern, but bjorn3 showed a way that we can solve this nowadays. That means that it's now possible to actually hook up a custom runner to the public wasmtime repo in a way that doesn't let contributors run arbitrary code on it.
In that sense I agree it'd be awesome to get this up and running, but one of the historic blockers as well has been someone to help set up and manage the infrastructure. Even if it's "mostly just github actions" it's still a fair amount to orchestrate
Yeah, I think there's a lot of improvements we can make without that. Near term, I think with purely gh-runners we can get useful enough results by ensuring that a single benchmark for different configurations (probably just different wasmtime versions) is run on the same runner.
This is still far from controlled, but I would take mildly noisy results over no results -- that impact can probably be reduced some if we can incorporate perf counters info in the nightly results, etc.
I think near term wins would be:
https://github.com/bytecodealliance/sightglass/pull/284
@Andrew Brown Copying you on this for additional context on the PR in case you didn't see this thread; hoping to at least have a rough cut of nightly with some kind of easily consumed output later this week.
Ah, cool! Glad to see you're working on this.
Some more context:
Excellent, thanks for the context. I share the reluctance to add in a 3rd party tool or service that requires upkeep. I think what I'll target to start is something that is static and targets gh-pages.
Something like https://github.com/marketplace/actions/continuous-benchmark seems like it could be promising; it supports ingesting a "custom" json format, which i should be able to target by doing a transform on the benchmark outputs.
If setup on a scheduled run, this should allow us to:
main (depending on SnR b/w runs).I'll do that work in my fork for for a bit and might request an early review there before upstreaming as permissions for CI are easier to manage in that way until ready to go.
At least in my early runs here, I'm seeing quite a range of cycle counts across iterations of benchmark runs for several of the tests. Does this look normal? For example, https://github.com/posborne/sightglass/actions/runs/14499832348/job/40677009588 (look at the "Output Results" JSON).
This is after a transform I'm playing with but I've verified the "raw" JSON results show similar discrepancies. I just kicked off another run that removes parallel processes to see if that reduces the variability but something seems suspect. Doing more iterations and/or taking a median over a mean and rejecting outliers. What seems suspicious is that the variance seems to be much more pronounced for some benchmarks (though, possibly just ones that are shorter).
I'm going to see if perf counters can work on runners, though I'm not terribly optimistic.
Do you have any platform information (/proc/cpuinfo, etc)? It seems likely to me that the GitHub Runner fleet is heterogeneous and there's no reason to expect that cycle count would be consistent across separate runs because of that
I can probably get that info, but in this case the samples are iterations from the same run on the same node in the same process back-to-back (assuming my understanding of the tool is correct). If the results in that scenario aren't even (reasonably) consistent it puts doubts on a lot of measurements.
With that in mind, I was seeing similar large deltas when testing locally and looking at the raw results as well (in that case on an m4 macbook).
Some of the tests will run in parallel on different nodes but I'm not comparing results (at this point) across runners at all.
ah sorry, I missed that. High variation even when running locally is kind of surprising, though we've done work (or specifically @Jamey Sharp did) to document how to get a very low-noise measurement. It involved pinning one core of his laptop at its minimum frequency, disabling sleep/power-saving features I think, and modifying systemd config to keep all other processes off the core. Maybe resteering interrupts too?
All this to say, instruction counts may be good to have as well!
Yeah, that sounds promising. This isn't all tests, but is the raw data from a gh-runner for an eigth of the tests. It does seem like a consistent theme is that the first iteration is measured as slower. This seems like it could be expected to a small degree (and I would be fine with just discarding this measurement) but I'm seeing a lot that are 30-40% which seems quite big.
The data here only had 4 iterations per benchmark (will be increasing once stable) but gives a little picture.
Wow, 30-40% is a huge swing. I guess it seems likely this could be a noisy-neighbor problem too -- runners have no reason to be tuned for performance consistency, rather they would want to pack jobs as tightly as possible and provide opportunistic performance when available
I've got to look at more runs to gain confidence on a few pieces. In particular:
:sad: No perf-events in gh runners, it would appear (unsuprisingly): Unable to create event group; try setting /proc/sys/kernel/perf_event_paranoid to 2 or below?: Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" }
yeah that frequently comes down to the hypervisor not giving access to the CPU's performance counters, in addition to the linux kernel running in the guest not supporting it as a knock-on effect
back in the early days of cretonne/lucet i expensed an Intel Nuc and we just hid it away in the server room somewhere so we had a place to somewhat reilably run benchmarks
maybe its still there! lol. but usually if theres been a power outage it requires someone to go into the server room in the fastly sf office and hit a button, and th
stopped being possible in early 2020 for some reason
I was writing Cranelift's aarch64 backend in Feb 2020 and left my RPi4 on my desk in Mozilla's Mountain View office, plugged into power and ethernet, and went home one Friday, then unexpectedly found myself ssh'ing in from Monday onward... that little guy kept going without a single complaint until around June at which point it stopped responding to pings. That sure was a weird period of time. (Thank goodness for actual datacenter control plane + out-of-band stuff.)
I may have to hunt that down; I've got some reqs out to get something like that going for my regular development use but a standalone node would be useful for getting consistent benchmarks (especially if configured to do stuff to tune for reproducible results with frequency scaling, etc.).
Reviewing the precision crate used for cycle measurement, I don't see a huge issue with the number it is grabbing (uses the rdtscp instruction on x64) and it should help with differences in frequency scaling but isn't a silver bullet, obviously.
We did rent a machine on Hetzner for a few years, and I suspect it wouldn't be hard to go back to doing that, though we'd need to arrange budget for it via the TSC etc
(for the purposes of benchmarking, though it was only sporadically used, hence the "used to")
Yeah, for running benchmarks as a one-off I can definitely reserve lab nodes within fastly, etc. I think I was hoping gh runners would end up being consistent enough to be used for continuous benchmarking as managing hardware for use by the open source project seems like a bit of a pain. I guess that would also brings us back around to considering something like https://codspeed.io as Andrew mentioned.
I think I'll deliver something on gh as a POC regardless and we can decide how meaningful or meaningless the results are; we can then experiment with hosted runners and go from there if it makes sense. Going that direction may be desired anyhow if/when we want coverage of more ISAs.
if there's a GH related issue you'd like to pursue, I can faciliate that, as in actually get that done and not some "other version of facilitate".
@Ralph , at least for the moment I don't think there's an issue with GH per se, but I'll keep that in mind or come up with a set of questions to pass on once I have results from a more complete evaluation.
Could I get a squash-and-merge on https://github.com/bytecodealliance/sightglass/pull/284. I've got another change (unrelated to nightly) that I want to queue up for review adding the ability to request multiple measures on a benchmark run.
Here's some boxplots showing the variability between benchmarks on a single host (in this case an arm mac) with 12 iterations / pass and 26 benchmark passes. Some benchmarks appear to vary wildly (e.g. libsodium-box_easy2) in ways that make it unsuitable for comparison usage. Others are short-lived and probably only useful with changes to do more iterations for certain tests (possibly removing instrumentation overhead in some way).
sightglass-benchmark-timings.html
This is really interesting; some of these look somewhat bimodal, e.g. the first one (blake3-scalar) has a bunch of runs around 130-140k cycles but regular outliers. I wonder if that could be due to heterogeneous SoC architectures, i.e., performance+efficiency cores in modern Macs?
I wonder if it might be worth measuring variability with a benchmark that should be absolutely deterministic, too, to isolate system noise from any Sightglass or Wasmtime effects: e.g., a for-loop over 100M or 1B iterations or whatever, with asm volatile(""); in the middle to avoid optimizing away; if that has more than say 0.1% variability in runtime then the system is noisy
(a for-loop in a native program, to be clear)
Hmm, interesting theory on ecore/pcore scheduling. I'll try to get some results on an x86 linux machine with core pinning to see how different things work. On this same machine, I'm collecting results to compare for a somewhat known-overhead case of enabling epoch interrupts.
The noop testing idea could be of use; I'm not sure if it is significantly better than just comparing standard deviation between historical runs to detect the same thing (though, of course, assuming execution variability wasn't somehow introduced by engine behavior, which seems fairly unlikely).
In some environments, we can get the kernel-level preemptions, but I'm guessing for something like github actions runners we could be bumping into hypervisor-level contention.
I've also wondered whether we should be looking more at minimums rather than mean or median. I'm not familiar enough with benchmarking literature to be sure but I feel like I've seen arguments somewhere that it's a reasonable choice for the kind of noise that occurs in benchmarking: most everything that interferes with the thing you're trying to measure can only make it take longer, not make it faster. A while ago I tried to find research answering this question one way or the other without much luck, so I didn't bring it up, but I don't know, maybe it would help here.
Oh, interesting, I really like that; and it makes a lot of intuitive sense at least
as long as there are reliably some runs that hit zero of the bottlenecks, I suppose -- a very noisy machine could have interruptions to every single benchmark run, but more runs reduces the probability of that
I suspect there are some cases where the minimum isn't representative for specific optimizations where we might be more interested in reducing tail performance impacts but it doesn't feel like (accounting for frequency scaling as cycle counts should) there are too many things that could result in unrepresentative outliers on the minimum side of things.
I've used the noop.wasm benchmark in the suite for this kind of baselining in the past.
A more principled approach might be fitting to an ex-gaussian distribution instead of a pure gaussian (https://en.wikipedia.org/wiki/Exponentially_modified_Gaussian_distribution), which accounts for skew. (I'd encountered that distribution when previously working in neuropsych but never understood before why it was useful…)
Here's a run with fewer passes but doing a comparison of baseline (epoch interrupts disabled) with epoch interrupts enabled. On some benchmarks, the overhead is clear and others it is a wash (not totally unexpected depending on the workload).
For this smaller test run (again, on a laptop I was multitasking on) we do see cases where the epoch interrupt scores have minimum scores that are lower which probably doesn't represent reality (though probably close enough that we would say that these represent no difference).
sightglass-benchmark-timings-compare.html
https://github.com/bytecodealliance/sightglass/pull/286 adds a tool for visualizing and comparing results; I'm using p25 quantiles as the value to compare but if we have consensus, it should be pretty easy to use something else.
Of note, several of the benchmarks in the "all" suite today have a huge standard deviation. In the output, I'm using CV (coefficient of variation) as a way to showcase results that likely shouldn't be trusted as well for marking "significant" speedup/slowdowns. The PR includes some data on tests I did against work that @Dan Gohman did to pick up where @Jamey Sharp left off on stack probe changes.
I'll take a look tomorrow!
@Paul Osborne, I was thinking about that PR after writing some thoughts there. I wanted to expand on the idea that we have (and need) a lot of flexibility when it comes to displaying the benchmark data: since it sounds like you've already been working in the Jupyter ecosystem, perhaps that could another option for the actual display of the results?
What I mean is, instead of engineering a whole set of charts in gh-pages, what if we simply dump the benchmark results in a branch and use an external tool to examine them, like Jupyter? I see Google hosts Jupyter notebooks--can they be made publicly accessible? If so, our benchmarking UI story can be "pull the results data from the branch" into a Jupyter notebook. What I like about that approach is that (a) it's "easy" to create new visualizations by cloning a Jupyter notebook (b) we have all the existing Python/Jupyter ecosystem available to us, and (c) we avoid maintaining/hosting that infrastructure. I believe the upload-elastic command is a start in that direction in that it should be relatively easy to modify it to dump all the relevant result and fingerprint files into a directory instead of uploading them to a server (though that is an option, too).
This is just brainstorming, so feel free to tell me why it's not a good idea. But what I'm getting at here is not so much a Jupyter-specific thing but rather the idea that we use external hosting infrastructure instead (Sheets? In-browser DuckDB? Etc.).
I think with the approach proposed in the PR, we'd get both. The raw nightly benchmark data in csv/json would be available on the branch as well as the generated report. So, anyone wanting to do analysis should be able to just pull the branch and run whatever tooling they want (jupyter/r/etc.).
Pandas can even do read_csv from a URL which would be compatible with publishing to a branch as well. We can document this workflow and provide some examples/notebooks if we like as well.
Last updated: Dec 06 2025 at 06:05 UTC