Stream: wasmtime

Topic: Sightglass Improvements Discussion


view this post on Zulip Paul Osborne (Apr 10 2025 at 20:26):

CI for sightglass appears to be broken due to using out of date actions, so I'm working on a basic update to improve that. I'm wondering if other contributors have thoughts on other critical work or workflow improvements to sightglass that we could prioritize.

I'm testing a fork now that fixes CI but it has been running for over 3 hours :/ https://github.com/posborne/sightglass/pull/1

Some ideas that come to mind to encourage contributions in the near term are probably to:

  1. Reduce CI times through setting up parallel jobs to run the all suite.
  2. Possibly remove prohibitively long-running benchmarks from PR requirements or otherwise tune this (libsodium seems to be the culprit here).
  3. Introduce nightly CI to do full benchmark suite if we remove some from normal PR flow.

Ultimately, just using the stock gha runners are probably not ideal for benchmarking but near term I think we can probably improve a lot without tackling that.

Something along the lines of https://github.com/bytecodealliance/sightglass/issues/93 looks great to me, but it may be useful to even target something easier to deploy or enable local experimentation and comparison of results. I'm thinking just something that would be able to take the structured output of runs from a few different configurations (for nightly we could just do main of wasmtime plus a few tagged versions) and outputs results showing percentage differences.

Also changes to just use stable rust in more cases where nightly features don't seem to be required. Trial CI against the fork first to get things working.
In order to report performance results based on PRs, we talked about implementing an HTTP server (e.g. in crates/server) that would: listen for incoming POST requests that contain JSON with the PR ...

view this post on Zulip Chris Fallin (Apr 10 2025 at 20:30):

Having a nightly runner of some sort that provides trends over time would be great, and we've talked about that / wanted that for five years or so now...

At one point we (BA) were renting a dedicated x86-64 machine on Hetzner because this seemed "coming soon" but we gave that up a year or so ago. I imagine once we have a workable runner it wouldn't be a huge deal to get this going again. The main concern at the time was security -- we didn't want to allow PRs from arbitrary people to spawn execution and there was a bit of indirection around that IIRC (private repo, ...) A nightly run of main solves that problem at least.

view this post on Zulip Alex Crichton (Apr 10 2025 at 20:40):

I can't comment much on the topic of CI for the repo itself, but on the topic of continually running benchmarks it's true that security was historically a concern, but bjorn3 showed a way that we can solve this nowadays. That means that it's now possible to actually hook up a custom runner to the public wasmtime repo in a way that doesn't let contributors run arbitrary code on it.

In that sense I agree it'd be awesome to get this up and running, but one of the historic blockers as well has been someone to help set up and manage the infrastructure. Even if it's "mostly just github actions" it's still a fair amount to orchestrate

view this post on Zulip Paul Osborne (Apr 10 2025 at 20:48):

Yeah, I think there's a lot of improvements we can make without that. Near term, I think with purely gh-runners we can get useful enough results by ensuring that a single benchmark for different configurations (probably just different wasmtime versions) is run on the same runner.

This is still far from controlled, but I would take mildly noisy results over no results -- that impact can probably be reduced some if we can incorporate perf counters info in the nightly results, etc.

I think near term wins would be:

  1. Make running a comparative benchmark of a branch on a dev/maintainers machine of wasmtime vs. baseline easier.
  2. Provide something relatively reliable that we can look at to see if we have a major perf improvement/regression on main compared to some previous release.

view this post on Zulip Paul Osborne (Apr 14 2025 at 22:59):

https://github.com/bytecodealliance/sightglass/pull/284

This first set of changes includes the following: Removes outdated github actions recipes that are no longer functioning. Updates the benchmark flow to complete much more quickly by: Don't tr...

view this post on Zulip Paul Osborne (Apr 15 2025 at 16:19):

@Andrew Brown Copying you on this for additional context on the PR in case you didn't see this thread; hoping to at least have a rough cut of nightly with some kind of easily consumed output later this week.

view this post on Zulip Andrew Brown (Apr 15 2025 at 16:54):

Ah, cool! Glad to see you're working on this.

view this post on Zulip Andrew Brown (Apr 15 2025 at 17:08):

Some more context:

A benchmark suite and tool to compare different implementations of the same primitives. - abrown/sightglass
CodSpeed integrates into dev and CI workflows to measure performance, detect regressions, and enable actionable optimizations.

view this post on Zulip Paul Osborne (Apr 15 2025 at 18:07):

Excellent, thanks for the context. I share the reluctance to add in a 3rd party tool or service that requires upkeep. I think what I'll target to start is something that is static and targets gh-pages.

Something like https://github.com/marketplace/actions/continuous-benchmark seems like it could be promising; it supports ingesting a "custom" json format, which i should be able to target by doing a transform on the benchmark outputs.

If setup on a scheduled run, this should allow us to:

I'll do that work in my fork for for a bit and might request an early review there before upstreaming as permissions for CI are easier to manage in that way until ready to go.

Continuous Benchmark using GitHub pages as dash board for keeping performance

view this post on Zulip Paul Osborne (Apr 16 2025 at 20:21):

At least in my early runs here, I'm seeing quite a range of cycle counts across iterations of benchmark runs for several of the tests. Does this look normal? For example, https://github.com/posborne/sightglass/actions/runs/14499832348/job/40677009588 (look at the "Output Results" JSON).

This is after a transform I'm playing with but I've verified the "raw" JSON results show similar discrepancies. I just kicked off another run that removes parallel processes to see if that reduces the variability but something seems suspect. Doing more iterations and/or taking a median over a mean and rejecting outliers. What seems suspicious is that the variance seems to be much more pronounced for some benchmarks (though, possibly just ones that are shorter).

A benchmark suite and tool to compare different implementations of the same primitives. - Periodic Benchmark Run · posborne/sightglass@bc4f149

view this post on Zulip Paul Osborne (Apr 16 2025 at 20:24):

I'm going to see if perf counters can work on runners, though I'm not terribly optimistic.

view this post on Zulip Chris Fallin (Apr 16 2025 at 20:25):

Do you have any platform information (/proc/cpuinfo, etc)? It seems likely to me that the GitHub Runner fleet is heterogeneous and there's no reason to expect that cycle count would be consistent across separate runs because of that

view this post on Zulip Paul Osborne (Apr 16 2025 at 20:32):

I can probably get that info, but in this case the samples are iterations from the same run on the same node in the same process back-to-back (assuming my understanding of the tool is correct). If the results in that scenario aren't even (reasonably) consistent it puts doubts on a lot of measurements.

With that in mind, I was seeing similar large deltas when testing locally and looking at the raw results as well (in that case on an m4 macbook).

view this post on Zulip Paul Osborne (Apr 16 2025 at 20:33):

Some of the tests will run in parallel on different nodes but I'm not comparing results (at this point) across runners at all.

view this post on Zulip Chris Fallin (Apr 16 2025 at 20:34):

ah sorry, I missed that. High variation even when running locally is kind of surprising, though we've done work (or specifically @Jamey Sharp did) to document how to get a very low-noise measurement. It involved pinning one core of his laptop at its minimum frequency, disabling sleep/power-saving features I think, and modifying systemd config to keep all other processes off the core. Maybe resteering interrupts too?

view this post on Zulip Chris Fallin (Apr 16 2025 at 20:35):

All this to say, instruction counts may be good to have as well!

view this post on Zulip Paul Osborne (Apr 16 2025 at 20:40):

Yeah, that sounds promising. This isn't all tests, but is the raw data from a gh-runner for an eigth of the tests. It does seem like a consistent theme is that the first iteration is measured as slower. This seems like it could be expected to a small degree (and I would be fine with just discarding this measurement) but I'm seeing a lot that are 30-40% which seems quite big.

The data here only had 4 iterations per benchmark (will be increasing once stable) but gives a little picture.

benchmark-results 2.json

view this post on Zulip Chris Fallin (Apr 16 2025 at 20:42):

Wow, 30-40% is a huge swing. I guess it seems likely this could be a noisy-neighbor problem too -- runners have no reason to be tuned for performance consistency, rather they would want to pack jobs as tightly as possible and provide opportunistic performance when available

view this post on Zulip Paul Osborne (Apr 16 2025 at 20:54):

I've got to look at more runs to gain confidence on a few pieces. In particular:

  1. Are there certain benchmarks that are more prone to seeing varied results (beyond just that shorter-lived ones can expect to see larger pct variance) or is it across all.
  2. Do we see the same variance looking at wall time; could there be some issues with the "cycles" counting (haven't looked at the impl).

:sad: No perf-events in gh runners, it would appear (unsuprisingly): Unable to create event group; try setting /proc/sys/kernel/perf_event_paranoid to 2 or below?: Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" }

view this post on Zulip Pat Hickey (Apr 16 2025 at 21:01):

yeah that frequently comes down to the hypervisor not giving access to the CPU's performance counters, in addition to the linux kernel running in the guest not supporting it as a knock-on effect

view this post on Zulip Pat Hickey (Apr 16 2025 at 21:02):

back in the early days of cretonne/lucet i expensed an Intel Nuc and we just hid it away in the server room somewhere so we had a place to somewhat reilably run benchmarks

view this post on Zulip Pat Hickey (Apr 16 2025 at 21:03):

maybe its still there! lol. but usually if theres been a power outage it requires someone to go into the server room in the fastly sf office and hit a button, and th

view this post on Zulip Chris Fallin (Apr 16 2025 at 21:15):

stopped being possible in early 2020 for some reason

I was writing Cranelift's aarch64 backend in Feb 2020 and left my RPi4 on my desk in Mozilla's Mountain View office, plugged into power and ethernet, and went home one Friday, then unexpectedly found myself ssh'ing in from Monday onward... that little guy kept going without a single complaint until around June at which point it stopped responding to pings. That sure was a weird period of time. (Thank goodness for actual datacenter control plane + out-of-band stuff.)

view this post on Zulip Paul Osborne (Apr 16 2025 at 21:16):

I may have to hunt that down; I've got some reqs out to get something like that going for my regular development use but a standalone node would be useful for getting consistent benchmarks (especially if configured to do stuff to tune for reproducible results with frequency scaling, etc.).

Reviewing the precision crate used for cycle measurement, I don't see a huge issue with the number it is grabbing (uses the rdtscp instruction on x64) and it should help with differences in frequency scaling but isn't a silver bullet, obviously.

view this post on Zulip Chris Fallin (Apr 16 2025 at 21:16):

We did rent a machine on Hetzner for a few years, and I suspect it wouldn't be hard to go back to doing that, though we'd need to arrange budget for it via the TSC etc

view this post on Zulip Chris Fallin (Apr 16 2025 at 21:17):

(for the purposes of benchmarking, though it was only sporadically used, hence the "used to")

view this post on Zulip Paul Osborne (Apr 16 2025 at 21:23):

Yeah, for running benchmarks as a one-off I can definitely reserve lab nodes within fastly, etc. I think I was hoping gh runners would end up being consistent enough to be used for continuous benchmarking as managing hardware for use by the open source project seems like a bit of a pain. I guess that would also brings us back around to considering something like https://codspeed.io as Andrew mentioned.

I think I'll deliver something on gh as a POC regardless and we can decide how meaningful or meaningless the results are; we can then experiment with hosted runners and go from there if it makes sense. Going that direction may be desired anyhow if/when we want coverage of more ISAs.

CodSpeed integrates into dev and CI workflows to measure performance, detect regressions, and enable actionable optimizations.

view this post on Zulip Ralph (Apr 17 2025 at 12:58):

if there's a GH related issue you'd like to pursue, I can faciliate that, as in actually get that done and not some "other version of facilitate".

view this post on Zulip Paul Osborne (Apr 17 2025 at 19:48):

@Ralph , at least for the moment I don't think there's an issue with GH per se, but I'll keep that in mind or come up with a set of questions to pass on once I have results from a more complete evaluation.

view this post on Zulip Paul Osborne (Apr 17 2025 at 19:50):

Could I get a squash-and-merge on https://github.com/bytecodealliance/sightglass/pull/284. I've got another change (unrelated to nightly) that I want to queue up for review adding the ability to request multiple measures on a benchmark run.

This first set of changes includes the following: Removes outdated github actions recipes that are no longer functioning. Updates the benchmark flow to complete much more quickly by: Don't tr...

view this post on Zulip Paul Osborne (Apr 22 2025 at 18:18):

Here's some boxplots showing the variability between benchmarks on a single host (in this case an arm mac) with 12 iterations / pass and 26 benchmark passes. Some benchmarks appear to vary wildly (e.g. libsodium-box_easy2) in ways that make it unsuitable for comparison usage. Others are short-lived and probably only useful with changes to do more iterations for certain tests (possibly removing instrumentation overhead in some way).

sightglass-benchmark-timings.html

view this post on Zulip Chris Fallin (Apr 22 2025 at 18:41):

This is really interesting; some of these look somewhat bimodal, e.g. the first one (blake3-scalar) has a bunch of runs around 130-140k cycles but regular outliers. I wonder if that could be due to heterogeneous SoC architectures, i.e., performance+efficiency cores in modern Macs?

view this post on Zulip Chris Fallin (Apr 22 2025 at 18:43):

I wonder if it might be worth measuring variability with a benchmark that should be absolutely deterministic, too, to isolate system noise from any Sightglass or Wasmtime effects: e.g., a for-loop over 100M or 1B iterations or whatever, with asm volatile(""); in the middle to avoid optimizing away; if that has more than say 0.1% variability in runtime then the system is noisy

view this post on Zulip Chris Fallin (Apr 22 2025 at 18:43):

(a for-loop in a native program, to be clear)

view this post on Zulip Paul Osborne (Apr 22 2025 at 18:49):

Hmm, interesting theory on ecore/pcore scheduling. I'll try to get some results on an x86 linux machine with core pinning to see how different things work. On this same machine, I'm collecting results to compare for a somewhat known-overhead case of enabling epoch interrupts.

view this post on Zulip Paul Osborne (Apr 22 2025 at 18:57):

The noop testing idea could be of use; I'm not sure if it is significantly better than just comparing standard deviation between historical runs to detect the same thing (though, of course, assuming execution variability wasn't somehow introduced by engine behavior, which seems fairly unlikely).

In some environments, we can get the kernel-level preemptions, but I'm guessing for something like github actions runners we could be bumping into hypervisor-level contention.

view this post on Zulip Jamey Sharp (Apr 22 2025 at 19:55):

I've also wondered whether we should be looking more at minimums rather than mean or median. I'm not familiar enough with benchmarking literature to be sure but I feel like I've seen arguments somewhere that it's a reasonable choice for the kind of noise that occurs in benchmarking: most everything that interferes with the thing you're trying to measure can only make it take longer, not make it faster. A while ago I tried to find research answering this question one way or the other without much luck, so I didn't bring it up, but I don't know, maybe it would help here.

view this post on Zulip Chris Fallin (Apr 22 2025 at 19:57):

Oh, interesting, I really like that; and it makes a lot of intuitive sense at least

view this post on Zulip Chris Fallin (Apr 22 2025 at 19:58):

as long as there are reliably some runs that hit zero of the bottlenecks, I suppose -- a very noisy machine could have interruptions to every single benchmark run, but more runs reduces the probability of that

view this post on Zulip Paul Osborne (Apr 22 2025 at 20:03):

I suspect there are some cases where the minimum isn't representative for specific optimizations where we might be more interested in reducing tail performance impacts but it doesn't feel like (accounting for frequency scaling as cycle counts should) there are too many things that could result in unrepresentative outliers on the minimum side of things.

view this post on Zulip Andrew Brown (Apr 22 2025 at 20:10):

I've used the noop.wasm benchmark in the suite for this kind of baselining in the past.

view this post on Zulip Jamey Sharp (Apr 22 2025 at 20:10):

A more principled approach might be fitting to an ex-gaussian distribution instead of a pure gaussian (https://en.wikipedia.org/wiki/Exponentially_modified_Gaussian_distribution), which accounts for skew. (I'd encountered that distribution when previously working in neuropsych but never understood before why it was useful…)

Φ ( x , μ , σ ) − 1 2 exp ⁡ [ λ 2 ( 2 μ + λ σ 2 − 2 x ) ] erfc ⁡ ( μ + λ σ 2 − x 2 σ ) {\displaystyle \Phi (x,\mu ,\sigma )-{\frac {1}{2}}\exp \left[{\frac {\lambda }{2}}(2\mu +\lambda \sigma ^{2}-2x)\right]\operatorname {erfc} \left({\frac {\mu +\lambda \sigma ^{2}-x}{{\sqrt {2}}\sigma }}\right)} where

view this post on Zulip Paul Osborne (Apr 22 2025 at 20:29):

Here's a run with fewer passes but doing a comparison of baseline (epoch interrupts disabled) with epoch interrupts enabled. On some benchmarks, the overhead is clear and others it is a wash (not totally unexpected depending on the workload).

For this smaller test run (again, on a laptop I was multitasking on) we do see cases where the epoch interrupt scores have minimum scores that are lower which probably doesn't represent reality (though probably close enough that we would say that these represent no difference).

sightglass-benchmark-timings-compare.html

view this post on Zulip Paul Osborne (Apr 30 2025 at 22:52):

https://github.com/bytecodealliance/sightglass/pull/286 adds a tool for visualizing and comparing results; I'm using p25 quantiles as the value to compare but if we have consensus, it should be pretty easy to use something else.

Of note, several of the benchmarks in the "all" suite today have a huge standard deviation. In the output, I'm using CV (coefficient of variation) as a way to showcase results that likely shouldn't be trusted as well for marking "significant" speedup/slowdowns. The PR includes some data on tests I did against work that @Dan Gohman did to pick up where @Jamey Sharp left off on stack probe changes.

This tool is extracted from some early work I did in a jupyter notebook to anlayze and compare results relative to a baseline benchmark run. The generated artificat is a single html file with both...

view this post on Zulip Andrew Brown (May 01 2025 at 02:38):

I'll take a look tomorrow!

view this post on Zulip Andrew Brown (May 02 2025 at 00:02):

@Paul Osborne, I was thinking about that PR after writing some thoughts there. I wanted to expand on the idea that we have (and need) a lot of flexibility when it comes to displaying the benchmark data: since it sounds like you've already been working in the Jupyter ecosystem, perhaps that could another option for the actual display of the results?

What I mean is, instead of engineering a whole set of charts in gh-pages, what if we simply dump the benchmark results in a branch and use an external tool to examine them, like Jupyter? I see Google hosts Jupyter notebooks--can they be made publicly accessible? If so, our benchmarking UI story can be "pull the results data from the branch" into a Jupyter notebook. What I like about that approach is that (a) it's "easy" to create new visualizations by cloning a Jupyter notebook (b) we have all the existing Python/Jupyter ecosystem available to us, and (c) we avoid maintaining/hosting that infrastructure. I believe the upload-elastic command is a start in that direction in that it should be relatively easy to modify it to dump all the relevant result and fingerprint files into a directory instead of uploading them to a server (though that is an option, too).

This is just brainstorming, so feel free to tell me why it's not a good idea. But what I'm getting at here is not so much a Jupyter-specific thing but rather the idea that we use external hosting infrastructure instead (Sheets? In-browser DuckDB? Etc.).

view this post on Zulip Paul Osborne (May 02 2025 at 16:23):

I think with the approach proposed in the PR, we'd get both. The raw nightly benchmark data in csv/json would be available on the branch as well as the generated report. So, anyone wanting to do analysis should be able to just pull the branch and run whatever tooling they want (jupyter/r/etc.).

Pandas can even do read_csv from a URL which would be compatible with publishing to a branch as well. We can document this workflow and provide some examples/notebooks if we like as well.


Last updated: Dec 06 2025 at 06:05 UTC