Stream: git-wasmtime

Topic: wasmtime / PR #12709 Cranelift: robustify timing infrastr...


view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 18:44):

cfallin opened PR #12709 from cfallin:cranelift-timing-asserts-and-instant-backstop to bytecodealliance:main:

In #12692, it was observed that the computation of time spent in nested timing spans, minus child time, was underflowing.

Correct operation of the handling of nested spans depends on the invariant that the accumulated time for child spans is less than or equal to a parent span once timing is completed. This property should hold as long as the system clock is monotonic, and as long as timing tokens are dropped in-order, so that the elapsed time of a parent truly is computed after the elapsed time of a child ends.

The timing state may also temporarily violate this invariant whenever a token is pending (still on stack and timing): the child time of any completed child spans will be counted, but the parent has not yet been. Hence, taking a snapshot of the state and computing "parent minus children" on that snapshot may observe cases that yield underflow.

This PR makes the infrastructure more robust along a few different dimensions:

This should address any theoretically possible sources of #12692, as far as I can tell.

<!--
Please make sure you include the following information:

Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.html

Please ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 18:44):

cfallin requested alexcrichton for a review on PR #12709.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 18:44):

cfallin requested wasmtime-compiler-reviewers for a review on PR #12709.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 20:08):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 20:08):

alexcrichton created PR review comment:

I'm not sure that this is the best approach to working around this myself -- there's a bunch more discussion on https://github.com/rust-lang/rust/issues/56612 which led to me making https://github.com/rust-lang/rust/pull/56988 for the Rust standard library which ended up getting morally reverted in https://github.com/rust-lang/rust/pull/89926. It seems like the Rust standard library coalesced around "saturating arithmetic is the least bad solution".

In the end I feel like we definitely need to do something here to avoid downstream panics because that's the worst of all worlds -- a slightly inaccurate and/or buggy timer taking down the rest of the system is a bit surprising. I'm tempted to go the least-amount-of-extra-maintenance route of the saturating arithmetic in retrospect because trying to reproduce/test anything else is more-or-less impossible because it's such a spurious failure in such niche configurations.

Given that WDYT of basically removing invocations of panicking methods (e.g. a - b) and going with that?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 20:18):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 20:18):

cfallin created PR review comment:

It looks like the main reason for the rollback in rust-lang/rust#89926 was "high worst-case latency and jitter of Instant::now() due to explicit synchronization; see https://github.com/rust-lang/rust/pull/83093 for benchmarks, the worst-case overhead is > 100x", whereas this PR keeps a thread-local with last-now (possible because all timing spans are thread-local) so should avoid the main reason for the rollback, right?

In general, philosophically (maybe this is my weirdness as a compiler-brained person), I like to get the low-level primitives right so we can rely on them, rather than building bandaids or "loose code" above it that can tolerate weirdness. The timing code here is built around the invariant of properly nested spans and I'd be concerned that even if we avoid panics by being permissive in the "self minus child" computations, we may still see weird/inconsistent/nonsensical data in anything above this layer that also processes the timing data. I'd rather have a coherent view of the world wrt timing spans and then propagate that upward.

Given that that was your initial instinct with rust-lang/rust#56988 as well, hopefully that is preferable as long as we know we're avoiding the downside?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 21:42):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 21:42):

alexcrichton created PR review comment:

I agree with the philosophy yeah, but my experience so far with clocks have been "monotonicity is a lie at scale" where the fix of trying harder to assume monotonicity often ends up being worse than the alternative, assuming clocks aren't actually monotonic.

That being said this is such a minor part of Cranelift I don't really think it's worth all that much debate over. Timing data isn't really critical for correctness at all and it'll be weird if clocks run backwards, but if clocks run backwards even weirder things happen so I'm less worried about that myself.

Overall if you're comfortable with this I think it's fine to land, we can deal with the fallout if/when it ever comes, which it very well may not.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 21:42):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 21:50):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 21:50):

cfallin created PR review comment:

Yep, fair, may very well be a lie at scale, but I guess it still seems to me that since the duration measurement is fully local to this module, at this point in time, in this thread, and the timestamps don't escape, then we actually can enforce the invariant (with this logic around this thread-local)? In any case, yep, I want to see if this resolves the issue that downstream is having -- happy to revisit if it's a problem later.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 21:50):

cfallin added PR #12709 Cranelift: robustify timing infrastructure against mis-use and/or system clock bugs. to the merge queue

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 22:14):

github-merge-queue[bot] removed PR #12709 Cranelift: robustify timing infrastructure against mis-use and/or system clock bugs. from the merge queue

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 22:26):

cfallin commented on PR #12709:

CI failure on merge queue in the OpenVino-on-Windows test:

    Permission denied (os error 2)
> using cached artifact: D:\a\wasmtime\wasmtime\target\debug\build\wasmtime-wasi-nn-23e0bb9516bdddbc\out\fixtures\model.onnx

I'll try landing again to see if it's spurious. cc @jlb6740 if you have seen this before or have any other ideas (if it becomes a repeating issue we'll have to disable the test)

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 22:26):

cfallin added PR #12709 Cranelift: robustify timing infrastructure against mis-use and/or system clock bugs. to the merge queue

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 22:49):

cfallin merged PR #12709.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2026 at 22:49):

cfallin removed PR #12709 Cranelift: robustify timing infrastructure against mis-use and/or system clock bugs. from the merge queue


Last updated: Mar 23 2026 at 16:19 UTC