wasmtime / PR #12709 Cranelift: robustify timing infrastr... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #12709 Cranelift: robustify timing infrastr...

Wasmtime GitHub notifications bot (Mar 03 2026 at 18:44):

cfallin opened PR #12709 from cfallin:cranelift-timing-asserts-and-instant-backstop to bytecodealliance:main:

In #12692, it was observed that the computation of time spent in nested timing spans, minus child time, was underflowing.

Correct operation of the handling of nested spans depends on the invariant that the accumulated time for child spans is less than or equal to a parent span once timing is completed. This property should hold as long as the system clock is monotonic, and as long as timing tokens are dropped in-order, so that the elapsed time of a parent truly is computed after the elapsed time of a child ends.

The timing state may also temporarily violate this invariant whenever a token is pending (still on stack and timing): the child time of any completed child spans will be counted, but the parent has not yet been. Hence, taking a snapshot of the state and computing "parent minus children" on that snapshot may observe cases that yield underflow.

This PR makes the infrastructure more robust along a few different dimensions:

It hardens the clock source we use to have a locally-ensured guarantee of monotonicity, since we rely on this for logical correctness. In particular, for each thread (since timing spans never move between threads), we track the last Instant that was used by the timing infrastructure, and use that value (zero time passed) if the system clock moves backward.

It hardens the assert about pass-timing token drop order from a debug_assert to an assert. If this invariant is being violated, we want to know about it noisily, rather than observing a subtraction underflow or other inconsistency later.

It adds an assert in take_current() to ensure that a snapshot is never taken when any pass timing is in progress.

This should address any theoretically possible sources of #12692, as far as I can tell.

Wasmtime GitHub notifications bot (Mar 03 2026 at 18:44):

cfallin requested alexcrichton for a review on PR #12709.

Wasmtime GitHub notifications bot (Mar 03 2026 at 18:44):

cfallin requested wasmtime-compiler-reviewers for a review on PR #12709.

Wasmtime GitHub notifications bot (Mar 03 2026 at 20:08):

alexcrichton submitted PR review.

Wasmtime GitHub notifications bot (Mar 03 2026 at 20:08):

alexcrichton created PR review comment:

I'm not sure that this is the best approach to working around this myself -- there's a bunch more discussion on https://github.com/rust-lang/rust/issues/56612 which led to me making https://github.com/rust-lang/rust/pull/56988 for the Rust standard library which ended up getting morally reverted in https://github.com/rust-lang/rust/pull/89926. It seems like the Rust standard library coalesced around "saturating arithmetic is the least bad solution".

In the end I feel like we definitely need to do something here to avoid downstream panics because that's the worst of all worlds -- a slightly inaccurate and/or buggy timer taking down the rest of the system is a bit surprising. I'm tempted to go the least-amount-of-extra-maintenance route of the saturating arithmetic in retrospect because trying to reproduce/test anything else is more-or-less impossible because it's such a spurious failure in such niche configurations.

Given that WDYT of basically removing invocations of panicking methods (e.g. a - b) and going with that?

Wasmtime GitHub notifications bot (Mar 03 2026 at 20:18):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Mar 03 2026 at 20:18):

cfallin created PR review comment:

It looks like the main reason for the rollback in rust-lang/rust#89926 was "high worst-case latency and jitter of Instant::now() due to explicit synchronization; see https://github.com/rust-lang/rust/pull/83093 for benchmarks, the worst-case overhead is > 100x", whereas this PR keeps a thread-local with last-now (possible because all timing spans are thread-local) so should avoid the main reason for the rollback, right?

In general, philosophically (maybe this is my weirdness as a compiler-brained person), I like to get the low-level primitives right so we can rely on them, rather than building bandaids or "loose code" above it that can tolerate weirdness. The timing code here is built around the invariant of properly nested spans and I'd be concerned that even if we avoid panics by being permissive in the "self minus child" computations, we may still see weird/inconsistent/nonsensical data in anything above this layer that also processes the timing data. I'd rather have a coherent view of the world wrt timing spans and then propagate that upward.

Given that that was your initial instinct with rust-lang/rust#56988 as well, hopefully that is preferable as long as we know we're avoiding the downside?

Wasmtime GitHub notifications bot (Mar 03 2026 at 21:42):

alexcrichton submitted PR review.

Wasmtime GitHub notifications bot (Mar 03 2026 at 21:42):

alexcrichton created PR review comment:

I agree with the philosophy yeah, but my experience so far with clocks have been "monotonicity is a lie at scale" where the fix of trying harder to assume monotonicity often ends up being worse than the alternative, assuming clocks aren't actually monotonic.

That being said this is such a minor part of Cranelift I don't really think it's worth all that much debate over. Timing data isn't really critical for correctness at all and it'll be weird if clocks run backwards, but if clocks run backwards even weirder things happen so I'm less worried about that myself.

Overall if you're comfortable with this I think it's fine to land, we can deal with the fallout if/when it ever comes, which it very well may not.

Wasmtime GitHub notifications bot (Mar 03 2026 at 21:42):

alexcrichton submitted PR review.

Wasmtime GitHub notifications bot (Mar 03 2026 at 21:50):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Mar 03 2026 at 21:50):

cfallin created PR review comment:

Yep, fair, may very well be a lie at scale, but I guess it still seems to me that since the duration measurement is fully local to this module, at this point in time, in this thread, and the timestamps don't escape, then we actually can enforce the invariant (with this logic around this thread-local)? In any case, yep, I want to see if this resolves the issue that downstream is having -- happy to revisit if it's a problem later.

CI failure on merge queue in the OpenVino-on-Windows test:
    Permission denied (os error 2)
> using cached artifact: D:\a\wasmtime\wasmtime\target\debug\build\wasmtime-wasi-nn-23e0bb9516bdddbc\out\fixtures\model.onnx
I'll try landing again to see if it's spurious. cc @jlb6740 if you have seen this before or have any other ideas (if it becomes a repeating issue we'll have to disable the test)

Wasmtime GitHub notifications bot (Mar 03 2026 at 22:26):

cfallin added PR #12709 Cranelift: robustify timing infrastructure against mis-use and/or system clock bugs. to the merge queue

Wasmtime GitHub notifications bot (Mar 03 2026 at 22:49):

cfallin merged PR #12709.

Wasmtime GitHub notifications bot (Mar 03 2026 at 22:49):

cfallin removed PR #12709 Cranelift: robustify timing infrastructure against mis-use and/or system clock bugs. from the merge queue

Last updated: Jun 01 2026 at 09:49 UTC

Stream: git-wasmtime

Topic: wasmtime / PR #12709 Cranelift: robustify timing infrastr...

Wasmtime GitHub notifications bot (Mar 03 2026 at 18:44):

Wasmtime GitHub notifications bot (Mar 03 2026 at 18:44):

Wasmtime GitHub notifications bot (Mar 03 2026 at 18:44):

Wasmtime GitHub notifications bot (Mar 03 2026 at 20:08):

Wasmtime GitHub notifications bot (Mar 03 2026 at 20:08):

Wasmtime GitHub notifications bot (Mar 03 2026 at 20:18):

Wasmtime GitHub notifications bot (Mar 03 2026 at 20:18):

Wasmtime GitHub notifications bot (Mar 03 2026 at 21:42):

Wasmtime GitHub notifications bot (Mar 03 2026 at 21:42):

Wasmtime GitHub notifications bot (Mar 03 2026 at 21:42):

Wasmtime GitHub notifications bot (Mar 03 2026 at 21:50):

Wasmtime GitHub notifications bot (Mar 03 2026 at 21:50):

Wasmtime GitHub notifications bot (Mar 03 2026 at 21:50):

Wasmtime GitHub notifications bot (Mar 03 2026 at 22:14):

Wasmtime GitHub notifications bot (Mar 03 2026 at 22:26):

Wasmtime GitHub notifications bot (Mar 03 2026 at 22:26):

Wasmtime GitHub notifications bot (Mar 03 2026 at 22:49):

Wasmtime GitHub notifications bot (Mar 03 2026 at 22:49):