wasmtime / PR #12693 Don't panic in duration underflow · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #12693 Don't panic in duration underflow

Wasmtime GitHub notifications bot (Feb 27 2026 at 23:32):

ConradIrwin opened PR #12693 from ConradIrwin:fix-interval-duration to bytecodealliance:main:

In debug mode we check for tokens being dropped out of order, but in a
release build this can fail.

Closes #12692

Wasmtime GitHub notifications bot (Feb 27 2026 at 23:32):

ConradIrwin requested cfallin for a review on PR #12693.

Wasmtime GitHub notifications bot (Feb 27 2026 at 23:32):

ConradIrwin requested wasmtime-compiler-reviewers for a review on PR #12693.

Wasmtime GitHub notifications bot (Feb 28 2026 at 00:00):

cfallin commented on PR #12693:

Thanks for your report and this PR!

I don't think any of the current contributors have touched this code in a while so we may not have a full understanding of its invariants, but I'm surprised that "tokens are dropped out of order" -- do you have a reproduction of that? (I see your companion issue filed said you had no repro, but maybe you do with this PR?)

In any case, I'm happy to merge it as a conservative fix but I'd like to understand if there's an underlying invariant violation that it would mask first, too.

Wasmtime GitHub notifications bot (Feb 28 2026 at 00:00):

cfallin commented on PR #12693:

(As an aside, I'm also surprised you're seeing this in (presumably production) telemetry -- is the timing infra enabled in your configuration?)

Wasmtime GitHub notifications bot (Feb 28 2026 at 02:32):

github-actions[bot] added the label cranelift on PR #12693.

Wasmtime GitHub notifications bot (Feb 28 2026 at 05:33):

ConradIrwin commented on PR #12693:

@cfallin no reproduction, just a somewhat surprising crash showed up in our telemetry.

I think timings are logged if RUST_LOG=debug is exported (regardless of build settings), which seems plausible for a Zed user working on Rust.

Claude's (very hand-wavy) explanation for the out of order timing was that as these things happen on different threads, the drop ordering was not guaranteed; but not sure I believe it. (It seems like some care is taken to avoid this, hence the debug assert here: https://github.com/ConradIrwin/wasmtime/blob/c5214df981a88d30226e9b3ac1b88da32395806a/cranelift/codegen/src/timing.rs#L269).

The other potential options are something weird with instants (e.g. this purported issue https://github.com/zed-industries/zed/issues/46636); or given this has only happened once, a particularly targeted cosmic ray.

Wasmtime GitHub notifications bot (Feb 28 2026 at 06:21):

cfallin commented on PR #12693:

Claude's (very hand-wavy) explanation for the out of order timing was that as these things happen on different threads, the drop ordering was not guaranteed; but not sure I believe it. (It seems like some care is taken to avoid this, hence the debug assert here: https://github.com/ConradIrwin/wasmtime/blob/c5214df981a88d30226e9b3ac1b88da32395806a/cranelift/codegen/src/timing.rs#L269).

OK, that's fortunately not possible -- Claude must not have seen the thread_local! on the relevant data structure here, such that compilations in different threads never interact. Drop order should be purely a function of the (hopefully deterministic!) pass driver structure in one compilation in one thread.

I'll take a closer look at this code and try to understand its invariants next week -- in the meantime, any further info you can gather about repros or conditions this is seen under are of course appreciated. Thanks!

Wasmtime GitHub notifications bot (Mar 02 2026 at 16:13):

alexcrichton commented on PR #12693:

This is perhaps one way to reproduce this:
    #[test]
    fn test() {
        let _a = process_file();
        drop(canonicalize_nans());
        take_current().total();
    }
The current invariant of dropping-in-order doesn't take into account when take_current() is called in the middle. Whether that's possible to happen within Wasmtime/Cranelift though I'm not entirely sure. One theoretical way to happen is that a rayon worker thread is in the middle of compiling something, so it's got half the timing for a function, and then it does some synchronization ends up work-stealing a function to compile. This second function finishes and Wasmtime calls take_current(), and then rayon would later resume the original compilation.

Cranelift itself doesn't have any rayon infrastructure but Wasmtime/wasmtime-internal-cranelift might have just enough to trigger this? The best fix for this might be to reset timing data when starting to compile a function and restoring that once compilation is finished.

Wasmtime GitHub notifications bot (Mar 02 2026 at 17:13):

ConradIrwin commented on PR #12693:

https://github.com/zed-industries/zed/issues/46636 seems to be hitting this too (but on a linux realtime kernel). They claim there are known bugs with Instant subtraction on a realtime linux kernel (but I can't find any other references to this).

The other thing I notice is that crane lift makes heavy use of catch_unwind; but unfortunately Zed's panic handler acts more like abort on panic. I assume from the comments that the catch_unwind is intended as a backup mechanism (not the primary mechanism) to prevent problems.

In terms of concurrency; we run a bunch of rayon stuff ourselves, and I'm pretty sure we can call wasmtime compile multiple times in parallel. I'm not familiar with how rayon manages thread pools but if it has a global thread pool then interleaving between different compiles seems possible in the way we use it.

Wasmtime GitHub notifications bot (Mar 02 2026 at 17:22):

cfallin commented on PR #12693:

The other thing I notice is that crane lift makes heavy use of catch_unwind

Not really? The only two calls to that function in cranelift/ are in the filetest runner and bugpoint, two separate CLI utilities not involved with the main compilation flow.

I think that Alex is onto the right hunch here with take_current -- I will try to get to this today.

Wasmtime GitHub notifications bot (Mar 03 2026 at 15:30):

cfallin commented on PR #12693:

The current invariant of dropping-in-order doesn't take into account when take_current() is called in the middle. Whether that's possible to happen within Wasmtime/Cranelift though I'm not entirely sure. One theoretical way to happen is that a rayon worker thread is in the middle of compiling something, so it's got half the timing for a function, and then it does some synchronization ends up work-stealing a function to compile. This second function finishes and Wasmtime calls take_current(), and then rayon would later resume the original compilation.

Cranelift itself doesn't have any rayon infrastructure but Wasmtime/wasmtime-internal-cranelift might have just enough to trigger this? The best fix for this might be to reset timing data when starting to compile a function and restoring that once compilation is finished.

I've read through the logic in wasmtime-internal-cranelift now and I'm not seeing how we could invoke take_current() in the middle of a compilation (or nest two compilations) -- there are no async functions or await points here, there's no recursion, and take_current is only called once at the toplevel after invoking Cranelift.

I strongly believe that this PR's approach of using a saturating subtraction is wrong: it is hiding an actual bug deeper in the logic by just ignoring a mismatch.

We could tighten up some assertions, e.g. have a generation number in PassTimes so that if it is prematurely reset and then a drop-token later increments, we catch that properly. If something in Zed or Wasmtime is "holding it wrong" that wouldn't resolve the actual bug but it would help to find it. @ConradIrwin any progress on getting a reproducer?

Wasmtime GitHub notifications bot (Mar 03 2026 at 16:16):

ConradIrwin commented on PR #12693:

@cfallin None from my side. That said, the reporter in https://github.com/zed-industries/zed/issues/46636 seems to be able to reproduce on demand (albeit under different circumstances). He thinks this is an expected bug in Rust's time calculations as described in https://github.com/rust-lang/rust/issues/56612, but feel free to reach out directly.

Looking closer at telemetry, in the last few weeks we've seen this overflow 14 times on Linux, and 3 times on Windows (no macOS). (Prior to that it was also happening definitely on Linux and probably elsewhere, but we made some changes to crash grouping so it's hard to know for sure); this is across roughly O(10,000,000) app opens in the same time period, so it's not exactly common.

(also my mistake on the catch_unwind; I was seeing it in the stack trace inside Rayon, which propagates it by default anyway)

Wasmtime GitHub notifications bot (Mar 03 2026 at 16:20):

ConradIrwin commented on PR #12693:

If you wanted to update the assertions to try and catch things earlier, I'm willing to ship it to zed to help out.

Wasmtime GitHub notifications bot (Mar 03 2026 at 18:45):

cfallin commented on PR #12693:

OK, I've put up #12709 for this. I'll go ahead and close this PR if you don't mind since I don't think the "hide the issue by ignoring underflow" approach is what we want.

Wasmtime GitHub notifications bot (Mar 03 2026 at 18:45):

cfallin closed without merge PR #12693.

Last updated: Jun 01 2026 at 09:49 UTC