Stream: git-wasmtime

Topic: wasmtime / issue #7623 Likely concurrency bug/deadlock in...


view this post on Zulip Wasmtime GitHub notifications bot (Dec 01 2023 at 15:12):

alexcrichton opened issue #7623:

This GitHub Actions failure happened on the 15.0.0 release branch yesterday. The bug there appears to be a timeout in the builder where the log ends with:

...
test parking_spot::tests::parking_lot::unpark_one_one_fast ... ok
test parking_spot::tests::parking_lot::unpark_all_one_fast ... ok
test parking_spot::tests::atomic_wait_notify has been running for over 60 seconds
Error: The operation was canceled.

In the absence of any other information this appears like the atomic_wait_notify test deadlocked and then the test timed out eventually when GitHub Actions let it run for 6 hours.

I've been staring at the test and the implementation of ParkingSpot and I unfortunately haven't been able to come up with anything.

One thing I have noticed is that blocking is done with Rust's standard Condvar which is documented as allowing possible spurious wakeups. This I think means that a thread can "steal" a wakeup notification meant for another. I haven't been able to construct a theoretical trace which leads to deadlock however. Additionally I can't say with any certainty that this is an actual issue since the precise concurrent behaviors allowed here I'm not certain of.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 01 2023 at 15:26):

bjorn3 commented on issue #7623:

Does this deadlock reproduce using loom?

view this post on Zulip Wasmtime GitHub notifications bot (Dec 01 2023 at 15:34):

alexcrichton commented on issue #7623:

I was curious myself! I rewrote the test with loom (had to remove the usage of scoped threads), and it's been running for 12+ hours and so far hasn't found an issue. Loom docs said I should run with LOOM_MAX_PREEMPTIONS={2,3} to get "most bugs out of the way" and the LOOM_MAX_PREEMPTIONS=3 run took ~12 hours and found no issues.

So to answer your question, so far no, but it's still running. Also I'm a bit suspicious about spurious wakeups here and my guess is that loom probably doesn't model spurious wakeups from Condvar, so it may not reproduce in loom after all

view this post on Zulip Wasmtime GitHub notifications bot (Dec 05 2023 at 17:07):

alexcrichton closed issue #7623:

This GitHub Actions failure happened on the 15.0.0 release branch yesterday. The bug there appears to be a timeout in the builder where the log ends with:

...
test parking_spot::tests::parking_lot::unpark_one_one_fast ... ok
test parking_spot::tests::parking_lot::unpark_all_one_fast ... ok
test parking_spot::tests::atomic_wait_notify has been running for over 60 seconds
Error: The operation was canceled.

In the absence of any other information this appears like the atomic_wait_notify test deadlocked and then the test timed out eventually when GitHub Actions let it run for 6 hours.

I've been staring at the test and the implementation of ParkingSpot and I unfortunately haven't been able to come up with anything.

One thing I have noticed is that blocking is done with Rust's standard Condvar which is documented as allowing possible spurious wakeups. This I think means that a thread can "steal" a wakeup notification meant for another. I haven't been able to construct a theoretical trace which leads to deadlock however. Additionally I can't say with any certainty that this is an actual issue since the precise concurrent behaviors allowed here I'm not certain of.


Last updated: Nov 22 2024 at 16:03 UTC