alexcrichton opened issue #7623:
This GitHub Actions failure happened on the 15.0.0 release branch yesterday. The bug there appears to be a timeout in the builder where the log ends with:
... test parking_spot::tests::parking_lot::unpark_one_one_fast ... ok test parking_spot::tests::parking_lot::unpark_all_one_fast ... ok test parking_spot::tests::atomic_wait_notify has been running for over 60 seconds Error: The operation was canceled.
In the absence of any other information this appears like the
atomic_wait_notify
test deadlocked and then the test timed out eventually when GitHub Actions let it run for 6 hours.I've been staring at the test and the implementation of
ParkingSpot
and I unfortunately haven't been able to come up with anything.One thing I have noticed is that blocking is done with Rust's standard
Condvar
which is documented as allowing possible spurious wakeups. This I think means that a thread can "steal" a wakeup notification meant for another. I haven't been able to construct a theoretical trace which leads to deadlock however. Additionally I can't say with any certainty that this is an actual issue since the precise concurrent behaviors allowed here I'm not certain of.
bjorn3 commented on issue #7623:
Does this deadlock reproduce using loom?
alexcrichton commented on issue #7623:
I was curious myself! I rewrote the test with loom (had to remove the usage of scoped threads), and it's been running for 12+ hours and so far hasn't found an issue. Loom docs said I should run with
LOOM_MAX_PREEMPTIONS={2,3}
to get "most bugs out of the way" and theLOOM_MAX_PREEMPTIONS=3
run took ~12 hours and found no issues.So to answer your question, so far no, but it's still running. Also I'm a bit suspicious about spurious wakeups here and my guess is that loom probably doesn't model spurious wakeups from
Condvar
, so it may not reproduce in loom after all
alexcrichton closed issue #7623:
This GitHub Actions failure happened on the 15.0.0 release branch yesterday. The bug there appears to be a timeout in the builder where the log ends with:
... test parking_spot::tests::parking_lot::unpark_one_one_fast ... ok test parking_spot::tests::parking_lot::unpark_all_one_fast ... ok test parking_spot::tests::atomic_wait_notify has been running for over 60 seconds Error: The operation was canceled.
In the absence of any other information this appears like the
atomic_wait_notify
test deadlocked and then the test timed out eventually when GitHub Actions let it run for 6 hours.I've been staring at the test and the implementation of
ParkingSpot
and I unfortunately haven't been able to come up with anything.One thing I have noticed is that blocking is done with Rust's standard
Condvar
which is documented as allowing possible spurious wakeups. This I think means that a thread can "steal" a wakeup notification meant for another. I haven't been able to construct a theoretical trace which leads to deadlock however. Additionally I can't say with any certainty that this is an actual issue since the precise concurrent behaviors allowed here I'm not certain of.
Last updated: Nov 22 2024 at 16:03 UTC