rylev opened issue #6811:
Moving streams to async has introduced a race condition. The
AsyncWriteStream
often gets dropped before the receiver inside the spawned task has time to receive a message.This makes sense as there's nothing to tie the spawned task's completion to the
Store
that contains theAsyncWriteStream
. It's very easy to drop theStore
before any handlers are allowed to run to completition.Test Case
Take the following "hello world" case:
use wasmtime::{ component::{Component, Linker}, Config, Engine, Store, }; use wasmtime_wasi::preview2::{Table, WasiCtx, WasiCtxBuilder, WasiView}; #[tokio::main] async fn main() -> anyhow::Result<()> { let mut config = Config::new(); config.wasm_component_model(true).async_support(true); let engine = Engine::new(&config)?; let component = Component::from_file(&engine, "example.wasm")?; let mut table = Table::new(); let ctx = WasiCtxBuilder::new().inherit_stdio().build(&mut table)?; let mut store = Store::new(&engine, Data { table, ctx }); let mut linker = Linker::new(&engine); wasmtime_wasi::preview2::command::add_to_linker(&mut linker)?; let instance = linker.instantiate_async(&mut store, &component).await?; let run = instance.get_typed_func::<(), ()>(&mut store, "hello")?; run.call_async(&mut store, ()).await?; Ok(()) } struct Data { table: Table, ctx: WasiCtx, } impl WasiView for Data { fn table(&self) -> &wasmtime_wasi::preview2::Table { &self.table } fn table_mut(&mut self) -> &mut wasmtime_wasi::preview2::Table { &mut self.table } fn ctx(&self) -> &wasmtime_wasi::preview2::WasiCtx { &self.ctx } fn ctx_mut(&mut self) -> &mut wasmtime_wasi::preview2::WasiCtx { &mut self.ctx } }
The
example.wasm
file is just a simple wasm component that has one exporthello
which just uses Rust'sprintln!
to print "Hello, World!" to stdout.This technically works, but only by coincidence. If we see when the "Hello, World" gets flushed to stdout, it happens after the spawned task that is handling IO is aborted in the Drop handler.
Steps to Reproduce
You can reproduce those by running the above example with a local checkout of
wasmtime
. Simply add a logging statement to the drop handler ofAsyncWriteStream
and you'll see that the drop (and thus the cancellation of the spawned task) happens before the output to stdout.I've found this to be reliably reproducible in a user visible way if the wasm binary prints more than 1 line of text to stdout. The first line usually gets successfully flushed, but the second does not. Adding something to slow down the execution of the runtime can eliminates the bug (e.g., adding
tokio::time::sleep(std::time::Duration::from_millis(1)).await;
after the function call, gets rid of the race condition).I first caught this bug first when upgrading Spin to the latest version of
wasmtime
. The exported function would complete and the containingStore
would be dropped before the underlyingAsyncWriteStream
's spawned task had a chance to receive the bytes to be written to the innerAsyncWrite
type. Only every ~10th invocation would see the spawned task actually process the bytes before theAsyncWriteStream
's drop handler canceled the task.Expected Results
I would expect that there would be some mechanism for ensuring that ongoing async writes complete before the
Store
is dropped.Versions and Environment
This is the case on the latest version of
wasmtime
as of this writing.
rylev added the bug label to Issue #6811.
pchickey assigned issue #6811 to pchickey.
pchickey commented on issue #6811:
Thanks for the report! @alexcrichton and @elliottt and I had talked about this hypothetical but I hadn't realized how trivial it is to trip over. We'll work on a mechanism to hold onto the
JoinHandle
s created inside a Store beyond the lifetime of the Store.
rylev commented on issue #6811:
Think about this more, I think the right thing to do is to just remove the call to abort inside
AsyncWriteStream
's drop impl. I don't see why an immediate abort would be necessary, and if there is such a use case, it seems like it should be opt-in.If we remove the call to
abort
, then the spawned task will still cleanly exit. Either the call torecv
will returnNone
because the channel is empty and the sender has been dropped OR a normal expected exit occurs here or here.The only additional thing I think that needs to changed is what happens when the
result_sender
errors because the channel is closed. I believe this shouldn't break out of the outer loop (i.e., we should just ignore the return value). We should continue normally trying to perform the current write and exit in one of the normal spots listed above. Otherwise we still have a race condition where the write is trying to finish, but the result channel has already been dropped. If there's no one on the other side of theresult_sender
then that's fine - the side effect of writing is still important and should be completed.
rylev edited a comment on issue #6811:
Think about this more, I think the right thing to do is to just remove the call to abort inside
AsyncWriteStream
's drop impl. I don't see why an immediate abort would be necessary, and if there is such a use case, it seems like it should be opt-in.If we remove the call to
abort
, then the spawned task will still cleanly exit. Either the call torecv
will returnNone
because the channel is empty and the sender has been dropped OR a normal expected exit occurs here or here.
alexcrichton commented on issue #6811:
The main consequence of letting things run in the background though was that each wasm guest could keep work arbitrarily sitting in the background for a long time, meaning that work could accumulate over time by accident before it's all flushed out and finished. The thinking was the we could add the ability for a host to explicitly opt-in to daemon-ing the background work to keep it going but additionally allow hosts to learn about the background work via some object too.
In light of this we may want to revisit the considered default though since I think it's true that almost all the time having background work won't harm anyone or anything. So while I think it'd still be good to package up the background work for an embedder to get access to, @pchickey what do you think about switch to not-abort-by-default?
elliottt closed issue #6811:
Moving streams to async has introduced a race condition. The
AsyncWriteStream
often gets dropped before the receiver inside the spawned task has time to receive a message.This makes sense as there's nothing to tie the spawned task's completion to the
Store
that contains theAsyncWriteStream
. It's very easy to drop theStore
before any handlers are allowed to run to completition.Test Case
Take the following "hello world" case:
use wasmtime::{ component::{Component, Linker}, Config, Engine, Store, }; use wasmtime_wasi::preview2::{Table, WasiCtx, WasiCtxBuilder, WasiView}; #[tokio::main] async fn main() -> anyhow::Result<()> { let mut config = Config::new(); config.wasm_component_model(true).async_support(true); let engine = Engine::new(&config)?; let component = Component::from_file(&engine, "example.wasm")?; let mut table = Table::new(); let ctx = WasiCtxBuilder::new().inherit_stdio().build(&mut table)?; let mut store = Store::new(&engine, Data { table, ctx }); let mut linker = Linker::new(&engine); wasmtime_wasi::preview2::command::add_to_linker(&mut linker)?; let instance = linker.instantiate_async(&mut store, &component).await?; let run = instance.get_typed_func::<(), ()>(&mut store, "hello")?; run.call_async(&mut store, ()).await?; Ok(()) } struct Data { table: Table, ctx: WasiCtx, } impl WasiView for Data { fn table(&self) -> &wasmtime_wasi::preview2::Table { &self.table } fn table_mut(&mut self) -> &mut wasmtime_wasi::preview2::Table { &mut self.table } fn ctx(&self) -> &wasmtime_wasi::preview2::WasiCtx { &self.ctx } fn ctx_mut(&mut self) -> &mut wasmtime_wasi::preview2::WasiCtx { &mut self.ctx } }
The
example.wasm
file is just a simple wasm component that has one exporthello
which just uses Rust'sprintln!
to print "Hello, World!" to stdout.This technically works, but only by coincidence. If we see when the "Hello, World" gets flushed to stdout, it happens after the spawned task that is handling IO is aborted in the Drop handler.
Steps to Reproduce
You can reproduce those by running the above example with a local checkout of
wasmtime
. Simply add a logging statement to the drop handler ofAsyncWriteStream
and you'll see that the drop (and thus the cancellation of the spawned task) happens before the output to stdout.I've found this to be reliably reproducible in a user visible way if the wasm binary prints more than 1 line of text to stdout. The first line usually gets successfully flushed, but the second does not. Adding something to slow down the execution of the runtime can eliminates the bug (e.g., adding
tokio::time::sleep(std::time::Duration::from_millis(1)).await;
after the function call, gets rid of the race condition).I first caught this bug first when upgrading Spin to the latest version of
wasmtime
. The exported function would complete and the containingStore
would be dropped before the underlyingAsyncWriteStream
's spawned task had a chance to receive the bytes to be written to the innerAsyncWrite
type. Only every ~10th invocation would see the spawned task actually process the bytes before theAsyncWriteStream
's drop handler canceled the task.Expected Results
I would expect that there would be some mechanism for ensuring that ongoing async writes complete before the
Store
is dropped.Versions and Environment
This is the case on the latest version of
wasmtime
as of this writing.
Last updated: Nov 22 2024 at 16:03 UTC