hiddenbit opened PR #13256 from hiddenbit:pass-stdin-size-hint to bytecodealliance:main:
While working on a program that reads a large amount of data through stdin, I was surprised by slow stdin throughput under wasmtime. Piping 1 GiB into a simple WASI program that reads in 64 KiB chunks took about 38 seconds (~28 MiB/s). I expected something in the gigabytes-per-second order of magnitude.
To demonstrate the issue, below is a minimal WASI program that reads stdin in configurable chunks.
Piping 1 GiB of zeroes and attempting to read in 65536 bytes chunks:
# Before the changes in this PR: takes ~38s -> ~28 MiB/s time dd if=/dev/zero bs=65536 count=16384 2>/dev/null | /path/to/wasmtime stdio_bench.wasm 65536 # With the changes in this PR: takes ~1.2s -> ~900 MiB/s time dd if=/dev/zero bs=65536 count=16384 2>/dev/null | /path/to/wasmtime stdio_bench.wasm 65536Benchmark application code:
// build with: cargo build --release --target wasm32-wasip1 use std::env; use std::io::{self, Read}; use std::process; fn main() { let chunk_size: usize = env::args() .nth(1) .and_then(|s| s.parse().ok()) .unwrap_or_else(|| { eprintln!("Usage: stdin_bench <chunk-size-bytes>"); process::exit(1); }); let stdin = io::stdin(); let mut handle = stdin.lock(); let mut buf = vec![0u8; chunk_size]; let mut total: u64 = 0; loop { match handle.read(&mut buf) { Ok(0) => break, Ok(n) => total += n as u64, Err(e) => { eprintln!("read error: {e}"); process::exit(1); } } } eprintln!("{total} bytes read"); }Root cause
Regardless of how many bytes were requested (e.g. 65536) the worker thread would read at most 1024 bytes (hard-coded number) per round-trip, which results in slow performance.
See worker_thread_stdin.rs:108.Fix
StdinState::ReadRequestednow has ausizesize hint from the caller. The worker thread uses this hint to size its read buffer, clamped to[1024, MAX_READ_SIZE_ALLOC]to avoid guest-controlled unbounded allocation while still enabling efficient bulk reads.Callers now store a size hint when transitioning to
ReadRequested;Pollable::readyusesMAX_READ_SIZE_ALLOCbecause it does not know the size of the following read.When reading 1 GiB in 64 KiB chunks, I now get:
Before After Speedup ~28 MiB/s ~900 MiB/s 32x
hiddenbit requested wasmtime-wasi-reviewers for a review on PR #13256.
github-actions[bot] added the label wasi on PR #13256.
:thumbs_up: alexcrichton submitted PR review:
Thanks! One small comment but otherwise looks good to me :+1:
:thumbs_up: alexcrichton submitted PR review:
Thanks! One small comment but otherwise looks good to me :+1:
:thumbs_up: alexcrichton submitted PR review:
Thanks! One small comment but otherwise looks good to me :+1:
hiddenbit updated PR #13256.
hiddenbit commented on PR #13256:
Thanks for the suggestion! You're right that
.max(1024)is a holdover from the old hardcoded buffer size and doesn't belong here conceptually.However, removing
max(...)entirely introduces a subtle issue: A WASI guest can callread(0). If that propagates to the worker thread, it allocatesBytesMut::zeroed(0)and callsstdin().read(&mut []), which returnsOk(0). The worker then interpretsOk(0)as EOF and transitions toStdinState::Closed, which permanently closes stdin. A subsequentread(42)would fail withStreamError::Closed. (If I didn't miss anything :sweat_smile:)I've addressed it like this:
- Short-circuit at the call site:
InputStream::readnow returnsOk(Bytes::new())immediately whensize == 0, avoiding an unnecessary worker wake-up entirely.- Safety floor in the worker: Replaced
.max(1024)with.max(1)so that even if a zero somehow reaches the worker (e.g. from a future code path), it can never be misinterpreted as EOF.This removes the arbitrary 1024 floor while guarding against the edge case.
hiddenbit edited PR #13256:
While working on a program that reads a large amount of data through stdin, I was surprised by slow stdin throughput under wasmtime. Piping 1 GiB into a simple WASI program that reads in 64 KiB chunks took about 38 seconds (~28 MiB/s). I expected something in the gigabytes-per-second order of magnitude.
To demonstrate the issue, below is a minimal WASI program that reads stdin in configurable chunks.
Piping 1 GiB of zeroes and attempting to read in 65536 bytes chunks:
# Before the changes in this PR: takes ~38s -> ~28 MiB/s time dd if=/dev/zero bs=65536 count=16384 2>/dev/null | /path/to/wasmtime stdio_bench.wasm 65536 # With the changes in this PR: takes ~1.2s -> ~900 MiB/s time dd if=/dev/zero bs=65536 count=16384 2>/dev/null | /path/to/wasmtime stdio_bench.wasm 65536Benchmark application code:
// build with: cargo build --release --target wasm32-wasip1 use std::env; use std::io::{self, Read}; use std::process; fn main() { let chunk_size: usize = env::args() .nth(1) .and_then(|s| s.parse().ok()) .unwrap_or_else(|| { eprintln!("Usage: stdin_bench <chunk-size-bytes>"); process::exit(1); }); let stdin = io::stdin(); let mut handle = stdin.lock(); let mut buf = vec![0u8; chunk_size]; let mut total: u64 = 0; loop { match handle.read(&mut buf) { Ok(0) => break, Ok(n) => total += n as u64, Err(e) => { eprintln!("read error: {e}"); process::exit(1); } } } eprintln!("{total} bytes read"); }Root cause
Regardless of how many bytes were requested (e.g. 65536) the worker thread would read at most 1024 bytes (hard-coded number) per round-trip, which results in slow performance.
See worker_thread_stdin.rs:108.Fix
StdinState::ReadRequestednow has ausizesize hint from the caller. The worker thread uses this hint to size its read buffer, clamped to[1, MAX_READ_SIZE_ALLOC]to avoid guest-controlled unbounded allocation while still enabling efficient bulk reads.Callers now store a size hint when transitioning to
ReadRequested;Pollable::readyusesMAX_READ_SIZE_ALLOCbecause it does not know the size of the following read.When reading 1 GiB in 64 KiB chunks, I now get:
Before After Speedup ~28 MiB/s ~900 MiB/s 32x
alexcrichton added PR #13256 Fix slow WASI stdin reads by passing size hint to worker thread to the merge queue.
alexcrichton commented on PR #13256:
Thanks!
:check: alexcrichton merged PR #13256.
alexcrichton removed PR #13256 Fix slow WASI stdin reads by passing size hint to worker thread from the merge queue.
Last updated: Jun 01 2026 at 09:49 UTC