Stream: git-wasmtime

Topic: wasmtime / PR #11807 support WASIp3 instance reuse in `wa...


view this post on Zulip Wasmtime GitHub notifications bot (Oct 07 2025 at 22:22):

dicej opened PR #11807 from dicej:p3-instance-reuse to bytecodealliance:main:

This is a draft implementation of WASIp3 instance reuse. Previously, we used one store and one instance per request for both p2 and p3 handlers, discarding the store and instance immediately after the request was handled. Now we attempt to reuse p3 instances and their stores instead, both serially and concurrently.

Advantages of reuse:

Disadvantages of reuse:

Given the tradeoffs, we'll definitely need to provide components a way to opt out of reuse if desired. See
https://github.com/WebAssembly/wasi-http/issues/190 for related discussion.

Implementation details:

I've moved ProxyHandler from wasmtime_cli::commands::serve to wasmtime_wasi_http::handler (so it can be reused by custom embedders), abstracted away the serve-specific parts, and added support for instance reuse via a new ProxyHandler::push function. The push function takes a work item representing an incoming request and dispatches it to a dynamically sized pool of worker tasks, each with its own store and wasi:http/handler@0.3.x instance. Worker tasks accept work items as capacity allows until they either reach a maximum total reuse count or hit an idle timeout, at which point they exit.

Performance:

I've run a few benchmarks using wasmtime-serve-rps.sh and hello-wasip3-http (a simple "hello, world" request handler). Highlights:

TODOs:

<!--
Please make sure you include the following information:

Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.html

Please ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->

view this post on Zulip Wasmtime GitHub notifications bot (Oct 07 2025 at 22:39):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 07 2025 at 22:48):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 07 2025 at 22:58):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 08 2025 at 11:31):

tschneidereit commented on PR #11807:

A 2/3rds reduction is less than I'd expect, given the max concurrency is set to 16, and all of these instances should be able to easily absorb 16 concurrent requests, given that they spent 99+% of their time waiting for that timeout to resolve.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 08 2025 at 13:11):

dicej commented on PR #11807:

  • 64% fewer concurrent instances required for an I/O-bound handler

A 2/3rds reduction is less than I'd expect, given the max concurrency is set to 16, and all of these instances should be able to easily absorb 16 concurrent requests, given that they spent 99+% of their time waiting for that timeout to resolve.

I think that's at least partially due to the algorithm I'm using to decide when a worker will accept concurrent requests (and indicate it's ready to the request dispatcher), which balances the goal of concurrent reuse against the goal of maximizing throughput for guests which do little or no I/O.

Specifically, as soon as a guest has accepted a work item (i.e. a request to be handled), it marks itself unavailable to the dispatcher and will not accept another task (nor mark itself available again) until the store's event loop returns Poll::Pending, meaning all work items it has accepted are blocked on I/O. In practice, this means that the number of available workers will go to zero more often than if they each accepted tasks unconditionally up to the max concurrency level, and when the dispatcher sees the available worker count at zero it starts another worker with its own instance.

The advantage of the above algorithm is that new work items are never (or at least very rarely) left waiting for a worker to free up from whatever CPU-bound work item(s) it is occupied with. The disadvantage is we use more instances. Perhaps there are ways to tweak the algorithm so it can reduce the instance count without affecting latency?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 08 2025 at 13:44):

tschneidereit commented on PR #11807:

I think that algorithm does make sense. My strong intuition would still be that it'd lead to better results than you're seeing. I wonder if part of the explanation could be the artificial nature of the load test, where a lot of requests come in at the same time with little variation? Maybe a load generation scheme that starts additional concurrent requests in a staggered manner would show different results?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 08 2025 at 13:49):

tschneidereit commented on PR #11807:

It looks like Siege supports this, and in fact does it by default—see the --delay option. By default, Siege will apparently delay individual requests' start times by up to delay seconds, and ramp up to the defined number of concurrent clients.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 08 2025 at 14:53):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 08 2025 at 14:53):

alexcrichton created PR review comment:

Perhaps not the most critical thing, but I'm getting more and more wary of the solution of "just slap FuturesUnordered to get concurrency". With the addition of this we now have these layers of FuturesUnordered:

This is getting to be a bit excessive IMO and I feel like we're going to start seeing perf cliffs as a result in various scaling situations. I don't have anything concrete I know of necessarily but I feel like we can't just keep slapping a FuturesUnordered everywhere and assume it'll all work out and scale well...

Unfortunately though I don't have a great solution off the top of my head. I realize what this is doing is maintaining concurrency within a store and API-wise we haven't given many other options here.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 10 2025 at 15:59):

dicej commented on PR #11807:

It looks like Siege supports this, and in fact does it by default—see the --delay option. By default, Siege will apparently delay individual requests' start times by up to delay seconds, and ramp up to the defined number of concurrent clients.

I just ran a test using siege -c 50 -t 30s -d 0.5 and got the exact same peak concurrent instance count as with hey. Siege doesn't appear to be applying any jitter to the delay, nor does it appear to vary when each user starts, and I'm not seeing any options to enable such things. I suspect we'll either need to keep looking or write our own load test tool :shrug:

view this post on Zulip Wasmtime GitHub notifications bot (Oct 10 2025 at 16:00):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 10 2025 at 16:16):

tschneidereit commented on PR #11807:

Interesting. I guess either the Siege docs are incorrect or my reading of them is. Either way, as mentioned in our call I think the results you already have are entirely sufficient performance motivation for going ahead, so there's no blocking need to get additional ones, given that I'd only ever expect them to be better.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 10 2025 at 20:46):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 10 2025 at 20:49):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 10 2025 at 21:09):

dicej commented on PR #11807:

While implementing new CLI options to control reuse (i.e. max reuse count, max concurrent reuse count, and idle instance timeout), I noticed that setting the max reuse count to 1 (which effectively disables reuse) reduced wasmtime-serve-rps.sh performance by 86% vs. the main branch. I was able to fix that by cheating: falling back to the normal request handling path and not using the new reuse-enabled path when the max reuse count is 1.

The upshot is that if you're setting max reuse count >1, you need to set it pretty high (e.g. >16) to actually get a performance benefit, since the overhead of the reuse-enabled code path will dominate otherwise. That suggests we might be leaving significant performance on the table even when the max reuse count is high. I don't know what the bottleneck is -- maybe tokio::spawn; maybe FuturesUnordered like Alex mentioned; maybe cache coherency traffic due to atomics; maybe something else :shrug: .

Anyway, I'm going to mark this ready-for-review now, since the major TODO items have been addressed, and both the default settings and the reuse-disabled mode provide performance that's at least as good as main.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 10 2025 at 21:09):

dicej has marked PR #11807 as ready for review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 10 2025 at 21:09):

dicej requested wasmtime-wasi-reviewers for a review on PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 10 2025 at 21:09):

dicej requested alexcrichton for a review on PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 10 2025 at 21:09):

dicej requested wasmtime-core-reviewers for a review on PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 10 2025 at 21:09):

dicej requested wasmtime-default-reviewers for a review on PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 11 2025 at 16:12):

dicej commented on PR #11807:

I don't know what the bottleneck is -- maybe tokio::spawn; maybe FuturesUnordered like Alex mentioned; maybe cache coherency traffic due to atomics; maybe something else :shrug: .

Thinking about this a bit more last night, I realized that at least part of the problem is how the code decides whether to create a new instance for a given request; the way that currently works when a bunch of requests come in at the same time is suboptimal from a latency perspective. I'll experiment with some variations on Monday.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 14:02):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 14:09):

dicej commented on PR #11807:

I don't know what the bottleneck is -- maybe tokio::spawn; maybe FuturesUnordered like Alex mentioned; maybe cache coherency traffic due to atomics; maybe something else :shrug: .

Thinking about this a bit more last night, I realized that at least part of the problem is how the code decides whether to create a new instance for a given request; the way that currently works when a bunch of requests come in at the same time is suboptimal from a latency perspective. I'll experiment with some variations on Monday.

This is mostly addressed; the performance penalty is now less than 20%. I've left the fast path in place for when reuse is completely disabled, so there's no penalty there at all. For small, but greater-than-one max reuse counts, performance is much better than before.

The tradeoff is now the code is more aggressive about creating new instances when a bunch of requests arrive simultaneously, so the peak instance count can be nearly as high as with reuse disabled. However, I think that's the correct tradeoff to maximize throughput in that scenario, and it should be pretty rare in practice anyway.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:28):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:28):

alexcrichton created PR review comment:

Since this already forwards all the arguments of the script itself, could these extra options be relegated to "$@" to avoid tampering with defaults too much?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:28):

alexcrichton created PR review comment:

With https://github.com/bytecodealliance/wasmtime/pull/11822 this I think is no longer necessary

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:28):

alexcrichton created PR review comment:

The synchronization and implementation here is subtle enough that I've got no idea, on first pass, why this is constructed the way it is. I see from Tokio's documentation, however, that this is a copy/paste of the examples they recommend for a mpmc channel. Could this function have a comment pointing to said docs indicating that this is the documented method of doing this?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:28):

alexcrichton created PR review comment:

Would it be possible to go ahead and resolve this TODO in this PR? I think it'd be nice to see this in action for both p2/p3 to avoid having to maintain multiple separate paths.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:28):

alexcrichton created PR review comment:

Could this be inlined above since it's only called once?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:28):

alexcrichton created PR review comment:

In lieu of this TODO being resolved, I think that the entire store must be thrown away right? Otherwise cancelling a task leaves a store in an effectively indeterminate state?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:28):

alexcrichton created PR review comment:

Naively I'd expect a signal like this would be updated on "major events" like a task being completed or a new task coming in. Is there a reason though that this needs to be updated on each turn of the event loop more-or-less?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:28):

alexcrichton created PR review comment:

This is a fair bit of #[cfg] traffic, could that be relegated to a submodule perhaps to handle concurrency or something like that? (basically seeking out if there's a way to reduce the total amount of #[cfg])

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:28):

alexcrichton created PR review comment:

How come this spawns a worker vs letting the handler at the top-level be in charge of that?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:28):

alexcrichton created PR review comment:

Instead of duplicating code, could this call Worker::run directly?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:28):

alexcrichton created PR review comment:

Personally I find nested/multiple uses of future::poll_fn quite confusing as I have to keep going back-and-forth trying to figure out what's polling what and what's doing what. For example here I need to jump back above to look at the loop, see where it's pending, etc. Would it be possible to simplify this by managing accept_concurrent within the loop itself? For example waiting for a lull between tasks or something like that.

I also find it pretty hard to reverse-engineer when this variable is set, unset, etc, and what are the state transitions for something like this.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:28):

alexcrichton created PR review comment:

I would ideally like to not have duplication like this currently has, so if possible I'd like to move the p2 handling into the handler added in the wasi-http crate. Previously the difference between these two branches was localized to just here and was, in theory, pretty easy to audit and see that they're doing the same thing. Now that they're so spread out across crates it's much more difficult to verify that both end up doing the same thing. In theory it should be possible for the new handler to pretty easily handle p2 modules, right?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:28):

alexcrichton created PR review comment:

Is this right to propagate an error here? I would expect that a task timing out results in a lot or tweaking of heuristics as it does right now by not accepting more tasks. Otherwise though for the entire worker I wouldn't expect it to fail just because one task timed out.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:38):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:40):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:46):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:46):

dicej created PR review comment:

It _could_, but Worker::run does a lot more than this code does, including polling the mpmc channel, fiddling with atomic variables, using a FuturesUnordered, etc., none of which is needed here.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:50):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:50):

dicej created PR review comment:

Because that could lead to orphan requests such that maybe_start_worker sees a non-zero worker_count and therefore makes no attempt to start a new one, but then the last worker exits right after that. This code ensures that, when the last worker exits and there is work to do, then a new worker will be started.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:52):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:52):

dicej created PR review comment:

The code in set_available short-circuits if the state hasn't actually changed, so this is a pretty light-weight operation. The reason we're doing it here is that accept_concurrent could change from outside the async block any time it's polled.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:55):

dicej created PR review comment:

Yes, and that's exactly what we do as soon as futures.len() has gone to zero; i.e. once every not-yet-timed-out task has finished or timed out (which will take at most one more timeout interval from the point of the first timeout), we drop the instance and store.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:55):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:57):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 16:57):

dicej created PR review comment:

I was trying to approximate the existing behavior in the non-reuse case where we return an error on timeout, but I don't have a strong opinion about doing that vs. just logging immediately and returning Ok(())

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 17:04):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 17:04):

dicej created PR review comment:

Would it be possible to simplify this by managing accept_concurrent within the loop itself? For example waiting for a lull between tasks or something like that.

Currently, there's no way to determine whether the store's event loop has gone idle from within a Future passed to run_concurrent -- the only way to know is based on when the Future returned by run_concurrent itself returns Pending. In other words, just because FuturesUnordered::next returns Pending doesn't mean we're ready to accept another task -- it may only mean that it's waiting on deferred work which will be done during the next turn of the store's event loop.

I'll think some more about what sort of API(s) we could add to Wasmtime to expose store event loop lulls in a way that's accessible from within a Future passed to run_concurrent, but it's not obvious to me what that would look like.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 17:10):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 17:10):

dicej created PR review comment:

At a minimum, I can move this code into handler.rs, for sure. Actually supporting p2 reuse is less trivial, but I'll give it a shot.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 19:09):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 19:09):

alexcrichton created PR review comment:

While true, does that factor into the 20% improvement this duplication earns? I would expect that those constructs are small peanuts, but that's also just a guess

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 19:10):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 19:10):

alexcrichton created PR review comment:

Ah ok makes sense, mind leaving a comment to that effect?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 19:11):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 19:11):

alexcrichton created PR review comment:

Hm ok I think my attempts to understand how this implementation works are basically all in failure. Given the comments below I've seriously misunderstood how this loop works and I already stared at it for an hour...

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 19:17):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 19:17):

alexcrichton created PR review comment:

Given the comments above this one, I think I've basically reached the conclusion that I don't really have any idea how this is all piecing together. The combinations and subtelties of poll_fn, plus FuturesUnordered, plus the multiple layers here (e.g. two poll_fn), plus the variables used to control behavior ... I am basically just totally lost.

I do think this would be much, much more readable if this was all written without poll_fn. Once the transition is made from async/await to futures internals things get extremely tricky. In contrast it's basically trivial to read async/await code. Overall poll_fn is fine IMO only when the internal closure is a handful of lines long as opposed to taking up most of the function.

Is it possible to rewrite most of this in an async/await style? Maybe not preserving the exact same behavior but something similar? For example the idle time could be replaced with "here's something which takes up to N messagse from the incoming queue, ending early if M time elapses between messages taken". That's not an exact correlation to the idle time implemented here (I think) but I feel like that captures the intent well enough and is in theory much easier to understand where that entire computation could be in an async block and forgotten about (or something like that)

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 19:19):

alexcrichton created PR review comment:

Overall I'm basically just wary of tacking on more and more featuers of "here's the p3 thing over there we're going to just ignore the p2 thing over here". It spreads out the implementation burden and means that we end up having a bunch of untested code because most things preexisting are p2.

I would naively expect integration to be pretty simple because p2 functions can be call_concurrent'd just the same as p3 functions. One simplification, perhaps, is that wasmtime serve should unconditionally depend on the component-model-async proposal. I don't think there's any point in supporting it with component-model but not component-model-async compiled in.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 19:19):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 19:25):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 19:25):

dicej created PR review comment:

Are you proposing that we generate the p2 bindings in wasmtime-wasi-http with exports: { default: async | store }? Or generate them with exports: { default: async } like we do today, but also generate another set with | store? Or something else?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 19:31):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 19:31):

alexcrichton created PR review comment:

I wasn't really thinking that far myself. My thinking was that we should make it work to have one implementation of wasmtime serve, and other goals should all be secondary. So for example if a second set of bindings is needed that seems fine to me. Either that or "get the function from bindings" accessor so a direct call is made to the TypedFunc or something like that.

Put another way, I don't personally care how the bindings work out, but I do care much more that there's only one "meat" implementation that we have to test/verify/audit/etc.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 21:45):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 21:55):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 22:01):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 22:01):

dicej created PR review comment:

Done.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 22:03):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 22:03):

dicej created PR review comment:

Ok, I went ahead and generated a third set of p3 bindings (in addition to the existing async and sync ones) which enable call_concurrent. I considered using TypedFunc directly, but found myself duplicating the function index pre-instance lookup code; easier just to let wasmtime-wit-bindgen handle that.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 13 2025 at 22:12):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 14 2025 at 13:41):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 14 2025 at 13:43):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 14 2025 at 13:43):

dicej created PR review comment:

I just pushed an update which bypasses the task queue when there are no already-available workers, which narrowed the performance gap to just 2%; that seems close enough to justify removing the duplication. It also points to an opportunity to switch to a faster MPMC channel if we can find (or build) one.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 14 2025 at 13:56):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 14 2025 at 22:37):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 15 2025 at 19:34):

alexcrichton commented on PR #11807:

Unfortunately I do continue to be basically ever-yet-more-lost on the internals of the implementation here. I'm trying to poke at things locally to see if I can improve my understanding, though.

I've run across something that seems to me like a serious bug in the design of component-model-async APIs in Wasmtime. I started trying to explore the sort of "outer timeout". here and the whole BTreeMap that was recently added. That feels quite surprising to me and naively I would not expect it to be necessary. I see now why it was done (sort of), but the inherent need for it, to me, indicates a serious bug in how we're exposing async things.

Currently run_concurrent takes an AsyncFnOnce, but in reality it's not actually behaving the way that I would expect an AsyncFnOnce to behave. Specifically if the guest calls std::thread::sleep then that blocks all other async operations in that AsyncFnOnce, including host timeouts (e.g. the tokio::time::timeout-per-request). That to me is a pretty big violation of what one would expect from async APIs in Rust.

Can you speak more to why this behavior is happening? I understand why all wasm would be blocked, but I don't understand why all other host interactions are also blocked.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 15 2025 at 19:50):

dicej commented on PR #11807:

Can you speak more to why this behavior is happening? I understand why all wasm would be blocked, but I don't understand why all other host interactions are also blocked.

If you want to trace though the code, run_concurrent calls poll_until calls handle_work_item calls resume_fiber calls fiber::resolve_or_release. In this case, the fiber ends up suspending _without_ releasing the store, so the resolve_or_release Future::poll returns Pending, which bubbles all the way back up to run_concurrent. At this point, nothing can be done with the Store -- the fiber has exclusive access, so we can't run the event loop, can't call tls::set, can't run the AsyncFnOnce, etc. It's like the Store is locked, and there's nothing we can do except either wait for the fiber to resume and release the store or give up and drop everything; we do the latter to enforce the timeout.

One way out of this would be to rewrite all of wasmtime-wasi's p2 support to use async | store when generating import bindings for any blocking operation, in which case the fiber would release access to the Store when the guest sleeps, meaning none of the above would be an issue. Not sure we have an appetite for that right now, though.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 15 2025 at 19:53):

dicej edited a comment on PR #11807:

Can you speak more to why this behavior is happening? I understand why all wasm would be blocked, but I don't understand why all other host interactions are also blocked.

If you want to trace though the code, run_concurrent calls poll_until calls handle_work_item calls resume_fiber calls fiber::resolve_or_release. In this case, the fiber ends up suspending _without_ releasing the store, so the resolve_or_release Future::poll returns Pending, which bubbles all the way back up to run_concurrent. At this point, nothing can be done with the Store -- the fiber has exclusive access, so we can't run the event loop, can't call tls::set, can't run the AsyncFnOnce, etc. It's like the Store is locked, and there's nothing we can do except either wait for the fiber to resume and release the store or give up and drop everything; we do the latter to enforce the timeout.

One way out of this would be to rewrite parts of wasmtime-wasi's p2 support to use async | store when generating import bindings for any blocking operation, in which case the fiber would release access to the Store when the guest sleeps, meaning none of the above would be an issue. Not sure we have an appetite for that right now, though.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 16 2025 at 16:00):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 16 2025 at 16:31):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 18:53):

alexcrichton updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 18:56):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 18:56):

alexcrichton created PR review comment:

Ok after thinking more about this, I think the main substantial change I'd like to see is the removal of this timeout. Given what we clarified in https://github.com/bytecodealliance/wasmtime/pull/11871 it means that this timeout effectively cannot work, so I don't think we should be relying on it. for anything here.

Otherwise while I'd still like to tweak things about this loop I don't think it's necessarily worth it at this time.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 18:56):

alexcrichton created PR review comment:

s/SeqCst/Relaxed/

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 18:56):

alexcrichton created PR review comment:

This can be private, right? (ideally it wouldn't be exposed)

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 20:08):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 20:08):

dicej created PR review comment:

Are you sure about that? The ordering of this operation relative to the Queue::push that happens before this must be preserved, whereas Relaxed is documented to have "No ordering constraints, only atomic operations". I imagine we could get away with something less strict than SeqCst here (maybe Release?), but Relaxed seems to weak in this case.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 20:09):

dicej edited PR review comment.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 20:12):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 20:12):

dicej created PR review comment:

serve.rs (which is in a different crate) needs access to it; not sure if there's a way around that.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 20:16):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 20:16):

dicej created PR review comment:

(and not just serve.rs, but anyone who wants to use this module, presumably)

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 20:49):

alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 20:49):

alexcrichton created PR review comment:

In this case worker_count isn't protecting anything unsafe or anything like that so AFAIK Relaxed is all that's needed. Anything stronger is only required typically when unsafe is involved or it's specifically trying to synchronize with some other modification of worker_count to see updates made somewhere else, but in this case the state is in a mutex anyway so I don't think that's required

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 20:50):

alexcrichton submitted PR review:

Talked some more with @dicej offline about this, and more-or-less it's not really worth spending more time bikeshedding things. This is working well-enough as-is and while I'm not perfectly satisfied it's not worth continuing to refine things at this point. The API is solid enough that it's fine to iterate in-tree.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 20:53):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 20:53):

dicej created PR review comment:

Alex and I chatted about this synchronously. He actually meant to say he wanted to remove the _other_ timeout in this closure. I pointed out that the other timeout is guaranteeing the invariant that we never kill an instance with at least once task that has not yet reached its timeout; it ensures that by setting reuse_count = max_instance_reuse_count as soon as any task reaches the timeout.

The cost of guaranteeing that invariant is that the code is more complicated than it otherwise would be, and more subtle, which is unfortunate, but abandoning the invariant would make it unusable for practical purposes, and it's not clear there's a better way to structure the code that still provides the same guarantee.

I'll add some more code comments to point out the subtleties here, which should help with future maintenance.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 21:00):

lann submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 21:00):

lann created PR review comment:

What is req_id?

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 21:30):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 21:34):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 21:34):

dicej created PR review comment:

wasmtime serve uses it to prefix lines emitted by the guest to stderr and stdout, so we plumb it through. It will only be non-None if max_instance_reuse_count == 1; otherwise, we can't reliably do that prefixing given that the store will be reused (although I suppose we could make it work for sequential-only reuse with more effort).

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 21:35):

dicej edited PR review comment.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 21:36):

dicej edited PR review comment.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 21:37):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 21:37):

dicej created PR review comment:

I'll expand the doc comment.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 21:48):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 21:51):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 21:51):

dicej created PR review comment:

Ok, taking another look at this, I've realized I glossed over that this is a load rather than a store, in which case what I wrote above doesn't apply. I'll study this a bit more.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 21:52):

dicej edited PR review comment.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 22:07):

lann submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 22:07):

lann created PR review comment:

It might be nice if these allowed for more flexible logic, maybe even with default impls, e.g.

fn should_reuse_instance(&self, reuse_count: usize) -> bool {
    true
}

fn should_reuse_concurrently(&self, reuse_count: usize, current_tasks: usize) -> bool {
    false
}

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 22:13):

lann submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 22:13):

lann created PR review comment:

Actually the things I would have in mind for this would be more global and could be managed by whatever is calling Worker::spawn, but now it isn't clear to me why these limits should be managed by Worker at all... :thinking:

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 22:21):

lann edited PR review comment.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 22:22):

dicej updated PR #11807.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 22:22):

lann edited PR review comment.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 22:24):

dicej created PR review comment:

Okay, I just pushed an update with more comments and also changed a _different_ SeqCst to Relaxed. I'm guessing the remaining SeqCsts are stronger than we actually need, but the ordering requirements between the Queue operations involving a Mutex and the worker_count AtomicUsize operations are a bit subtle, and I'm erring on the side of paranoia here.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 22:24):

dicej submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Oct 17 2025 at 22:32):

dicej commented on PR #11807:

I'm going to hold off until Monday to merge this, just in case any other concerns come up. Thanks for the attentive review!

view this post on Zulip Wasmtime GitHub notifications bot (Oct 20 2025 at 15:18):

alexcrichton merged PR #11807.


Last updated: Dec 06 2025 at 06:05 UTC