Performance regression with keep_resident + tables in 35.0.0 · wasmtime

I just finished debugging a performance regression in Spin and wanted to write it down here in case anyone else is affected by this. We were seeing a performance regression when Wasmtime 34 was updated to Wasmtime 35. After various layers of bisection I found that https://github.com/bytecodealliance/wasmtime/pull/10388 was the culprit.

The stack-switching PR was disabled for Wasmtime 34 despite merging just before the branch (I reverted it manually to de-risk Wasmtime 34). The PR, however, lives in Wasmtime 35 and 36 and main today. After puzzling over how an off-by-default feature would affect performance so drastically I discovered that the PR inadvertently changed the behavior of TablePool::reset_table_pages_to_zero where previously only the table's size was reset and afterwards the entire table slot was reset. A calculation of table.size() * mem::size_of::<*mut u8>() was changed to self.data_size(table.element_type()) where the latter is the size of the whole slot. A normal bug to happen so no one's at fault of course.

This meant, though, that when combined with *_keep_resident options it means that tables could have a way higher memset amount afterwards than before (for the same-size tables too). This ended up being the source of our performance regression.

The reason I'm talking about this here instead of on GitHub is that this is inadvertently already fixed. I ended up fixing this behavior in https://github.com/bytecodealliance/wasmtime/pull/11341 mistakenly assuming that the table pool allocator had always reset the entire slot instead of just the table itself. Basically I didn't realize that the behavior I was changing had itself changed recently with the merging of stack-switching. That PR did not make its way into Wasmtime 35 but it has made its way into Wasmtime 36.

So tl;dr; if you use *_keep_resident and see a performance regression on Wasmtime 35 but not 34 or 36 this may be why.

Stack switching: Infrastructure and runtime support by frank-emrich · Pull Request #10388 · bytecodealliance/wasmtime

This PR is part of a series that adds support for the Wasm stack switching proposal. The explainer document for the proposal is here. There's a tracking issue describing the overall progress an...

Reset fewer bytes when resetting tables by alexcrichton · Pull Request #11341 · bytecodealliance/wasmtime

This commit changes the resetting of tables back to all-null in TablePool::reset_table_pages_to_zero. Previously the full capacity of the table was reset back to zero, depending on the configuratio...

Paul Osborne (Aug 28 2025 at 19:04):

Interesting, thanks for finding this and for the writeup; that's definitely an error I made while working to address feedback on the stack-switching runtime changes.

Roman Volosatovs (Aug 29 2025 at 14:45):

I've removed these options to work around it, but maybe I should revisit them in 36

Alex Crichton (Aug 29 2025 at 14:47):

Your profile Roman shows most of the memset from deallocate_memories which shouldn't have changed between 34/35/36, so that may be something else?

Roman Volosatovs (Aug 29 2025 at 14:54):

In that embedding the memory size is actually static across all modules, and it's pretty big, so I just assumed that it's simply too big for the feature. madvise performed way better in my testing.
I did see the table deallocation also incur significant cost, but I've since lost that profile - removing the keep_resident options fixed both issues for me that time.

nearcore/runtime/near-vm-runner/src/wasmtime_runner/mod.rs at 359902578a29d4542fb0d816c9cee2a45341d4a0 · near/nearcore

Reference client for NEAR Protocol. Contribute to near/nearcore development by creating an account on GitHub.

Roman Volosatovs (Aug 29 2025 at 14:55):

Paul Osborne (Aug 29 2025 at 16:58):

I would definitely expect that for certain workloads keep_resident can make things worse, with or without the bug (though worse with the 35 bug). This is true with the pagemap optimizations as well. The biggest penalty moves a bit with madvise to arise during page faults, but if there's very few dirtied pages then both the madvise and the page faults can be, in aggregate, pretty expensive.

I did some comparisons of those tradeoffs in comments I added later to https://github.com/bytecodealliance/wasmtime/pull/11372. What I don't show there is any comparison of a single huge madvise compared against a pagemap scan + madvise. I felt it reached the point where just trying to bench real workloads made more sense.

Add support for the Linux PAGEMAP_SCAN ioctl by alexcrichton · Pull Request #11372 · bytecodealliance/wasmtime

This series of commits is the brainchild of @tschneidereit who, in his spare time, reads Linux kernel documentation and finds random ioctls. Specifically @tschneidereit discovered the PAGEMAP_SCAN ...

Stream: wasmtime