alexcrichton opened PR #11372 from alexcrichton:pagemap_scan to bytecodealliance:main:
This series of commits is the brainchild of @tschneidereit who, in his spare time, reads Linux kernel documentation and finds random ioctls. Specifically @tschneidereit discovered the
PAGEMAP_SCANioctl which has the basic description of:This ioctl(2) is used to get and optionally clear some specific
flags from page table entries. The information is returned with
PAGE_SIZE granularity.As a bit of background one of the main scaling bottlenecks for Wasmtime-on-the-server is the implementation of resetting WebAssembly instances back to their original state, notably linear memories and tables. Wasmtime employs two separate strategies on Linux for doing this today:
- The
PoolingAllocationConfighas*_keep_residentoptions which indicates that this much memory will be memset back to the original contents. This options default to 0 bytes.- Resetting memory beyond
*_keep_residentis done withmadvise(MADV_DONTNEED)to reset linear memories and tables back to their original state.Both of these strategies have drawbacks. For
*_keep_residentand memset this is a blind memset of the lower addresses in linear memory where lots of memory is set that wasn't either read or written from the guest. As a result this can't be too large lest Wasmtime become amemsetbenchmark. Formadvise(MADV_DONTNEED)this requires modifying page tables (removing resident pages) which results in IPIs to synchronize other cores. Additionally each invocation of a WebAssembly instance will always incur a page fault on memory accesses (albeit a minor fault, not a major fault), which can add up.By using the
PAGEMAP_SCANioctl we can take the best of both worlds here and combine these into a more efficient way to reset linear memories and tables back to their original contents. At a high level the ioctl works by:
- A range of virtual memory is specified to "scan". Flags are also configured to indicate what are the interesting bits we care about in the page table.
- The ioctl writes to a user-supplied buffer with page-aligned regions of memory that match the flags specified.
- Wasmtime uses this to detect, within an entire linear memory, what pages are "dirty" or written to and will memset these while
madvise-ing any pages above the*_keep_residentthreshold.In essence this is almost a tailor-made syscall for Wasmtime and perfectly fits our use case. We can quickly find dirty pages, up to a certain maximum, which segments memory into "manually reset these regions" followed by "decommit these regions". This enables Wasmtime to
memsetonly memory written by the guest and, for example, memory that's only read by the guest remains paged in and remains unmodified.This is an improvement over
*_keep_residentbecause only written pages are reset back to their original contents, not all pages in the low addresses of the linear memory address space. This is also an improvement overmadvise(MADV_DONTNEED)because readonly pages are kept resident over time (no page faults on future invocations, all mapped to the same file-backed readonly page across multiple instances), written pages are kept resident over time (but reset back to original contents after instantiation), and IPIs/TLB shootdowns can be avoided (assuming the working set of written memory is less than*_keep_residentand memory isn't grown during execution). Overall this helps combine the best of both worlds and provides another tool in the toolbox of Wasmtime to efficiently reset linear memory back to zero.In terms of practical impact this enables a 2.5x increase in RPS measured of a simple p2 hello-world server that uses
wasmtime serve. The server in question only writes two host pages of memory and so resetting a linear memory requires a single syscall and 8192 bytes of memcpy. Tables are memset to zero and don't require an ioctl because their total size is so small (both are less than 1 page of memory).
This series of commits was originally written by @tschneidereit who did all the initial exploration and effort. I've rebased the commits and cleaned things up towards the end for integration in Wasmtime. The main technical changes here are to hook into linear memory and table deallocation. The
MemoryImageSlothas also been refactored to contain anArcpointing to the original image itself in-memory which is used to manuallymemsetcontents back to their original version. Resetting aMemoryImageSlothas been refactored to pretend that page maps are always available which helps simplify the code and ensure thorough testing even when not on Linux.Wasmtime only uses
PAGEMAP_SCANon Linux and this support is disabled on all other platforms. ThePAGEMAP_SCANioctl is also relatively new so Wasmtime also disables this all on Linux at runtime if the kernel isn't new enough to supportPAGEMAP_SCAN.Support for the
PAGEMAP_SCANioctl is in a Linux-specific module and is effectively a copy of this repository where I did initial testing of this ioctl. That repository isn't published on crates.io nor do I plan on publishing it on crates.io. The test suite is included here as well in the module for Linux pagemap bits.A new test is added to
cow.rswhich exercises some of the more interesting cases withPAGEMAP_SCANadditionally.
alexcrichton requested fitzgen for a review on PR #11372.
alexcrichton requested wasmtime-core-reviewers for a review on PR #11372.
alexcrichton updated PR #11372.
alexcrichton updated PR #11372.
alexcrichton updated PR #11372.
alexcrichton updated PR #11372.
alexcrichton commented on PR #11372:
One thing I should also note here is that @tschneidereit has plans for revamping the
*_keep_residentoptions to rationalize them a bit more in light of this new syscall, but we've agreed that it's best to separate that out to a separate change. That means that this PR doesn't change the meaning of the preexisting*_keep_residentoptions, it just makes them much more effective.
fitzgen created PR review comment:
nitpick newline between the closing brace and the
#[cfg(test)]
fitzgen submitted PR review:
Very cool! Excited for this!
I remember we had some discussion in the Wasmtime meeting about the dual-purpose of
keep_residentand how it configures both an amount tomemsetand a limit on the pool's max RSS when idle. This PR would remove that second bit: if the Wasm program touches different pages on each instantiation, but always touches only less thankeep_residenton each instantiation, then all pages in the slot will eventually become and stay resident. Probably this is the correct thing to do long term, and we should have two different knobs for these two different purposes, but some embedders might be relying on that functionality today and it isn't being replaced with a new/different knob right now. It is not clear to me whether punting on a solution for that use case until some vague re-rationalization of options happens in the future is acceptable or not.
fitzgen created PR review comment:
"an indeed a" ?
fitzgen created PR review comment:
Doc for this variant?
fitzgen created PR review comment:
It would be nice if we could batch up and coalesce
PAGEMAP_SCANs similar to what we do with decommits. Specifically, while I don't think we will have anioctlvofPAGEMAP_SCANvor whatever anytime soon, I can imagine that when a component uses multiple tables/memories, those pool slots will often be right next to each other, and we could fold them into one bigger scan region instead of performing multpipleioctls with smaller scan regions for each pool slot.Not something that needs to happen in this PR, but would be good to file a follow up issue for.
fitzgen created PR review comment:
Should this consult
walk_endto only return the regions that have actually been scanned?
fitzgen created PR review comment:
"files" -> "fails" ?
fitzgen created PR review comment:
Should this and
category_anyof_maskbeexpect(dead_code, reason = "...")instead? That way we won't need to remember to remove this fake usage if we do end up actually using it elsewhere.
fitzgen created PR review comment:
This will only contain valid data for categories that were set in the
return_mask, right? Good thing to note in the docs.
fitzgen created PR review comment:
So
category_maskis "scan if all categories match (ie skip if any doesn't match)" andcategory_anyof_maskis "scan if any category matches (ie skip if none match" right?Can you specify both masks at the same time? If you are inverting any categories, do they have to be inverted in both masks? If I am understanding correctly, the answer is "yes". That is a little unfortunate, and makes crafting a higher-level API a little bit harder.
So yeah, I see that this very directly reflects the underlying API while exposing a safe interface, which is fantastic, but this all feels fairly funky still. I think that at minimum we should rename
category_masktocategory_allof_maskor something like that, to better disambiguate it andcategory_anyof_mask. In an ideal world, I'd like some kind of boolean term builder but that seems pretty difficult given the restrictions of the underlying API. I'll have to think on that a little more.
fitzgen created PR review comment:
ditto
alexcrichton submitted PR review.
alexcrichton created PR review comment:
I'll be honest I don't know what this is and while I found sources for docs for the other variants I am unable to find a copy/paste snippet for this.
alexcrichton updated PR #11372.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
One issue I think we'd run into for linear memories is guard regions where the "reset this region" specification would never take into account guards meaning that all regions would appear disjoint. For tables that could in theory work better but even then the "region to reset" is just the active table upon deallocation which isn't the entire table and so tables would also appear disjoint.
Personally I'd be wary of refactoring significantly to handle all those cases to try to get a larger
PAGEMAP_SCANioctl region and coalesce them because that would also run afoul of*_keep_residentoptions where all lower-addressed pages would be the first to remain resident while everything afterwards would be paged out. Basically while I agree it'd be nice to coalesce calls I don't think it'd be practical even if things were structured to enable it.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Oh this is part of the
ioctlitself where the return value of theioctlis the number of regions that were filled in by the kernel, so the length of this array is determined by that without needing to consultwalk_endand such. The regions may not end inwalk_endas well because the scan could have finished but only pages before the end were considered "interesting". If the regions fill up though I think the end address of the last region would bewalk_end
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Personally I also found this confusing and I struggled with the precise behavior of these masks as well. In the end I gave up trying to understand the documentation and went to the source:
From what I've seen the ioctl docs for these fields are confusing and/or not precise enough. I tried to synthesize documentation from the function definition itself.
Overall though I'd prefer to stay as a close as possible to the kernel API itself without changing the abstraction. Basically be a "pure mostly safe layer" to ensure that callers have a direct connection to what's happening for the
ioctlrather than having to first wade through an abstraction in Wasmtime and then understanding theioctlitself. Given that I'd push back on renames of fields, but I can do another pass at docs to fill them out more faithfully.
pchickey commented on PR #11372:
Did some private benchmarking and, in my server application, this took a single-core wasi-http hello world from 25000 to 40000 rps - 60% speedup in a benchmark that is mostly measuring instantiation and cleanup overhead. Awesome work!
fitzgen commented on PR #11372:
I remember we had some discussion in the Wasmtime meeting about the dual-purpose of
keep_residentand how it configures both an amount tomemsetand a limit on the pool's max RSS when idle. This PR would remove that second bit: if the Wasm program touches different pages on each instantiation, but always touches only less thankeep_residenton each instantiation, then all pages in the slot will eventually become and stay resident. Probably this is the correct thing to do long term, and we should have two different knobs for these two different purposes, but some embedders might be relying on that functionality today and it isn't being replaced with a new/different knob right now. It is not clear to me whether punting on a solution for that use case until some vague re-rationalization of options happens in the future is acceptable or not.This comment is based on a misunderstanding I had about the way this PR works and can be ignored. We will still only keep
keep_residentbytes resident with this change.
fitzgen submitted PR review.
fitzgen created PR review comment:
That's fair.
fitzgen submitted PR review.
alexcrichton updated PR #11372.
pchickey submitted PR review:
I've found a correctness issue in this implementation that I'm working to make a reproducing test case for
alexcrichton updated PR #11372.
alexcrichton commented on PR #11372:
Finished debugging with Pat. Culprit is that the
Enginewas created, but forked subprocesses created stores. That definitely doesn't work given how I've set things up because the stores have a page map for the parent process, not the child process. Will work on a fix.
alexcrichton updated PR #11372.
alexcrichton commented on PR #11372:
@pchickey mind reviewing the most recent commit to confirm it looks good?
alexcrichton updated PR #11372.
pchickey submitted PR review:
I can confirm that pagemap is now working properly when wasmtime is used in a forked child process! Thank you for tracking this down with me @alexcrichton .
pchickey submitted PR review:
I can confirm that pagemap is now working properly when wasmtime is used in a forked child process! Thank you for tracking this down with me @alexcrichton .
Now that pagemap scan is actually returning a non-empty set of pages (because, before fixing the bug, it was scanning the parent process's pages) our throughput improvement is 50%, not 60% as before. 50% is still outstanding!
alexcrichton updated PR #11372.
alexcrichton has enabled auto merge for PR #11372.
alexcrichton updated PR #11372.
alexcrichton merged PR #11372.
posborne commented on PR #11372:
I missed this PR dropping but am reviewing now. If I am understanding correctly, one quirk of this approach is that with lazy page faulting of the
keep_resident_*portion of regions, we might expect to see zeroing get slower over time as it would just take a single full use of thekeep_residentpages for the scan to always report all pages as dirty henceforth, thus making resetting this memory revert to a more expensive path from that point forward (as we're going to incur the full cost of memset plus we'll be doing the syscall on these pages).This behavior wouldn't show up in a microbenchmark but is something I would expect to see in the real world; please correct me if my understanding is off.
I also have a standalone test program / benchmarking suite where I've been trying to get some more concrete numbers on where different approaches perform worse/better (based off an earlier version of @tschneidereit's branch). This includes a half-hearted attempt to do runs in parallel, though my setup definitely doesn't get to the point of meaningfully capture the IPI/TLB-Shootdown behavior encountered at scale.
I'm actively tinkering with getting the test to be more representative of real-world use and costs: https://github.com/posborne/pagemap-scan-benchmark. So far, I'm seeing quite a few cases where doing a scan approach can be more expensive than other approaches. Hopefully I'll have that updated later today with a model I'm a bit more confident in as well as my analysis tools for processing and plotting the comparisons.
alexcrichton commented on PR #11372:
Before responsding, if you're concerned about this PR though I'd be happy to add a
Configoption to forcibly disable (or enable) it as that seems generally useful no matter what. Basically want to make sure it's no a headache to upgrade or have you feel like you're forced to bottom all this out before the next upgrade.Otherwise though one factor not necessarily captured in your benchmark is that it's not always the first N bytes of linear memory that are in the working set of the module. For example all Rust programs start with ~1MB of a linear memory shadow stack of which only a very small fraction is used at the top. After the shadow stack is read-only data which can have variable size depending on the program at hand. After that is data and the heap which will have likely an expected amount of churn. More-or-less I would predict that the old heuristic of "only memset the first N bytes" much worse in practice than what pagemaps can achieve by figuring out a module's working set over time.
Another factor perhaps is that one big benefit of pagemaps is that it's possible to skip the
madvisecall altogether. If the working set is smaller than keep-resident options then the pagemap scan is sufficient to reset linear memory.In general though I'd love to be able to improve our set of benchmarks over time and how we're modeling all of this and measuring what to optimize, so I look forward to the results you get! I would definitely believe that further tuning of heuristics is needed here to, for example, completely madvise-away a memory if it's been reused too many times or the last known-keep-resident size was too large or something like that.
posborne commented on PR #11372:
Before responding, if you're concerned about this PR though I'd be happy to add a Config option to forcibly disable (or enable) it as that seems generally useful no matter what. Basically want to make sure it's no a headache to upgrade or have you feel like you're forced to bottom all this out before the next upgrade.
I don't have any acute concerns with this ATM, though having reasonable controls may be a good thing to look at with any follow-on PRs to cleanup
keep_residentconfig clarity, etc.Another factor perhaps is that one big benefit of pagemaps is that it's possible to skip the madvise call altogether. If the working set is smaller than keep-resident options then the pagemap scan is sufficient to reset linear memory.
Part of what I'm trying to piece together and potentially suss out with benchmarking is whether doing the scan in userspace to avoid the madvise call is actually cheaper than just doing the madvise. Both are going to have some awareness of the VMAs and be using much of the same code in the kernel, so it isn't clear that avoiding madvise for a clean pages is a huge win in practice.
The savings in the case where we can avoid doing memset's for a good chunk of pages is the only area I'm currently confident is providing potentially significant savings. I'll try to break some of these cases down better and provide that data here. I'll also spend a bit more time reading through the kernel code paths taken byawareness of the VMAs and be using much of the same code in the kernel, so it isn't clear that avoiding madvise and whether there's any potential side-benefits (or risks) with the userspace pagemap scanning.
I haven't carefully considered the COW side of things as of yet.
In general though I'd love to be able to improve our set of benchmarks over time and how we're modeling all of this and measuring what to optimize, so I look forward to the results you get! I would definitely believe that further tuning of heuristics is needed here to, for example, completely madvise-away a memory if it's been reused too many times or the last known-keep-resident size was too large or something like that.
I agree that heuristics will probably be required at some point; right now I'm just trying to build a slightly better model to see what might make sense. There's probably some factors that may come into play with embeddings at scale in terms of the number of vmas and associated cost of the scan op, etc. but things don't look terrible here at first blush.
alexcrichton commented on PR #11372:
(FWIW I've also added this as a discussion topic for tomorrow for further, synchronous, discussion)
Part of what I'm trying to piece together and potentially suss out with benchmarking is whether doing the scan in userspace to avoid the madvise call is actually cheaper than just doing the madvise.
Oh this is surprising to me! I'll admit that I basically take it as an axiom that
madviseis slow, but that's also been the result of almost all historical benchmarking. You're right that both syscalls will iterate/synchronize on VMAs, but the major costs ofmadviseare:
- IPIs and more synchronization are required to remove present pages from the page tables.
- Reuse of the same pages in the future will incur a minor page fault that the kernel services. This has historically had contention in the kernel show up high in profiles too.
These extra costs of
madvisebecome more problematic the higher the concurrency too. Theinstantiation.rsbenchmark in this repository measures time-to-instantiate with N threads in the background also doing work, and while we ideally want that to be constant (unaffected by other threads) it steadily gets worse the more threads you add due to all of the contention. (this isn't a perfect benchmark but a proxy at least). By doing a memset we avoid all of the kernel synchronization at the cost of a pagemap (ideally) which is the main goal here.The main ways COW comes into the picture I know of are:
- If a page is only read (e.g. rodata for the wasm) using PAGEMAP_SCAN vs memset-or-madvise enables us to avoid touching the page entirely. That means that after the first instance faults in the read-only page it never changes from then on and it's always present with no future page faults necessary for future instances. For memset we'd unnecessarily copy the contents and for madvise we'd unnecessarily page it out.
- If a page is read, then written, then PAGEMAP_SCAN and memset achieve the same goal of keeping the memory resident. The first instance will fault on the read, then fault on the write, and so that cost would be paid on all future instances with madvise but both PAGEMAP_SCAN and memset avoid the cost for future instances.
- Written-first pages are similar to the above, but slightly cheaper to fault in the future with madvise since it's just one fault instead of 2.
One thing I know as well is that @tschneidereit's original benchmarking found the implementation in the kernel relatively expensive with temporary data structures and such. My own benchmarking shows the ioctl is still highest in the profile and so it's probably not the fastest thing in the world, but it's so far consistently beaten out pure-madvise. (I haven't been benchmarking against madvise-plus-memset vs pagemaps since I've assumed madvise-plus-memset is way slower, but I should probably confirm that)
posborne commented on PR #11372:
Testing with thread parallelism, process parallelism (and no parallelism) yields interesting results. It definitely appears that the PAGEMAP_SCAN ioctl suffers from lock contention issues, especially in the multi-threaded case but, at least as setup in this benchmark, doing a scan/memset may fair better than madvise for some multi-process cases.=
Full report from a set of benchmark runs (select region size in dropdown) here: interactive_pagemap_benchmarks_by_size.html.zip
Here's the plot with a 1MB pool with 1 thread/process, 16 threads, and 16 processes running the benchmarks concurrently. My current suspicion is that this is capturing some IPIs and lock contention issues (there are some per-process mm locks which I think account for the thread/process perf differences).

posborne commented on PR #11372:
@alexcrichton pointed at that the current benchmarks results in a bunch of concurrent memory mapping for test setup which might be skewing things a bit; I'll try to reduce that noise as it is probably compromising the results here a bit.
posborne commented on PR #11372:
As suspected, changing up the benchmark to minimize mmap/munmap for the parallel tests greatly reduces the noise on the syscalls; here's the updated results @ https://github.com/posborne/pagemap-scan-benchmark/commit/d347f388dfea805fb12cc2043cc89106b295d141
interactive_pagemap_benchmarks_by_size.html.zip
General Summary:
- Just zeroing the memory takes less time than pagemap below 512KB; there are probably indirect costs to always zeroing (memory bandwidth, overall cpu pressure) not shown here to consider, but this is some data.
- Many threads in a proc seems to have a penalty (relatively to the same number of threads) doing the benchmark for the syscalls, but it also seems to impact memset in some way as well in this test (don't have a full idea of why this is or if it is an issue with my test setup).
- In most cases, it does look like doing the pagemap scan + zeroing outperforms madvising the full range. The end result of those ops is not identical in terms of resident memory, etc. so not an exact mapping to how we use these in tree.
1MB memory region size

128K memory region size

tschneidereit commented on PR #11372:
@posborne thank you for sharing this detailed analysis! I think the results are excellent validation of the viability of pagemap_scan as an approach to improve performance, and in case any of the
keep-residentoptions are used usually also reduce memory usage.There's also one more benefit to the pagemap_scan strategy that isn't reflected in the benchmarks yet, relative to
madvise: memory reset manually doesn't incur pagefaults the next time it's written to. Because of that, it might make sense to include theregion.make_dirty()call in the measurements.Otherwise, I guess it might make sense to add an option to set a minimum heap size, under which we'd always just do a memset. I doubt anything much above 64KB would be useful there, though: doing needless memsets not only increases memory usage for CoW-backed pages, but also causes cache churn.
I'd also expect most real-world modules/components not to be this small: anything using a shadow stack is probably going to reserve enough just for that to be too large. (Which also usually means that the overwhelming majority of the current
keep-residentbudgets will be wasted on the shadow stack, making the real-world impact of moving thepagemap_scaneven better ifkeep-residentis already in use.)
posborne commented on PR #11372:
@tschneidereit Thank you for the feedback.
There's also one more benefit to the pagemap_scan strategy that isn't reflected in the benchmarks yet, relative to madvise: memory reset manually doesn't incur pagefaults the next time it's written to. Because of that, it might make sense to include the region.make_dirty() call in the measurements.
Yeah, that's a good call; i've had those in and out at different periods of time, but the page fault is probably best to measure in all cases. The point at which you pay that penalty is different but more impactful to actual execution (vs. refreshing the pool).
Otherwise, I guess it might make sense to add an option to set a minimum heap size, under which we'd always just do a memset. I doubt anything much above 64KB would be useful there, though: doing needless memsets not only increases memory usage for CoW-backed pages, but also causes cache churn.
I'm still building up my full intuition on what values are likely to be configured in practice; for regions on the smaller side, my initial thought was that tables might be more likely to fall in that range. From out-of-band experience, the extra memsets do definitely cause undesirable cache behavior and cut into memory bandwidth which we've seen become a problem at a certain scale (not strictly driven by wasmtime but across the board).
I do think having the option could make sense in order to try to find a value that performs optimally. The other thing I've been considering, which would require some state tracking, would be whether to mark a region as not worth scanning (regardless of size). If 100% of the pages are dirty (or maybe even 80% or some other number) then it may make more sense to just do the full memset and avoid the syscall (for regions where we are OK keeping resident).
Last updated: Dec 06 2025 at 07:03 UTC