cfallin commented on issue #3733:
I've implemented lazy table initialization now (mostly; there's one odd test failure with linking I need to resolve). It could probably be cleaned up somewhat, but here's one initial datapoint, with spidermonkey.wasm:
(before) sequential/default/spidermonkey.wasm time: [71.886 us 72.012 us 72.133 us] (after) sequential/default/spidermonkey.wasm time: [22.243 us 22.256 us 22.270 us] change: [-69.117% -69.060% -69.000%] (p = 0.00 < 0.05) Performance has improved.
So, 72µs to 22µs, or a 69% reduction.
cfallin commented on issue #3733:
@alexcrichton (and anyone else watching) argh, sorry about that, I pushed my changes prior to last night's measurements to a private save-the-work-in-progress fork that I keep, but not here; just pushed now. Regardless the review comments are useful and I'll keep refining this!
cfallin commented on issue #3733:
Should pass all tests now, and rebased on latest; I'll refactor as suggested above and get this ready for review tomorrow!
cfallin commented on issue #3733:
@alexcrichton I've refactored roughly along the lines of what you suggested -- a single
Arc
to encapsulate all of the per-Module
computed state that the runtime needs (memfds, lazy table backing, etc). I'm currently fighting some weird test failure (likely something stupid but I seemed to have broken trampolines), and haven't addressed a few of your more minor comments above; will continue to work on that, but this should give a preview of the basic direction.
cfallin edited a comment on issue #3733:
@alexcrichton I've refactored roughly along the lines of what you suggested -- a single
Arc
to encapsulate all of the per-Module
computed state that the runtime needs (memfds, lazy table backing, etc). <strike>I'm currently fighting some weird test failure (likely something stupid but I seemed to have broken trampolines), and</strike> haven't addressed a few of your more minor comments above; will continue to work on that, but this should give a preview of the basic direction.
cfallin commented on issue #3733:
One test failure resulted in discovering the issue in #3769; I split that off as a separate PR and can rebase this on top once it merges.
cfallin commented on issue #3733:
Here's a benchmark result of explicit-zeroing-of-VMContext (including anyfunc) vs using madvise -- using a tweaked version of the in-tree
instantiation
benchmark, withempty.wat
andspidermonkey.wasm
as two extreme endpoints of the module-size spectrum. (Noempty.wat
in the 16-thread version yet because the parallel part of the bench is written around one module but I can hack that in next if we need):explicit memset: sequential/pooling/empty.wat time: [1.3086 us 1.3187 us 1.3296 us] sequential/pooling/spidermonkey.wasm time: [15.703 us 15.824 us 15.950 us] parallel/pooling/1000 instances with 16 threads [spidermonkey.wasm] time: [17.817 ms 17.955 ms 18.094 ms] with `instance_decommit_pages()` (madvise): sequential/pooling/empty.wat time: [2.5290 us 2.5315 us 2.5349 us] change: [+85.334% +86.995% +88.607%] (p = 0.00 < 0.05) Performance has regressed. sequential/pooling/spidermonkey.wasm time: [9.7999 us 9.8139 us 9.8288 us] change: [-38.201% -37.882% -37.557%] (p = 0.00 < 0.05) Performance has improved. parallel/pooling/1000 instances with 16 threads [spidermonkey.wasm] time: [4.2498 ms 4.2551 ms 4.2606 ms] change: [-76.488% -76.302% -76.116%] (p = 0.00 < 0.05) Performance has improved.
So tl;dr is that empty instantiation gets a bit slower, as we'd expect -- extra madvise vs a tiny memset -- but for a "realistic large-class" module like SpiderMonkey, this is a ~4x speed improvement in total instantiation time. To me this suggests we should take this approach, and then perhaps refine with the single-madvise-for-whole-slot when we can for the small module case, but: thoughts?
alexcrichton commented on issue #3733:
For the benchmark numbers I posted https://github.com/bytecodealliance/wasmtime/pull/3775 which I think will tweak the benchmark to better serve measuring this. Running that locally for a large module I'm seeing the expected exponential slowdown for the pooling allocator, and also as expected the mmap allocator is behaving better here due to fewer ipis needed (as it's all contended on the vma lock)
Could you try benchmarking with that?
cfallin commented on issue #3733:
Sure; here's a benchmarking run using your updated
instantiation
benchmark. I include baselines fordefault
andpooling
, but my local changes don't affectdefault
(the mmap allocator) -- inpooling.rs
, I remove the call todecommit_instance_pages
and passfalse
toprezeroed
oninitialize_vmcontext
instead, and that's it.================================================================================ With madvise() to flash-zero Instance/VMContext: ================================================================================ sequential/default/empty.wat time: [1.1723 us 1.1752 us 1.1778 us] sequential/pooling/empty.wat time: [2.4567 us 2.4608 us 2.4652 us] parallel/default/empty.wat: with 1 background thread time: [1.1596 us 1.1628 us 1.1666 us] parallel/default/empty.wat: with 16 background threads time: [20.918 us 21.125 us 21.307 us] parallel/pooling/empty.wat: with 1 background thread time: [2.4203 us 2.4224 us 2.4249 us] parallel/pooling/empty.wat: with 16 background threads time: [25.422 us 25.860 us 26.295 us] sequential/default/small_memory.wat time: [5.5100 us 5.5230 us 5.5361 us] sequential/pooling/small_memory.wat time: [3.0336 us 3.0356 us 3.0376 us] parallel/default/small_memory.wat: with 1 background thread time: [5.3376 us 5.3488 us 5.3614 us] parallel/default/small_memory.wat: with 16 background threads time: [95.562 us 96.023 us 96.485 us] parallel/pooling/small_memory.wat: with 1 background thread time: [3.0222 us 3.0270 us 3.0316 us] parallel/pooling/small_memory.wat: with 16 background threads time: [172.62 us 175.12 us 177.54 us] sequential/default/data_segments.wat time: [6.3065 us 6.3094 us 6.3120 us] sequential/pooling/data_segments.wat time: [2.8083 us 2.8116 us 2.8147 us] parallel/default/data_segments.wat: with 1 background thread time: [6.0155 us 6.0290 us 6.0407 us] parallel/default/data_segments.wat: with 16 background threads time: [141.99 us 142.58 us 143.27 us] parallel/pooling/data_segments.wat: with 1 background thread time: [2.7353 us 2.7533 us 2.7687 us] parallel/pooling/data_segments.wat: with 16 background threads time: [40.435 us 41.035 us 41.644 us] sequential/default/wasi.wasm time: [6.6997 us 6.7118 us 6.7264 us] sequential/pooling/wasi.wasm time: [3.7074 us 3.7099 us 3.7129 us] parallel/default/wasi.wasm: with 1 background thread time: [6.7795 us 6.7964 us 6.8131 us] parallel/default/wasi.wasm: with 16 background threads time: [154.73 us 155.36 us 156.03 us] parallel/pooling/wasi.wasm: with 1 background thread time: [3.8161 us 3.8195 us 3.8231 us] parallel/pooling/wasi.wasm: with 16 background threads time: [60.806 us 61.850 us 62.927 us] sequential/default/spidermonkey.wasm time: [15.974 us 15.983 us 15.993 us] sequential/pooling/spidermonkey.wasm time: [6.0185 us 6.0215 us 6.0248 us] parallel/default/spidermonkey.wasm: with 1 background thread time: [16.189 us 16.201 us 16.215 us] parallel/default/spidermonkey.wasm: with 16 background threads time: [165.91 us 167.16 us 168.51 us] parallel/pooling/spidermonkey.wasm: with 1 background thread time: [5.9293 us 5.9348 us 5.9403 us] parallel/pooling/spidermonkey.wasm: with 16 background threads time: [55.862 us 57.049 us 58.373 us] ================================================================================ With explicit memset: ("default"-policy was not changed, so excluding) ================================================================================ sequential/pooling/empty.wat time: [1.2088 us 1.2111 us 1.2136 us] change: [-51.445% -51.270% -51.113%] (p = 0.00 < 0.05) Performance has improved. parallel/pooling/empty.wat: with 1 background thread time: [1.1838 us 1.1853 us 1.1869 us] change: [-50.952% -50.778% -50.613%] (p = 0.00 < 0.05) Performance has improved. parallel/pooling/empty.wat: with 16 background threads time: [21.178 us 21.353 us 21.515 us] change: [-18.533% -17.405% -16.183%] (p = 0.00 < 0.05) Performance has improved. sequential/pooling/small_memory.wat time: [1.7699 us 1.7719 us 1.7740 us] change: [-41.479% -41.392% -41.293%] (p = 0.00 < 0.05) Performance has improved. parallel/pooling/small_memory.wat: with 1 background thread time: [1.7915 us 1.7930 us 1.7945 us] change: [-40.593% -40.487% -40.383%] (p = 0.00 < 0.05) Performance has improved. parallel/pooling/small_memory.wat: with 16 background threads time: [65.378 us 66.775 us 68.080 us] change: [-62.639% -61.536% -60.394%] (p = 0.00 < 0.05) Performance has improved. sequential/pooling/data_segments.wat time: [1.5276 us 1.5288 us 1.5301 us] change: [-45.622% -45.551% -45.475%] (p = 0.00 < 0.05) Performance has improved. parallel/pooling/data_segments.wat: with 1 background thread time: [1.5664 us 1.5716 us 1.5776 us] change: [-42.192% -41.806% -41.407%] (p = 0.00 < 0.05) Performance has improved. parallel/pooling/data_segments.wat: with 16 background threads time: [30.378 us 30.951 us 31.554 us] change: [-24.975% -23.479% -21.963%] (p = 0.00 < 0.05) Performance has improved. sequential/pooling/wasi.wasm time: [2.0274 us 2.0401 us 2.0519 us] change: [-46.488% -46.206% -45.904%] (p = 0.00 < 0.05) Performance has improved. parallel/pooling/wasi.wasm: with 1 background thread time: [1.9543 us 1.9559 us 1.9579 us] change: [-48.881% -48.790% -48.700%] (p = 0.00 < 0.05) Performance has improved. parallel/pooling/wasi.wasm: with 16 background threads time: [49.629 us 50.532 us 51.431 us] change: [-21.128% -19.627% -18.124%] (p = 0.00 < 0.05) Performance has improved. sequential/pooling/spidermonkey.wasm time: [12.469 us 12.556 us 12.635 us] change: [+108.02% +109.03% +110.10%] (p = 0.00 < 0.05) Performance has regressed. parallel/pooling/spidermonkey.wasm: with 1 background thread time: [12.523 us 12.562 us 12.595 us] change: [+109.29% +110.20% +111.03%] (p = 0.00 < 0.05) Performance has regressed. parallel/pooling/spidermonkey.wasm: with 16 background threads time: [254.13 us 277.81 us 304.77 us] change: [+410.73% +449.63% +496.28%] (p = 0.00 < 0.05) Performance has regressed.
So, switching to an explicit memset, we see performance improvements in all cases except the large SpiderMonkey module (31k functions in my local build), where using explicit memset is 5.49x slower (+449%).
Either way, it's clear to me that we'll feel some pain on the low end (madvise) or high end (memset), so the ultimate design probably has to be to incorporate this flash-clearing into an madvise we're already doing (single-madvise idea). I can implement that as soon as we confirm that we want to remove uffd. I guess the question is just which we settle on in the meantime :-)
cfallin commented on issue #3733:
Hmm, actually, the other option is to bring back the bitmap approach from the first versions of this PR above. I think the atomics scared everyone off but by now I've propagated
&mut self
everywhere it needs to be, so this should be pretty straightforward. Then we can always memset the bitmap, but that's relatively small and should always be cheaper than an madvise, even on SpiderMonkey...
alexcrichton commented on issue #3733:
I'm a bit perplexed with the numbers I'm seeing locally. Using this small program I measured that the difference in memset-vs-madvise to clear 16 pages of memory is:
threads memset madvise 1 691ns 234ns 4 752ns 7995ns 8 766ns 15991ns This is what I'd expect which is that memset has a relatively constant cost where
madvise
gets worse as you add more threads since there's more IPIs. I don't understand why memset gets slightly worse when you add more threads, however. (these are all threads operating on disjoint pages of memory).So it doesn't really make sense to me that using
memset
in your benchmarks actually makes things slower rather than faster.I don't have the
spidermonkey.wasm
you're testing with but a random "big module" that I had on hand was rustpython.wasm. This has ~12k functions as a ~6k element table. I ran theinstantiation
benchmark with this PR as-is via:$ cargo bench --features memfd --bench instantiation 'parallel.*pool.*rustpy'
After I applied this patch which I believe avoids madvise for instance pages entirely and uses memset. Using the same benchmarking command the timings I got for the two approaches are:
threads madvise memset 1 16us 8us 2 113us 15us 3 174us 40us 4 234us 44us This is more in line with what I am expecting. We're seeing that madvise is atrocious for concurrent performance, so I'm highly surprised that you're seeing madvise as faster and I'm seeing memset as faster. I'm working on the arm64 server that we have but I don't think x86_64 vs aarch64 would explain this difference.
I've confirmed with
perf
that the hottest symbol in themadvise
build istlb_flush_mmu
(as expected), and the hottest symbol in thememset
build ismemset
(also as expected).Can you try to reproduce these results locally? If not I think we need to figure out what the difference is?
cfallin commented on issue #3733:
I'm working on the arm64 server that we have but I don't think x86_64 vs aarch64 would explain this difference.
Ah, I think it would actually, or more precisely the server specs will: the aarch64 machine we have is 128 cores, so an "IPI to every core" is exceedingly expensive. For reference I'm doing tests on my Ryzen 3900X, 12 cores; big but not "never do a broadcast to all cores or you will suffer" big.
It seems to me that past a certain tradeoff point, which varies based on the system, madvise is faster to flash-zero sparsely accessed memory. (In the limit, the IPI is a fixed cost, flushing the whole TLB is a fixed cost, and actually zeroing the page tables is faster than zeroing whole pages.) Perhaps on arm2-ci that point is where we're zeroing 1MB, or 10MB, or 100MB; it would be interesting to sweep your experiment in that axis.
In any case, the numbers are the numbers; I suspect if I ssh'd to the arm64 machine and ran with your wasm module, I'd replicate yours, and if you ssh'd to my workstation and ran with my wasm module, you'd replicate mine. The interesting bits are the differences in platform configuration and workload I think.
For reference, my spidermonkey.wasm (gzipped) has 31894 functions, so the anyfunc array is 765 kilobytes that need to be zeroed. Three-quarters of a megabyte! We can definitely do better than that I think. (Possibly-exported functions, as you've suggested before -- I still need to build this -- comes to 7420 functions, or 178 KB; better but still too much.)
Given all of that, I do have a proposed direction that I think will solve all of the above issues: I believe that an initialization-bitmap can help us here. If we keep a bitmap to indicate when an anyfunc is not initialized, we need only 3992 bytes (499
u64
s) for spidermonkey.wasm. This seems unambiguously better with no downsides, but please do let me know if you disagree :-) I'd be happy to split that part off into a separate PR with its own speedup measurements, leaving this PR with a memset-only approach (which should not be the final word for anyone running SpiderMonkey).
cfallin commented on issue #3733:
I was curious to quantify the difference between our systems a bit more so:
Using this small program I measured that the difference in memset-vs-madvise to clear 16 pages of memory is:
On a 12-core system I get:
64KiB zeroing:
threads madvise memset 1 136ns 470ns 2 1.16µs 582ns 4 3.79µs 625ns 8 6µs 646ns 1 MiB zeroing:
threads madvise memset 1 363ns 10.635µs 2 1.724µs 11.438µs 4 3.857µs 11.694µs 8 5.554µs 12.308µs So on my system at least, madvise is a clear loss at. 64KiB, but is a win in all cases for 1 MiB. (Wildly enough, it becomes more of a win at higher thread counts, because the total cache footprint of all threads touching their memory exceeds my LLC size.) I haven't bisected the breakeven point between those two.
I'm also kind of flabbergasted at the cost of madvise on arm2-ci; for 64KIB, 8 threads, you see 16 µs while I see 646 ns, about 25 times faster (!).
So given that my canonical "big module" requires close to a megabyte of VMContext zeroing, and given that we're testing on systems with wildly different TLB-shootdown cost, I'm not surprised at all that we've been led to different conclusions! The variance of workloads and systems "in the field" is all the more reason to not have to zero data at all, IMHO, by using a bitmap :-)
cfallin edited a comment on issue #3733:
I was curious to quantify the difference between our systems a bit more so:
Using this small program I measured that the difference in memset-vs-madvise to clear 16 pages of memory is:
On a 12-core system I get:
64KiB zeroing:
threads madvise memset 1 136ns 470ns 2 1.16µs 582ns 4 3.79µs 625ns 8 6µs 646ns 1 MiB zeroing:
threads madvise memset 1 363ns 10.635µs 2 1.724µs 11.438µs 4 3.857µs 11.694µs 8 5.554µs 12.308µs So on my system at least, madvise is a clear loss at. 64KiB, but is a win in all cases for 1 MiB. (Wildly enough, it becomes more of a win at higher thread counts, because the total cache footprint of all threads touching their memory exceeds my LLC size.) I haven't bisected the breakeven point between those two.
I'm also kind of flabbergasted at the cost of madvise on arm2-ci; for 64KIB, 8 threads, you see 16 µs while I see <strike>646 ns, about 25 times faster (!).</strike> EDIT: 6 µs, about 2.6x faster (table columns are hard sorry).
So given that my canonical "big module" requires close to a megabyte of VMContext zeroing, and given that we're testing on systems with wildly different TLB-shootdown cost, I'm not surprised at all that we've been led to different conclusions! The variance of workloads and systems "in the field" is all the more reason to not have to zero data at all, IMHO, by using a bitmap :-)
alexcrichton commented on issue #3733:
Hm there's still more I'd like to dig into performance-wise here, but I don't want to over-rotate and dedicate this whole thread to a few lines of code that are pretty inconsequential. Additionally I was thinking last night and concluded "why even zero at all?" For the anyfunc array I don't think there's actually any need to zero since it's not accessed in a loop really right now. For example table elements are sort of a next layer of cache so if you hammer on table elements any computation to create the anyfunc is cached at that layer. Otherwise the only other uses of anyfuncs are exports (cached inherently as you typically pull out the export and don't pull it out again-and-again) and as
ref.func
which isn't really used by modules today. "Always compute anyfunc constructors" may makeref.func
slower one day but it seems like we could cross that bridge when we get there.Effectively I think we could get away with doing nothing to the anyfunc array on instantiation. When an anyfunc is asked for, and every time it's asked for, we construct the anyfunc into the appropriate slot and return it. These aren't ever used concurrently, as we've seen, so there's no need to worry about concurrent writes and it's fine to pave over what's previously there with the same content.
In the future when we have the instance/table/memory all adjacent in the pooling allocator we may as well zero out the memory with one
madvise
since it's pretty easy to do (and we're already require to do it for memories/tables). For now though until that happens it should be fine to leave the contents undefined until it's used. We could also implement a sort of "hardening" at some point to usecalloc
to alloc aVMContext
in the on-demand allocator one day, but that doesn't need to be how it's done for now either.If that seems reasonable then I think it's fine to shelve performance things for later since there's no longer a question of how to zero since we wouldn't need to zero at all.
cfallin commented on issue #3733:
I was thinking last night and concluded "why even zero at all?" For the anyfunc array I don't think there's actually any need to zero since it's not accessed in a loop really right now. For example table elements are sort of a next layer of cache so if you hammer on table elements any computation to create the anyfunc is cached at that layer. Otherwise the only other uses of anyfuncs are exports (cached inherently as you typically pull out the export and don't pull it out again-and-again) and as
ref.func
which isn't really used by modules today. "Always compute anyfunc constructors" may makeref.func
slower one day but it seems like we could cross that bridge when we get there.Huh, that is a really interesting idea; I very much like the simplicity of it! Thanks!
I can do this, then file an issue to record the "maybe an is-initialized bitmap, or maybe madvise-zero anyfuncs along with the other bits" ideas when if/when we do eventually come across
ref.func
used in the wild in a hotpath.
cfallin commented on issue #3733:
@alexcrichton I've done all of the refactors you suggested (
wasmtime::ModuleInner
implementing runtime trait directly, no separate Arc-held thing; and moving table init precomputation to compilation inwasmtime_environ
) -- things are a lot cleaner now, thanks!Tomorrow I'll squash this down and look at the complete diff with fresh eyes, and clean it up -- undoubtedly some small diffs have snuck in from attempting and undoing bits that I'll want to remove. I'll do a final round of before/after benchmarking as well so provide a good top-level summary of this PR's effects.
cfallin commented on issue #3733:
OK, I think this is ready for the next (hopefully final? happy to keep going of course) round of review.
Here's a benchmark (in-tree benches/instantiation.rs, using above spidermonkey.wasm, memfd enabled, {1, 16} threads):
<details>
BEFORE: sequential/default/spidermonkey.wasm time: [68.569 us 68.696 us 68.856 us] sequential/pooling/spidermonkey.wasm time: [69.406 us 69.435 us 69.465 us] parallel/default/spidermonkey.wasm: with 1 background thread time: [69.444 us 69.470 us 69.497 us] parallel/default/spidermonkey.wasm: with 16 background threads time: [183.72 us 184.31 us 184.89 us] parallel/pooling/spidermonkey.wasm: with 1 background thread time: [69.018 us 69.070 us 69.136 us] parallel/pooling/spidermonkey.wasm: with 16 background threads time: [326.81 us 337.32 us 347.01 us] WITH THIS PR: sequential/default/spidermonkey.wasm time: [6.7821 us 6.8096 us 6.8397 us] change: [-90.245% -90.193% -90.142%] (p = 0.00 < 0.05) Performance has improved. sequential/pooling/spidermonkey.wasm time: [3.0410 us 3.0558 us 3.0724 us] change: [-95.566% -95.552% -95.537%] (p = 0.00 < 0.05) Performance has improved. parallel/default/spidermonkey.wasm: with 1 background thread time: [7.2643 us 7.2689 us 7.2735 us] change: [-89.541% -89.533% -89.525%] (p = 0.00 < 0.05) Performance has improved. parallel/default/spidermonkey.wasm: with 16 background threads time: [147.36 us 148.99 us 150.74 us] change: [-18.997% -18.081% -17.285%] (p = 0.00 < 0.05) Performance has improved. parallel/pooling/spidermonkey.wasm: with 1 background thread time: [3.1009 us 3.1021 us 3.1033 us] change: [-95.517% -95.511% -95.506%] (p = 0.00 < 0.05) Performance has improved. parallel/pooling/spidermonkey.wasm: with 16 background threads time: [49.449 us 50.475 us 51.540 us] change: [-85.423% -84.964% -84.465%] (p = 0.00 < 0.05) Performance has improved.
</details>
tl;dr: 80-95% faster, or 69µs -> 7µs (1-threaded, on-demand) / 337µs -> 50µs (16-threaded, pooling).
cfallin commented on issue #3733:
Alright, everything addressed and CI green -- time to land this. Thanks again @alexcrichton for all the feedback!
Last updated: Jan 24 2025 at 00:11 UTC