alexcrichton opened PR #5661 from lru
to main
:
This commit is a reimplementation of the strategy by which the pooling instance allocator selects a slot for a module. Previously there was a choice amongst three different algorithms: "reuse affinity", "next available", and "random". The default was "reuse affinity" but some new data has come to light which shows that this may not always be a good default.
Notably the pooling allocator will retain some memory per-slot in the pooling instance allocator, for example instance data or memory data if-so-configured. This means that a currently unused, but previously used, slot can contribute to the RSS usage of a program using Wasmtime. Consequently the RSS impact here is O(max slots) which can be counter-intuitive for embedders. This particularly affects "reuse affinity" because the algorithm for picking a slot when there are no affine slots is "pick a random slot", which means eventually all slots will get used.
In discussions about possible ways to tackle this, an alternative to "pick a strategy" arose and is now implemented in this commit. Concretely the new allocation algorithm for a slot is now:
- First pick the most recently used affine slot, if one exists.
- Otherwise if the number of affine slots to other modules is above some threshold N then pick the least-recently used affine slot.
- Otherwise pick a slot that's affine to nothing.
The "N" in this algorithm is configurable and setting it to 0 is the same as the old "next available" strategy while setting it to infinity is the same as the "reuse affinity" algorithm. Setting it to something in the middle provides a knob to allow a modest "cache" of affine slots while not allowing the total set of slots used to grow too much beyond the maximal concurrent set of modules. The "random" strategy is now no longer possible and was removed to help simplify the allocator.
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
alexcrichton updated PR #5661 from lru
to main
.
alexcrichton requested cfallin for a review on PR #5661.
cfallin submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
This is just the number of slots, right? Might be slightly clearer named as
num_slots
(otherwise the name could imply that the range of cold indices considered is a subset of the whole range), or maybemax_instances
to match the constructor's arg name below.
cfallin created PR review comment:
Can we name this
affine_list_link
and the belowwarm_list_link
? Otherwise the names are easy to confuse IMHO (and e.g.unused
I would expect to be an unused padding field).warm
inwarm_list
to matchInner::warm
above.
cfallin created PR review comment:
I started off wondering if this should be
<=
instead, and eventually convinced myself that this is correct, but for a pretty subtle reason that I think should be documented (if it's right).My initial uncertainty came because we aren't creating new unused warm slots here (rather, we're creating an used slot, by allocating it). So if our unused-warm count is at the maximum (the
=
case), we don't violate our promise by taking a new cold slot.The subtle bit though is then once this slot is eventually terminated, we're already committed to having a new unused warm slot, unless we actively reclaim down to the limit (which would take more plumbing). So really we're conservatively accounting for that later effect.
Could we add a (short) comment to this effect?
cfallin created PR review comment:
Similarly, I had initially thought here that
pick_warm
should always succeed so thepick_cold
should probably be omitted in lieu of a panic (or at least a debug_assert); but this is not true in the case thatmax_unused_warm_slots
is 0 and we're starting out withunused_warm_slots
at 0 too. Because<
rather than<=
(see ablove) we'll come here even though we have no warm slots. So this is correct as written, but we should comment why.
cfallin created PR review comment:
Can we debug-assert here that
unused_warm
is correct, since we're collecting the linked list into aVec
anyway?
cfallin created PR review comment:
debug_assert!(max_unused_warm_slots <= max_instances)
?
cfallin created PR review comment:
Tiny efficiency tweak, but should we monomorphize here rather than take a fn ptr? It's a small function and potentially hot in high-allocation scenarios...
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Ah good point! This can actually be removed entirely in favor of just looking at the length of
slot_state
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Adding this assertion would require adding validation to the
PoolingAllocatorConfig
inwasmtime
as well to provide a better error message than tripping the assertion. Thinking about that though I think it may not be worth it since it's not really a problem ifmax_unused_warm_slots
is bigger than the number of slots. It's a bit silly but it can also perhaps be helpful to always pass a large value here to say "always keep everything warm"
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Ah so full disclosure I didn't think about
<
or<=
at all here. I wrote one down, it passed tests, and I never looked back.To be clear though the
max_unused_warm_slots
isn't a hard contract or anything, it's more of a guideline than anything else. If the max warm slots is set to 5, for example, but you create 10 instances concurrently, then when those 10 instances are deallocated there will be 10 warm slots.In that sense I don't think that the
<
or<=
really matters here since off-by-one is expected to happen naturally when the concurrency limit goes high.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Personally I think we should defer to a future profiler, if necessary. My guess is that LLVM is already devirtualizing, and if not the mutex lock/unlock is likely much more expensive.
In general Rust, I personally think, doesn't rely enough on devirtualization and relies much too heavily on monomorphization, leading to numerous compile time issues.
alexcrichton updated PR #5661 from lru
to main
.
alexcrichton requested cfallin for a review on PR #5661.
cfallin submitted PR review.
cfallin created PR review comment:
Yep, that's fine!
cfallin submitted PR review.
cfallin created PR review comment:
Yeah, that's a good point, we don't do reclamation so it's really heuristic rather than a true invariant. Seems fine then.
cfallin submitted PR review.
alexcrichton merged PR #5661.
Last updated: Dec 23 2024 at 13:07 UTC