cfallin commented on issue #3733:
Benchmarking: the delta is hard to see with memory initialization taking most of the time (pre-#3699); a few percent improvement in the mean with
spidermonkey.wasm
but Criterion says not statistically significant.However, in #3697 I measured approximately a factor-of-two improvement in the
initialize_vmcontext
hotpath (which was most of instantiation, in a scenario where we have a memory technique to reuse mappings).
cfallin edited a comment on issue #3733:
Benchmarking: the delta is hard to see with memory initialization taking most of the time (pre-#3697); a few percent improvement in the mean with
spidermonkey.wasm
but Criterion says not statistically significant.However, in #3697 I measured approximately a factor-of-two improvement in the
initialize_vmcontext
hotpath (which was most of instantiation, in a scenario where we have a memory technique to reuse mappings).
github-actions[bot] commented on issue #3733:
Subscribe to Label Action
cc @peterhuene
<details>
This issue or pull request has been labeled: "wasmtime:api"Thus the following users have been cc'd because of the following labels:
- peterhuene: wasmtime:api
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
alexcrichton commented on issue #3733:
Thanks for splitting this out! I've mentioned a few things here in other contexts but I think it's good to get them all in once place. The tl;dr; is that I don't personally feel like this is the best approach, but I think we can still get the same wins.
Due to a number of implementation details I don't think that this PR is really getting all that much of a benefit from laziness. The anyfunc array is used as storage for a module's
*mut VMCallerCheckedAnyfunc
which is used as the representation of afuncref
. This is subsequently used by things likeref.func
, tables, and exports. Today the anyfunc array is the size of the module's index space of functions, but this is actually much larger than necessary because we only use it for those items, and not all functions make their way intoref.func
, tables, or exports. This PR also doesn't lazily initialize tables, only the anyfunc array, which means we still initialize all anyfuncs used as part of tables.Adding all that together we still pay the full cost of anyfunc initialization for all table entries when a module is instantiated. The cost of initializing exports is done lazily and most modules nowadays don't have
ref.func
instructions so there's not much cost there. Overall this means that this PR is improving the amount of work done, but I think it's doing it in a bit of a roundabout way. Effectively we're immediately initializing a static subset of the anyfunc array on initialization and we're never actually initializing anything else after that (modulo exports but that's such a small set I don't think it really matters all that much).Given all that personally think that a better implementation would be to avoid the laziness complications and instead just go ahead and keep doing what we do today, only over the same set of functions that this PR is doing it for. Basically I think we should shrink the anyfunc array to the size of the
escaped_funcs
array. That I think means that the same magnitude of work would be done on instantiation and there wouldn't be any need to deal with laziness anywhere.
As a few other side things:
- For "benchmarking" this I think it might be good to take a module with 1k functions and a table with 100 of those functions and instantiate that. That should get memory out of the picture and should help focus on just the time it takes to deal with all the anyfuncs.
- For the race comment, I'm under the impression that because the writes are non-atomic that any concurrent access to
get_caller_checked_anyfunc
would result in a data race. I don't think we have concurrent accesses today though so this is probably fine, but if we were to stick with this PR I'd prefer to either makeget_caller_checked_anyfunc
unsafe or make it take&mut self
.
Ok now all that being said, one thing you mentioned the other day which I'm coming around to is that long-term I think it would be best to lazily initialize tables instead of just this anyfunc array. With CoW and uffd we're already effectively lazily initializing memory (just relying on the kernel to do so), and I think it would be best to do that for tables as well. I don't really know how we'd do this off the top of my head though.
Alternatively we could try to figure out how to make the table initialization much more "static" where work is done at compile time instead of instantiation time. I had one idea over here but that's sort of half-baked and only shrinks
VMCallerCheckedAnyfunc
, it doesn't offload too much to compile time. Overall doing something at compile-time I think would also be better than the lazy approach when we can.
cfallin commented on issue #3733:
@alexcrichton thanks for your thoughts here!
Top-level: I agree that in practice, eagerly initializing a smaller array-of-anyfuncs which is only "anyfuncs referenced in tables or by ref.func" is more-or-less equivalent to this PR's lazy approach. (In the presence of many
ref.func
s referring to functions that are not in tables, this PR should still win a bit, I think.)The main attraction I had to this approach was simplicity and small impact -- indexing via
escaped_funcs
indices is definitely workable too but implies a bit more indirection.The other thing about a pervasively lazy approach is that it paves the way for more gains later, as I think you're getting at with your second-to-last paragraph. If we're lazy about requesting anyfunc pointers (ie not initializing table elements until accessed), then the laziness in filling in the pointed-to anyfuncs finally pays off. And I think the change to get to this point is pretty simple -- just a null check when loading from the table and a slowpath that calls
get_caller_checked_anyfunc
from inside a libcall, then initializes the table with the result.Of course we could do both, and that has benefits too -- both shrinking the size of the vmcontext, and maybe doing things eagerly at first, does not preclude us from lazily initializing later...
A specific note too:
For the race comment, I'm under the impression that because the writes are non-atomic that any concurrent access to get_caller_checked_anyfunc would result in a data race.
Yes, exactly, the goal is to take advantage of a benign race (and come out with correct results regardless). Maybe to make the compiler happy we need to do this all through atomics and
Ordering::Relaxed
, as long as the bitmap update compiles down to something without thelock
prefix on x86-64; thelock or
was the major hotspot in my tests otherwise.The comment is attempting to give a safety argument why this is still OK in the presence of concurrent accesses; we should do whatever we need to at the language semantics level but I think I've convinced myself that the resulting machine code / access pattern is sound :-)
cfallin commented on issue #3733:
@alexcrichton I've reformulated the benign-racy init path now to use atomics for the bitmap accesses at least. There is still a (very intentional!) race, in that the bit-set is not done with an atomic
fetch_or
but rather by just storing the loaded value with the new bit set, and no compare-exchange or anything like that. This is to avoid any pipeline serializations, bus locking, etc in the hot-path. The comment explains why this is safe; the tl;dr is that it depends on the fact that all initializations write the same values (idempotent writes) so spurious re-initializations are acceptable.
alexcrichton commented on issue #3733:
I would personally not say that this PR is simple with a small impact, due to a few consequences:
- There's a new
InstanceAllocationInfo
structure to juggle which should theoretically be managed with other module-level resources likeArc<Module>
but isn't. This means while it logically has the same lifetime as other metadata it practically doesn't because it was developed at a different time.- This inflates the size of
VMContext
with extra flag bits for whether types are initialized.- This, as currently written, tries to deal with a race that never actually happens in practice. I don't believe that the
get_caller_checked_anyfunc
function is ever actually called concurrently (in that Wasmtime's APIs I don't think provide any means of doing so) so the extra synchronization/comments aren't actually necessary.Overall I continue to feel that this is a fair bit of complication, especially with
InstanceAllocationInfo
, for what ends up being not a huge benefit. This current lazy strategy really isn't all that lazy, it basically immediately initializes everything used for tables/exports and then never initializes anything else in the anyfunc array, so the laziness is never actually leveraged (it just makes initial initialization slower since the flags are all checked and written).I'm less certain about the idea of paving the way for gains later as well. I understand how this is sort of theoretically applicable but we're talking about nanoseconds-per-element "laziness" which feels like too small of a granularity to manage to me. If we get around to implementing a fully-lazily-initialized table approach (which I'm not entirely sure how we'd do that) it seems like it might be at a more bulk-level instead of at a per-element level. I would also expect that whatever laziness mechanism that would use would subsume this one since there's not really much need for cascading laziness.
Naively I would expect that using
escaped_funcs
would be workable to implement. It seems like we'd assign entries inescaped_funcs
an index (which would probably be a side table stored inModule
) and thenref.func
would use this index to get to the anyfunc array as well as table element initialization going through this (perhaps pre-computing something else inModule
to avoid too many levels of array acceses during initialization).I've reformulated the benign-racy init path now to use atomics for the bitmap accesses at least
I believe this is still a theoretical data race that Rust disallows. Despite the machine code working just fine at the LLVM layer any concurrent unsynchronized writes are a data race and undefined behavior, so any concurrent writes would have to be atomic. (again though I don't think it's worthwhile trying to fix this because I don't think this overall laziness approach is what we want either)
cfallin commented on issue #3733:
@alexcrichton thanks for your comments!
I'm less certain about the idea of paving the way for gains later as well. I understand how this is sort of theoretically applicable but we're talking about nanoseconds-per-element "laziness" which feels like too small of a granularity to manage to me. If we get around to implementing a fully-lazily-initialized table approach (which I'm not entirely sure how we'd do that) it seems like it might be at a more bulk-level instead of at a per-element level. I would also expect that whatever laziness mechanism that would use would subsume this one since there's not really much need for cascading laziness.
So, a few things:
I think the per-element lazy initialization is going to be an important part of any lazy table init. As I think @fitzgen mentioned earlier as well, there's not necessarily any spatial correlation between uses of adjacent table elements. So imagine I have 1000 funcs in a table, and I pull out index 553 to return to you, and never touch anything else. Do we initialize the whole table at that point? Or a block of 100 surrounding the accessed index? The larger the "bulk init", the greater the waste -- accessing index 553 probably doesn't mean I'm more likely to access 554 or 552 in the future than any other index.
Given that, I think that "cascading laziness" is not really the right way to characterize this change, plus a lazy-table-init change on top. It's more like this change is a necessary implementation piece of the latter. If the top layer is lazy (at the whole-table, or table-chunk, or individual-element level) but we don't have a way of avoiding setting up the anyfuncs the top layer points to, then we haven't carried through the work savings.
Re: "I'm not entirely sure how we'd do that" (lazy table init), do you have any thoughts on the scheme I proposed, with a null check on each table.get that branches to a slowpath libcall?
Reducing to the list of escaped-funcs only in the anyfuncs table is definitely possible, and I'm happy to go that way if you'd prefer for now. But re: above, I think it doesn't really do anything to prepare us for a lazy-table-init world. This change was designed to sort of push us in that direction, and avoid the need for side-tables and indirections.
Otherwise, simplicity is subjective, I guess :-) The factoring of state here is I think going to be necessary for most approaches that initialize this general state after instantiation -- we probably can't get away from having e.g. signature info at libcall time.
I won't argue the concurrency bit here is complex -- maybe the better answer is to put together an actual lazy-table-init implementation, and then get rid of the initialization bitmap and go back to the zero-func-ptr-means-uninitialized design I had earlier.
Maybe it would help if I "drew the rest of the owl" here and actually sketched lazy table initialization as well? This would perhaps make the picture a bit clearer how all of this could work together, at least as I'm imagining it.
alexcrichton commented on issue #3733:
Ah so my thinking of a more bulk-style initialization is me being inspired by what we're doing for memories where we fault in by page rather than all at once. I'm sort of naively assuming that some level of granularity would also be similar on tables, but byte access patterns may not really have any correlation to wasm-table-access-patterns so it may just be a flawed assumption on my part.
Also I think I understand now more how something like this would be required for full laziness. I'm still somewhat skeptical that it would need to look precisely like this but I see how my thinking of cascading laziness being unnecessary is flawed.
And finally sorry I meant to respond to the table point but I forgot! That does sound workable but I think we still need to be careful. For example if an element segment initializes an imported table I don't think that we can skip that. Additionally we'd have to be careful to implement
Table::get
properly in the embedding API for lazily-initialized tables (I don't think that all the access to do the lazy initialization is trivially there today).
I think that my preference at this point is to largely let performance guide us. The performance win in this PR is to not initialize anyfuncs for everything that never needs an anyfunc. There's still some unrealized performance wins though:
- Fully lazily initializating anyfuncs (requires lazy tables since table init today force-initializes the anyfuncs)
- Shrinking the vmcontext to only have anyfuncs for what's necessary
Given that this is all peformance work effectively I think I'd prefer to be guided by numbers. I would be curious to benchmark this implementation here vs one that shrinks the anyfunc array like I've been thinking. The performance difference may come out in the wash but given how small the numbers are here I could imagine it being significant. I think we could then use a similar benchmark for measuring a fully-lazy-table approach.
Overall I'm still hesitant to commit to design work which will largely only come to fruition in the future when the future is still somewhat uncertain (e.g. the precise design of a fully-lazy-table approach). In that sense I'd prefer to get the win we'll always want in any case, shrinking the VMContext, here and work on the laziness afterwards. (or doing it all at once is fine, but I'd prefer to not be in an inbetween state where we still have a big VMContext and some of it is initialized at instantiation time)
cfallin commented on issue #3733:
I believe this is still a theoretical data race that Rust disallows. Despite the machine code working just fine at the LLVM layer any concurrent unsynchronized writes are a data race and undefined behavior, so any concurrent writes would have to be atomic. (again though I don't think it's worthwhile trying to fix this because I don't think this overall laziness approach is what we want either)
I was almost done with my Friday, then a thought occurred to me -- we can just make the anyfunc fields atomics as well and do relaxed-ordering loads/stores. This satisfies the language's memory model; now all racing accesses are to atomics, and there is a Release-to-Acquire edge on the bitmap update that ensures the relaxed stores during init are seen by relaxed loads during normal use. This part at least should be fully theoretically sound now, I think.
I'll do the rest of the lazy-table init as it is in my head on Monday, just to sketch it and be able to measure, I think. This work will be compatible with / complementary to any work that reduces the number of anyfuncs as well...
alexcrichton commented on issue #3733:
Er sorry but to reiterate again, I do not think anything _should_ be atomic here. This method is statically not possible to call concurrently (or so I believe) so I do not think we should be "working around" the compiler in this case. Instead I think the method should change to
&mut self
or otherwise beunsafe
and indicate that callers must be&mut self
or otherwise prevent concurrent calls.
cfallin commented on issue #3733:
Ah, OK; I had been assuming that concurrent calls actually are possible via the "get an export" path. Within
wasmtime_runtime
at least, that's all via&self
methods. But I see that at thewasmtime
layer this takes a mut Store (orAsContextMut
or whatnot) so we actually do have mutual exclusion. I agree that given this, just using "unsafe" is way simpler.
Last updated: Jan 24 2025 at 00:11 UTC