fitzgen opened Issue #2639:
Right now, a 1000ft overview of our instantiation process (ignoring creating import-able functions in the linker, which shared host functions adderesses) looks something like this:
- look up imports in linker's hash table and flatten them to an array
- allocate space for memories/tables/globals/etc
- fill the vmctx with pointers to the imports, etc
- initialize globals by interpreting global intializers from the wasm module
- initialize tables by interpreting element initializer segments from the wasm module
- initialize memory by interpreting data initializer segments from the wasm module
The instance allocator pool work makes (2) super fast! :check_mark:️
Although they should usually take relatively little time, we can make (4) through (6) even faster by using cranelift to compile an initialization function that doesn't have an interpreter loop iterating over each initializer and checking that it is in bounds and all that, but instead emit code with that interpreter loop unrolled and a bunch of bounds checking that was per-iteration of the interpreter loop de-duplicated into a single check for everything. Then we just call this JIT code during instantiation, rather than initializing these things ourselves!
Of course the amount of speed up we'll get by doing this is going to be a function of how many global/table element/data segment initializers a module has. Usually it isn't too many. But some modules, particularly those generated by Wizer, might have a good amount of them, and this could potentially save us a few microseconds on instantiation (great to be at the level where we are counting microseconds here :smiley:). Also, funcref tables can get pretty big and generally every index is initialized with an element.
We could potentially also JIT code for (3) but this seems slightly more complicated because it is more heterogeneous and also the vmctx fields/layout change more frequently than globals/tables/memory so it may have a larger maintenance burden.
Finally, we've talked about using virtual memory tricks to make page-aligned and -sized data segments
- lazily initialized (via userfaultfd) and
- copy-on-write (via mapping them with
MAP_PRIVATE
)This JIT-initialization approach should technically be complimentary to these things. Even if (6) effectively goes away from our instantiation times by becoming lazy, (4) and (5) will still need initializing at instantiation time, as will any non-page-aligned and -sized data segments. But it might make the potential speed ups that much smaller, and bump this optimization pretty far down the priority list. Something to consider.
Aside: it is worth thinking about speeding up (1) as well. If we are repeatedly instantiating the same module with the same imports (eg an instantiation of the same module with the same imports for each http request that a server receives) then it seems like we could do (1) just the one time and then reuse the flattened imports array for every instantiation. Not totally sure what this would look like at the API level. I think it might be possible to implement without wasmtime API changes, but maybe we don't want to force everyone to implement this same optimization by hand? Another thing to mull over.
cc @tschneidereit @lukewagner @alexcrichton since we talked about this yesterday
cc @peterhuene because this is related to instantiation performance
fitzgen labeled Issue #2639:
Right now, a 1000ft overview of our instantiation process (ignoring creating import-able functions in the linker, which shared host functions adderesses) looks something like this:
- look up imports in linker's hash table and flatten them to an array
- allocate space for memories/tables/globals/etc
- fill the vmctx with pointers to the imports, etc
- initialize globals by interpreting global intializers from the wasm module
- initialize tables by interpreting element initializer segments from the wasm module
- initialize memory by interpreting data initializer segments from the wasm module
The instance allocator pool work makes (2) super fast! :check_mark:️
Although they should usually take relatively little time, we can make (4) through (6) even faster by using cranelift to compile an initialization function that doesn't have an interpreter loop iterating over each initializer and checking that it is in bounds and all that, but instead emit code with that interpreter loop unrolled and a bunch of bounds checking that was per-iteration of the interpreter loop de-duplicated into a single check for everything. Then we just call this JIT code during instantiation, rather than initializing these things ourselves!
Of course the amount of speed up we'll get by doing this is going to be a function of how many global/table element/data segment initializers a module has. Usually it isn't too many. But some modules, particularly those generated by Wizer, might have a good amount of them, and this could potentially save us a few microseconds on instantiation (great to be at the level where we are counting microseconds here :smiley:). Also, funcref tables can get pretty big and generally every index is initialized with an element.
We could potentially also JIT code for (3) but this seems slightly more complicated because it is more heterogeneous and also the vmctx fields/layout change more frequently than globals/tables/memory so it may have a larger maintenance burden.
Finally, we've talked about using virtual memory tricks to make page-aligned and -sized data segments
- lazily initialized (via userfaultfd) and
- copy-on-write (via mapping them with
MAP_PRIVATE
)This JIT-initialization approach should technically be complimentary to these things. Even if (6) effectively goes away from our instantiation times by becoming lazy, (4) and (5) will still need initializing at instantiation time, as will any non-page-aligned and -sized data segments. But it might make the potential speed ups that much smaller, and bump this optimization pretty far down the priority list. Something to consider.
Aside: it is worth thinking about speeding up (1) as well. If we are repeatedly instantiating the same module with the same imports (eg an instantiation of the same module with the same imports for each http request that a server receives) then it seems like we could do (1) just the one time and then reuse the flattened imports array for every instantiation. Not totally sure what this would look like at the API level. I think it might be possible to implement without wasmtime API changes, but maybe we don't want to force everyone to implement this same optimization by hand? Another thing to mull over.
cc @tschneidereit @lukewagner @alexcrichton since we talked about this yesterday
cc @peterhuene because this is related to instantiation performance
fitzgen labeled Issue #2639:
Right now, a 1000ft overview of our instantiation process (ignoring creating import-able functions in the linker, which shared host functions adderesses) looks something like this:
- look up imports in linker's hash table and flatten them to an array
- allocate space for memories/tables/globals/etc
- fill the vmctx with pointers to the imports, etc
- initialize globals by interpreting global intializers from the wasm module
- initialize tables by interpreting element initializer segments from the wasm module
- initialize memory by interpreting data initializer segments from the wasm module
The instance allocator pool work makes (2) super fast! :check_mark:️
Although they should usually take relatively little time, we can make (4) through (6) even faster by using cranelift to compile an initialization function that doesn't have an interpreter loop iterating over each initializer and checking that it is in bounds and all that, but instead emit code with that interpreter loop unrolled and a bunch of bounds checking that was per-iteration of the interpreter loop de-duplicated into a single check for everything. Then we just call this JIT code during instantiation, rather than initializing these things ourselves!
Of course the amount of speed up we'll get by doing this is going to be a function of how many global/table element/data segment initializers a module has. Usually it isn't too many. But some modules, particularly those generated by Wizer, might have a good amount of them, and this could potentially save us a few microseconds on instantiation (great to be at the level where we are counting microseconds here :smiley:). Also, funcref tables can get pretty big and generally every index is initialized with an element.
We could potentially also JIT code for (3) but this seems slightly more complicated because it is more heterogeneous and also the vmctx fields/layout change more frequently than globals/tables/memory so it may have a larger maintenance burden.
Finally, we've talked about using virtual memory tricks to make page-aligned and -sized data segments
- lazily initialized (via userfaultfd) and
- copy-on-write (via mapping them with
MAP_PRIVATE
)This JIT-initialization approach should technically be complimentary to these things. Even if (6) effectively goes away from our instantiation times by becoming lazy, (4) and (5) will still need initializing at instantiation time, as will any non-page-aligned and -sized data segments. But it might make the potential speed ups that much smaller, and bump this optimization pretty far down the priority list. Something to consider.
Aside: it is worth thinking about speeding up (1) as well. If we are repeatedly instantiating the same module with the same imports (eg an instantiation of the same module with the same imports for each http request that a server receives) then it seems like we could do (1) just the one time and then reuse the flattened imports array for every instantiation. Not totally sure what this would look like at the API level. I think it might be possible to implement without wasmtime API changes, but maybe we don't want to force everyone to implement this same optimization by hand? Another thing to mull over.
cc @tschneidereit @lukewagner @alexcrichton since we talked about this yesterday
cc @peterhuene because this is related to instantiation performance
pchickey commented on Issue #2639:
This is a very neat idea. I'll note that Lucet ships page-aligned and sized data segments in the shared object because the
userfaultfd
memory manager needs it.
cfallin commented on Issue #2639:
A useful question might be: how densely or sparsely used are tables of imported functions? In other words, does it make sense to consider a design where we have a bit more indirection, and lazily resolve a real function pointer (something like a PLT/GOT in a traditional linker world)? This is coming from "the fastest initialization is no initialization at all"-type thoughts; no idea if the data would actually support it though!
Last updated: Jan 24 2025 at 00:11 UTC