jameysharp added the enhancement label to Issue #8195.
jameysharp added the wasmtime label to Issue #8195.
jameysharp opened issue #8195:
Just to add more complexity to the discussions in #8002 and #8160:
Feature
Current WebAssembly toolchains represent source-language function pointers by constructing a single WebAssembly funcref table, initializing it with an element section, and never writing to it again. That pattern looks something like the following. (Note that the examples I've looked at do not export the table, although we could potentially handle a variation which does as well.)
(module (type $fn (func (result i32))) (table $fnptrs 2 2 funcref) (func $callee (result i32) i32.const 42) (func $caller (param i32) (result i32) local.get 0 call_indirect $fnptrs (type $fn)) (elem $fnptrs (i32.const 1) func $callee) )
I would like Wasmtime to recognize this pattern and get as close as possible to the
call_indirect
being just a bounds-check, type-check, and native indirect function call. I would also like to place these pre-initialized function tables in read-only data in the compiled module.Benefit
Currently, Wasmtime generally compiles a WebAssembly
call
instruction to a native call instruction, so there's almost no sandboxing overhead to direct function calls. But indirect calls through a function pointer incur a pretty heavy penalty with our current codegen, due primarily to lazy table initialization, and also to impeding various optimizations.As far as I understand, this is the primary use of tables in WebAssembly today, so optimizing this case should have significant impact.
If we can place the table contents in read-only data, that eliminates table initialization cost during instantiation or runtime.
And because the loads of that data are non-trapping and to read-only memory, they can participate in GVN and LICM optimizations. So for a
call_indirect
in a loop on the same index we would hoist everything except a conditional branch and the call instruction itself out of the hot-path of the loop. We can already hoist most of the bounds-check and address computation now, but loading the base pointer and theVMFuncRef
fields can't be hoisted today.Implementation
Assume we know a funcref table is only written to using active element segments. That is, never with
table.set
or similar instructions, and never modified by the host. (I propose defining a custom section with optimization hints, where the runtime may ignore the hints, but if it supports them then it enables additional validation rules which prohibittable.set
and the like.)Given that constraint, the only funcrefs which may be present in such a table are either
ref.null
, aref.func
with a declared or imported function, or aglobal.get
from an imported global, according to the validation rules for Constant Expressions.For declared functions, the callee's vmctx argument is the vmctx for the instance where the table is declared, which means we don't need to store it alongside the function pointer. (If the table is exported and might be used from a different context, the exported representation can be the table base and size, plus the declaring instance's vmctx.) This is the most common case, so ensuring these intra-module indirect calls are fast is important.
For imported functions, we know where to look in the declaring module's vmctx to find the callee's function pointer and vmctx. So we can emit a trampoline which loads those two pointers from fixed offsets in the caller's vmctx and tail-calls to the real callee.
In both cases we know the type of the callee, so we can build an array of (type, pointer) pairs. Both values are constants, so this array can be stored in read-only data, except the function pointers need relocation. To avoid relocation, we can store the function pointers as offsets relative to the location of the array; before calling through the pointer, add the appropriate base. That also means we can compress the table, using 32-bit offsets even on 64-bit targets. (I believe on x86 this addition can even fold into the address-mode in the call-indirect instruction.)
With the array being in rodata alongside the compiled code, references to that array can be relocated ahead of time with PC-relative offsets. So in the sequence for
call_indirect
, we don't need any runtime relocations or any loads from vmctx. Implementing this requires adding support in CLIF for read-only data, including relocations for functions emitted at the same time.We can handle null funcrefs by making the type check serve two purposes. A null funcref should never match the type of any call, so once the type-check fails we just need to figure out whether we're trapping due to a bad signature or a null indirect call. If the type ID for null is reserved as 0 and the callee's type is loaded in
v1
, then this is justtrapz v1, icall_null; trap bad_sig
.The
global.get
case is complicated because we don't know the type of the imported funcref at compile time. Fortunately, that case doesn't seem to be used by clang or rustc right now, so we could just disable this optimization if it happens. If it becomes important, we could switch to callee-checked types; see alternatives below.If
table.get
is ever used on a table which has been optimized this way, we may need to synthesize aVMFuncRef
from the compressed representation. We might want an additional optimization hint that a table is only used forcall_indirect
, or we could lazily build up a table in the common representation whenevertable.get
is used.The end result of all this is that after the bounds check, the code for
call_indirect
is a trusted read-only load of the type and the function offset; a conditional branch on the type; and a call through the pointer, which often points directly to the right callee.Alternatives
#8002 is about avoiding some of the same lazy-initialization work but does so in a wider range of cases, and is probably best to do first. Even with that change though, I think this is still worth doing later.
Rather than storing the type alongside the function offset in the array, we could use callee-checked types, and emit a trampoline for every table entry. That means we don't need to emit code to check types at call sites. For locally-declared functions the trampoline would be just a type-check and tail-call. For null pointers the trampoline would immediately trap, and could be shared between all tables. This is a little tricky because we need to pass the type the caller expected to the trampoline in a way that does not depend on the signature of the callee.
I considered implementing
ref.null
in terms of a tiny function that immediately traps, and for callee-checked types that's exactly what we should do. But with caller-checked types we have to do the type check before the call anyway, so it's better to just catch the error that way.Rather than introducing a custom section for optimization hints, we could have Wasmtime check which instructions are used before deciding what table representation to use. However we have avoided doing that in Wasmtime so far and this doesn't seem like a good time to start.
The optimization hints could be a modification to core wasm, rather than stored in a custom section. However, these modules should produce the exact same results if you strip out the optimization hints, so I think it's actually best to use a custom section that can be ignored by implementations which don't care.
cc: @alexcrichton @fitzgen
alexcrichton commented on issue #8195:
This sounds like a neat idea to me, and if we could make it work I really like the idea of making allocation of a table effectively a noop during instantiation, no vm tricks required at all. That not only makes instantiation faster but it also lowers the per-instance resources required.
That being said I fear that the conditions for this optimization might be too restrictive. For example one use case that won't fit here is
componentize-py
or anything dealing with shared-everything dynamic linking. In that situation a root module will export a function table, but the main hard part would be that further instantiations of later modules would all append to this function table since it's imported and then initialized with local element segments. At a high level this optimization would still be possible in that we could simulate running the instantiation early on, but that's not something we've currently tried to do yet.It might also be worth looking at some of the GC-based toolchains to see how they're using funcref tables as well. I'm not sure if there's liberal use of
table.get
and globals for functions, and if there is it might make this less attractive if it's seen as purely a Rust/C/C++ optimization.
Last updated: Nov 22 2024 at 16:03 UTC