cfallin requested jameysharp for a review on PR #8158.
cfallin opened PR #8158 from cfallin:typed-funcref-ics-test
to bytecodealliance:main
:
In order to have fast IC (inline cache) chains in AOT-compiled dynamic language Wasms, it would be great if we could make the "call to a typed funcref at a constant table index" pattern fast.
This use-case was discussed at the most recent Wasmtime biweekly and @jameysharp is working on some optimizations; the intent of this PR is to provide a concrete test-case whose blessed output we can see improve over time.
In particular, the following opts are still desirable:
- With the use of non-nullable typed funcrefs, there shouldn't be a null check (there currently is, as noted by a comment in the code due to lack of type information at the right spot).
- With the use of a constant table size and a constant index to the
table.get
, we should be able to load from the table without a bounds-check or any Spectre masking.Other further optimizations for this pattern might be possible if we rearrange the table and function-reference data structures, and the lazy-initialization scheme thereof, but the above should be agnostic to that.
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin requested wasmtime-core-reviewers for a review on PR #8158.
jameysharp submitted PR review:
"Chris," I whine to myself, "stop working on a Saturday evening," as I approve your PR at 10pm
Thanks for this! It's definitely the kind of thing I was hoping for.
This example module initializes element zero of the table to point to a function, then calls through elements one and two. What is that supposed to do? Or am I misreading it?
alexcrichton commented on PR #8158:
(table $ic-sites 100 100 (ref $ic-stub) (ref.func $ic1))
This example module initializes element zero of the table to point to a function
Oh this syntax is actually initializing the entire table with
(ref.func $ic1)
, it's new syntax/binary stuff introduced as part of the function-references proposal (basically folded into the gc proposal)
Somewhat orthogonal question as well: I forget if I asked this in the past, but is the reason to use a typed table instead of typed globals to get the lazy initialization? If I change the above test to use two typed function globals it ends up codegenning (today)
function u0:1(i64 vmctx, i64, i32, i32, i32, i32) -> i32 fast { gv0 = vmctx gv1 = load.i64 notrap aligned readonly gv0+8 gv2 = load.i64 notrap aligned gv1 gv3 = vmctx sig0 = (i64 vmctx, i64, i32, i32, i32, i32) -> i32 fast sig1 = (i64 vmctx, i32 uext, i32 uext) -> i32 uext system_v sig2 = (i64 vmctx, i32 uext) -> i32 uext system_v stack_limit = gv2 block0(v0: i64, v1: i64, v2: i32, v3: i32, v4: i32, v5: i32): v8 -> v0 v14 -> v0 @003b v9 = load.i64 notrap aligned table v0+80 @003d brif v9, block3, block2 block2 cold: @003d trap null_reference block3: @003d v10 = load.i64 notrap aligned readonly v9+16 @003d v11 = load.i64 notrap aligned readonly v9+32 @003d v12 = call_indirect sig0, v10(v11, v0, v2, v3, v4, v5) @004c v15 = load.i64 notrap aligned table v0+96 @004e brif v15, block5, block4 block4 cold: @004e trap null_reference block5: @004e v16 = load.i64 notrap aligned readonly v15+16 @004e v17 = load.i64 notrap aligned readonly v15+32 @004e v18 = call_indirect sig0, v16(v17, v0, v2, v3, v4, v5) @0057 jump block1 block1: @0052 v19 = iadd.i32 v18, v12 v6 -> v19 @0057 return v19 }
which is pretty good given that we've got https://github.com/bytecodealliance/wasmtime/issues/5291 in our pocket
jameysharp commented on PR #8158:
Thanks Alex, now I understand!
I'd forgotten about #5291; we just added a different use of memflags to indicate trap code so we have precedent now. But besides that, if I understood Chris correctly, this example is supposed to be typed as a non-nullable function reference. So if we thread that type information through correctly then we should be able to elide the null-check even without #5291, right?
cfallin commented on PR #8158:
Indeed, this initializes everything then calls ICs 1 and 2, which were meant to be arbitrary -- I'll add some comments to make it clearer!
is the reason to use a typed table instead of typed globals to get the lazy initialization?
not quite; I realized once I progressed past "IC caller" logic to "IC update" logic that while AOT-compiling can produce bodies that have statically different code locations per IC head, the update logic is shared (polymorphic over IC index) and there's no "set global N to V" instruction. I could potentially finagle the IC-stub ABI to return a new funcref and always update in the statically-unique callsite sequence or something, but that's a lot of overhead if these are frequent; IMHO this is what tables are made for :-)
cfallin updated PR #8158.
cfallin has enabled auto merge for PR #8158.
cfallin merged PR #8158.
alexcrichton commented on PR #8158:
Ah so the
call_ref
instructions are always done with a constant index, but thetable.set
instructions are done with a variable index? (if I'm understanding you right). I mostly wanted to confirm that allcall_ref
is done with a constant index or otherwise this codegen wouldn't be representative.So if we thread that type information through correctly then we should be able to elide the null-check even without https://github.com/bytecodealliance/wasmtime/issues/5291, right?
True!
Thinking a bit more on this, if the
table.get
pluscall_ref
pair were replaced withcall_indirect
we wouldn't have to solve the problem of propagating validator type information into lowering. In such a situation we could inspect the table's type and notice that it's non-nullable and skip the null check. I think it should be the case thattable.get
pluscall_ref
should be the same codegen ascall_indirect
at least ...
alexcrichton commented on PR #8158:
This has inspired me to go off and do https://github.com/bytecodealliance/wasmtime/pull/8159 to handle the null check
cfallin commented on PR #8158:
Ah so the call_ref instructions are always done with a constant index, but the table.set instructions are done with a variable index? (if I'm understanding you right). I mostly wanted to confirm that all call_ref is done with a constant index or otherwise this codegen wouldn't be representative.
Right; the hope/aspiration is that that constant index can trickle through a carefully-placed series of optimizations so this turns into something closer to the good codegen we see with globals.
(Two other points that came to mind later re: use of tables: scalability is another factor -- the max of 100k (?) globals is a real limit if we use one for every IC site in a large program; and also, your point about lazy init, wherein we don't lazy-init globals and that'd be a big regression of instantiation latency on said ~100k-IC programs. The flipside is that we have the lazy-init dynamic checks on table load...)
call_indirect
insteadAh! I hadn't realized that
call_indirect
also participated in the "typed funcref" proposal; for some reason I had it pegged as "old dynamic sig check" territory. All the better if it lets us apply the optimizations more easily; I'll update this test to use it.
Thinking a bit more about lazy init: the overall behavior we want is that for a very large program with many ICs, we have fast instantiation and, once ICs are warmed up, fast IC invocation with as few dynamic checks as possible.
IC sites always start as linked to the "fallback IC"; my planned use for this "fast IC head" mechanism was to have a conditional at very IC site testing whether the IC-stub struct is a fallback IC, invoking a traditional (C++) function pointer if so, and then (hitting a weval intrinsic that leads to) doing this typed-funcref thing if not. The upshot of that is that we always have some slowpath action (attaching an IC) before we do the typed funcref invocation.
I guess what I'm getting at is: could we avoid the lazy-init checks if we had a nullable typed funcref table, with null as the default value? Then we do the usual init-the-anyfunc-before-you-take-its-address thing at the
ref.func
in the IC-attach path, and the just-instantiated state is a bunch of zeroes (kindly provided by mmap/madvise), and the invocation is load/load/call-reg. That's another new optimization but a special case for lazy-init instead. Stated succinctly: if a table's default funcref is null, don't do lazy-init, and make the0
bit-pattern mean null rather than non-initialized-pointer-to-anyfunc.
cfallin edited a comment on PR #8158:
Ah so the call_ref instructions are always done with a constant index, but the table.set instructions are done with a variable index? (if I'm understanding you right). I mostly wanted to confirm that all call_ref is done with a constant index or otherwise this codegen wouldn't be representative.
Right; the hope/aspiration is that that constant index can trickle through a carefully-placed series of optimizations so this turns into something closer to the good codegen we see with globals.
(Two other points that came to mind later re: use of tables: scalability is another factor -- the max of 100k (?) globals is a real limit if we use one for every IC site in a large program; and also, your point about lazy init, wherein we don't lazy-init globals and that'd be a big regression of instantiation latency on said ~100k-IC programs. The flipside is that we have the lazy-init dynamic checks on table load...)
call_indirect
insteadAh! I hadn't realized that
call_indirect
also participated in the "typed funcref" proposal; for some reason I had it pegged as "old dynamic sig check" territory. All the better if it lets us apply the optimizations more easily; I'll update this test to use it.
Thinking a bit more about lazy init: the overall behavior we want is that for a very large program with many ICs, we have fast instantiation and, once ICs are warmed up, fast IC invocation with as few dynamic checks as possible.
IC sites always start as linked to the "fallback IC"; my planned use for this "fast IC head" mechanism was to have a conditional at every IC site testing whether the IC-stub struct is a fallback IC, invoking a traditional (C++) function pointer if so, and then (hitting a weval intrinsic that leads to) doing this typed-funcref thing if not. The upshot of that is that we always have some slowpath action (attaching an IC) before we do the typed funcref invocation.
I guess what I'm getting at is: could we avoid the lazy-init checks if we had a nullable typed funcref table, with null as the default value? Then we do the usual init-the-anyfunc-before-you-take-its-address thing at the
ref.func
in the IC-attach path, and the just-instantiated state is a bunch of zeroes (kindly provided by mmap/madvise), and the invocation is load/load/call-reg. That's another new optimization but a special case for lazy-init instead. Stated succinctly: if a table's default funcref is null, don't do lazy-init, and make the0
bit-pattern mean null rather than non-initialized-pointer-to-anyfunc.
cfallin edited a comment on PR #8158:
Ah so the call_ref instructions are always done with a constant index, but the table.set instructions are done with a variable index? (if I'm understanding you right). I mostly wanted to confirm that all call_ref is done with a constant index or otherwise this codegen wouldn't be representative.
Right; the hope/aspiration is that that constant index can trickle through a carefully-placed series of optimizations so this turns into something closer to the good codegen we see with globals.
(Two other points that came to mind later re: use of tables: scalability is another factor -- the max of 100k (?) globals is a real limit if we use one for every IC site in a large program; and also, your point about lazy init, wherein we don't lazy-init globals and that'd be a big regression of instantiation latency on said ~100k-IC programs. The flipside is that we have the lazy-init dynamic checks on table load...)
call_indirect
insteadAh! I hadn't realized that
call_indirect
also participated in the "typed funcref" proposal; for some reason I had it pegged as "old dynamic sig check" territory. All the better if it lets us apply the optimizations more easily; I'll update this test to use it.
Thinking a bit more about lazy init: the overall behavior we want is that for a very large program with many ICs, we have fast instantiation and, once ICs are warmed up, fast IC invocation with as few dynamic checks as possible.
IC sites always start as linked to the "fallback IC"; my planned use for this "fast IC head" mechanism was to have a conditional at every IC site testing whether the IC-stub struct is a fallback IC, invoking a traditional (C++) function pointer if so, and then (hitting a weval intrinsic that leads to) doing this typed-funcref thing if not. The upshot of that is that we always have some slowpath action (attaching an IC) before we do the typed funcref invocation.
I guess what I'm getting at is: could we avoid the lazy-init checks if we had a nullable typed funcref table, with null as the default value? Then we do the usual init-the-anyfunc-before-you-take-its-address thing at the
ref.func
in the IC-attach path, and the just-instantiated state is a bunch of zeroes (kindly provided by mmap/madvise), and the invocation is load/load/call-reg. That's another new optimization but a special case for lazy-init instead. Stated succinctly: if a table's default funcref is null, don't do lazy-init, and make the0
bit-pattern mean null rather than non-initialized-pointer-to-anyfunc. (EDIT: and this combines with #8159 to eliminate the null-check branch, to be clear.)
alexcrichton commented on PR #8158:
Makes sense! It's always possible we can use the same optimization techniques on globals as well as tables, nothing saying we have to keep everything as-is for example. If tables work then there's no need to change though.
could we avoid the lazy-init checks if we had a nullable typed funcref table, with null as the default value
I like this idea!
Last updated: Jan 24 2025 at 00:11 UTC