Stream: git-wasmtime

Topic: wasmtime / PR #8158 Wasm tests: add typed-funcref test sh...


view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2024 at 03:29):

cfallin requested jameysharp for a review on PR #8158.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2024 at 03:29):

cfallin opened PR #8158 from cfallin:typed-funcref-ics-test to bytecodealliance:main:

In order to have fast IC (inline cache) chains in AOT-compiled dynamic language Wasms, it would be great if we could make the "call to a typed funcref at a constant table index" pattern fast.

This use-case was discussed at the most recent Wasmtime biweekly and @jameysharp is working on some optimizations; the intent of this PR is to provide a concrete test-case whose blessed output we can see improve over time.

In particular, the following opts are still desirable:

Other further optimizations for this pattern might be possible if we rearrange the table and function-reference data structures, and the lazy-initialization scheme thereof, but the above should be agnostic to that.

<!--
Please make sure you include the following information:

Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.html

Please ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->

view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2024 at 03:29):

cfallin requested wasmtime-core-reviewers for a review on PR #8158.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2024 at 05:08):

jameysharp submitted PR review:

"Chris," I whine to myself, "stop working on a Saturday evening," as I approve your PR at 10pm

Thanks for this! It's definitely the kind of thing I was hoping for.

This example module initializes element zero of the table to point to a function, then calls through elements one and two. What is that supposed to do? Or am I misreading it?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2024 at 16:14):

alexcrichton commented on PR #8158:

(table $ic-sites 100 100 (ref $ic-stub) (ref.func $ic1))

This example module initializes element zero of the table to point to a function

Oh this syntax is actually initializing the entire table with (ref.func $ic1), it's new syntax/binary stuff introduced as part of the function-references proposal (basically folded into the gc proposal)


Somewhat orthogonal question as well: I forget if I asked this in the past, but is the reason to use a typed table instead of typed globals to get the lazy initialization? If I change the above test to use two typed function globals it ends up codegenning (today)

function u0:1(i64 vmctx, i64, i32, i32, i32, i32) -> i32 fast {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    sig0 = (i64 vmctx, i64, i32, i32, i32, i32) -> i32 fast
    sig1 = (i64 vmctx, i32 uext, i32 uext) -> i32 uext system_v
    sig2 = (i64 vmctx, i32 uext) -> i32 uext system_v
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32, v3: i32, v4: i32, v5: i32):
                                    v8 -> v0
                                    v14 -> v0
@003b                               v9 = load.i64 notrap aligned table v0+80
@003d                               brif v9, block3, block2

                                block2 cold:
@003d                               trap null_reference

                                block3:
@003d                               v10 = load.i64 notrap aligned readonly v9+16
@003d                               v11 = load.i64 notrap aligned readonly v9+32
@003d                               v12 = call_indirect sig0, v10(v11, v0, v2, v3, v4, v5)
@004c                               v15 = load.i64 notrap aligned table v0+96
@004e                               brif v15, block5, block4

                                block4 cold:
@004e                               trap null_reference

                                block5:
@004e                               v16 = load.i64 notrap aligned readonly v15+16
@004e                               v17 = load.i64 notrap aligned readonly v15+32
@004e                               v18 = call_indirect sig0, v16(v17, v0, v2, v3, v4, v5)
@0057                               jump block1

                                block1:
@0052                               v19 = iadd.i32 v18, v12
                                    v6 -> v19
@0057                               return v19
}

which is pretty good given that we've got https://github.com/bytecodealliance/wasmtime/issues/5291 in our pocket

view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2024 at 16:45):

jameysharp commented on PR #8158:

Thanks Alex, now I understand!

I'd forgotten about #5291; we just added a different use of memflags to indicate trap code so we have precedent now. But besides that, if I understood Chris correctly, this example is supposed to be typed as a non-nullable function reference. So if we thread that type information through correctly then we should be able to elide the null-check even without #5291, right?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2024 at 17:09):

cfallin commented on PR #8158:

Indeed, this initializes everything then calls ICs 1 and 2, which were meant to be arbitrary -- I'll add some comments to make it clearer!

is the reason to use a typed table instead of typed globals to get the lazy initialization?

not quite; I realized once I progressed past "IC caller" logic to "IC update" logic that while AOT-compiling can produce bodies that have statically different code locations per IC head, the update logic is shared (polymorphic over IC index) and there's no "set global N to V" instruction. I could potentially finagle the IC-stub ABI to return a new funcref and always update in the statically-unique callsite sequence or something, but that's a lot of overhead if these are frequent; IMHO this is what tables are made for :-)

view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2024 at 17:13):

cfallin updated PR #8158.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2024 at 17:16):

cfallin has enabled auto merge for PR #8158.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2024 at 17:56):

cfallin merged PR #8158.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2024 at 18:03):

alexcrichton commented on PR #8158:

Ah so the call_ref instructions are always done with a constant index, but the table.set instructions are done with a variable index? (if I'm understanding you right). I mostly wanted to confirm that all call_ref is done with a constant index or otherwise this codegen wouldn't be representative.

So if we thread that type information through correctly then we should be able to elide the null-check even without https://github.com/bytecodealliance/wasmtime/issues/5291, right?

True!

Thinking a bit more on this, if the table.get plus call_ref pair were replaced with call_indirect we wouldn't have to solve the problem of propagating validator type information into lowering. In such a situation we could inspect the table's type and notice that it's non-nullable and skip the null check. I think it should be the case that table.get plus call_ref should be the same codegen as call_indirect at least ...

view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2024 at 18:31):

alexcrichton commented on PR #8158:

This has inspired me to go off and do https://github.com/bytecodealliance/wasmtime/pull/8159 to handle the null check

view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2024 at 20:09):

cfallin commented on PR #8158:

Ah so the call_ref instructions are always done with a constant index, but the table.set instructions are done with a variable index? (if I'm understanding you right). I mostly wanted to confirm that all call_ref is done with a constant index or otherwise this codegen wouldn't be representative.

Right; the hope/aspiration is that that constant index can trickle through a carefully-placed series of optimizations so this turns into something closer to the good codegen we see with globals.

(Two other points that came to mind later re: use of tables: scalability is another factor -- the max of 100k (?) globals is a real limit if we use one for every IC site in a large program; and also, your point about lazy init, wherein we don't lazy-init globals and that'd be a big regression of instantiation latency on said ~100k-IC programs. The flipside is that we have the lazy-init dynamic checks on table load...)

call_indirect instead

Ah! I hadn't realized that call_indirect also participated in the "typed funcref" proposal; for some reason I had it pegged as "old dynamic sig check" territory. All the better if it lets us apply the optimizations more easily; I'll update this test to use it.


Thinking a bit more about lazy init: the overall behavior we want is that for a very large program with many ICs, we have fast instantiation and, once ICs are warmed up, fast IC invocation with as few dynamic checks as possible.

IC sites always start as linked to the "fallback IC"; my planned use for this "fast IC head" mechanism was to have a conditional at very IC site testing whether the IC-stub struct is a fallback IC, invoking a traditional (C++) function pointer if so, and then (hitting a weval intrinsic that leads to) doing this typed-funcref thing if not. The upshot of that is that we always have some slowpath action (attaching an IC) before we do the typed funcref invocation.

I guess what I'm getting at is: could we avoid the lazy-init checks if we had a nullable typed funcref table, with null as the default value? Then we do the usual init-the-anyfunc-before-you-take-its-address thing at the ref.func in the IC-attach path, and the just-instantiated state is a bunch of zeroes (kindly provided by mmap/madvise), and the invocation is load/load/call-reg. That's another new optimization but a special case for lazy-init instead. Stated succinctly: if a table's default funcref is null, don't do lazy-init, and make the 0 bit-pattern mean null rather than non-initialized-pointer-to-anyfunc.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2024 at 20:10):

cfallin edited a comment on PR #8158:

Ah so the call_ref instructions are always done with a constant index, but the table.set instructions are done with a variable index? (if I'm understanding you right). I mostly wanted to confirm that all call_ref is done with a constant index or otherwise this codegen wouldn't be representative.

Right; the hope/aspiration is that that constant index can trickle through a carefully-placed series of optimizations so this turns into something closer to the good codegen we see with globals.

(Two other points that came to mind later re: use of tables: scalability is another factor -- the max of 100k (?) globals is a real limit if we use one for every IC site in a large program; and also, your point about lazy init, wherein we don't lazy-init globals and that'd be a big regression of instantiation latency on said ~100k-IC programs. The flipside is that we have the lazy-init dynamic checks on table load...)

call_indirect instead

Ah! I hadn't realized that call_indirect also participated in the "typed funcref" proposal; for some reason I had it pegged as "old dynamic sig check" territory. All the better if it lets us apply the optimizations more easily; I'll update this test to use it.


Thinking a bit more about lazy init: the overall behavior we want is that for a very large program with many ICs, we have fast instantiation and, once ICs are warmed up, fast IC invocation with as few dynamic checks as possible.

IC sites always start as linked to the "fallback IC"; my planned use for this "fast IC head" mechanism was to have a conditional at every IC site testing whether the IC-stub struct is a fallback IC, invoking a traditional (C++) function pointer if so, and then (hitting a weval intrinsic that leads to) doing this typed-funcref thing if not. The upshot of that is that we always have some slowpath action (attaching an IC) before we do the typed funcref invocation.

I guess what I'm getting at is: could we avoid the lazy-init checks if we had a nullable typed funcref table, with null as the default value? Then we do the usual init-the-anyfunc-before-you-take-its-address thing at the ref.func in the IC-attach path, and the just-instantiated state is a bunch of zeroes (kindly provided by mmap/madvise), and the invocation is load/load/call-reg. That's another new optimization but a special case for lazy-init instead. Stated succinctly: if a table's default funcref is null, don't do lazy-init, and make the 0 bit-pattern mean null rather than non-initialized-pointer-to-anyfunc.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 17 2024 at 20:12):

cfallin edited a comment on PR #8158:

Ah so the call_ref instructions are always done with a constant index, but the table.set instructions are done with a variable index? (if I'm understanding you right). I mostly wanted to confirm that all call_ref is done with a constant index or otherwise this codegen wouldn't be representative.

Right; the hope/aspiration is that that constant index can trickle through a carefully-placed series of optimizations so this turns into something closer to the good codegen we see with globals.

(Two other points that came to mind later re: use of tables: scalability is another factor -- the max of 100k (?) globals is a real limit if we use one for every IC site in a large program; and also, your point about lazy init, wherein we don't lazy-init globals and that'd be a big regression of instantiation latency on said ~100k-IC programs. The flipside is that we have the lazy-init dynamic checks on table load...)

call_indirect instead

Ah! I hadn't realized that call_indirect also participated in the "typed funcref" proposal; for some reason I had it pegged as "old dynamic sig check" territory. All the better if it lets us apply the optimizations more easily; I'll update this test to use it.


Thinking a bit more about lazy init: the overall behavior we want is that for a very large program with many ICs, we have fast instantiation and, once ICs are warmed up, fast IC invocation with as few dynamic checks as possible.

IC sites always start as linked to the "fallback IC"; my planned use for this "fast IC head" mechanism was to have a conditional at every IC site testing whether the IC-stub struct is a fallback IC, invoking a traditional (C++) function pointer if so, and then (hitting a weval intrinsic that leads to) doing this typed-funcref thing if not. The upshot of that is that we always have some slowpath action (attaching an IC) before we do the typed funcref invocation.

I guess what I'm getting at is: could we avoid the lazy-init checks if we had a nullable typed funcref table, with null as the default value? Then we do the usual init-the-anyfunc-before-you-take-its-address thing at the ref.func in the IC-attach path, and the just-instantiated state is a bunch of zeroes (kindly provided by mmap/madvise), and the invocation is load/load/call-reg. That's another new optimization but a special case for lazy-init instead. Stated succinctly: if a table's default funcref is null, don't do lazy-init, and make the 0 bit-pattern mean null rather than non-initialized-pointer-to-anyfunc. (EDIT: and this combines with #8159 to eliminate the null-check branch, to be clear.)

view this post on Zulip Wasmtime GitHub notifications bot (Mar 18 2024 at 14:49):

alexcrichton commented on PR #8158:

Makes sense! It's always possible we can use the same optimization techniques on globals as well as tables, nothing saying we have to keep everything as-is for example. If tables work then there's no need to change though.

could we avoid the lazy-init checks if we had a nullable typed funcref table, with null as the default value

I like this idea!


Last updated: Nov 22 2024 at 16:03 UTC