Stream: wasmtime

Topic: Optimizing Pulley execution time


view this post on Zulip Alex Crichton (Jan 17 2025 at 18:27):

In the wasmtime meeting yesterday we dove a bit into pulley and some theoretical optimizations. One thing @Chris Fallin you mentioned was reducing the number of loads per instruction where currently the load-per-instruction-operand might be overly punishing us by saturating load/store ports in a cpu

I was thinking about that last night and this morning I implemented a strategy where whenever we decode an instruction the decoder unconditionally loads a pointer's worth of bytes (e.g. 8 bytes on x64). Operands are then decoded from that via bit shifts. If the instruction is longer than 8 bytes it'll continue to load more afterwards. Basically I didn't change the pulley bytecode format at all, but I optimized the loads such that on average, on x64, there's only a single 64-bit load per instruction.

For the tail-call interpreter loop this meant that at the end of an instruction handler you'd load 64-bits for the next instruction. The low 8 bits are used for a dynamic dispatch and the upper 56 bits are passed in a register to the next opcode. The next opcode then uses these 56 bits to decode the operands.

I've confirmed looking at the disassembly of the tail-call interpreter that everything looks as expected. There's bit-twiddling where I'd expect, no extra overhead from the abstractions I'm using, and there's only a single load at the end of most handlers of data for the next instruction.

So despite all this, the performance has relatively mixed results. For example this is a performance comparison of the "mach loop" interpreter with this load-64-bits trick vs the previous commit on main

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 42746797.10 ± 4341093.72 (confidence = 99%)

  main.so is 1.06x to 1.07x faster than pulley-less-instruction-loads.so!

  [645100214 648469690.10 653317578] main.so
  [687997108 691216487.20 698783018] pulley-less-instruction-loads.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 899905066.70 ± 26179495.28 (confidence = 99%)

  pulley-less-instruction-loads.so is 1.05x to 1.05x faster than main.so!

  [20179755580 20198967934.80 20236773749] main.so
  [19278160035 19299062868.10 19355235070] pulley-less-instruction-loads.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 3095623.30 ± 2128376.59 (confidence = 99%)

  pulley-less-instruction-loads.so is 1.01x to 1.07x faster than main.so!

  [81785649 83368590.70 87006738] main.so
  [79519941 80272967.40 84093446] pulley-less-instruction-loads.so

and this is a performance comparison of the tail-call-based interpreter loop:

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 85980346.40 ± 4256370.91 (confidence = 99%)

  main-tai.so is 1.08x to 1.09x faster than pulley-less-instruction-loads-tai.so!

  [979172509 982182980.40 991260483] main-tai.so
  [1067376846 1068163326.80 1071737075] pulley-less-instruction-loads-tai.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 6865385.20 ± 2141737.48 (confidence = 99%)

  main-tai.so is 1.05x to 1.09x faster than pulley-less-instruction-loads-tai.so!

  [100417359 101239118.00 106632440] main-tai.so
  [107581286 108104503.20 111904624] pulley-less-instruction-loads-tai.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 741291892.60 ± 11310370.33 (confidence = 99%)

  main-tai.so is 1.03x to 1.04x faster than pulley-less-instruction-loads-tai.so!

  [20890036242 20898157837.90 20915333190] main-tai.so
  [21632673509 21639449730.50 21660859731] pulley-less-instruction-loads-tai.so

the tl;dr; here being that this is a universal loss for the tail-call-dispatch interpreter loop and for the "mach loop" dispatch it's not always a win.

I wanted to confirm though, @Chris Fallin is this the sort of optimization you were imagining?

view this post on Zulip Alex Crichton (Jan 17 2025 at 18:30):

As an example, this is the disassembly of xmov after the optimization I applied

0000000001ffb110 <pulley_interpreter::interp::tail_loop::xmov>:
  push   %rbp
  mov    %rsp,%rbp
  mov    %edx,%eax                      ;; move our own instructions bytes into eax
  shr    $0x5,%eax
  and    $0xf8,%eax
  mov    0x200(%rdi,%rax,1),%rax        ;; bit tricks/load the src register indexed by `eax`
  and    $0x1f,%edx
  mov    %rax,0x200(%rdi,%rdx,8)        ;; bit tricks/store into dst register indexed by `edx`
  mov    -0x5(%rsi),%rdx                ;; load the next instruction's bytes (8 bytes)
  add    $0x3,%rsi
  movzbl %dl,%eax                       ;; extract opcode to dispatch on
  shr    $0x8,%rdx
  lea    pulley_interpreter::interp::tail_loop::OPCODE_HANDLER_TABLE,%rcx
  pop    %rbp
 jmp    *(%rcx,%rax,8)                 ;; tail call dispatch

view this post on Zulip Chris Fallin (Jan 17 2025 at 18:41):

basically yeah that's it; so this data point shows that the loads aren't the bottleneck -- interesting!

view this post on Zulip Chris Fallin (Jan 17 2025 at 18:42):

at this point looking at uarch perf counters to see what is the bottleneck seems worthwhile; but honestly the biggest lever is probably reducing opcode count with macro-ops, exactly as you've been doing...

view this post on Zulip Alex Crichton (Jan 17 2025 at 18:43):

alas, well I might try to land this work behind a #[cfg] since it might end up helping in the future

view this post on Zulip Chris Fallin (Jan 17 2025 at 18:46):

oh, one more optimization I didn't mention yesterday: there's the concept of a "threaded interpreter" (name predates threads-as-parallelism-primitives) where you build a table of function pointers as a translation of the bytecode; that would replace the movzbl/shr/lea/jmp dispatching through the opcode table with probably an add $8, %ptr; jmp *(%ptr). a little tricky to do control flow (have to translate the dests to indices in the handler-func-table) but possible. I'd check how much time is spent on the last 5 instructions first before pursuing of course

view this post on Zulip Alex Crichton (Jan 17 2025 at 18:47):

so sort of like a relocation pass after we've loaded the bytecode?

view this post on Zulip Chris Fallin (Jan 17 2025 at 18:48):

yeah, I suppose you would want to build the whole thing as a side-table computation after loading bytecode, since it depends on runtime func-pointer values

view this post on Zulip Alex Crichton (Jan 17 2025 at 18:57):

apparently vtune says:

Microarchitecture Usage: 0.0% of Pipeline Slots
 | You code efficiency on this platform is too low.
 |
 | Possible cause: memory stalls, instruction starvation, branch misprediction
 | or long latency instructions.
 |
 | Next steps: Run Microarchitecture Exploration analysis to identify the cause
 | of the low microarchitecture usage efficiency.
 |
    Performance-core (P-core)
        Retiring: 0.0% of Pipeline Slots
        Front-End Bound: 0.0% of Pipeline Slots
        Bad Speculation: 100.0% of Pipeline Slots
         | A significant proportion of pipeline slots containing useful work are
         | being cancelled. This can be caused by mispredicting branches or by
         | machine clears. Note that this metric value may be highlighted due to
         | Branch Resteers issue.
         |

view this post on Zulip Chris Fallin (Jan 17 2025 at 19:05):

"you are using 0.0% of your CPU, try harder"

view this post on Zulip Chris Fallin (Jan 17 2025 at 19:05):

mispredicts from the indirect predictor don't surprise me but 100% is very weird.

view this post on Zulip Chris Fallin (Jan 17 2025 at 19:05):

Spectre mitigations turning off indirect predictor entirely?

view this post on Zulip Chris Fallin (Jan 17 2025 at 19:05):

(I'd try perf too, fwiw)

view this post on Zulip Alex Crichton (Jan 17 2025 at 19:05):

this is the uarch-exploration report I think https://gist.github.com/alexcrichton/75eca31603706054fb2dcc83a7c0b5be

GitHub Gist: instantly share code, notes, and snippets.

view this post on Zulip Alex Crichton (Jan 17 2025 at 19:06):

but I can't really make heads or tails of this

view this post on Zulip Alex Crichton (Jan 17 2025 at 19:06):

I've only ever really done perf sampling, is there like a general "perf show me a uarch summary" command?

view this post on Zulip Chris Fallin (Jan 17 2025 at 19:07):

perf list shows all the events, then perf stat -e EVENT ... to count it (multiple -e supported, I usually do cycles, instructions, and whatever events of interest)

view this post on Zulip Alex Crichton (Jan 17 2025 at 19:07):

heh perf list has 6500 events

view this post on Zulip Chris Fallin (Jan 17 2025 at 19:08):

go Intel perf counter team! 10k is close, you can do it

view this post on Zulip Chris Fallin (Jan 17 2025 at 19:10):

the summary above doesn't tell me too much but does have some high-level data points: issue port utilization is low, so we're not bottlenecked on, say, too many ALU instructions; L1 stalls 16% of the time; running out of the DSB (predecoded cache) most of the time so not an L1i / too much code issue

view this post on Zulip Chris Fallin (Jan 17 2025 at 19:10):

main thing is that IPC is > 2, so this is really a "too many instructions" kind of issue I think

view this post on Zulip Chris Fallin (Jan 17 2025 at 19:10):

in other words, main remedy is fewer bytecode ops, since each bytecode op body is pretty minimal

view this post on Zulip Alex Crichton (Jan 17 2025 at 19:11):

hm ok that makes sense to me yeah

view this post on Zulip Alex Crichton (Jan 17 2025 at 19:11):

with perf stat I got:

     <not counted>      cpu_atom/br_misp_retired.all_branches/                                        (0.00%)
       202,601,699      cpu_core/br_misp_retired.all_branches/
     <not counted>      cpu_atom/br_inst_retired.all_branches/                                        (0.00%)
    21,691,599,697      cpu_core/br_inst_retired.all_branches/

view this post on Zulip Alex Crichton (Jan 17 2025 at 19:11):

which doesn't look like ~90% misprediction rate

view this post on Zulip Alex Crichton (Jan 17 2025 at 19:12):

I suppose 10% mispredict is a high enough rate to consume 90% of pipeline slots

view this post on Zulip Chris Fallin (Jan 17 2025 at 19:12):

10 MPKI (misses per thousand instructions) is somewhat high when one has giant instruction windows of today but doesn't scream "major problem" to me

view this post on Zulip Chris Fallin (Jan 17 2025 at 19:13):

right, I think it's about as good as one can hope for: not pathological (every indirect missing for some reason), predictor is doing its job, it's just hard to predict a bunch of indirects in sequence

view this post on Zulip Alex Crichton (Jan 17 2025 at 19:13):

can wasm switch to an opcode-per-program

view this post on Zulip Alex Crichton (Jan 17 2025 at 19:14):

like can my wasm module just be one opcode

view this post on Zulip Alex Crichton (Jan 17 2025 at 19:14):

(func $fib (param i32) (result i32) fib

view this post on Zulip Chris Fallin (Jan 17 2025 at 19:14):

pulley op serve_http_request_using_program1.wasm

view this post on Zulip Alex Crichton (Jan 17 2025 at 19:14):

in any case fun exploration!

view this post on Zulip Chris Fallin (Jan 17 2025 at 19:15):

yeah, good to confirm nothing too wonky going on

view this post on Zulip Robin Freyler (Jan 17 2025 at 21:12):

Have you done the perf instrumentation on switch or tail-call based dispatch? Would be kinda interesting to know if they differ with respect to branch mispredictions. At least in theory tail-call dispatch should be simpler for a branch predictor to predict. 90% branch hit rate is low but also kinda normal for an interpreter.

view this post on Zulip Alex Crichton (Jan 17 2025 at 21:27):

not yet, no, but so far most of benchmarking hasn't shown either as a clear winner, sometimes one is faster and sometimes the other is

view this post on Zulip Alex Crichton (Jan 17 2025 at 21:27):

which might just mean that something else is the bigger bottleneck right now

view this post on Zulip Alex Crichton (Jan 22 2025 at 23:13):

Along the lines of Pulley optimizations and/or Cranelift. I've got some Rust code:

#[unsafe(no_mangle)]
pub unsafe extern "C" fn foo(a: &[u32; 100]) -> u32 {
    a.iter().sum()
}

which compiles to wasm then clif (for Pulley) which looks like:

function u0:0(i64 vmctx, i64, i32) -> i32 tail {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned gv0+104
    gv2 = load.i64 notrap aligned readonly checked gv0+96

block0(v0: i64, v1: i64, v2: i32):
    v4 = iconst.i32 0
    v12 = iconst.i64 4
    v15 = load.i64 notrap aligned readonly checked v0+96
    v20 = iconst.i32 4
    v22 = iconst.i32 400
    jump block2(v4, v4)  ; v4 = 0, v4 = 0

block2(v8: i32, v18: i32):
    v11 = load.i64 notrap aligned v0+104
    v9 = iadd.i32 v2, v8
    v10 = uextend.i64 v9
    v27 = iconst.i64 4
    v28 = isub v11, v27  ; v27 = 4
    v14 = icmp ugt v10, v28
    trapnz v14, heap_oob
    v16 = iadd.i64 v15, v10
    v17 = load.i32 little heap v16
    v29 = iconst.i32 4
    v30 = iadd v8, v29  ; v29 = 4
    v31 = iconst.i32 400
    v32 = icmp ne v30, v31  ; v31 = 400
    v24 = uextend.i32 v32
    v19 = iadd v17, v18
    brif v24, block2(v30, v19), block4

block4:
    jump block3

block3:
    jump block1

block1:
    return v19
}

Is it expected that v11 = load.i64 notrap aligned v0+104 isn't hoisted out of the loop?

view this post on Zulip Alex Crichton (Jan 22 2025 at 23:13):

I know we do LICM but does it not touch load/stores/side-effectful instructions?

view this post on Zulip Alex Crichton (Jan 22 2025 at 23:13):

oh d'oh I see the readonly bit which is needed

view this post on Zulip Alex Crichton (Jan 22 2025 at 23:14):

now I see the base pointer being hoisted

view this post on Zulip Chris Fallin (Jan 22 2025 at 23:17):

Yes, exactly, and I guess the bit of reasoning that prevents it semantically is that the memory could grow due to some other op in the loop body so we need to reload the bound

view this post on Zulip Chris Fallin (Jan 22 2025 at 23:18):

We could define a separate alias-analysis region for memory limits (or for the limit for each memory), and annotate libcalls for memory.grow as setting it

view this post on Zulip Chris Fallin (Jan 22 2025 at 23:18):

but then we'd still need to do something interprocedural, or else only do this opt in loops without calls

view this post on Zulip Alex Crichton (Jan 22 2025 at 23:18):

I think it's already in a different region though b/c it's "none" vs heap?

view this post on Zulip Alex Crichton (Jan 22 2025 at 23:18):

or rather does licm consult aliasing info?

view this post on Zulip Chris Fallin (Jan 22 2025 at 23:19):

ah, sorry, yeah, implicit in above is "and then we could use that to make LICM smarter"

view this post on Zulip Chris Fallin (Jan 22 2025 at 23:19):

currently it doesn't analyze the loop body for writes to alias regions at all

view this post on Zulip Alex Crichton (Jan 22 2025 at 23:19):

kk makes sense

view this post on Zulip Chris Fallin (Jan 22 2025 at 23:19):

seems like a totally reasonable thing to do, though, especially for bounds-checked perf (makes a note)

view this post on Zulip Alex Crichton (Jan 22 2025 at 23:20):

yeah I'm thinking I'll probably open a meta-issue for pulley perf soon

view this post on Zulip Alex Crichton (Jan 22 2025 at 23:20):

there's a few little things I don't know how to fix myself like this, and there's some regalloc stuff I don't understand what's happening as well

view this post on Zulip Alex Crichton (Jan 22 2025 at 23:20):

the fruit for macro ops seems to be higher and higher alas

view this post on Zulip Chris Fallin (Jan 22 2025 at 23:21):

good to write it down at least -- then maybe if I get spare cycles, or @fitzgen (he/him) , or someone else, we've got a solid queue of perf ideas

view this post on Zulip Alex Crichton (Jan 22 2025 at 23:21):

Screenshot 2025-01-22 at 5.21.18 PM.png

e.g. this is one tight loop (screenshot for colors)

and I'm not sure why both the entry to the loop and back-edge both have xmov x0, x28

view this post on Zulip Alex Crichton (Jan 22 2025 at 23:22):

ideally it'd just be a single conditional branch at the end, but I think regalloc decisions are thwarting that

view this post on Zulip Alex Crichton (Jan 22 2025 at 23:22):

I've seen this on native too but never bottom'd it out

view this post on Zulip Chris Fallin (Jan 22 2025 at 23:22):

weird!


Last updated: Jan 24 2025 at 00:11 UTC