Optimizing Pulley execution time · wasmtime

In the wasmtime meeting yesterday we dove a bit into pulley and some theoretical optimizations. One thing @Chris Fallin you mentioned was reducing the number of loads per instruction where currently the load-per-instruction-operand might be overly punishing us by saturating load/store ports in a cpu

I was thinking about that last night and this morning I implemented a strategy where whenever we decode an instruction the decoder unconditionally loads a pointer's worth of bytes (e.g. 8 bytes on x64). Operands are then decoded from that via bit shifts. If the instruction is longer than 8 bytes it'll continue to load more afterwards. Basically I didn't change the pulley bytecode format at all, but I optimized the loads such that on average, on x64, there's only a single 64-bit load per instruction.

For the tail-call interpreter loop this meant that at the end of an instruction handler you'd load 64-bits for the next instruction. The low 8 bits are used for a dynamic dispatch and the upper 56 bits are passed in a register to the next opcode. The next opcode then uses these 56 bits to decode the operands.

I've confirmed looking at the disassembly of the tail-call interpreter that everything looks as expected. There's bit-twiddling where I'd expect, no extra overhead from the abstractions I'm using, and there's only a single load at the end of most handlers of data for the next instruction.

So despite all this, the performance has relatively mixed results. For example this is a performance comparison of the "mach loop" interpreter with this load-64-bits trick vs the previous commit on main

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 42746797.10 ± 4341093.72 (confidence = 99%)

  main.so is 1.06x to 1.07x faster than pulley-less-instruction-loads.so!

  [645100214 648469690.10 653317578] main.so
  [687997108 691216487.20 698783018] pulley-less-instruction-loads.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 899905066.70 ± 26179495.28 (confidence = 99%)

  pulley-less-instruction-loads.so is 1.05x to 1.05x faster than main.so!

  [20179755580 20198967934.80 20236773749] main.so
  [19278160035 19299062868.10 19355235070] pulley-less-instruction-loads.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 3095623.30 ± 2128376.59 (confidence = 99%)

  pulley-less-instruction-loads.so is 1.01x to 1.07x faster than main.so!

  [81785649 83368590.70 87006738] main.so
  [79519941 80272967.40 84093446] pulley-less-instruction-loads.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 85980346.40 ± 4256370.91 (confidence = 99%)

  main-tai.so is 1.08x to 1.09x faster than pulley-less-instruction-loads-tai.so!

  [979172509 982182980.40 991260483] main-tai.so
  [1067376846 1068163326.80 1071737075] pulley-less-instruction-loads-tai.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 6865385.20 ± 2141737.48 (confidence = 99%)

  main-tai.so is 1.05x to 1.09x faster than pulley-less-instruction-loads-tai.so!

  [100417359 101239118.00 106632440] main-tai.so
  [107581286 108104503.20 111904624] pulley-less-instruction-loads-tai.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 741291892.60 ± 11310370.33 (confidence = 99%)

  main-tai.so is 1.03x to 1.04x faster than pulley-less-instruction-loads-tai.so!

  [20890036242 20898157837.90 20915333190] main-tai.so
  [21632673509 21639449730.50 21660859731] pulley-less-instruction-loads-tai.so

the tl;dr; here being that this is a universal loss for the tail-call-dispatch interpreter loop and for the "mach loop" dispatch it's not always a win.

I wanted to confirm though, @Chris Fallin is this the sort of optimization you were imagining?

Alex Crichton (Jan 17 2025 at 18:30):

As an example, this is the disassembly of xmov after the optimization I applied

0000000001ffb110 <pulley_interpreter::interp::tail_loop::xmov>:
  push   %rbp
  mov    %rsp,%rbp
  mov    %edx,%eax                      ;; move our own instructions bytes into eax
  shr    $0x5,%eax
  and    $0xf8,%eax
  mov    0x200(%rdi,%rax,1),%rax        ;; bit tricks/load the src register indexed by `eax`
  and    $0x1f,%edx
  mov    %rax,0x200(%rdi,%rdx,8)        ;; bit tricks/store into dst register indexed by `edx`
  mov    -0x5(%rsi),%rdx                ;; load the next instruction's bytes (8 bytes)
  add    $0x3,%rsi
  movzbl %dl,%eax                       ;; extract opcode to dispatch on
  shr    $0x8,%rdx
  lea    pulley_interpreter::interp::tail_loop::OPCODE_HANDLER_TABLE,%rcx
  pop    %rbp
→ jmp    *(%rcx,%rax,8)                 ;; tail call dispatch

Chris Fallin (Jan 17 2025 at 18:41):

basically yeah that's it; so this data point shows that the loads aren't the bottleneck -- interesting!

Chris Fallin (Jan 17 2025 at 18:42):

at this point looking at uarch perf counters to see what is the bottleneck seems worthwhile; but honestly the biggest lever is probably reducing opcode count with macro-ops, exactly as you've been doing...

Alex Crichton (Jan 17 2025 at 18:43):

alas, well I might try to land this work behind a #[cfg] since it might end up helping in the future

Chris Fallin (Jan 17 2025 at 18:46):

oh, one more optimization I didn't mention yesterday: there's the concept of a "threaded interpreter" (name predates threads-as-parallelism-primitives) where you build a table of function pointers as a translation of the bytecode; that would replace the movzbl/shr/lea/jmp dispatching through the opcode table with probably an add $8, %ptr; jmp *(%ptr). a little tricky to do control flow (have to translate the dests to indices in the handler-func-table) but possible. I'd check how much time is spent on the last 5 instructions first before pursuing of course

Alex Crichton (Jan 17 2025 at 18:47):

Chris Fallin (Jan 17 2025 at 18:48):

yeah, I suppose you would want to build the whole thing as a side-table computation after loading bytecode, since it depends on runtime func-pointer values

Alex Crichton (Jan 17 2025 at 18:57):

Microarchitecture Usage: 0.0% of Pipeline Slots
 | You code efficiency on this platform is too low.
 |
 | Possible cause: memory stalls, instruction starvation, branch misprediction
 | or long latency instructions.
 |
 | Next steps: Run Microarchitecture Exploration analysis to identify the cause
 | of the low microarchitecture usage efficiency.
 |
    Performance-core (P-core)
        Retiring: 0.0% of Pipeline Slots
        Front-End Bound: 0.0% of Pipeline Slots
        Bad Speculation: 100.0% of Pipeline Slots
         | A significant proportion of pipeline slots containing useful work are
         | being cancelled. This can be caused by mispredicting branches or by
         | machine clears. Note that this metric value may be highlighted due to
         | Branch Resteers issue.
         |

Chris Fallin (Jan 17 2025 at 19:05):

mispredicts from the indirect predictor don't surprise me but 100% is very weird.

Chris Fallin (Jan 17 2025 at 19:05):

Alex Crichton (Jan 17 2025 at 19:05):

Alex Crichton (Jan 17 2025 at 19:06):

I've only ever really done perf sampling, is there like a general "perf show me a uarch summary" command?

Chris Fallin (Jan 17 2025 at 19:07):

perf list shows all the events, then perf stat -e EVENT ... to count it (multiple -e supported, I usually do cycles, instructions, and whatever events of interest)

Alex Crichton (Jan 17 2025 at 19:07):

Chris Fallin (Jan 17 2025 at 19:08):

Chris Fallin (Jan 17 2025 at 19:10):

the summary above doesn't tell me too much but does have some high-level data points: issue port utilization is low, so we're not bottlenecked on, say, too many ALU instructions; L1 stalls 16% of the time; running out of the DSB (predecoded cache) most of the time so not an L1i / too much code issue

Chris Fallin (Jan 17 2025 at 19:10):

main thing is that IPC is > 2, so this is really a "too many instructions" kind of issue I think

Chris Fallin (Jan 17 2025 at 19:10):

in other words, main remedy is fewer bytecode ops, since each bytecode op body is pretty minimal

Alex Crichton (Jan 17 2025 at 19:11):

     <not counted>      cpu_atom/br_misp_retired.all_branches/                                        (0.00%)
       202,601,699      cpu_core/br_misp_retired.all_branches/
     <not counted>      cpu_atom/br_inst_retired.all_branches/                                        (0.00%)
    21,691,599,697      cpu_core/br_inst_retired.all_branches/

Alex Crichton (Jan 17 2025 at 19:11):

Alex Crichton (Jan 17 2025 at 19:12):

Chris Fallin (Jan 17 2025 at 19:12):

10 MPKI (misses per thousand instructions) is somewhat high when one has giant instruction windows of today but doesn't scream "major problem" to me

Chris Fallin (Jan 17 2025 at 19:13):

right, I think it's about as good as one can hope for: not pathological (every indirect missing for some reason), predictor is doing its job, it's just hard to predict a bunch of indirects in sequence

Alex Crichton (Jan 17 2025 at 19:13):

Alex Crichton (Jan 17 2025 at 19:14):

Chris Fallin (Jan 17 2025 at 19:14):

Alex Crichton (Jan 17 2025 at 19:14):

Chris Fallin (Jan 17 2025 at 19:15):

Robin Freyler (Jan 17 2025 at 21:12):

Have you done the perf instrumentation on switch or tail-call based dispatch? Would be kinda interesting to know if they differ with respect to branch mispredictions. At least in theory tail-call dispatch should be simpler for a branch predictor to predict. 90% branch hit rate is low but also kinda normal for an interpreter.

Alex Crichton (Jan 17 2025 at 21:27):

not yet, no, but so far most of benchmarking hasn't shown either as a clear winner, sometimes one is faster and sometimes the other is

Alex Crichton (Jan 17 2025 at 21:27):

Alex Crichton (Jan 22 2025 at 23:13):

Along the lines of Pulley optimizations and/or Cranelift. I've got some Rust code:

#[unsafe(no_mangle)]
pub unsafe extern "C" fn foo(a: &[u32; 100]) -> u32 {
    a.iter().sum()
}

function u0:0(i64 vmctx, i64, i32) -> i32 tail {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned gv0+104
    gv2 = load.i64 notrap aligned readonly checked gv0+96

block0(v0: i64, v1: i64, v2: i32):
    v4 = iconst.i32 0
    v12 = iconst.i64 4
    v15 = load.i64 notrap aligned readonly checked v0+96
    v20 = iconst.i32 4
    v22 = iconst.i32 400
    jump block2(v4, v4)  ; v4 = 0, v4 = 0

block2(v8: i32, v18: i32):
    v11 = load.i64 notrap aligned v0+104
    v9 = iadd.i32 v2, v8
    v10 = uextend.i64 v9
    v27 = iconst.i64 4
    v28 = isub v11, v27  ; v27 = 4
    v14 = icmp ugt v10, v28
    trapnz v14, heap_oob
    v16 = iadd.i64 v15, v10
    v17 = load.i32 little heap v16
    v29 = iconst.i32 4
    v30 = iadd v8, v29  ; v29 = 4
    v31 = iconst.i32 400
    v32 = icmp ne v30, v31  ; v31 = 400
    v24 = uextend.i32 v32
    v19 = iadd v17, v18
    brif v24, block2(v30, v19), block4

block4:
    jump block3

block3:
    jump block1

block1:
    return v19
}

Is it expected that v11 = load.i64 notrap aligned v0+104 isn't hoisted out of the loop?

Alex Crichton (Jan 22 2025 at 23:13):

I know we do LICM but does it not touch load/stores/side-effectful instructions?

Alex Crichton (Jan 22 2025 at 23:13):

Alex Crichton (Jan 22 2025 at 23:14):

Chris Fallin (Jan 22 2025 at 23:17):

Yes, exactly, and I guess the bit of reasoning that prevents it semantically is that the memory could grow due to some other op in the loop body so we need to reload the bound

Chris Fallin (Jan 22 2025 at 23:18):

We could define a separate alias-analysis region for memory limits (or for the limit for each memory), and annotate libcalls for memory.grow as setting it

Chris Fallin (Jan 22 2025 at 23:18):

but then we'd still need to do something interprocedural, or else only do this opt in loops without calls

Alex Crichton (Jan 22 2025 at 23:18):

Chris Fallin (Jan 22 2025 at 23:19):

ah, sorry, yeah, implicit in above is "and then we could use that to make LICM smarter"

Chris Fallin (Jan 22 2025 at 23:19):

Alex Crichton (Jan 22 2025 at 23:19):

Chris Fallin (Jan 22 2025 at 23:19):

seems like a totally reasonable thing to do, though, especially for bounds-checked perf (makes a note)

Alex Crichton (Jan 22 2025 at 23:20):

there's a few little things I don't know how to fix myself like this, and there's some regalloc stuff I don't understand what's happening as well

Alex Crichton (Jan 22 2025 at 23:20):