In the wasmtime meeting yesterday we dove a bit into pulley and some theoretical optimizations. One thing @Chris Fallin you mentioned was reducing the number of loads per instruction where currently the load-per-instruction-operand might be overly punishing us by saturating load/store ports in a cpu
I was thinking about that last night and this morning I implemented a strategy where whenever we decode an instruction the decoder unconditionally loads a pointer's worth of bytes (e.g. 8 bytes on x64). Operands are then decoded from that via bit shifts. If the instruction is longer than 8 bytes it'll continue to load more afterwards. Basically I didn't change the pulley bytecode format at all, but I optimized the loads such that on average, on x64, there's only a single 64-bit load per instruction.
For the tail-call interpreter loop this meant that at the end of an instruction handler you'd load 64-bits for the next instruction. The low 8 bits are used for a dynamic dispatch and the upper 56 bits are passed in a register to the next opcode. The next opcode then uses these 56 bits to decode the operands.
I've confirmed looking at the disassembly of the tail-call interpreter that everything looks as expected. There's bit-twiddling where I'd expect, no extra overhead from the abstractions I'm using, and there's only a single load at the end of most handlers of data for the next instruction.
So despite all this, the performance has relatively mixed results. For example this is a performance comparison of the "mach loop" interpreter with this load-64-bits trick vs the previous commit on main
execution :: cycles :: benchmarks/bz2/benchmark.wasm
Δ = 42746797.10 ± 4341093.72 (confidence = 99%)
main.so is 1.06x to 1.07x faster than pulley-less-instruction-loads.so!
[645100214 648469690.10 653317578] main.so
[687997108 691216487.20 698783018] pulley-less-instruction-loads.so
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 899905066.70 ± 26179495.28 (confidence = 99%)
pulley-less-instruction-loads.so is 1.05x to 1.05x faster than main.so!
[20179755580 20198967934.80 20236773749] main.so
[19278160035 19299062868.10 19355235070] pulley-less-instruction-loads.so
execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 3095623.30 ± 2128376.59 (confidence = 99%)
pulley-less-instruction-loads.so is 1.01x to 1.07x faster than main.so!
[81785649 83368590.70 87006738] main.so
[79519941 80272967.40 84093446] pulley-less-instruction-loads.so
and this is a performance comparison of the tail-call-based interpreter loop:
execution :: cycles :: benchmarks/bz2/benchmark.wasm
Δ = 85980346.40 ± 4256370.91 (confidence = 99%)
main-tai.so is 1.08x to 1.09x faster than pulley-less-instruction-loads-tai.so!
[979172509 982182980.40 991260483] main-tai.so
[1067376846 1068163326.80 1071737075] pulley-less-instruction-loads-tai.so
execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 6865385.20 ± 2141737.48 (confidence = 99%)
main-tai.so is 1.05x to 1.09x faster than pulley-less-instruction-loads-tai.so!
[100417359 101239118.00 106632440] main-tai.so
[107581286 108104503.20 111904624] pulley-less-instruction-loads-tai.so
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 741291892.60 ± 11310370.33 (confidence = 99%)
main-tai.so is 1.03x to 1.04x faster than pulley-less-instruction-loads-tai.so!
[20890036242 20898157837.90 20915333190] main-tai.so
[21632673509 21639449730.50 21660859731] pulley-less-instruction-loads-tai.so
the tl;dr; here being that this is a universal loss for the tail-call-dispatch interpreter loop and for the "mach loop" dispatch it's not always a win.
I wanted to confirm though, @Chris Fallin is this the sort of optimization you were imagining?
As an example, this is the disassembly of xmov
after the optimization I applied
0000000001ffb110 <pulley_interpreter::interp::tail_loop::xmov>:
push %rbp
mov %rsp,%rbp
mov %edx,%eax ;; move our own instructions bytes into eax
shr $0x5,%eax
and $0xf8,%eax
mov 0x200(%rdi,%rax,1),%rax ;; bit tricks/load the src register indexed by `eax`
and $0x1f,%edx
mov %rax,0x200(%rdi,%rdx,8) ;; bit tricks/store into dst register indexed by `edx`
mov -0x5(%rsi),%rdx ;; load the next instruction's bytes (8 bytes)
add $0x3,%rsi
movzbl %dl,%eax ;; extract opcode to dispatch on
shr $0x8,%rdx
lea pulley_interpreter::interp::tail_loop::OPCODE_HANDLER_TABLE,%rcx
pop %rbp
→ jmp *(%rcx,%rax,8) ;; tail call dispatch
basically yeah that's it; so this data point shows that the loads aren't the bottleneck -- interesting!
at this point looking at uarch perf counters to see what is the bottleneck seems worthwhile; but honestly the biggest lever is probably reducing opcode count with macro-ops, exactly as you've been doing...
alas, well I might try to land this work behind a #[cfg] since it might end up helping in the future
oh, one more optimization I didn't mention yesterday: there's the concept of a "threaded interpreter" (name predates threads-as-parallelism-primitives) where you build a table of function pointers as a translation of the bytecode; that would replace the movzbl/shr/lea/jmp dispatching through the opcode table with probably an add $8, %ptr; jmp *(%ptr)
. a little tricky to do control flow (have to translate the dests to indices in the handler-func-table) but possible. I'd check how much time is spent on the last 5 instructions first before pursuing of course
so sort of like a relocation pass after we've loaded the bytecode?
yeah, I suppose you would want to build the whole thing as a side-table computation after loading bytecode, since it depends on runtime func-pointer values
apparently vtune says:
Microarchitecture Usage: 0.0% of Pipeline Slots
| You code efficiency on this platform is too low.
|
| Possible cause: memory stalls, instruction starvation, branch misprediction
| or long latency instructions.
|
| Next steps: Run Microarchitecture Exploration analysis to identify the cause
| of the low microarchitecture usage efficiency.
|
Performance-core (P-core)
Retiring: 0.0% of Pipeline Slots
Front-End Bound: 0.0% of Pipeline Slots
Bad Speculation: 100.0% of Pipeline Slots
| A significant proportion of pipeline slots containing useful work are
| being cancelled. This can be caused by mispredicting branches or by
| machine clears. Note that this metric value may be highlighted due to
| Branch Resteers issue.
|
"you are using 0.0% of your CPU, try harder"
mispredicts from the indirect predictor don't surprise me but 100% is very weird.
Spectre mitigations turning off indirect predictor entirely?
(I'd try perf
too, fwiw)
this is the uarch-exploration report I think https://gist.github.com/alexcrichton/75eca31603706054fb2dcc83a7c0b5be
but I can't really make heads or tails of this
I've only ever really done perf sampling, is there like a general "perf show me a uarch summary" command?
perf list
shows all the events, then perf stat -e EVENT ...
to count it (multiple -e
supported, I usually do cycles, instructions, and whatever events of interest)
heh perf list
has 6500 events
go Intel perf counter team! 10k is close, you can do it
the summary above doesn't tell me too much but does have some high-level data points: issue port utilization is low, so we're not bottlenecked on, say, too many ALU instructions; L1 stalls 16% of the time; running out of the DSB (predecoded cache) most of the time so not an L1i / too much code issue
main thing is that IPC is > 2, so this is really a "too many instructions" kind of issue I think
in other words, main remedy is fewer bytecode ops, since each bytecode op body is pretty minimal
hm ok that makes sense to me yeah
with perf stat
I got:
<not counted> cpu_atom/br_misp_retired.all_branches/ (0.00%)
202,601,699 cpu_core/br_misp_retired.all_branches/
<not counted> cpu_atom/br_inst_retired.all_branches/ (0.00%)
21,691,599,697 cpu_core/br_inst_retired.all_branches/
which doesn't look like ~90% misprediction rate
I suppose 10% mispredict is a high enough rate to consume 90% of pipeline slots
10 MPKI (misses per thousand instructions) is somewhat high when one has giant instruction windows of today but doesn't scream "major problem" to me
right, I think it's about as good as one can hope for: not pathological (every indirect missing for some reason), predictor is doing its job, it's just hard to predict a bunch of indirects in sequence
can wasm switch to an opcode-per-program
like can my wasm module just be one opcode
(func $fib (param i32) (result i32) fib
pulley op serve_http_request_using_program1.wasm
in any case fun exploration!
yeah, good to confirm nothing too wonky going on
Have you done the perf
instrumentation on switch
or tail-call
based dispatch? Would be kinda interesting to know if they differ with respect to branch mispredictions. At least in theory tail-call
dispatch should be simpler for a branch predictor to predict. 90% branch hit rate is low but also kinda normal for an interpreter.
not yet, no, but so far most of benchmarking hasn't shown either as a clear winner, sometimes one is faster and sometimes the other is
which might just mean that something else is the bigger bottleneck right now
Along the lines of Pulley optimizations and/or Cranelift. I've got some Rust code:
#[unsafe(no_mangle)]
pub unsafe extern "C" fn foo(a: &[u32; 100]) -> u32 {
a.iter().sum()
}
which compiles to wasm then clif (for Pulley) which looks like:
function u0:0(i64 vmctx, i64, i32) -> i32 tail {
gv0 = vmctx
gv1 = load.i64 notrap aligned gv0+104
gv2 = load.i64 notrap aligned readonly checked gv0+96
block0(v0: i64, v1: i64, v2: i32):
v4 = iconst.i32 0
v12 = iconst.i64 4
v15 = load.i64 notrap aligned readonly checked v0+96
v20 = iconst.i32 4
v22 = iconst.i32 400
jump block2(v4, v4) ; v4 = 0, v4 = 0
block2(v8: i32, v18: i32):
v11 = load.i64 notrap aligned v0+104
v9 = iadd.i32 v2, v8
v10 = uextend.i64 v9
v27 = iconst.i64 4
v28 = isub v11, v27 ; v27 = 4
v14 = icmp ugt v10, v28
trapnz v14, heap_oob
v16 = iadd.i64 v15, v10
v17 = load.i32 little heap v16
v29 = iconst.i32 4
v30 = iadd v8, v29 ; v29 = 4
v31 = iconst.i32 400
v32 = icmp ne v30, v31 ; v31 = 400
v24 = uextend.i32 v32
v19 = iadd v17, v18
brif v24, block2(v30, v19), block4
block4:
jump block3
block3:
jump block1
block1:
return v19
}
Is it expected that v11 = load.i64 notrap aligned v0+104
isn't hoisted out of the loop?
I know we do LICM but does it not touch load/stores/side-effectful instructions?
oh d'oh I see the readonly
bit which is needed
now I see the base pointer being hoisted
Yes, exactly, and I guess the bit of reasoning that prevents it semantically is that the memory could grow due to some other op in the loop body so we need to reload the bound
We could define a separate alias-analysis region for memory limits (or for the limit for each memory), and annotate libcalls for memory.grow
as setting it
but then we'd still need to do something interprocedural, or else only do this opt in loops without calls
I think it's already in a different region though b/c it's "none" vs heap
?
or rather does licm consult aliasing info?
ah, sorry, yeah, implicit in above is "and then we could use that to make LICM smarter"
currently it doesn't analyze the loop body for writes to alias regions at all
kk makes sense
seems like a totally reasonable thing to do, though, especially for bounds-checked perf (makes a note)
yeah I'm thinking I'll probably open a meta-issue for pulley perf soon
there's a few little things I don't know how to fix myself like this, and there's some regalloc stuff I don't understand what's happening as well
the fruit for macro ops seems to be higher and higher alas
good to write it down at least -- then maybe if I get spare cycles, or @fitzgen (he/him) , or someone else, we've got a solid queue of perf ideas
Screenshot 2025-01-22 at 5.21.18 PM.png
e.g. this is one tight loop (screenshot for colors)
and I'm not sure why both the entry to the loop and back-edge both have xmov x0, x28
ideally it'd just be a single conditional branch at the end, but I think regalloc decisions are thwarting that
I've seen this on native too but never bottom'd it out
weird!
Last updated: Jan 24 2025 at 00:11 UTC