alexcrichton labeled issue #4291:
This WebAssembly file which is reduced to a single function from this issue complies like this on
main
:$ /usr/bin/time -v ./target/release/wasmtime compile extract.wasm ... Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.58 ... Maximum resident set size (kbytes): 6565472 ... Exit status: 0
when compared to wasmtime 0.36.0 which is pre-regalloc2, however, this yields:
$ /usr/bin/time -v ./wasmtime-v0.36.0-aarch64-linux/wasmtime compile ./extract.wasm ... Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.86 ... Maximum resident set size (kbytes): 215264 ... Exit status: 0
I think this means that what previously took ~200M to compile is now taking upwards of 6.5G.
alexcrichton opened issue #4291:
This WebAssembly file which is reduced to a single function from this issue complies like this on
main
:$ /usr/bin/time -v ./target/release/wasmtime compile extract.wasm ... Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.58 ... Maximum resident set size (kbytes): 6565472 ... Exit status: 0
when compared to wasmtime 0.36.0 which is pre-regalloc2, however, this yields:
$ /usr/bin/time -v ./wasmtime-v0.36.0-aarch64-linux/wasmtime compile ./extract.wasm ... Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.86 ... Maximum resident set size (kbytes): 215264 ... Exit status: 0
I think this means that what previously took ~200M to compile is now taking upwards of 6.5G.
cfallin commented on issue #4291:
I did some investigation on this yesterday and today (not quite fulltime, I'm still under the weather a bit, but regalloc hacking is still the best way to pass the time...). I found three distinct things I could improve:
Most importantly, some ugly quadratic behavior with liverange splitting. The heuristic has always been "split at first conflict", and a split is always a 2-for-1 deal, not N-for-1. The test program above has a single vreg that is passed as arg0, then arg1, then arg0, then arg1, ... through a long sequence of callsites. This means that it has to be split into N pieces each of which can be put in the appropriate register. Unfortunately splitting had cost O(|bundle|), i.e. proportional to the total length of the bundle. Bad news! My fix to this issue is to "bottom out" at a limit: if a single original bundle is split more than K times (10, for now), go ahead and do an N-for-1 split into minimal pieces.
Also, during splitting, we were copying the
Use
list over to the new second half, and truncating in the first, but notshrink_to_fit
'ing. So we had O(n^2) memory at the end of the run too. D'oh.Finally, handling of call-ABI clobbers had a bit too much overhead by treating them as normal defs; I went ahead and resolved an old TODO and used the proper clobbers API, and also adopted a bitmask-based clobbers representation rather than a list. On the Cranelift side the clobbers-list is now a
const
bitmask for an ABI rather than a dynamically-built thing with allocations and all the rest.This moved the needle on compilation of the above significantly:
% perf stat ../wasmtime/target/release/wasmtime compile ~/testfile.wasm Performance counter stats for '../wasmtime/target/release/wasmtime compile /home/cfallin/testfile.wasm': 4,206.10 msec task-clock # 1.053 CPUs utilized 4,340 context-switches # 1.032 K/sec 822 cpu-migrations # 195.431 /sec 1,163,585 page-faults # 276.643 K/sec 16,856,753,781 cycles # 4.008 GHz (83.36%) 1,621,615,014 stalled-cycles-frontend # 9.62% frontend cycles idle (83.11%) 3,111,090,359 stalled-cycles-backend # 18.46% backend cycles idle (83.35%) 28,553,303,978 instructions # 1.69 insn per cycle # 0.11 stalled cycles per insn (83.38%) 6,475,239,780 branches # 1.539 G/sec (83.50%) 16,905,250 branch-misses # 0.26% of all branches (83.33%) 3.995578486 seconds time elapsed 2.763566000 seconds user 1.382605000 seconds sys % perf stat target/release/wasmtime compile ~/testfile.wasm Performance counter stats for 'target/release/wasmtime compile /home/cfallin/testfile.wasm': 1,006.23 msec task-clock # 1.267 CPUs utilized 3,825 context-switches # 3.801 K/sec 745 cpu-migrations # 740.388 /sec 46,823 page-faults # 46.533 K/sec 4,000,880,722 cycles # 3.976 GHz (83.93%) 285,506,402 stalled-cycles-frontend # 7.14% frontend cycles idle (83.77%) 302,458,733 stalled-cycles-backend # 7.56% backend cycles idle (82.24%) 4,816,665,288 instructions # 1.20 insn per cycle # 0.06 stalled cycles per insn (83.49%) 869,534,746 branches # 864.151 M/sec (83.48%) 11,265,004 branch-misses # 1.30% of all branches (83.27%) 0.794473768 seconds time elapsed 0.844001000 seconds user 0.143025000 seconds sys
Or in other words, 4x faster compilation and 24x fewer page faults (~= 24x less anon memory used).
In comparison, Wasmtime v0.36 (pre-regalloc2) is:
% perf stat ~/Downloads/wasmtime-v0.36.0-x86_64-linux/wasmtime compile ~/testfile.wasm Performance counter stats for '/home/cfallin/Downloads/wasmtime-v0.36.0-x86_64-linux/wasmtime compile /home/cfallin/testfile.wasm': 959.79 msec task-clock # 1.233 CPUs utilized 5,047 context-switches # 5.258 K/sec 697 cpu-migrations # 726.199 /sec 58,171 page-faults # 60.608 K/sec 3,792,924,189 cycles # 3.952 GHz (83.95%) 234,549,074 stalled-cycles-frontend # 6.18% frontend cycles idle (82.94%) 258,495,205 stalled-cycles-backend # 6.82% backend cycles idle (82.15%) 5,110,076,091 instructions # 1.35 insn per cycle # 0.05 stalled cycles per insn (83.41%) 1,102,335,350 branches # 1.149 G/sec (83.58%) 11,660,266 branch-misses # 1.06% of all branches (84.11%) 0.778638937 seconds time elapsed 0.772824000 seconds user 0.166435000 seconds sys
So v0.36 is ever-so-slightly faster (by ~5%) but curiously the current
main
-with-fixes runs ~5% fewer instructions during compilation, just gets a lower IPC. Fewer pagefaults ( == less memory) in current as well. These numbers are close enough to "within noise" I'd want to measure more carefully before making strong claims here. I do feel comfortable saying "anomaly fixed and back to parity" though, given the above.I suspect this may be the same issue we saw in #4045 as well but I haven't verified that.
I'll put up proper PRs next week, when I'm fully back; for now the branches are here (regalloc2) and here (Cranelift).
cfallin closed issue #4291:
This WebAssembly file which is reduced to a single function from this issue complies like this on
main
:$ /usr/bin/time -v ./target/release/wasmtime compile extract.wasm ... Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.58 ... Maximum resident set size (kbytes): 6565472 ... Exit status: 0
when compared to wasmtime 0.36.0 which is pre-regalloc2, however, this yields:
$ /usr/bin/time -v ./wasmtime-v0.36.0-aarch64-linux/wasmtime compile ./extract.wasm ... Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.86 ... Maximum resident set size (kbytes): 215264 ... Exit status: 0
I think this means that what previously took ~200M to compile is now taking upwards of 6.5G.
Last updated: Nov 22 2024 at 16:03 UTC