Hello, I'm a (W)ASM novice & attempting to figure out why a precompiled WASM binary seems to have additional memory access as compared to the LLVM version.
I am not the author of the program, however I communicate with them.
These are two perf reports of the same function.
LLVM
WASM
the WASM version seems to have additional memory accesses that aren't present in the LLVM version, and more overall time spent shuffling memory.
this is in a virtual machine interpreter loop, so I might have to do some digging to find an isolated, reproducible case.
Was the WASM binary compiled with SIMD enabled? x86_64 mandates at least SSE2 (128bit simd vectors) so you get basic SIMD even without doing anything special. WASM only introduced SIMD some time after the first release so clang and rustc by default compile without SIMD usage enabled for better compatibility. This does somewhat reduce performance and in general doubles the amount of memory accesses necessary for copying values larger than 64bit.
i compiled a 2nd time with SIMD enabled.
Also try with https://github.com/bytecodealliance/wasmtime/pull/3989 This PR switches to a new register allocator which should produce faster code than the current register allocator.
it doesnt seem to make a difference, however this build doesnt include a lot of simd code, so i might try to enable that and recompile as wasm w/ simd enabled
oh thx! ill try regalloc2
@Rowan Cannaday when you say "LLVM version", do you mean a build to a native binary? And when you say "additional memory access", it sounds like you mean the static count (number of loads/stores in the disassembly), rather than some runtime measurement?
The reason I ask for clarification on the first is that comparing Wasm vs. native is quite different than comparing compiler backends. In other words this isn't so much (or just) a Cranelift vs. LLVM question as it is a "running in a sandbox" vs. "running in a native runtimeless environment" question, if I am reading this correctly.
The reason I ask for clarification on the second is that running Wasm code under Wasmtime involves code paths in the runtime as well, which will naturally access memory; but if we're just looking at static disassemblies then we don't have to worry about that.
The Wasm-vs-native comparison will at least imply additional memory accesses for indirect (function pointer or virtual method) calls, as these go through tables; and some memory accesses to get the Wasm memory info from the VM context; and memory accesses to the VM context for stack-limit checking, if configured; and a little more metadata handling on calls to imported functions; and probably some other stuff I'm forgetting.
Then when we get to the actual compiler comparison, even with identical IR (ie in a hypothetical world where we compare a native compiler using Cranelift vs clang+LLVM, or where we compare Wasmtime+Cranelift to a Wasm frontend + LLVM), I wouldn't be surprised if we do a bit worse, because LLVM can do redundant load elimination, store-to-load forwarding, dead store elimination, and in general reason about memory operations more fully than we can. Some of this is on our TODO list for optimizations to build, but some of it is also not possible for Wasm code due to strict trapping semantics.
Anyway, depending on clarifications above I'd be happy to discuss further and see what we can do!
Thanks @Chris Fallin , this is helpful. As I said, I'm a ASM novice, so I'm probably not asking the right questions. I'm also not the person who wrote the application. I'm mostly trying to figure out where the bottlenecks are specific to my use-case such that over time they can be contextualized.
I'm pasting dzaima's comment from another channel (I'm instigating as the go-between: WASM optimization is low on his priority list right now).
cranelift isn't meant to be only JITted, its calling convention is such that it's easy to use for non-JITted code, i.e. everything is passed on the stack. Then there's the fact that every memory read & write must have bounds checks added, which requires knowing the bounds in the first place, which you'll have to get from reading some global variable in RAM at the start of the function. Usually that's fine as most functions are large, but scalar code in CBQN can call a ton of tiny functions, where that overhead is pretty big
so it sounds like this is currently just the reality of compiling and running to wasm with the biggest contributors being:
CBQN doesnt have SIMD operations for WASM (only x86), so this is still a potential improvement too.
the following compile time options are being used:
--target=wasm32-wasi" LDFLAGS="-lwasi-emulated-mman --target=wasm32-wasi -Wl,-z,stack-size=8388608 -Wl,--initial-memory=67108864" LD_LIBS= PIE= c
Thanks again!
cranelift isn't meant to be only JITted, its calling convention is such that it's easy to use for non-JITted code, i.e. everything is passed on the stack.
Hmm, that's interesting -- Wasm-to-Wasm calls use the standard ABI on the platform (on Linux/x86-64, this is the System V ABI, which puts the first 6 int args / first 8? float args in registers). It's not clear to me what JIT vs non-JIT modes of use would have to do with this -- in either case code is ultimately invoked by a function call, and it either has the right ABI or you use a trampoline.
Anyway, that's a small point, and the bigger point of Wasm imposing some overheads stands. It's worth noting that this overhead isn't for nothing: it's the cost of software-enforced sandboxing, which guarantees that Wasm code cannot touch memory or corrupt state outside its heap. Whether that is more important than the last bit of performance is up to the particular application, though we'll keep looking at ways to shrink the gap!
Last updated: Dec 23 2024 at 12:05 UTC