Cranelift generating additional memory access opcodes? · wasm

Hello, I'm a (W)ASM novice & attempting to figure out why a precompiled WASM binary seems to have additional memory access as compared to the LLVM version.
I am not the author of the program, however I communicate with them.

the WASM version seems to have additional memory accesses that aren't present in the LLVM version, and more overall time spent shuffling memory.

this is in a virtual machine interpreter loop, so I might have to do some digging to find an isolated, reproducible case.

bjorn3 (Apr 09 2022 at 21:15):

Was the WASM binary compiled with SIMD enabled? x86_64 mandates at least SSE2 (128bit simd vectors) so you get basic SIMD even without doing anything special. WASM only introduced SIMD some time after the first release so clang and rustc by default compile without SIMD usage enabled for better compatibility. This does somewhat reduce performance and in general doubles the amount of memory accesses necessary for copying values larger than 64bit.

Rowan Cannaday (Apr 09 2022 at 21:20):

bjorn3 (Apr 09 2022 at 21:21):

Switch Cranelift over to regalloc2. by cfallin · Pull Request #3989 · bytecodealliance/wasmtime

This is a draft PR for now, meant to serve as a discussion-starter. I'll work on splitting this into logically separate commits next week, but wanted to get the initial thing up first. All tests pa...

Rowan Cannaday (Apr 09 2022 at 21:21):

it doesnt seem to make a difference, however this build doesnt include a lot of simd code, so i might try to enable that and recompile as wasm w/ simd enabled

Rowan Cannaday (Apr 09 2022 at 21:21):

Chris Fallin (Apr 10 2022 at 00:15):

@Rowan Cannaday when you say "LLVM version", do you mean a build to a native binary? And when you say "additional memory access", it sounds like you mean the static count (number of loads/stores in the disassembly), rather than some runtime measurement?

The reason I ask for clarification on the first is that comparing Wasm vs. native is quite different than comparing compiler backends. In other words this isn't so much (or just) a Cranelift vs. LLVM question as it is a "running in a sandbox" vs. "running in a native runtimeless environment" question, if I am reading this correctly.

The reason I ask for clarification on the second is that running Wasm code under Wasmtime involves code paths in the runtime as well, which will naturally access memory; but if we're just looking at static disassemblies then we don't have to worry about that.

The Wasm-vs-native comparison will at least imply additional memory accesses for indirect (function pointer or virtual method) calls, as these go through tables; and some memory accesses to get the Wasm memory info from the VM context; and memory accesses to the VM context for stack-limit checking, if configured; and a little more metadata handling on calls to imported functions; and probably some other stuff I'm forgetting.

Then when we get to the actual compiler comparison, even with identical IR (ie in a hypothetical world where we compare a native compiler using Cranelift vs clang+LLVM, or where we compare Wasmtime+Cranelift to a Wasm frontend + LLVM), I wouldn't be surprised if we do a bit worse, because LLVM can do redundant load elimination, store-to-load forwarding, dead store elimination, and in general reason about memory operations more fully than we can. Some of this is on our TODO list for optimizations to build, but some of it is also not possible for Wasm code due to strict trapping semantics.

Anyway, depending on clarifications above I'd be happy to discuss further and see what we can do!

Rowan Cannaday (Apr 10 2022 at 14:34):

Thanks @Chris Fallin , this is helpful. As I said, I'm a ASM novice, so I'm probably not asking the right questions. I'm also not the person who wrote the application. I'm mostly trying to figure out where the bottlenecks are specific to my use-case such that over time they can be contextualized.

I'm pasting dzaima's comment from another channel (I'm instigating as the go-between: WASM optimization is low on his priority list right now).

so it sounds like this is currently just the reality of compiling and running to wasm with the biggest contributors being:

CBQN doesnt have SIMD operations for WASM (only x86), so this is still a potential improvement too.

 --target=wasm32-wasi" LDFLAGS="-lwasi-emulated-mman --target=wasm32-wasi -Wl,-z,stack-size=8388608 -Wl,--initial-memory=67108864" LD_LIBS= PIE= c

GitHub - dzaima/CBQN: a BQN implementation in C

a BQN implementation in C. Contribute to dzaima/CBQN development by creating an account on GitHub.

Chris Fallin (Apr 10 2022 at 19:31):

Hmm, that's interesting -- Wasm-to-Wasm calls use the standard ABI on the platform (on Linux/x86-64, this is the System V ABI, which puts the first 6 int args / first 8? float args in registers). It's not clear to me what JIT vs non-JIT modes of use would have to do with this -- in either case code is ultimately invoked by a function call, and it either has the right ABI or you use a trampoline.

Anyway, that's a small point, and the bigger point of Wasm imposing some overheads stands. It's worth noting that this overhead isn't for nothing: it's the cost of software-enforced sandboxing, which guarantees that Wasm code cannot touch memory or corrupt state outside its heap. Whether that is more important than the last bit of performance is up to the particular application, though we'll keep looking at ways to shrink the gap!

Stream: wasm

Topic: Cranelift generating additional memory access opcodes?

Rowan Cannaday (Apr 09 2022 at 20:53):

bjorn3 (Apr 09 2022 at 21:15):

Rowan Cannaday (Apr 09 2022 at 21:20):

bjorn3 (Apr 09 2022 at 21:21):

Rowan Cannaday (Apr 09 2022 at 21:21):

Rowan Cannaday (Apr 09 2022 at 21:21):

Chris Fallin (Apr 10 2022 at 00:15):

Rowan Cannaday (Apr 10 2022 at 14:34):

Chris Fallin (Apr 10 2022 at 19:31):