Stream: wasm

Topic: Cranelift generating additional memory access opcodes?


view this post on Zulip Rowan Cannaday (Apr 09 2022 at 20:53):

Hello, I'm a (W)ASM novice & attempting to figure out why a precompiled WASM binary seems to have additional memory access as compared to the LLVM version.
I am not the author of the program, however I communicate with them.

These are two perf reports of the same function.
LLVM
WASM

the WASM version seems to have additional memory accesses that aren't present in the LLVM version, and more overall time spent shuffling memory.

this is in a virtual machine interpreter loop, so I might have to do some digging to find an isolated, reproducible case.

view this post on Zulip bjorn3 (Apr 09 2022 at 21:15):

Was the WASM binary compiled with SIMD enabled? x86_64 mandates at least SSE2 (128bit simd vectors) so you get basic SIMD even without doing anything special. WASM only introduced SIMD some time after the first release so clang and rustc by default compile without SIMD usage enabled for better compatibility. This does somewhat reduce performance and in general doubles the amount of memory accesses necessary for copying values larger than 64bit.

view this post on Zulip Rowan Cannaday (Apr 09 2022 at 21:20):

i compiled a 2nd time with SIMD enabled.

view this post on Zulip bjorn3 (Apr 09 2022 at 21:21):

Also try with https://github.com/bytecodealliance/wasmtime/pull/3989 This PR switches to a new register allocator which should produce faster code than the current register allocator.

This is a draft PR for now, meant to serve as a discussion-starter. I'll work on splitting this into logically separate commits next week, but wanted to get the initial thing up first. All tests pa...

view this post on Zulip Rowan Cannaday (Apr 09 2022 at 21:21):

it doesnt seem to make a difference, however this build doesnt include a lot of simd code, so i might try to enable that and recompile as wasm w/ simd enabled

view this post on Zulip Rowan Cannaday (Apr 09 2022 at 21:21):

oh thx! ill try regalloc2

view this post on Zulip Chris Fallin (Apr 10 2022 at 00:15):

@Rowan Cannaday when you say "LLVM version", do you mean a build to a native binary? And when you say "additional memory access", it sounds like you mean the static count (number of loads/stores in the disassembly), rather than some runtime measurement?

The reason I ask for clarification on the first is that comparing Wasm vs. native is quite different than comparing compiler backends. In other words this isn't so much (or just) a Cranelift vs. LLVM question as it is a "running in a sandbox" vs. "running in a native runtimeless environment" question, if I am reading this correctly.

The reason I ask for clarification on the second is that running Wasm code under Wasmtime involves code paths in the runtime as well, which will naturally access memory; but if we're just looking at static disassemblies then we don't have to worry about that.

The Wasm-vs-native comparison will at least imply additional memory accesses for indirect (function pointer or virtual method) calls, as these go through tables; and some memory accesses to get the Wasm memory info from the VM context; and memory accesses to the VM context for stack-limit checking, if configured; and a little more metadata handling on calls to imported functions; and probably some other stuff I'm forgetting.

Then when we get to the actual compiler comparison, even with identical IR (ie in a hypothetical world where we compare a native compiler using Cranelift vs clang+LLVM, or where we compare Wasmtime+Cranelift to a Wasm frontend + LLVM), I wouldn't be surprised if we do a bit worse, because LLVM can do redundant load elimination, store-to-load forwarding, dead store elimination, and in general reason about memory operations more fully than we can. Some of this is on our TODO list for optimizations to build, but some of it is also not possible for Wasm code due to strict trapping semantics.

Anyway, depending on clarifications above I'd be happy to discuss further and see what we can do!

view this post on Zulip Rowan Cannaday (Apr 10 2022 at 14:34):

Thanks @Chris Fallin , this is helpful. As I said, I'm a ASM novice, so I'm probably not asking the right questions. I'm also not the person who wrote the application. I'm mostly trying to figure out where the bottlenecks are specific to my use-case such that over time they can be contextualized.

I'm pasting dzaima's comment from another channel (I'm instigating as the go-between: WASM optimization is low on his priority list right now).

cranelift isn't meant to be only JITted, its calling convention is such that it's easy to use for non-JITted code, i.e. everything is passed on the stack. Then there's the fact that every memory read & write must have bounds checks added, which requires knowing the bounds in the first place, which you'll have to get from reading some global variable in RAM at the start of the function. Usually that's fine as most functions are large, but scalar code in CBQN can call a ton of tiny functions, where that overhead is pretty big

so it sounds like this is currently just the reality of compiling and running to wasm with the biggest contributors being:

CBQN doesnt have SIMD operations for WASM (only x86), so this is still a potential improvement too.

the following compile time options are being used:

 --target=wasm32-wasi" LDFLAGS="-lwasi-emulated-mman --target=wasm32-wasi -Wl,-z,stack-size=8388608 -Wl,--initial-memory=67108864" LD_LIBS= PIE= c

Thanks again!

a BQN implementation in C. Contribute to dzaima/CBQN development by creating an account on GitHub.

view this post on Zulip Chris Fallin (Apr 10 2022 at 19:31):

cranelift isn't meant to be only JITted, its calling convention is such that it's easy to use for non-JITted code, i.e. everything is passed on the stack.

Hmm, that's interesting -- Wasm-to-Wasm calls use the standard ABI on the platform (on Linux/x86-64, this is the System V ABI, which puts the first 6 int args / first 8? float args in registers). It's not clear to me what JIT vs non-JIT modes of use would have to do with this -- in either case code is ultimately invoked by a function call, and it either has the right ABI or you use a trampoline.

Anyway, that's a small point, and the bigger point of Wasm imposing some overheads stands. It's worth noting that this overhead isn't for nothing: it's the cost of software-enforced sandboxing, which guarantees that Wasm code cannot touch memory or corrupt state outside its heap. Whether that is more important than the last bit of performance is up to the particular application, though we'll keep looking at ways to shrink the gap!


Last updated: Jan 24 2025 at 00:11 UTC