I'm trying to figure out if there's a built-in way in cranelift to copy a memory region larger than what would fit in a value. For instance, I have a user struct, and a value pointing to its base address, and I want to copy it to a stack slot (for which I also have its base address of course). Do I have to emit individual load / store instructions for each of the primitive fields in the flattened structure? Or is there a more optimal way to express my intent to copy a block of memory with the IR?
Another thing I'm wondering about is the distinction between stack_load / stack_store and load / store. For simplicity, I am currently generating all memory operations as load / store, because it's easy to generate code that way. Every time I have a stack slot, I simply use stack_addr to obtain its address. Will emitting the IR like this lead to less optimization opportunities? Or is stack_load / stack_store just an alias for convenience?
There's an open issue about adding Cranelift instructions for memory copies (https://github.com/bytecodealliance/wasmtime/issues/5479). If you're using the cranelift-frontend
crate there are also helpers there for emitting sequences of loads/stores for you (emit_small_mem*
in https://docs.rs/cranelift-frontend/latest/cranelift_frontend/struct.FunctionBuilder.html), although they don't necessarily produce good code.
I don't think there's anything wrong with always using stack_addr
. I can't immediately think of any optimization opportunities we'd miss as a result, but I'm not guaranteeing anything. :laughing: If you find that Cranelift is generating silly code sequences as a result, we might be able to fix that with either mid-end optimizations or better back-end lowering rules.
distinction between stack_load / stack_store and load / store
FWIW, stack_load
/ stack_store
are legalized to stack_addr
followed by regular loads/stores, so there's no need to worry about suboptimal codegen from that particular aspect
There's an open issue about adding Cranelift instructions for memory copies (https://github.com/bytecodealliance/wasmtime/issues/5479)
I'll keep an eye on that :eyes:
emit_small_mem*
Thanks! That seems just what I needed. I was already halfway implementing something myself, but I'd rather use a builtin solution.
although they don't necessarily produce good code.
Good enough for now. Make it run, then make it fast :)
FWIW, stack_load / stack_store are legalized to stack_addr followed by regular loads/stores
I'm so glad to hear this! :sweat_smile: I wasn't looking forward to juggling two different representations of memory during codegen.
On that note, I was taking a look at emit_small_memory_copy
and I see it has some alignment requirements on the alignment of the provided src and dst addresses. When allocating memory by reserving a stack slot, how can I make sure the stack slot will have a specific alignment? create_sized_stack_slot
only takes a size parameter, but no alignment
stack_{load,store}
might automatically set the "stack" bit on the mem flags, which might give a benefit for our simple alias analysis, but I'm not totally sure if that happens here or is up to the front ends
We had some discussion and consensus in https://github.com/bytecodealliance/wasmtime/issues/5922 that stack slots should have an alignment specifier but I don't think anybody is working on that yet. In that issue bjorn3 noted that "if you ensure that every stack slot is a multiple of 16, you get 16 aligned stack slots on all current backends." So that's a workaround you can use for stack-slot alignment for now.
if you ensure that every stack slot is a multiple of 16, you get 16 aligned stack slots on all current backends.
Right, thanks a lot! I'll use this workaround too in the meantime :+1:
Why 16 specifically though? Isn't most data 8-byte aligned in 64-bit CPUs?
16-alignment shows up both in ABIs (x86-64 and aarch64 both require SP to be 16-aligned) and in vector ISAs (some 128-bit-vector loads/stores -- actually usually most, except "unaligned" variants -- require 16-alignment)
the former may be because of the latter and stack-spills, I'm not sure
Right, makes sense! Just to make sure I understand: This workaround is just to ensure that my own stack slots are always 16-aligned, but cranelift still ensures the stack pointer is set to a multiple of 16 when I call a function using e.g. the SysV calling convention (or any other that requires it) right? I'm guessing I don't need to manually insert any padding stack slots myself for that
Yeah, cranelift will always ensure that the calling convention is followed with respect to stack alignment. It does so even if you use stack slots of odd sizes. Only if you want your own stack slots to be 16 byte aligned do you need to make them all a multiple of 16 bytes as size.
Last updated: Dec 23 2024 at 12:05 UTC