Stream: cranelift

Topic: Copying memory larger than a single Value (memcpy?)


view this post on Zulip Setzer22 (May 01 2023 at 10:57):

I'm trying to figure out if there's a built-in way in cranelift to copy a memory region larger than what would fit in a value. For instance, I have a user struct, and a value pointing to its base address, and I want to copy it to a stack slot (for which I also have its base address of course). Do I have to emit individual load / store instructions for each of the primitive fields in the flattened structure? Or is there a more optimal way to express my intent to copy a block of memory with the IR?

Another thing I'm wondering about is the distinction between stack_load / stack_store and load / store. For simplicity, I am currently generating all memory operations as load / store, because it's easy to generate code that way. Every time I have a stack slot, I simply use stack_addr to obtain its address. Will emitting the IR like this lead to less optimization opportunities? Or is stack_load / stack_store just an alias for convenience?

view this post on Zulip Jamey Sharp (May 01 2023 at 16:12):

There's an open issue about adding Cranelift instructions for memory copies (https://github.com/bytecodealliance/wasmtime/issues/5479). If you're using the cranelift-frontend crate there are also helpers there for emitting sequences of loads/stores for you (emit_small_mem* in https://docs.rs/cranelift-frontend/latest/cranelift_frontend/struct.FunctionBuilder.html), although they don't necessarily produce good code.

I don't think there's anything wrong with always using stack_addr. I can't immediately think of any optimization opportunities we'd miss as a result, but I'm not guaranteeing anything. :laughing: If you find that Cranelift is generating silly code sequences as a result, we might be able to fix that with either mid-end optimizations or better back-end lowering rules.

Feature Introduce instructions that behave like memcpy and memset. These should lower to repe movsb and repe stosb for memcpy and memset respectively on x86_64 with the ermsb feature. According to ...

view this post on Zulip Chris Fallin (May 01 2023 at 16:14):

distinction between stack_load / stack_store and load / store

FWIW, stack_load / stack_store are legalized to stack_addr followed by regular loads/stores, so there's no need to worry about suboptimal codegen from that particular aspect

view this post on Zulip Setzer22 (May 01 2023 at 16:33):

There's an open issue about adding Cranelift instructions for memory copies (https://github.com/bytecodealliance/wasmtime/issues/5479)

I'll keep an eye on that :eyes:

emit_small_mem*

Thanks! That seems just what I needed. I was already halfway implementing something myself, but I'd rather use a builtin solution.

although they don't necessarily produce good code.

Good enough for now. Make it run, then make it fast :)

FWIW, stack_load / stack_store are legalized to stack_addr followed by regular loads/stores

I'm so glad to hear this! :sweat_smile: I wasn't looking forward to juggling two different representations of memory during codegen.

On that note, I was taking a look at emit_small_memory_copy and I see it has some alignment requirements on the alignment of the provided src and dst addresses. When allocating memory by reserving a stack slot, how can I make sure the stack slot will have a specific alignment? create_sized_stack_slot only takes a size parameter, but no alignment

view this post on Zulip fitzgen (he/him) (May 01 2023 at 17:15):

stack_{load,store} might automatically set the "stack" bit on the mem flags, which might give a benefit for our simple alias analysis, but I'm not totally sure if that happens here or is up to the front ends

view this post on Zulip Jamey Sharp (May 01 2023 at 17:15):

We had some discussion and consensus in https://github.com/bytecodealliance/wasmtime/issues/5922 that stack slots should have an alignment specifier but I don't think anybody is working on that yet. In that issue bjorn3 noted that "if you ensure that every stack slot is a multiple of 16, you get 16 aligned stack slots on all current backends." So that's a workaround you can use for stack-slot alignment for now.

👋 Hey, Me and @alexcrichton were discussing this via Zulip. The stack_load gets translated into a load.f64x2 notrap aligned v34, and that's why we do the load sinking in the backend. Is stack_load ...

view this post on Zulip Setzer22 (May 01 2023 at 18:08):

if you ensure that every stack slot is a multiple of 16, you get 16 aligned stack slots on all current backends.

Right, thanks a lot! I'll use this workaround too in the meantime :+1:

view this post on Zulip Setzer22 (May 01 2023 at 18:10):

Why 16 specifically though? Isn't most data 8-byte aligned in 64-bit CPUs?

view this post on Zulip Chris Fallin (May 01 2023 at 18:12):

16-alignment shows up both in ABIs (x86-64 and aarch64 both require SP to be 16-aligned) and in vector ISAs (some 128-bit-vector loads/stores -- actually usually most, except "unaligned" variants -- require 16-alignment)

view this post on Zulip Chris Fallin (May 01 2023 at 18:12):

the former may be because of the latter and stack-spills, I'm not sure

view this post on Zulip Setzer22 (May 01 2023 at 18:30):

Right, makes sense! Just to make sure I understand: This workaround is just to ensure that my own stack slots are always 16-aligned, but cranelift still ensures the stack pointer is set to a multiple of 16 when I call a function using e.g. the SysV calling convention (or any other that requires it) right? I'm guessing I don't need to manually insert any padding stack slots myself for that

view this post on Zulip bjorn3 (May 01 2023 at 19:24):

Yeah, cranelift will always ensure that the calling convention is followed with respect to stack alignment. It does so even if you use stack slots of odd sizes. Only if you want your own stack slots to be 16 byte aligned do you need to make them all a multiple of 16 bytes as size.


Last updated: Nov 22 2024 at 16:03 UTC