bjorn3 opened issue #5479:
Feature
Introduce instructions that behave like memcpy and memset. These should lower to
repe movsb
andrepe stosb
for memcpy and memset respectively on x86_64 with the ermsb feature. According to https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-a-profile-architecture-developments-2021 there is also an AArch64 extension for this, but I couldn't find more details.Benefit
Using a native instruction reduces instruction cache bloat and may be faster in some cases. It may also help future optimizations with recognizing these operations as such to allow optimizing them away in some cases. This is very important for runtime performance of rust code as rustc generates a lot of unnecessary copies of locals.
Implementation
The instructions should take an immediate as size argument and be lowered to native instructions if available, or as libcalls to an external memcpy or memset function.
bjorn3 edited issue #5479:
Feature
Introduce instructions that behave like memcpy and memset. These should lower to
repe movsb
andrepe stosb
for memcpy and memset respectively on x86_64 with the ermsb feature. According to https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-a-profile-architecture-developments-2021 there is also an AArch64 extension for this, but I couldn't find more details.Benefit
Using a native instruction reduces instruction cache bloat and may be faster in some cases. It may also help future optimizations with recognizing these operations as such to allow optimizing them away in some cases. This is very important for runtime performance of rust code as rustc generates a lot of unnecessary copies of locals.
Implementation
The instructions should take an immediate as size argument and be lowered to native instructions if available, or as libcalls to an external memcpy or memset function if not available.
jameysharp commented on issue #5479:
I have a few notes if somebody else takes this on. (I suspect bjorn3 already knows all this, having written some of it.)
In Cranelift, the
ABIMachineSpec
trait implemented by backends has agen_memcpy
method. It's used during codegen for function calls byCaller::emit_copy_regs_to_buffer
, so I suspect it's well-covered by tests. On x86 and aarch64 it always calls the librarymemcpy
function, but since it's backend-specific it could do something else. (I didn't check the other backends, but they're probably the same.) Making this code available to frontends like cg-clif sounds good to me.The
cranelift_frontend::FunctionBuilder
type has methodscall_mem{cmp,cpy,move,set}
for unconditionally emitting a library call. It also offersemit_small_{memory_compare,memory_copy,memset}
to generate an unrolled CLIF loop for buffers smaller than a threshold, falling back to usingcall_*
for larger buffers. I don't see any significant uses of any of these functions in Cranelift or Wasmtime, at any time in the git history. So they probably aren't well tuned and might not even work in general.I think you're right that the best way to expose these is with CLIF instructions, rather than
cranelift_frontend
methods. But I'm curious if anybody (like @cfallin?) has other suggestions.
bjorn3 commented on issue #5479:
I use emit_small_* once in cg_clif. It is definitively not well tuned though. It doesn't support copying 128bit chunks using xmm registers for example.
cfallin commented on issue #5479:
Yeah, I think it's reasonable to create CLIF opcodes for these. memcpy/memset are among the canonical primitives you usually get in a compiler's intrinsics; we don't have a separate notion of intrinsic calls, so new opcodes are the way forward. This would then give us one central implementation we could use where needed (e.g. for struct args on the stack, as noted above) and that we could optimize well.
jameysharp commented on issue #5479:
In #5564, @Kixiron suggested that these cranelift-frontend functions ought to take separate
MemFlags
for each address operand. I think that's a good suggestion, but that we should do it with these new proposed instructions instead of putting any more development into the cranelift-frontend versions.
Kixiron commented on issue #5479:
I agree, cranelift-native instructions for memcpy/memset (probably memcmp too, though that's not mentioned here?) is definitely an overall better approach
Kixiron edited a comment on issue #5479:
I agree, cranelift-native instructions for memcpy/memset (probably memcmp too, though that's not mentioned here?) is definitely an overall better approach since it'd allow everything that the current approach does and some, like automatically lowering calls with known lengths to unrolled versions (essentially giving us
emit_small_*
for free, but also applicable to const-eval'd and dataflow scenarios)
Last updated: Jan 24 2025 at 00:11 UTC