Stream: git-wasmtime

Topic: wasmtime / issue #5479 Cranelift: Introduce memcpy and me...


view this post on Zulip Wasmtime GitHub notifications bot (Dec 20 2022 at 09:46):

bjorn3 opened issue #5479:

Feature

Introduce instructions that behave like memcpy and memset. These should lower to repe movsb and repe stosb for memcpy and memset respectively on x86_64 with the ermsb feature. According to https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-a-profile-architecture-developments-2021 there is also an AArch64 extension for this, but I couldn't find more details.

Benefit

Using a native instruction reduces instruction cache bloat and may be faster in some cases. It may also help future optimizations with recognizing these operations as such to allow optimizing them away in some cases. This is very important for runtime performance of rust code as rustc generates a lot of unnecessary copies of locals.

Implementation

The instructions should take an immediate as size argument and be lowered to native instructions if available, or as libcalls to an external memcpy or memset function.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 20 2022 at 09:47):

bjorn3 edited issue #5479:

Feature

Introduce instructions that behave like memcpy and memset. These should lower to repe movsb and repe stosb for memcpy and memset respectively on x86_64 with the ermsb feature. According to https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-a-profile-architecture-developments-2021 there is also an AArch64 extension for this, but I couldn't find more details.

Benefit

Using a native instruction reduces instruction cache bloat and may be faster in some cases. It may also help future optimizations with recognizing these operations as such to allow optimizing them away in some cases. This is very important for runtime performance of rust code as rustc generates a lot of unnecessary copies of locals.

Implementation

The instructions should take an immediate as size argument and be lowered to native instructions if available, or as libcalls to an external memcpy or memset function if not available.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 20 2022 at 21:42):

jameysharp commented on issue #5479:

I have a few notes if somebody else takes this on. (I suspect bjorn3 already knows all this, having written some of it.)

In Cranelift, the ABIMachineSpec trait implemented by backends has a gen_memcpy method. It's used during codegen for function calls by Caller::emit_copy_regs_to_buffer, so I suspect it's well-covered by tests. On x86 and aarch64 it always calls the library memcpy function, but since it's backend-specific it could do something else. (I didn't check the other backends, but they're probably the same.) Making this code available to frontends like cg-clif sounds good to me.

The cranelift_frontend::FunctionBuilder type has methods call_mem{cmp,cpy,move,set} for unconditionally emitting a library call. It also offers emit_small_{memory_compare,memory_copy,memset} to generate an unrolled CLIF loop for buffers smaller than a threshold, falling back to using call_* for larger buffers. I don't see any significant uses of any of these functions in Cranelift or Wasmtime, at any time in the git history. So they probably aren't well tuned and might not even work in general.

I think you're right that the best way to expose these is with CLIF instructions, rather than cranelift_frontend methods. But I'm curious if anybody (like @cfallin?) has other suggestions.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 20 2022 at 21:50):

bjorn3 commented on issue #5479:

I use emit_small_* once in cg_clif. It is definitively not well tuned though. It doesn't support copying 128bit chunks using xmm registers for example.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 20 2022 at 23:09):

cfallin commented on issue #5479:

Yeah, I think it's reasonable to create CLIF opcodes for these. memcpy/memset are among the canonical primitives you usually get in a compiler's intrinsics; we don't have a separate notion of intrinsic calls, so new opcodes are the way forward. This would then give us one central implementation we could use where needed (e.g. for struct args on the stack, as noted above) and that we could optimize well.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 12 2023 at 19:15):

jameysharp commented on issue #5479:

In #5564, @Kixiron suggested that these cranelift-frontend functions ought to take separate MemFlags for each address operand. I think that's a good suggestion, but that we should do it with these new proposed instructions instead of putting any more development into the cranelift-frontend versions.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 12 2023 at 19:18):

Kixiron commented on issue #5479:

I agree, cranelift-native instructions for memcpy/memset (probably memcmp too, though that's not mentioned here?) is definitely an overall better approach

view this post on Zulip Wasmtime GitHub notifications bot (Jan 12 2023 at 19:20):

Kixiron edited a comment on issue #5479:

I agree, cranelift-native instructions for memcpy/memset (probably memcmp too, though that's not mentioned here?) is definitely an overall better approach since it'd allow everything that the current approach does and some, like automatically lowering calls with known lengths to unrolled versions (essentially giving us emit_small_* for free, but also applicable to const-eval'd and dataflow scenarios)


Last updated: Jan 24 2025 at 00:11 UTC