alexcrichton opened PR #13382 from alexcrichton:refactor-array-copy to bytecodealliance:main:
This commit is a step towards optimizing the codegen and implementation of the
array.fillandarray.copyGC instructions. This builds on the intrinsics refactored in #13367 and #13368 to be used in the GC implementation as well. For examplearray.fillandarray.new_defaultfori8arrays will now usememsetto fill an array. The larger change here is the reimplementation ofarray.copyto inline the loop within generated CLIF.Prior to this commit the
array.copyfunction was exclusively built as a libcall. There were two versions of this libcall -- one for GC references and one for non GC references. Even with this minor optimization, however, the benchmark in #13279 is still quite slow. The main reason for this is that the GC references version of the libcall has an extremely high constant overhead per element copied. This is due to the nature of the implementation using completely safe APIs which engages a lot more dynamic type checks and GC heap tracking than necessary.The implementation of
array.copyin this commit now looks like:
- First the bounds of the copy operation are all validated. Traps happen if either array is out of bounds.
- Next raw heap addresses are calculated for the start/end of the src/dst.
- For copyable types (e.g.
i8,i32, etc), the implementation then delegates to the same intrinsic asmemory.copy, a host-definedmemmove.- For GC types (e.g.
anyref), the implementation instead has an inline loop which copies elements from the source to the destination. This copy is structured as two loops, either a forwards copy or a backwards copy, and only one is taken.The result of this implementation is that the benchmark in #13279 is massively faster. What previously took minutes-to-hours now completes in seconds, even with the DRC collector.
One caveat with this implementation is that when delegating to
memory.copyormemory.fillthere's an
unnecessary-if-wasmtime-has-no-bugs bounds check double-check the host will not fault. While wasm code will gracefully catch segfaults the host will not, so this is necessary to ensure that in the face of heap corruption the host never faults.In the future it might make sense to split out functionality like this into separate helper CLIF functions to keep the source relatively lean. In the meantime though the massive performance benefits seem to outweigh the size costs (to me at least).
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
alexcrichton requested cfallin for a review on PR #13382.
alexcrichton requested wasmtime-compiler-reviewers for a review on PR #13382.
alexcrichton requested wasmtime-core-reviewers for a review on PR #13382.
alexcrichton edited PR #13382:
This commit is a step towards optimizing the codegen and implementation of the
array.fillandarray.copyGC instructions. This builds on the intrinsics refactored in #13367 and #13368 to be used in the GC implementation as well. For examplearray.fillandarray.new_defaultfori8arrays will now usememsetto fill an array. The larger change here is the reimplementation ofarray.copyto inline the loop within generated CLIF.Prior to this commit the
array.copyfunction was exclusively built as a libcall. There were two versions of this libcall -- one for GC references and one for non GC references. Even with this minor optimization, however, the benchmark in #13279 is still quite slow. The main reason for this is that the GC references version of the libcall has an extremely high constant overhead per element copied. This is due to the nature of the implementation using completely safe APIs which engages a lot more dynamic type checks and GC heap tracking than necessary.The implementation of
array.copyin this commit now looks like:
- First the bounds of the copy operation are all validated. Traps happen if either array is out of bounds.
- Next raw heap addresses are calculated for the start/end of the src/dst.
- For copyable types (e.g.
i8,i32, etc), the implementation then delegates to the same intrinsic asmemory.copy, a host-definedmemmove.- For GC types (e.g.
anyref), the implementation instead has an inline loop which copies elements from the source to the destination. This copy is structured as two loops, either a forwards copy or a backwards copy, and only one is taken.The result of this implementation is that the benchmark in #13279 is massively faster. What previously took minutes-to-hours now completes in seconds, even with the DRC collector.
One caveat with this implementation is that when delegating to
memory.copyormemory.fillthere's an
unnecessary-if-wasmtime-has-no-bugs bounds check double-check the host will not fault. While wasm code will gracefully catch segfaults the host will not, so this is necessary to ensure that in the face of heap corruption the host never faults.In the future it might make sense to split out functionality like this into separate helper CLIF functions to keep the source relatively lean. In the meantime though the massive performance benefits seem to outweigh the size costs (to me at least).
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->Closes #13279
alexcrichton unassigned cfallin from PR #13382 Optimize array.fill and array.copy.
alexcrichton requested fitzgen for a review on PR #13382.
alexcrichton updated PR #13382.
github-actions[bot] added the label wasmtime:api on PR #13382.
github-actions[bot] added the label wasmtime:ref-types on PR #13382.
github-actions[bot] commented on PR #13382:
Subscribe to Label Action
cc @fitzgen
<details>
This issue or pull request has been labeled: "wasmtime:api", "wasmtime:ref-types"Thus the following users have been cc'd because of the following labels:
- fitzgen: wasmtime:ref-types
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
:thumbs_up: fitzgen submitted PR review.
:speech_balloon: fitzgen created PR review comment:
Do we want to special case all POD types here? Like why not do this for
i16andv128andf32and eveni31ref?I guess
memory_fillonly takes a byte value to fill? Might make sense to have 8/16/32/64/128 versions. File a follow up issue?
:speech_balloon: fitzgen created PR review comment:
Maybe move this after
array_boundsis called (and havearray_boundsreturn itsone_elem_sizethatemit_array_size_infocreates) so that
- We aren't emitting
one_elem_sizean extra time, and- we can do the multiply after we know it won't overflow (which doesn't really matter but is one less comment to read / thing to consider and double check for code readers)
:speech_balloon: fitzgen created PR review comment:
We are checking fuel before this, do we also want to check epochs at these loop headers? Or introduce an outer loop around this that does epoch checks and only process chunks of a maximum size in this inner loop?
:speech_balloon: fitzgen created PR review comment:
I think this should be slightly better:
let corrupt_src = builder .ins() .icmp(IntCC::UnsignedGreaterThan, src_end_addr, heap_end); let corrupt_dst = builder .ins() .icmp(IntCC::UnsignedGreaterThan, dst_end_addr, heap_end); let corrupt = builder.ins().bor(corrupt_src, corrupt_dst); func_env.trapnz(builder, corrupt, TRAP_GC_HEAP_CORRUPT);
:memo: alexcrichton submitted PR review.
:speech_balloon: alexcrichton created PR review comment:
I think codegen for that will be pretty significantly worse because it requires taking the results of the comparisons, or'ing them together, and then testing that value against zero, and then branching. Code-wise I think the best we can do here is compare, conditional branch, compare, conditional branch. IR-wise it's a bit larger, but machine-code-wise it'll be tighter
:memo: alexcrichton submitted PR review.
:speech_balloon: alexcrichton created PR review comment:
Oh the
translate_loop_headerat the top does both fuel and epoch checks so these loops should be good.
:memo: alexcrichton submitted PR review.
:speech_balloon: alexcrichton created PR review comment:
I was wondering the same, yeah, and you're right the reason is that it'd require new libcalls and I wasn't looking to expand the set of libcalls here. I also figure that languages will probably expect byte-arrays to be the most optimal so I figured we'd at least have a fast path for that. Can file a follow-up issue yeah.
alexcrichton updated PR #13382.
:thumbs_up: fitzgen submitted PR review.
:speech_balloon: fitzgen created PR review comment:
Hm yeah I guess we don't even legalize trapz into a branch in the midend anymore, so my original implied concern about this pattern being harder to GVN/LICM/etc is moot anyways
:speech_balloon: fitzgen created PR review comment:
Ah, I missed that! thanks
alexcrichton updated PR #13382.
:memo: alexcrichton submitted PR review.
:speech_balloon: alexcrichton created PR review comment:
I've filed an issue at https://github.com/bytecodealliance/wasmtime/issues/13386, and upon considering this more I realized that one thing we'd want to do, even in the future world of larger bit-widths, is that we still want to specialize based on what's actually being filled in. For example
array.new_defaultwith zero-initialization should be amemsetfor pretty much all types.I've gone ahead and implemented that here, too. Mind re-reviewing the latest commit?
:thumbs_up: fitzgen submitted PR review:
Ship it!
fitzgen added PR #13382 Optimize array.fill and array.copy to the merge queue
github-merge-queue[bot] removed PR #13382 Optimize array.fill and array.copy from the merge queue
alexcrichton updated PR #13382.
alexcrichton has enabled auto merge for PR #13382.
alexcrichton added PR #13382 Optimize array.fill and array.copy to the merge queue
:check: alexcrichton merged PR #13382.
alexcrichton removed PR #13382 Optimize array.fill and array.copy from the merge queue
:thumbs_up: Ari4ka submitted PR review.
Last updated: Jun 01 2026 at 09:49 UTC