gfx requested Copilot for a review on PR #13460.
gfx opened PR #13460 from wado-lang:array-copy-inline to bytecodealliance:main:
Small, statically-sized
array.copys of scalar elements currently go through thememory_copylibcall, just like every other size. For tiny copies the libcall's fixed per-call cost — a wasm↔host transition plus an indirect call — dominates the actualmemmove, soarray.copyis far slower than it should be for small fixed-size copies (enough that a hand-writtenarray.get/array.setloop can beat it).This PR expands such copies inline, as a sequence of loads followed by stores, when the element type is scalar and the length is a compile-time constant of at most 8 elements.
Speed
array.copyof Ni32elements, ns per copy, release build, lower is better:
N before (libcall) after (inline) speedup 1 4.60 1.27 3.6× 2 3.36 1.27 2.6× 4 4.17 1.43 2.9× 8 4.88 2.00 2.4× The inline path also beats the manual element-by-element loop that was previously the faster workaround. The gap is larger still in unoptimized builds, where the libcall body itself is not optimized (~24 ns/call drops to a couple ns).
<details><summary>Methodology</summary>
ns per whole-array copy = (wall time at N iterations − wall time at 0 iterations) / N, taking the min of several repetitions, default config, x86_64. The 0-iteration baseline cancels out module compilation and array allocation. The "before" column is measured with a dynamic length (forcing the libcall) and the "after" column with a constant length (inline), at the same N.
</details>Correctness & scope
- Every element is loaded before any is stored, so overlapping source/destination ranges keep
array.copy'smemmovesemantics.- Dynamic or larger lengths, and tables, still use the
memory_copylibcall — there itsmemmoveamortizes the call overhead.- The bound of 8 elements comes from measurement: the inline-vs-libcall crossover is around 16 elements, but the largest wins cluster at ≤ 8 while inline code size grows with the element count.
Tests
tests/disas/array-copy-inline.watlocks the codegen (the libcall is gone; loads-then-stores remain).tests/misc_testsuite/gc/array-copy-inline.wastchecks correctness for every element size (i8/i16/i32/i64/f32/f64/v128), the threshold boundary (8 inline vs 9 libcall), and overlapping copies; the wast harness runs it across the DRC, null, and copying collectors.:robot: Generated with Claude Code
gfx requested cfallin for a review on PR #13460.
gfx requested wasmtime-compiler-reviewers for a review on PR #13460.
gfx requested wasmtime-core-reviewers for a review on PR #13460.
:memo: Copilot submitted PR review:
Pull request overview
[!NOTE]
Copilot was unable to run its full agentic suite in this review.Adds an optimization to inline small, statically-sized
array.copyoperations in Cranelift to avoid the fixed overhead of thememory_copylibcall, and introduces tests to lock down both codegen shape and runtime semantics.Changes:
- Inline-expand constant-length
array.copyfor small copies (<= 8 elements) into explicit loads+stores withmemmovesemantics.- Add runtime GC tests covering element sizes/types, overlap semantics, and the 8/9-element threshold.
- Add a disassembly-based “shape” test to ensure the inline expansion remains stable.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
File Description tests/misc_testsuite/gc/array-copy-inline.wast New runtime tests validating correctness across element sizes, threshold behavior, and overlap ( memmove) semantics.tests/disas/array-copy-inline.wat New disassembly test to lock in the expected inline load-then-store codegen sequence. crates/cranelift/src/func_environ.rs Implements the inline expansion for small constant-length array.copyand a helper to detect constant lengths.
:light_bulb: <a href="/bytecodealliance/wasmtime/new/main?filename=.github/instructions/*.instructions.md" class="Link--inTextBlock" target="_blank" rel="noopener noreferrer">Add Copilot custom instructions</a> for smarter, more guided reviews. <a href="https://docs.github.com/en/copilot/customizing-copilot/adding-repository-custom-instructions-for-github-copilot" class="Link--inTextBlock" target="_blank" rel="noopener noreferrer">Learn how to get started</a>.
:speech_balloon: Copilot created PR review comment:
The doc comment says \"scalar\"
array.copy, but the implementation also supports 16-byte elements (v128 viaI8X16) and copies f32/f64 by using integer load/store types of the same width. Please adjust the wording to reflect that this is a bitwise copy for fixed-size element widths (including v128), not strictly \"scalar\" in the wasm type sense.
:speech_balloon: Copilot created PR review comment:
The local
elem_sizeshadows theelem_sizeparameter, which makes the code harder to read and easier to misuse during future edits. Rename the local to something likeelem_size_i32/elem_stride(and update uses) to avoid shadowing.
gfx updated PR #13460.
gfx updated PR #13460.
:memo: gfx submitted PR review.
:speech_balloon: gfx created PR review comment:
Done in 023d5e5 — the local is now
stride(and the element countcount), which also removes the same shadowing of thenparameter.
:memo: gfx submitted PR review.
:speech_balloon: gfx created PR review comment:
Fixed in 023d5e5 — the doc now describes a bitwise copy by element width (
f32/f64go throughi32/i64,v128throughi8x16) instead of calling it "scalar".
alexcrichton commented on PR #13460:
Thanks for this! If you're willing I believe there's a better, more general, but slightly-more-difficult-to-work-with, location for this optimization to be located. Specifically within
raw_bulk_memory_operationI think it'd be reasonable to test for the byte length and see if it's a constant. That way this would uniformly handle all constant-length bulk copies rather than justarray.copy. Additionally it's also fine to perform "wrong width" loads/stores where, for example,i8x16could be used to load values all the time when the size is > 16 bytes as opposed to only when the array element isv128.Does that sound reasonable to you? And would you be interested to move this optimization to there?
gfx updated PR #13460.
Done in d9a6a14467 — moved into
raw_bulk_memory_operation, keyed on a constant byte length, somemory.copyandtable.copyget it too, not justarray.copy. Loads/stores are now width-agnostic (greedyi8x16→i64→…→i8), andBulkOp::MemoryCopycarries entity-appropriate flags (GC corrupt-trap / little+heap / table region; none assume alignment).The threshold is now byte-based at 128 B. Measured on aarch64, inline is ~1.7–2.7× faster than the libcall up to 128 B; at 256 B the load-all-then-store-all sequence ties the libcall, so it falls back there. Inlined-copy fuel is charged by byte length to match the libcall.
gfx updated PR #13460.
gfx edited PR #13460:
Small, constant-length bulk copies —
memory.copy,table.copy, andarray.copy— currently go through thememory_copylibcall regardless of size. For tiny copies the libcall's fixed per-call cost — a wasm↔host transition plus an indirect call — dominates the actualmemmove, so they are far slower than they need to be.This PR expands such copies inline, as wide loads followed by stores, when the copy length is a compile-time constant of at most 128 bytes.
Updated after code review (thanks @alexcrichton): this started out
array.copy-specific and element-count-based. Following the review suggestion it now lives inraw_bulk_memory_operation, keyed on the constant byte length, so it covers all constant-length bulk copies uniformly — and it uses width-agnostic accesses (greedyi8x16→i64→ … →i8) instead of one load/store per element.Speed
i32array copies, ns per copy, release build, aarch64 (Apple M3 Pro), lower is better:
bytes libcall inline speedup 8 3.9 1.9 2.0× 16 4.8 2.0 2.4× 32 5.2 2.4 2.2× 64 4.1 2.3 1.8× 128 4.6 2.7 1.7× 256 5.9 5.9 1.0× (libcall) Because the inline path is width-agnostic,
i8/i16arrays andmemory.copyof the same byte length behave the same.<details><summary>Methodology</summary>
ns per copy = wall time / iteration count for a function that runs ~1e6 copies in a loop (so the per-call host↔wasm transition and the allocation are amortized away), taking the median of several repetitions. The "libcall" column uses a dynamic length (forcing the libcall) and the "inline" column a constant length, in the same module at the same size. A small fixed per-iteration readback keeps the copy from being optimized away, which puts a ~2 ns floor on the absolute numbers; the inline-vs-libcall ratio is the signal.
</details>Correctness & scope
- The byte range is covered greedily with the widest convenient access; every chunk is loaded before any is stored, so overlapping source/destination ranges keep
memmovesemantics.BulkOp::MemoryCopycarries entity-appropriate access flags (GC-heap corruption trap for arrays, little-endian heap region for linear memory, table region for tables); none assume alignment, since wide chunks may straddle element boundaries.- Dynamic lengths, and constant lengths above 128 bytes, keep the libcall, whose
memmoveamortizes the call cost — past 128 bytes inline merely ties it (see the 256-byte row). Reference-typed array elements and lazily-initialized funcref tables still take the per-element loop, unchanged.- The 128-byte bound is from measurement: inline wins by ~1.7–2.7× through 128 bytes and ties the libcall by 256, where the cost of loading every chunk before storing catches up.
- Fuel for an inlined copy is charged by byte length, matching the libcall.
Tests
tests/disas/array-copy-inline.watandtests/disas/memory-copy-inline.watlock the codegen (the libcall is gone; wide loads-then-stores remain, including the greedyi8x16+i64+i32split).tests/misc_testsuite/gc/array-copy-inline.wastchecks correctness for every element width (i8/i16/i32/i64/f32/f64/v128), the threshold boundary (16 inline vs 17 libcalli64s, i.e. 128 vs 136 bytes), and overlapping copies; the wast harness runs it across the DRC, null, and copying collectors.:robot: Generated with Claude Code
gfx edited PR #13460:
Small, constant-length bulk copies —
memory.copy,table.copy, andarray.copy— currently go through thememory_copylibcall regardless of size. For tiny copies the libcall's fixed per-call cost — a wasm↔host transition plus an indirect call — dominates the actualmemmove, so they are far slower than they need to be.This PR expands such copies inline, as wide loads followed by stores, when the copy length is a compile-time constant of at most 128 bytes.
Updated after code review (thanks @alexcrichton): this started out
array.copy-specific and element-count-based. Following the review suggestion it now lives inraw_bulk_memory_operation, keyed on the constant byte length, so it covers all constant-length bulk copies uniformly — and it uses width-agnostic accesses (greedyi8x16→i64→ … →i8) instead of one load/store per element.Speed
i32array copies, ns per copy, release build, aarch64 (Apple M3 Pro), lower is better:
bytes libcall inline speedup 8 3.9 1.9 2.0× 16 4.8 2.0 2.4× 32 5.2 2.4 2.2× 64 4.1 2.3 1.8× 128 4.6 2.7 1.7× 256 5.9 5.9 1.0× (libcall) Because the inline path is width-agnostic,
i8/i16arrays andmemory.copyof the same byte length behave the same.<details><summary>Methodology</summary>
ns per copy = wall time / iteration count for a function that runs ~1e6 copies in a loop (so the per-call host↔wasm transition and the allocation are amortized away), taking the median of several repetitions. The "libcall" column uses a dynamic length (forcing the libcall) and the "inline" column a constant length, in the same module at the same size. A small fixed per-iteration readback keeps the copy from being optimized away, which puts a ~2 ns floor on the absolute numbers; the inline-vs-libcall ratio is the signal.
</details>Correctness & scope
- The byte range is covered greedily with the widest convenient access; every chunk is loaded before any is stored, so overlapping source/destination ranges keep
memmovesemantics.BulkOp::MemoryCopycarries entity-appropriate access flags (GC-heap corruption trap for arrays, little-endian heap region for linear memory, table region for tables); none assume alignment, since wide chunks may straddle element boundaries.- The length must be a compile-time constant: the expansion is fully unrolled, branch-free straight-line code, so the chunk widths and offsets are fixed when the code is emitted. Dynamic lengths (and constant lengths above 128 bytes) keep the libcall, whose
memmoveis the right tool when the length is unknown or large — past 128 bytes inline merely ties it (see the 256-byte row). Reference-typed array elements and lazily-initialized funcref tables still take the per-element loop, unchanged.- The 128-byte bound is from measurement: inline wins by ~1.7–2.7× through 128 bytes and ties the libcall by 256, where the cost of loading every chunk before storing catches up.
- Fuel for an inlined copy is charged by byte length, matching the libcall.
Tests
tests/disas/array-copy-inline.watandtests/disas/memory-copy-inline.watlock the codegen (the libcall is gone; wide loads-then-stores remain, including the greedyi8x16+i64+i32split).tests/misc_testsuite/gc/array-copy-inline.wastchecks correctness for every element width (i8/i16/i32/i64/f32/f64/v128), the threshold boundary (16 inline vs 17 libcalli64s, i.e. 128 vs 136 bytes), and overlapping copies; the wast harness runs it across the DRC, null, and copying collectors.:robot: Generated with Claude Code
gfx updated PR #13460.
gfx updated PR #13460.
gfx edited PR #13460.
- [x] improve the code based on the review
- [x] read the AI guidline of BA
gfx deleted a comment on PR #13460:
- [x] improve the code based on the review
- [x] read the AI guidline of BA
alexcrichton unassigned cfallin from PR #13460 cranelift: inline small constant-length array.copy for performance.
alexcrichton requested alexcrichton for a review on PR #13460.
:memo: alexcrichton submitted PR review.
:speech_balloon: alexcrichton created PR review comment:
One thing that I'd be a bit worried about here is that types are ignored and the operations aren't actually performed, this is just plumbing
iconstvalues. I think that this'll require actually handling the input/output types of the operation, for exampleireducewould alter the value of a immediate with the upper bits set.One possible alternative to this, however, is to plumb the raw length of the operation through the
BulkOp::MemoryCopyvalue. It's already a bit unfortunate to duplicate const-eval here and while understandable we might be able to sidestep it. Ideally, for example, this would purely matchiconstand nothing else, and the need to matchuextend/ireduce/imul_immis due to the code generated here otherwise. If, though, the original length were plumbed (as-is present in wasm) through theBulkOp::MemoryCopythat might mean thaticonstwould only be necessary here.To handle that, could you see if this could be updated to only match
iconstand then the necessary variable plumbed through to here? Failing that, though, I think this'll need to be a more complicated implementation which handles the types that are happening in each operation.
:speech_balloon: alexcrichton created PR review comment:
For this I'd recommend having
if bytes <= INLINE_COPY_MAX_BYTESwith perhaps an early-return of 0 to avoid generating any code. (using a range withcontainsfeels a bit overkill here)
:speech_balloon: alexcrichton created PR review comment:
Threading
flagshere can be a bit finnicky -- would you be up for benchmarking without threadingflagsto here? For example usingMemFlagsData::trusted()here for loads/stores. While it makes sense that we could do it and theoretically optimize further I'd ideally like to start simpler if we can and skip passingflagsaround. If there's a noticable performance boost threadingflagsaround, though, that's a different situation to be in.
gfx updated PR #13460.
:memo: gfx submitted PR review.
:speech_balloon: gfx created PR review comment:
Done in fa4a65f: the gate is now
bytes <= INLINE_COPY_MAX_BYTES. The zero-length case falls out naturally —emit_inline_memcpy(0)decomposes into zero chunks, so the load/store loops emit nothing.
:memo: gfx submitted PR review.
:speech_balloon: gfx created PR review comment:
Done in fa4a65f: plumbed the wasm length through
BulkOp::MemoryCopy { const_len: Option<u64> }, sovalue_as_const_intnow matches onlyiconst(no moreuextend/ireduce/imul_imm). The byte length is computed in Rust at the call site ascount.checked_mul(elem_size).
:memo: gfx submitted PR review.
:speech_balloon: gfx created PR review comment:
Done in fa4a65f: dropped the threaded flags;
emit_inline_memcpynow usesMemFlagsData::trusted(). Verified on x86_64 + aarch64 that the asm is identical (movdqu / ldr q). Each load only feeds its paired store, so the x64alignedflag never lets a load fold into an alignment-requiring SSE operand — there is a comment in the body explaining this.
:speech_balloon: gfx edited PR review comment.
:thumbs_up: alexcrichton submitted PR review:
Looks great, thanks again for this!
alexcrichton added PR #13460 cranelift: inline small constant-length array.copy for performance to the merge queue.
github-merge-queue[bot] removed PR #13460 cranelift: inline small constant-length array.copy for performance from the merge queue.
alexcrichton commented on PR #13460:
Locally the test is failing as:
running 1 test test fuel::craneliftpulley_run ... FAILED failures: ---- fuel::craneliftpulley_run stdout ---- thread '<unnamed>' (3621049) panicked at cranelift/codegen/src/isa/pulley_shared/inst/emit.rs:469:13: assertion `left == right` failed left: Big right: Little note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace failures: fuel::craneliftpulley_runI'll try to dig in tomorrow to take a look
alexcrichton commented on PR #13460:
In the meantime, if you're willing to test, unconditionally doing little endian loads/stores would probably fix this since I think the issue is Pulley doesn't implement big-endian v128 loads/stores. I think at least, would still need to confirm.
gfx updated PR #13460.
Thanks! Added a
pulley64beregression test and fixed it in 20708699b4 (red-green). Pinning the inline chunk flags toLittleis safe since each load feeds only its paired store and is never observed as a number.
gfx edited a comment on PR #13460:
Thanks! Added a
pulley64beregression test and fixed it in 20708699b4 (with red-green TDD). Pinning the inline chunk flags toLittleis safe since each load feeds only its paired store and is never observed as a number.
:thumbs_up: alexcrichton submitted PR review.
alexcrichton added PR #13460 cranelift: inline small constant-length array.copy for performance to the merge queue.
:check: alexcrichton merged PR #13460.
alexcrichton removed PR #13460 cranelift: inline small constant-length array.copy for performance from the merge queue.
Last updated: Jun 01 2026 at 09:49 UTC