Stream: git-wasmtime

Topic: wasmtime / PR #13460 cranelift: inline small constant-len...


view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 01:01):

gfx requested Copilot for a review on PR #13460.

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 01:01):

gfx opened PR #13460 from wado-lang:array-copy-inline to bytecodealliance:main:

Small, statically-sized array.copys of scalar elements currently go through the memory_copy libcall, just like every other size. For tiny copies the libcall's fixed per-call cost — a wasm↔host transition plus an indirect call — dominates the actual memmove, so array.copy is far slower than it should be for small fixed-size copies (enough that a hand-written array.get/array.set loop can beat it).

This PR expands such copies inline, as a sequence of loads followed by stores, when the element type is scalar and the length is a compile-time constant of at most 8 elements.

Speed

array.copy of N i32 elements, ns per copy, release build, lower is better:

N before (libcall) after (inline) speedup
1 4.60 1.27 3.6×
2 3.36 1.27 2.6×
4 4.17 1.43 2.9×
8 4.88 2.00 2.4×

The inline path also beats the manual element-by-element loop that was previously the faster workaround. The gap is larger still in unoptimized builds, where the libcall body itself is not optimized (~24 ns/call drops to a couple ns).

<details><summary>Methodology</summary>

ns per whole-array copy = (wall time at N iterations − wall time at 0 iterations) / N, taking the min of several repetitions, default config, x86_64. The 0-iteration baseline cancels out module compilation and array allocation. The "before" column is measured with a dynamic length (forcing the libcall) and the "after" column with a constant length (inline), at the same N.
</details>

Correctness & scope

Tests

:robot: Generated with Claude Code

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 01:01):

gfx requested cfallin for a review on PR #13460.

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 01:01):

gfx requested wasmtime-compiler-reviewers for a review on PR #13460.

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 01:01):

gfx requested wasmtime-core-reviewers for a review on PR #13460.

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 01:03):

:memo: Copilot submitted PR review:

Pull request overview

[!NOTE]
Copilot was unable to run its full agentic suite in this review.

Adds an optimization to inline small, statically-sized array.copy operations in Cranelift to avoid the fixed overhead of the memory_copy libcall, and introduces tests to lock down both codegen shape and runtime semantics.

Changes:

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/misc_testsuite/gc/array-copy-inline.wast New runtime tests validating correctness across element sizes, threshold behavior, and overlap (memmove) semantics.
tests/disas/array-copy-inline.wat New disassembly test to lock in the expected inline load-then-store codegen sequence.
crates/cranelift/src/func_environ.rs Implements the inline expansion for small constant-length array.copy and a helper to detect constant lengths.

:light_bulb: <a href="/bytecodealliance/wasmtime/new/main?filename=.github/instructions/*.instructions.md" class="Link--inTextBlock" target="_blank" rel="noopener noreferrer">Add Copilot custom instructions</a> for smarter, more guided reviews. <a href="https://docs.github.com/en/copilot/customizing-copilot/adding-repository-custom-instructions-for-github-copilot" class="Link--inTextBlock" target="_blank" rel="noopener noreferrer">Learn how to get started</a>.

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 01:03):

:speech_balloon: Copilot created PR review comment:

The doc comment says \"scalar\" array.copy, but the implementation also supports 16-byte elements (v128 via I8X16) and copies f32/f64 by using integer load/store types of the same width. Please adjust the wording to reflect that this is a bitwise copy for fixed-size element widths (including v128), not strictly \"scalar\" in the wasm type sense.

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 01:03):

:speech_balloon: Copilot created PR review comment:

The local elem_size shadows the elem_size parameter, which makes the code harder to read and easier to misuse during future edits. Rename the local to something like elem_size_i32/elem_stride (and update uses) to avoid shadowing.

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 01:06):

gfx updated PR #13460.

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 01:13):

gfx updated PR #13460.

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 01:14):

:memo: gfx submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 01:14):

:speech_balloon: gfx created PR review comment:

Done in 023d5e5 — the local is now stride (and the element count count), which also removes the same shadowing of the n parameter.

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 01:14):

:memo: gfx submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 01:14):

:speech_balloon: gfx created PR review comment:

Fixed in 023d5e5 — the doc now describes a bitwise copy by element width (f32/f64 go through i32/i64, v128 through i8x16) instead of calling it "scalar".

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2026 at 15:34):

alexcrichton commented on PR #13460:

Thanks for this! If you're willing I believe there's a better, more general, but slightly-more-difficult-to-work-with, location for this optimization to be located. Specifically within raw_bulk_memory_operation I think it'd be reasonable to test for the byte length and see if it's a constant. That way this would uniformly handle all constant-length bulk copies rather than just array.copy. Additionally it's also fine to perform "wrong width" loads/stores where, for example, i8x16 could be used to load values all the time when the size is > 16 bytes as opposed to only when the array element is v128.

Does that sound reasonable to you? And would you be interested to move this optimization to there?

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 00:44):

gfx updated PR #13460.

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 00:44):

gfx commented on PR #13460:

Done in d9a6a14467 — moved into raw_bulk_memory_operation, keyed on a constant byte length, so memory.copy and table.copy get it too, not just array.copy. Loads/stores are now width-agnostic (greedy i8x16i64→…→i8), and BulkOp::MemoryCopy carries entity-appropriate flags (GC corrupt-trap / little+heap / table region; none assume alignment).

The threshold is now byte-based at 128 B. Measured on aarch64, inline is ~1.7–2.7× faster than the libcall up to 128 B; at 256 B the load-all-then-store-all sequence ties the libcall, so it falls back there. Inlined-copy fuel is charged by byte length to match the libcall.

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 00:48):

gfx updated PR #13460.

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 00:49):

gfx edited PR #13460:

Small, constant-length bulk copies — memory.copy, table.copy, and array.copy — currently go through the memory_copy libcall regardless of size. For tiny copies the libcall's fixed per-call cost — a wasm↔host transition plus an indirect call — dominates the actual memmove, so they are far slower than they need to be.

This PR expands such copies inline, as wide loads followed by stores, when the copy length is a compile-time constant of at most 128 bytes.

Updated after code review (thanks @alexcrichton): this started out array.copy-specific and element-count-based. Following the review suggestion it now lives in raw_bulk_memory_operation, keyed on the constant byte length, so it covers all constant-length bulk copies uniformly — and it uses width-agnostic accesses (greedy i8x16i64 → … → i8) instead of one load/store per element.

Speed

i32 array copies, ns per copy, release build, aarch64 (Apple M3 Pro), lower is better:

bytes libcall inline speedup
8 3.9 1.9 2.0×
16 4.8 2.0 2.4×
32 5.2 2.4 2.2×
64 4.1 2.3 1.8×
128 4.6 2.7 1.7×
256 5.9 5.9 1.0× (libcall)

Because the inline path is width-agnostic, i8/i16 arrays and memory.copy of the same byte length behave the same.

<details><summary>Methodology</summary>

ns per copy = wall time / iteration count for a function that runs ~1e6 copies in a loop (so the per-call host↔wasm transition and the allocation are amortized away), taking the median of several repetitions. The "libcall" column uses a dynamic length (forcing the libcall) and the "inline" column a constant length, in the same module at the same size. A small fixed per-iteration readback keeps the copy from being optimized away, which puts a ~2 ns floor on the absolute numbers; the inline-vs-libcall ratio is the signal.
</details>

Correctness & scope

Tests

:robot: Generated with Claude Code

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 00:52):

gfx edited PR #13460:

Small, constant-length bulk copies — memory.copy, table.copy, and array.copy — currently go through the memory_copy libcall regardless of size. For tiny copies the libcall's fixed per-call cost — a wasm↔host transition plus an indirect call — dominates the actual memmove, so they are far slower than they need to be.

This PR expands such copies inline, as wide loads followed by stores, when the copy length is a compile-time constant of at most 128 bytes.

Updated after code review (thanks @alexcrichton): this started out array.copy-specific and element-count-based. Following the review suggestion it now lives in raw_bulk_memory_operation, keyed on the constant byte length, so it covers all constant-length bulk copies uniformly — and it uses width-agnostic accesses (greedy i8x16i64 → … → i8) instead of one load/store per element.

Speed

i32 array copies, ns per copy, release build, aarch64 (Apple M3 Pro), lower is better:

bytes libcall inline speedup
8 3.9 1.9 2.0×
16 4.8 2.0 2.4×
32 5.2 2.4 2.2×
64 4.1 2.3 1.8×
128 4.6 2.7 1.7×
256 5.9 5.9 1.0× (libcall)

Because the inline path is width-agnostic, i8/i16 arrays and memory.copy of the same byte length behave the same.

<details><summary>Methodology</summary>

ns per copy = wall time / iteration count for a function that runs ~1e6 copies in a loop (so the per-call host↔wasm transition and the allocation are amortized away), taking the median of several repetitions. The "libcall" column uses a dynamic length (forcing the libcall) and the "inline" column a constant length, in the same module at the same size. A small fixed per-iteration readback keeps the copy from being optimized away, which puts a ~2 ns floor on the absolute numbers; the inline-vs-libcall ratio is the signal.
</details>

Correctness & scope

Tests

:robot: Generated with Claude Code

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 00:58):

gfx updated PR #13460.

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 01:02):

gfx updated PR #13460.

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 01:20):

gfx edited PR #13460.

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 11:29):

gfx commented on PR #13460:

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 11:29):

gfx deleted a comment on PR #13460:

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 14:58):

alexcrichton unassigned cfallin from PR #13460 cranelift: inline small constant-length array.copy for performance.

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 14:58):

alexcrichton requested alexcrichton for a review on PR #13460.

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 15:09):

:memo: alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 15:09):

:speech_balloon: alexcrichton created PR review comment:

One thing that I'd be a bit worried about here is that types are ignored and the operations aren't actually performed, this is just plumbing iconst values. I think that this'll require actually handling the input/output types of the operation, for example ireduce would alter the value of a immediate with the upper bits set.

One possible alternative to this, however, is to plumb the raw length of the operation through the BulkOp::MemoryCopy value. It's already a bit unfortunate to duplicate const-eval here and while understandable we might be able to sidestep it. Ideally, for example, this would purely match iconst and nothing else, and the need to match uextend/ireduce/imul_imm is due to the code generated here otherwise. If, though, the original length were plumbed (as-is present in wasm) through the BulkOp::MemoryCopy that might mean that iconst would only be necessary here.

To handle that, could you see if this could be updated to only match iconst and then the necessary variable plumbed through to here? Failing that, though, I think this'll need to be a more complicated implementation which handles the types that are happening in each operation.

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 15:09):

:speech_balloon: alexcrichton created PR review comment:

For this I'd recommend having if bytes <= INLINE_COPY_MAX_BYTES with perhaps an early-return of 0 to avoid generating any code. (using a range with contains feels a bit overkill here)

view this post on Zulip Wasmtime GitHub notifications bot (May 24 2026 at 15:09):

:speech_balloon: alexcrichton created PR review comment:

Threading flags here can be a bit finnicky -- would you be up for benchmarking without threading flags to here? For example using MemFlagsData::trusted() here for loads/stores. While it makes sense that we could do it and theoretically optimize further I'd ideally like to start simpler if we can and skip passing flags around. If there's a noticable performance boost threading flags around, though, that's a different situation to be in.

view this post on Zulip Wasmtime GitHub notifications bot (May 26 2026 at 02:51):

gfx updated PR #13460.

view this post on Zulip Wasmtime GitHub notifications bot (May 26 2026 at 02:51):

:memo: gfx submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 26 2026 at 02:51):

:speech_balloon: gfx created PR review comment:

Done in fa4a65f: the gate is now bytes <= INLINE_COPY_MAX_BYTES. The zero-length case falls out naturally — emit_inline_memcpy(0) decomposes into zero chunks, so the load/store loops emit nothing.

view this post on Zulip Wasmtime GitHub notifications bot (May 26 2026 at 02:51):

:memo: gfx submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 26 2026 at 02:51):

:speech_balloon: gfx created PR review comment:

Done in fa4a65f: plumbed the wasm length through BulkOp::MemoryCopy { const_len: Option<u64> }, so value_as_const_int now matches only iconst (no more uextend/ireduce/imul_imm). The byte length is computed in Rust at the call site as count.checked_mul(elem_size).

view this post on Zulip Wasmtime GitHub notifications bot (May 26 2026 at 02:51):

:memo: gfx submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 26 2026 at 02:51):

:speech_balloon: gfx created PR review comment:

Done in fa4a65f: dropped the threaded flags; emit_inline_memcpy now uses MemFlagsData::trusted(). Verified on x86_64 + aarch64 that the asm is identical (movdqu / ldr q). Each load only feeds its paired store, so the x64 aligned flag never lets a load fold into an alignment-requiring SSE operand — there is a comment in the body explaining this.

view this post on Zulip Wasmtime GitHub notifications bot (May 26 2026 at 02:54):

:speech_balloon: gfx edited PR review comment.

view this post on Zulip Wasmtime GitHub notifications bot (May 26 2026 at 17:56):

:thumbs_up: alexcrichton submitted PR review:

Looks great, thanks again for this!

view this post on Zulip Wasmtime GitHub notifications bot (May 26 2026 at 17:56):

alexcrichton added PR #13460 cranelift: inline small constant-length array.copy for performance to the merge queue.

view this post on Zulip Wasmtime GitHub notifications bot (May 26 2026 at 20:15):

github-merge-queue[bot] removed PR #13460 cranelift: inline small constant-length array.copy for performance from the merge queue.

view this post on Zulip Wasmtime GitHub notifications bot (May 26 2026 at 21:55):

alexcrichton commented on PR #13460:

Locally the test is failing as:

running 1 test
test fuel::craneliftpulley_run ... FAILED

failures:

---- fuel::craneliftpulley_run stdout ----

thread '<unnamed>' (3621049) panicked at cranelift/codegen/src/isa/pulley_shared/inst/emit.rs:469:13:
assertion `left == right` failed
  left: Big
 right: Little
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    fuel::craneliftpulley_run

I'll try to dig in tomorrow to take a look

view this post on Zulip Wasmtime GitHub notifications bot (May 26 2026 at 21:56):

alexcrichton commented on PR #13460:

In the meantime, if you're willing to test, unconditionally doing little endian loads/stores would probably fix this since I think the issue is Pulley doesn't implement big-endian v128 loads/stores. I think at least, would still need to confirm.

view this post on Zulip Wasmtime GitHub notifications bot (May 27 2026 at 02:30):

gfx updated PR #13460.

view this post on Zulip Wasmtime GitHub notifications bot (May 27 2026 at 02:34):

gfx commented on PR #13460:

Thanks! Added a pulley64be regression test and fixed it in 20708699b4 (red-green). Pinning the inline chunk flags to Little is safe since each load feeds only its paired store and is never observed as a number.

view this post on Zulip Wasmtime GitHub notifications bot (May 27 2026 at 02:36):

gfx edited a comment on PR #13460:

Thanks! Added a pulley64be regression test and fixed it in 20708699b4 (with red-green TDD). Pinning the inline chunk flags to Little is safe since each load feeds only its paired store and is never observed as a number.

view this post on Zulip Wasmtime GitHub notifications bot (May 27 2026 at 13:33):

:thumbs_up: alexcrichton submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (May 27 2026 at 13:33):

alexcrichton added PR #13460 cranelift: inline small constant-length array.copy for performance to the merge queue.

view this post on Zulip Wasmtime GitHub notifications bot (May 27 2026 at 13:59):

:check: alexcrichton merged PR #13460.

view this post on Zulip Wasmtime GitHub notifications bot (May 27 2026 at 13:59):

alexcrichton removed PR #13460 cranelift: inline small constant-length array.copy for performance from the merge queue.


Last updated: Jun 01 2026 at 09:49 UTC