matthargett edited PR #13259.
alexcrichton unassigned dicej from PR #13259 Fix a couple of issues that prevent wasmtime for compiling/running on arm64_32 (Apple Watch).
alexcrichton requested alexcrichton for a review on PR #13259.
:memo: alexcrichton submitted PR review.
:speech_balloon: alexcrichton created PR review comment:
This may not be necessary? The
regioncrate below is already inskip-treeand locally if I update to 0.6.0 and runcargo deny check bans licensesit says everything is ok.
:speech_balloon: alexcrichton created PR review comment:
Are you sure that this zero-extend happens? I don't know how to compile for this target but this snippet on godbolt is in theory somewhat similar and shows no zero extension from what I can tell.
Personally I'd feel a bit more comfortable if the various variables used here in this file were switched to
u64instead ofusizeto handle this since I think that would mean that the:xisn't required, right? Using au64I think would also be a good place to put documentation and basically explain how onarm64_32targets we're operating on full registers, not the half-width thatusizetakes.
:speech_balloon: alexcrichton created PR review comment:
For the failing CI about missing vets I'll push a commit to this PR once the inline asm bits are worked out
alexcrichton commented on PR #13259:
Wanted to say again thanks for the porting work here and even the benchmark work as well, it's much appreciated!
:memo: matthargett submitted PR review.
:speech_balloon: matthargett created PR review comment:
ok, had to re-read the docs on this to wrap my brain around it, and you're right ofc:
https://doc.rust-lang.org/reference/inline-assembly.html#r-asm.register-operands.smaller-valueThe aarch64 ISA does zero-extend on mov w<N>, ... as a side effect, but rustc's contract for inline asm is "upper bits undefined" — it's free to materialize a u32 operand using any sequence that leaves the high bits arbitrary, and the comment claimed otherwise. thanks for catching this!
:memo: matthargett submitted PR review.
:speech_balloon: matthargett created PR review comment:
sorry, this was vestigial -- when I updated to mach2 0.6, it had a huge sweeping effect on the lockfile and this sponge got left inside the patient when I was trying to narrow the lockfile changes. another good catch! :D
matthargett updated PR #13259.
matthargett edited PR #13259:
Two-commit series enabling
wasmtimeto build forarm64_32-apple-watchos
(Apple Watch Series 4+ ILP32 ABI). Verified end-to-end on Apple Watch SE 2
(S8 SoC, watchOS 11) and iPhone XS (A12, iOS 18) running an 11-workload
Pulley benchmark, with WAMR fast-interp as a side-by-side comparison
runtime.Commit 1 —
unwinder: use u64 for register-width values in stack walkThe unwinder's
Unwind::get_next_older_pc_from_fpand
assert_fp_is_aligned(and the per-archget_stack_pointer/
get_next_older_pc_from_fp/resume_to_exception_handler/
assert_fp_is_aligned) take and return register-width quantities, not
pointer-width quantities. They were typedusize, which works correctly
on every aarch64 LP64 target and on x86_64 / riscv64 / s390x. On
arm64_32-apple-watchos (ILP32 ABI: 64-bit registers, 32-bit pointers)
usizeisu32, which produced two issues in the aarch64 inline-asm
bodies:
The
mov {0}, sp/mov lr, {pc}operands tookusizevalues, so
onarm64_32rustc emits anasm_sub_registerwarning: the operand
is a 32-bit value going into the 64-bitregclass, and the default
rendering is ambiguous betweenw<N>(32-bit lane) andx<N>(the
64-bit GPR view we need forsp/lr, neither of which has a
32-bit alias).The Rust Reference is explicit that the upper bits of a register
holding a sub-register-width input are undefined (see
<https://doc.rust-lang.org/reference/inline-assembly.html#r-asm.register-operands.smaller-value>),
so the previous behaviour (which happened to work on aarch64 hardware
because writes tow<N>zero-extendx<N>as a side effect of the
ISA) was relying on a property the language doesn't promise.Switch the register-bearing values throughout the unwinder to
u64on
every architecture. Foraarch64-*LP64 targetsu64andusizeare
identical, so this is a benign rename; forarm64_32it makes the asm
operands unambiguously 64-bit-register-class and removes the dependency
on undefined upper-bit behaviour. For x86_64 / riscv64 / s390xu64and
usizeare also identical, so the change is a rename plus a couple of
cast boundaries where the unwinder talks to host-pointer code.In
aarch64.rs, also switch the saved-LR load from
*(fp as *mut usize).offset(1)to*(fp as *mut u64).offset(1). AAPCS64
reserves two 64-bit slots for the frame record on every aarch64 ABI
variant — includingarm64_32— so an 8-byte stride is correct
regardless of pointer width.The
Unwindtrait method signatures are updated; the only callers were
FrameCursor::advance(in this crate) andUnwindPulley(in
wasmtime/src/runtime/vm/interpreter.rs), which now widen/narrow
between the host-pointerusizethey store internally and theu64the
trait uses.Handler::resume_tailccwidens itspayload1/payload2
and the savedpc/sp/fptou64at the call to the per-arch
resume_to_exception_handler. The publicget_stack_pointer()keeps
itsusizereturn shape via a thin wrapper.Commit 2 —
Bump mach2 dep from 0.4.2 to 0.6
mach2 v0.4.2emitscompile_error!("mach requires macOS or iOS")on
any target where neithertarget_os = "macos"noriosmatches, plus a
matching narrowtarget_vendorgate on itslibcbuild-dep. That blocks
Apple watchOS / tvOS / visionOS targets — wasmtime'sruntimefeature
pullsmach2in unconditionally so the build fails with both
error: mach requires macOS or iOSand
error[E0463]: can't find crate for libc.The fix has been upstream in mach2 since 0.6.0 (commit
538ce75,
2025-08-16, "Add support for tvOS, watchOS and visionOS"): both gates
widen tocfg(target_vendor = "apple"). The mach2 module API wasmtime
imports (exc,exception_types,kern_return,mach_init,
mach_port,message,ndr,port,thread_act,thread_status) is
unchanged between 0.4.2 and 0.6.0; only internal libc/core::ffi
type-plumbing differs. Bumping the workspace dep is sufficient — no
changes inmachports.rs.Verified by building
wasmtimeas astaticlibfor
arm64_32-apple-watchosunder
nightly-2026-01-25 + -Z build-std=std,panic_abortwith
--features pulley,runtime,std,cranelift,anyhow. The dev-only path
(cranelift-jit -> region -> mach2 0.4.x) keeps an older mach2 in the
lockfile for cranelift-jit's own host tests; that path is not part of
any production embedder build and stays unchanged.cargo denyflags
the resulting twomach2versions; we considered an explicitskip
entry indeny.tomlbut reverted it on review feedback sinceregion
is already inskip-treeand the duplicate is contained to
cranelift-jit's dev-deps. The right long-term fix is for
regionto update — alex.crichton is preparing acargo vetaudit
update separately.End-to-end verification
This 2-commit stack + the companion
target-lexiconArm64_32patch
(submitted separately to bytecodealliance/target-lexicon) is enough to
build a Pulley-only static library for arm64_32-apple-watchos and link
it into a watchOS app. On real hardware:Apple Watch SE 2 (S8 SoC, watchOS 11, arm64_32-apple-watchos)
workload Pulley WAMR fast-interp winner fib(30) 132.04 ms 165.05 ms Pulley +25% fib_tail(100000) [return_call] 0.566 ms 0.886 ms Pulley +57% factorial(20) <1 µs <1 µs tie sieve(10000) 0.762 ms 0.938 ms Pulley +23% crc32(64 KiB) 5.301 ms 5.153 ms WAMR +3% matmul simd128 64×64 3.986 ms 9.398 ms Pulley +136% matmul relaxed-simd FMA 3.143 ms err — not in WAMR Pulley convolution 256×256 10.549 ms 11.789 ms Pulley +12% audio DSP 1000×512 1471.56 ms 1060.26 ms WAMR +39% bulk_memory (memory.copy/fill) 31.564 ms 15.644 ms WAMR +102% call_indirect (200 K dispatches) 36.727 ms 23.260 ms WAMR +58% iPhone XS (A12, iOS 18, aarch64-apple-ios)
workload Pulley WAMR fast-interp winner fib(30) 41.147 ms 49.371 ms Pulley +20% fib_tail(100000) [return_call] 0.252 ms 0.382 ms Pulley +52% sieve(10000) 0.340 ms 0.269 ms WAMR +26% matmul simd128 64×64 1.697 ms 3.228 ms Pulley +90% matmul relaxed-simd FMA 1.339 ms err — not in WAMR Pulley audio DSP 1000×512 536.97 ms 418.73 ms WAMR +28% bulk_memory (memory.copy/fill) 10.824 ms 4.962 ms WAMR +118% call_indirect (200 K dispatches) 17.621 ms 8.861 ms WAMR +99% All results match the host-Rust reference function byte-for-byte
across both runtimes.
:memo: alexcrichton submitted PR review.
:speech_balloon: alexcrichton created PR review comment:
For the signatures of these functions could they continue to use
usizeinstead ofu64? While internally aarch64 should useu64to handle the ILP32 ABI, if we eventually add a 32-bit platform to the reperotoire here we ideally wouldn't want it to move aroundu64values as opposed tousizevalues. In that sense could theu64be an implementation detail of this aarch64 module? For example there'd be explicitu64::try_from(thing).unwrap()upcasts (which'd never actually panic and would document no loss of precision) coupled with explicitas usizedowncasts which would be documented as "this is expected to lose precision on ILP32"
matthargett updated PR #13259.
matthargett edited PR #13259:
Two-commit series enabling
wasmtimeto build forarm64_32-apple-watchos
(Apple Watch Series 4+ ILP32 ABI). Verified end-to-end on Apple Watch SE 2
(S8 SoC, watchOS 11) and iPhone XS (A12, iOS 18) running an 11-workload
Pulley benchmark, with WAMR fast-interp as a side-by-side comparison
runtime.Commit 1 —
unwinder: type aarch64 register-bearing locals as u64
crates/unwinder/src/arch/aarch64.rshas inline-asm operands that take
register-width values. They were typedusize, which works on the usual
aarch64-*LP64 targets whereusizeisu64and the operand class is
unambiguously the 64-bit GPR view. Onarm64_32-apple-watchos(ILP32
ABI: 64-bit registers, 32-bit pointers)usizeisu32, which makes
the same operands ambiguous between thew<N>(32-bit lane) andx<N>
(64-bit GPR) views — exactly what rustc'sasm_sub_registerlint flags.
Relying on the ISA-side zero-extend that aarch64 happens to perform on
mov w<N>, ...would also be relying on a property the language doesn't
promise: the Rust Reference is explicit that the upper bits of a
register holding a sub-register-width input are undefined (see
<https://doc.rust-lang.org/reference/inline-assembly.html#r-asm.register-operands.smaller-value>).Rather than leak
u64into the public surface (theUnwindtrait, the
sharedarch/mod.rsdispatch, and the per-arch backends inx86.rs/
riscv64.rs/s390x.rs), keep the public function signaturesusize
— that's the existing convention shared with the other backends, and
theu64-vs-pointer-width split is unique to aarch64-on-ILP32. Inside
aarch64.rsonly, type any register-bearing local that participates in
inline asm asu64, and cast at the boundaries:-
u64::try_from(v).unwrap()widensusize→u64(infallible on
every supported Rust target, the.unwrap()documents that any
failure would be a target-property issue rather than a runtime one).-
as usizenarrowsu64→usizeat the return — truncates on
arm64_32by design (the saved PC/SP there is a 32-bit host
pointer that fits exactly in the low 32 bits) and is the identity
on aarch64 LP64.Also switch the saved-LR load from
*(fp as *mut usize).offset(1)to
*(fp as *mut u64).offset(1). AAPCS64 reserves two 64-bit slots for
the frame record on every aarch64 ABI variant — includingarm64_32—
so an 8-byte stride is correct regardless of pointer width. With
*mut usizeonarm64_32.offset(1)would advance by only 4 bytes
and read the upper half of the saved-FP slot. This is a latent
correctness fix; today the unwinder isn't exercised onarm64_32
(which runs Pulley, not Cranelift-compiled native code), but the
corrected form is the right one to land alongside the type change.Diff is one file (
crates/unwinder/src/arch/aarch64.rs, +69 / -4).
No behaviour change on existing aarch64 LP64 targets; silences two
asm_sub_registerwarnings on a futurearm64_32-apple-watchosbuild.Commit 2 —
Bump mach2 dep from 0.4.2 to 0.6
mach2 v0.4.2emitscompile_error!("mach requires macOS or iOS")on
any target where neithertarget_os = "macos"noriosmatches, plus
a matching narrowtarget_vendorgate on itslibcbuild-dep. That
blocks Apple watchOS / tvOS / visionOS targets — wasmtime'sruntime
feature pullsmach2in unconditionally so the build fails with both
error: mach requires macOS or iOSanderror[E0463]: can't find crate for libc.The fix has been upstream in mach2 since 0.6.0 (commit
538ce75,
2025-08-16, "Add support for tvOS, watchOS and visionOS"): both gates
widen tocfg(target_vendor = "apple"). The mach2 module API wasmtime
imports (exc,exception_types,kern_return,mach_init,
mach_port,message,ndr,port,thread_act,thread_status)
is unchanged between 0.4.2 and 0.6.0; only internal libc/core::ffi
type-plumbing differs. Bumping the workspace dep is sufficient — no
changes inmachports.rs.Verified by building
wasmtimeas astaticlibfor
arm64_32-apple-watchosunder
nightly-2026-01-25 + -Z build-std=std,panic_abortwith
--features pulley,runtime,std,cranelift,anyhow. The dev-only path
(cranelift-jit -> region -> mach2 0.4.x) keeps an older mach2 in the
lockfile for cranelift-jit's own host tests; that path is not part of
any production embedder build and stays unchanged.cargo denyflags
the resulting twomach2versions butregionis already in
skip-tree, so nodeny.tomlchange is needed; the right long-term
fix is forregionto update. @alexcrichton is preparing a
cargo vetaudit update for the new mach2 0.6.0 separately.End-to-end verification
This 2-commit stack + the companion
target-lexiconArm64_32patch
(submitted separately to bytecodealliance/target-lexicon) is enough to
build a Pulley-only static library for arm64_32-apple-watchos and link
it into a watchOS app. On real hardware:Apple Watch SE 2 (S8 SoC, watchOS 11, arm64_32-apple-watchos)
workload Pulley WAMR fast-interp winner fib(30) 132.04 ms 165.05 ms Pulley +25% fib_tail(100000) [return_call] 0.566 ms 0.886 ms Pulley +57% factorial(20) <1 µs <1 µs tie sieve(10000) 0.762 ms 0.938 ms Pulley +23% crc32(64 KiB) 5.301 ms 5.153 ms WAMR +3% matmul simd128 64×64 3.986 ms 9.398 ms Pulley +136% matmul relaxed-simd FMA 3.143 ms err — not in WAMR Pulley convolution 256×256 10.549 ms 11.789 ms Pulley +12% audio DSP 1000×512 1471.56 ms 1060.26 ms WAMR +39% bulk_memory (memory.copy/fill) 31.564 ms 15.644 ms WAMR +102% call_indirect (200 K dispatches) 36.727 ms 23.260 ms WAMR +58% iPhone XS (A12, iOS 18, aarch64-apple-ios)
workload Pulley WAMR fast-interp winner fib(30) 41.147 ms 49.371 ms Pulley +20% fib_tail(100000) [return_call] 0.252 ms 0.382 ms Pulley +52% sieve(10000) 0.340 ms 0.269 ms WAMR +26% matmul simd128 64×64 1.697 ms 3.228 ms Pulley +90% matmul relaxed-simd FMA 1.339 ms err — not in WAMR Pulley audio DSP 1000×512 536.97 ms 418.73 ms WAMR +28% bulk_memory (memory.copy/fill) 10.824 ms 4.962 ms WAMR +118% call_indirect (200 K dispatches) 17.621 ms 8.861 ms WAMR +99% All results match the host-Rust reference function byte-for-byte
across both runtimes.
matthargett edited PR #13259:
Two-commit series enabling
wasmtimeto build forarm64_32-apple-watchos
(Apple Watch Series 4+ ILP32 ABI). Verified end-to-end on Apple Watch SE 2
(S8 SoC, watchOS 11) and iPhone XS (A12, iOS 18) running an 11-workload
Pulley benchmark, with WAMR fast-interp as a side-by-side comparison
runtime.Commit 1 —
unwinder: type aarch64 register-bearing locals as u64
crates/unwinder/src/arch/aarch64.rshas inline-asm operands that take
register-width values. They were typedusize, which works on the usual
aarch64-*LP64 targets whereusizeisu64and the operand class is
unambiguously the 64-bit GPR view. Onarm64_32-apple-watchos(ILP32
ABI: 64-bit registers, 32-bit pointers)usizeisu32, which makes
the same operands ambiguous between thew<N>(32-bit lane) andx<N>
(64-bit GPR) views — exactly what rustc'sasm_sub_registerlint flags.
Relying on the ISA-side zero-extend that aarch64 happens to perform on
mov w<N>, ...would also be relying on a property the language doesn't
promise: the Rust Reference is explicit that the upper bits of a
register holding a sub-register-width input are undefined (see
<https://doc.rust-lang.org/reference/inline-assembly.html#r-asm.register-operands.smaller-value>).Rather than leak
u64into the public surface (theUnwindtrait, the
sharedarch/mod.rsdispatch, and the per-arch backends inx86.rs/
riscv64.rs/s390x.rs), keep the public function signaturesusize
— that's the existing convention shared with the other backends, and
theu64-vs-pointer-width split is unique to aarch64-on-ILP32. Inside
aarch64.rsonly, type any register-bearing local that participates in
inline asm asu64, and cast at the boundaries:-
u64::try_from(v).unwrap()widensusize→u64(infallible on
every supported Rust target, the.unwrap()documents that any
failure would be a target-property issue rather than a runtime one).-
as usizenarrowsu64→usizeat the return — truncates on
arm64_32by design (the saved PC/SP there is a 32-bit host
pointer that fits exactly in the low 32 bits) and is the identity
on aarch64 LP64.Also switch the saved-LR load from
*(fp as *mut usize).offset(1)to
*(fp as *mut u64).offset(1). AAPCS64 reserves two 64-bit slots for
the frame record on every aarch64 ABI variant — includingarm64_32—
so an 8-byte stride is correct regardless of pointer width. With
*mut usizeonarm64_32.offset(1)would advance by only 4 bytes
and read the upper half of the saved-FP slot. This is a latent
correctness fix; today the unwinder isn't exercised onarm64_32
(which runs Pulley, not Cranelift-compiled native code), but the
corrected form is the right one to land alongside the type change.Diff is one file (
crates/unwinder/src/arch/aarch64.rs, +69 / -4).
No behaviour change on existing aarch64 LP64 targets; silences two
asm_sub_registerwarnings on a futurearm64_32-apple-watchosbuild.Commit 2 —
Bump mach2 dep from 0.4.2 to 0.6
mach2 v0.4.2emitscompile_error!("mach requires macOS or iOS")on
any target where neithertarget_os = "macos"noriosmatches, plus
a matching narrowtarget_vendorgate on itslibcbuild-dep. That
blocks Apple watchOS / tvOS / visionOS targets — wasmtime'sruntime
feature pullsmach2in unconditionally so the build fails with both
error: mach requires macOS or iOSanderror[E0463]: can't find crate for libc.The fix has been upstream in mach2 since 0.6.0 (commit
538ce75,
2025-08-16, "Add support for tvOS, watchOS and visionOS"): both gates
widen tocfg(target_vendor = "apple"). The mach2 module API wasmtime
imports (exc,exception_types,kern_return,mach_init,
mach_port,message,ndr,port,thread_act,thread_status)
is unchanged between 0.4.2 and 0.6.0; only internal libc/core::ffi
type-plumbing differs. Bumping the workspace dep is sufficient — no
changes inmachports.rs.Verified by building
wasmtimeas astaticlibfor
arm64_32-apple-watchosunder
nightly-2026-01-25 + -Z build-std=std,panic_abortwith
--features pulley,runtime,std,cranelift,anyhow. The dev-only path
(cranelift-jit -> region -> mach2 0.4.x) keeps an older mach2 in the
lockfile for cranelift-jit's own host tests; that path is not part of
any production embedder build and stays unchanged.cargo denyflags
the resulting twomach2versions butregionis already in
skip-tree, so nodeny.tomlchange is needed; the right long-term
fix is forregionto update. @alexcrichton is preparing a
cargo vetaudit update for the new mach2 0.6.0 separately.End-to-end verification
This 2-commit stack + the companion
target-lexiconArm64_32patch
(submitted separately to bytecodealliance/target-lexicon) is enough to
build a Pulley-only static library for arm64_32-apple-watchos and link
it into a watchOS app. On real hardware:Apple Watch SE 2 (S8 SoC, watchOS 11, arm64_32-apple-watchos)
workload Pulley WAMR fast-interp winner fib(30) 132.04 ms 165.05 ms Pulley +25% fib_tail(100000) [return_call] 0.566 ms 0.886 ms Pulley +57% factorial(20) <1 µs <1 µs tie sieve(10000) 0.762 ms 0.938 ms Pulley +23% crc32(64 KiB) 5.301 ms 5.153 ms WAMR +3% matmul simd128 64×64 3.986 ms 9.398 ms Pulley +136% matmul relaxed-simd FMA 3.143 ms err — not in WAMR Pulley convolution 256×256 10.549 ms 11.789 ms Pulley +12% audio DSP 1000×512 1471.56 ms 1060.26 ms WAMR +39% bulk_memory (memory.copy/fill) 31.564 ms 15.644 ms WAMR +102% call_indirect (200 K dispatches) 36.727 ms 23.260 ms WAMR +58% iPhone XS (A12, iOS 18, aarch64-apple-ios)
workload Pulley WAMR fast-interp winner fib(30) 41.147 ms 49.371 ms Pulley +20% fib_tail(100000) [return_call] 0.252 ms 0.382 ms Pulley +52% sieve(10000) 0.340 ms 0.269 ms WAMR +26% matmul simd128 64×64 1.697 ms 3.228 ms Pulley +90% matmul relaxed-simd FMA 1.339 ms err — not in WAMR Pulley audio DSP 1000×512 536.97 ms 418.73 ms WAMR +28% bulk_memory (memory.copy/fill) 10.824 ms 4.962 ms WAMR +118% call_indirect (200 K dispatches) 17.621 ms 8.861 ms WAMR +99% All results match the host-Rust reference function byte-for-byte
across both runtimes.
:memo: matthargett submitted PR review.
:speech_balloon: matthargett created PR review comment:
ok, I tried to rethink this from a modularity/encapsulation/element-of-least-surprise perspective. I moved comments around a bit and expanded them, but lmk if its an over-correction or too verbose now.
:memo: alexcrichton submitted PR review.
:speech_balloon: alexcrichton created PR review comment:
Nah looks perfect, thanks!
alexcrichton commented on PR #13259:
For the vets I typically push directly to a PR, which by-default works most of the time, but I think the origin of this fork, the rebeckerspecialties organization, doesn't allow that. In lieu of that @matthargett could you cherry-pick https://github.com/alexcrichton/wasmtime/commit/4c193dda87f7c4c29e055fc3af39e88fec4b5a39 into this PR and then I can an approve-and-merge?
matthargett updated PR #13259.
matthargett commented on PR #13259:
Done — cherry-picked your
4c193dda87f7c4c29e055fc3af39e88fec4b5a39(Add vets for mach2) onto the head of this PR's branch. New tip is3c0c73fbde.Verified locally:
cargo vet checksucceeds: _Vetting Succeeded (482 fully audited, 32 partially audited, 53 exempted)_.- The two pre-existing wildcard-expiry / unnecessary-import warnings are unrelated to this commit.
Ready for approve-and-merge whenever you have a moment. Thanks for the offer to push directly — the rebeckerspecialties org's branch protections do block third-party pushes, so the cherry-pick path is the cleanest workaround.
:thumbs_up: alexcrichton submitted PR review.
alexcrichton added PR #13259 Fix a couple of issues that prevent wasmtime for compiling/running on arm64_32 (Apple Watch) to the merge queue
:check: alexcrichton merged PR #13259.
alexcrichton removed PR #13259 Fix a couple of issues that prevent wasmtime for compiling/running on arm64_32 (Apple Watch) from the merge queue
Last updated: Jun 01 2026 at 09:49 UTC