arm64_32 Target · cranelift · Zulip Chat Archive

Stream: cranelift

Topic: arm64_32 Target

Ralph Küpper (Apr 08 2026 at 03:09):

Hi all,

Long-time admirer of Cranelift here. I'm curious whether arm64_32 (the ILP32 ABI used by Apple Watch Series 4–8) is on the roadmap at all, or whether it's considered out of scope.

I'm working on a TypeScript-to-native compiler that targets Apple Watch via Cranelift, and arm64 (Series 9+) works great. The older watches are the only gap, and I'd love to know whether it's worth waiting for upstream support or if I should plan around it.

Thanks!

Chris Fallin (Apr 08 2026 at 03:54):

Hi! The short answer is that it definitely is not going to happen without someone contributing it.

The longer answer is that Cranelift is actively maintained by the ~3-5 of us who work fulltime on Wasmtime+Cranelift, but there is not really much active Cranelift work going on these days except when motivated by Wasmtime stuff (though some of us still have ideas we want to explore). There is definitely not a team with a "roadmap" of big features like new targets to implement -- we're stretched far too thin for that. So unfortunately there's no one here who will see this and add a 3-month project to their timeline, sorry. But PRs welcome!

Ralph Küpper (Apr 08 2026 at 04:15):

Thanks for the answer, that makes sense :slight_smile:

Matt Hargett (May 04 2026 at 05:02):

@Ralph Küpper I have PRs up that contributes support for the interpreter. They're linked from this meta-issue I created: https://github.com/bytecodealliance/wasmtime/issues/13255

Matt Hargett (May 04 2026 at 05:03):

I've done a fair amount of benchmarking on Apple Watch 6/SE2, including targeting A12 (S8) and whole-program LTO. From my notes:

iPhone XS (A12) — Pulley wins: matmul SIMD (+90%), matmul FMA (only Pulley supports relaxed-simd), tail-call (+52%), convolution (+9%). WAMR wins: bulk_memory (2.2×), call_indirect (2×), audio_dsp (+28%), sieve (+26%).
Apple Watch SE2 (S8) — Pulley wins: matmul SIMD (2.4×), tail-call (+57%), sieve (+23%), fib (+25%). WAMR wins: bulk_memory (2×), call_indirect (+58%), audio_dsp (+39%).

Key shifts from M4 host:

fib and sieve flip on weaker cores — Pulley's tighter dispatch loop wins on smaller D-cache.
matmul SIMD widens on SE2 (+90% iPhone → 2.4× on watch). Pulley's vector lowering scales down well.
call_indirect gap shrinks on SE2 (2× → +58%) — fixed per-dispatch cost gets amortized when the function body is slower.
audio_dsp widens on SE2 (+28% → +39%) — the motivating workload, 410ms gap per 1000-frame block.

Matt Hargett (May 04 2026 at 05:04):

because I'm targeting deployment on the App Store, I'm sticking to the interpreter and not working on the JIT aspect. (I could under paid contract, but I don't need it for my immediate purposes.)

Chris Fallin (May 04 2026 at 13:50):

@Matt Hargett interesting numbers -- could you clarify what the baseline is? e.g. in this

Pulley wins: matmul SIMD (+90%) [ ... ]

+90% over what? another interpreter?

Chris Fallin (May 04 2026 at 13:51):

Ah, I just saw WAMR in sibling thread? or WasmEdge?

Matt Hargett (May 04 2026 at 18:13):

against WAMR. I can add WasmEdge to my benchmarking app, since I know it works on arm64_32, if that's useful for the bytecode alliance / global community.

Matt Hargett (May 08 2026 at 06:21):

@Chris Fallin btw, versus your original IC branch I made the IC ARMv8-portable for non-Apple-silicon deploy targets, which appeared to fix a latent torn-pair race that the original IC had on weakly-ordered cores

Matt Hargett (May 08 2026 at 06:22):

I have all the proof points on my local hardware, I'm just needing some feedback about how you all would like me to proceed from here. an OK answer would be "we don't want this, please keep it in your fork" -- just lmk! :D

Chris Fallin (May 08 2026 at 16:01):

(I'm assuming these messages are replies to the neighboring "call_indirect optimization" topic -- replying as such)

btw, versus your original IC branch I made the IC ARMv8-portable for non-Apple-silicon deploy targets, which appeared to fix a latent torn-pair race that the original IC had on weakly-ordered cores

I don't think that's right (or, say more please!): my approach was to put the cache in the vmctx, which is locally owned by the running instance. Thus there cannot be any racy accesses because there is only one thread touching the state at a time. (When an instance is running it holds a &mut Store borrow; a store cannot run multiple threads)

On the other hand, your prototyped approach of caching targets in the bytecode by making it mutable is absolutely racy, and you'll run into issues as soon as you have more than once instance running in a multithreaded context. (The cache-related explanation is also somewhat dubious to me as explained)

So my advice remains: the state has to be stored in vmctx, not the bytecode, and if there are good wins with that, we'd be interested in taking it. Thanks!

Matt Hargett (May 12 2026 at 18:23):

haven't forgotten about this, just doing a bunch of benchmarking so I have defensible/credible data

Last updated: Jun 01 2026 at 09:49 UTC