wasmtime / PR #12101 Cranelift: add patchable call instru... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #12101 Cranelift: add patchable call instru...

Wasmtime GitHub notifications bot (Dec 01 2025 at 06:05):

cfallin opened PR #12101 from cfallin:magical-disappearing-patchable-call-now-you-see-it-now-you-dont to bytecodealliance:main:

The new patchable_call CLIF instruction pairs with the patchable ABI, and emits a callsite with one new key property: the MachBuffer carries metadata that describes exactly which byte range to "NOP out" (overwrite with NOP instructions) to disable that callsite. Doing so is semantically valid and explicitly supported.

This enables patching of code at runtime to dynamically turn on and off features such as instrumentation or debugging hooks. We plan to use this to implement breakpoints in Wasmtime's guest debugging support.

As part of this change, I added a notion of "unit of NOP bytes" to the MachBuffer so that the consumer (e.g., Wasmtime's Cranelift-based code compilation pipeline and metadata-producing logic) can handle patchable callsites without any other special knowledge of the ISA.

For the "real metal" ISAs there are perfectly well-defined NOPs to use, but for Pulley, where all opcodes are assigned at compile time by macro magic, I explicitly defined NOP as opcode byte 0 by moving Nop's definition to the top of the list and adding a unit test asserting its encoding.

A design note: in principle it would be possible, as an alternative, to treat "patchability" as an orthogonal dimension of all callsites, and emit the metadata describing the instruction-offset range for any callsite with the flag set. The only truly necessary semantic restriction is that there are no return values (because if we turn the callsite off, nothing writes to them); we could support patchability for other ABIs and for the other kinds of call instructions. The patchable ABI would then be better described as something like the "no clobbers ABI". I opted not to generalize in this way because it creates some less-tested corners and the generalized form, at least at the MachInst level, is not really much simpler in the end.

A testing note: I opted not to implement actual code patching in the cranelift-tools filetest runner and test patching callsites in/out via some actuation (e.g. a magic hostcall, like we do for throws) because (i) that's a lot of new plumbing and (ii) we are going to test this very shortly in Wasmtime anyway and (iii) the correctness (or not) of the location-and-length metadata is easy enough to verify in the disassemblies in the compile-tests.

Wasmtime GitHub notifications bot (Dec 01 2025 at 06:05):

cfallin requested wasmtime-compiler-reviewers for a review on PR #12101.

Wasmtime GitHub notifications bot (Dec 01 2025 at 06:05):

cfallin requested wasmtime-default-reviewers for a review on PR #12101.

Wasmtime GitHub notifications bot (Dec 01 2025 at 06:05):

cfallin requested abrown for a review on PR #12101.

Wasmtime GitHub notifications bot (Dec 01 2025 at 06:05):

cfallin requested fitzgen for a review on PR #12101.

Wasmtime GitHub notifications bot (Dec 01 2025 at 10:54):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (Dec 01 2025 at 10:54):

bjorn3 created PR review comment:

Would it be possible to do something like:

let mut bytes = smallvec::smallvec![];
pulley_interpreter::op::Nop {}.encode(&mut bytes);
bytes

Wasmtime GitHub notifications bot (Dec 01 2025 at 20:05):

alexcrichton submitted PR review:

For Pulley, I'd agree it'd be nice to not have to rely on opcode 0 since that'd enable us to move it around and add payloads as necessary too. I vaguely recall seeing somewhere that some ISA (maybe wasm?) has opcode 0 as a trap which means if you accidentally start executing a bunch of zeros it'll die quickly as opposed to sliding into whatever's next and start executing that.

Otherwise though, I'm realizing that in the context of epochs-with-signal-handlers this is a pretty powerful instruction since, in theory, a patchable load (with the right ABI) could be replaced with a load-from-thing. More generally everything we could want to do with some sort of resumable exception seems like it would ideally want to go through this mechanism (ish) where IR-wise it's a call but runtime-wise we patch it to something that's a context-specific nop and signal handlers get to "rewrite" it to a call.

Given this, would it be worth fleshing out the ABI a bit here? For example is there a downside to saying that the patchable ABI clobbers 1 or 2 registers? Or something along those lines in lieu of attempting to encode the ABI in the function itself or something like that.

I've not thought about this too too deeply per se, but it sort of feels like this is on the cusp of being a framework at least for catch-a-thing-with-a-signal-and-go-back-to-wasm albeit not precisely suitable as-is (due to no clobbers). I realize though that this goal is orthogonal to debugging stuff, so if you'd prefer to not try to think about this now that's also fine too

Wasmtime GitHub notifications bot (Dec 01 2025 at 20:50):

cfallin commented on PR #12101:

I agree that code-patching could be the basis for a lot of other mechanisms that push cost away from the common case (leaving it to be "just the cost of fetching some NOPs") and toward an uncommon case. In particular I like the idea (that I think you're suggesting?) of having a trap context rewrite all epoch-yield callsites into actual calls, while they're NOPs by default.

The one difficulty to keep in mind wrt anything involving signals is that patching the code will also require ownership of the Store, which is something that we don't yet have in a signal context; I guess that's where the modified ABI comes in, so "vmctx is an implicit argument in a known fixed register" is a way around that.

... but now that I think about that, the existing instruction is already suitable: the ABI is "no clobbers but there are still register args", so a patchable_call %epoch_yield(v0) where v0 is vmctx will already have vmctx in rdi or x0 or whatever, even when the callsite is in its default NOP'd-out state. So from that we could get the Store (Instance::enter_from_wasm etc) and from that patch all sites.

@erikrose there's a potential epoch-yield mechanism that requires no dead loads/stores at all -- happy to discuss further.

I'm not sure what other generalizations we could consider here other than the ones mentioned in my commit message above (any callsite / any ABI vs patchable only) but for that dimension I think I want to stick with what we have for the time being. If there's some other instruction we should have here too I'm happy to discuss further of course (maybe as part of a follow-on).

Wasmtime GitHub notifications bot (Dec 01 2025 at 20:51):

cfallin commented on PR #12101:

Ah, actually, I'm thinking in terms of a signal taken at the site itself; a signal taken from an arbitrary point (say SIGALRM on a periodic basis) still has no way to get Store ownership in the general case, so scratch that...

Wasmtime GitHub notifications bot (Dec 01 2025 at 20:54):

cfallin commented on PR #12101:

So I guess the instruction one would actually want for the epochs case is patchable_call_or_load %epoch_yield(v0), v1 with v1 being the page that gets unmapped to cause a SIGSEGV, then in that context do the patch (and there we do have the Store`. Happy to sketch how that should look if @erikrose is curious.

Wasmtime GitHub notifications bot (Dec 01 2025 at 20:54):

cfallin edited a comment on PR #12101:

So I guess the instruction one would actually want for the epochs case is patchable_call_or_load %epoch_yield(v0), v1 with v1 being the page that gets unmapped to cause a SIGSEGV, then in that context do the patch (and there we do have the Store. Happy to sketch how that should look if @erikrose is curious.

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:00):

alexcrichton commented on PR #12101:

I should clarify I'm also stretching the imagination by using the term "patchable" here for what I'm saying. What I'd roughly envision is that we could have instructions (like epoch yields) which are translated to CLIF as a patchable_call instruction which gets machine code for "do the call". During compilation finalization, however, we'd rewrite all those call instructions to "do a load from rdi" or whatever register has the address of the load. At runtime nothing would actually end up being patched (avoids code copies which make instantiation expensive) and the signal handler would sort of do the patch for the code itself. The only thing that doesn't work in this scheme is the return address on x64 which must be located on the stack and we can't modify the stack in Windows vectored exception handlers (I even double-checked and apparently the red zone is 0 bytes on x64).

If we got this working though this would all naturally extend to breakpoints where we could change the size of the patchable call to the size of a breakpoint instruction for that purpose.

In general my rough hope is that we wouldn't need a CLIF-instruction-per-use-case and could funnel them through one relatively narrow waist ideally. Until there's a solution for x64 Windows exceptions, however, I don't think it's worth pushing too much on this because anything non-portable is only of niche value to have lots of infrastructure supporting it

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:09):

cfallin commented on PR #12101:

A few thoughts:

a "Call or something else" doesn't need one to mutate the stack (fortunately) (I think!) because within the signal handler we just update the code we're about to resume at; then the actual call executes and does the return-address push; so this is strictly better than the difficulties we had with the various call-injection trampolines

I am imagining that when used for breakpoints, we do the patching to NOP during code finalization, so the image on disk is all NOPs and we have the bytes to patch in for a breakpoint call stored in metadata; so there's no CoW cost for any of this in the default case

That said, I'm not totally sure I get the full picture you have for what you'd have a breakpoint be -- currently my thought is NOP-by-default, patched-to-call. Where does the load come in?

Making this a more general framework where we could have (say) any two instructions, or patch one to another, or whatever, seems difficult from a compiler architecture point of view. I think the direction I see as feasible would be a "swiss army knife patchable instruction" that knows how to (provides metadata to) become (i) a call, (ii) a load, (iii) a NOP, (iv) maybe something else. Then any use case that needs to patch between any subset of these possible side-effects is subsumed by it.

All that said, I think this discussion is well ahead of what we need to get breakpoints-MVP, so I'd prefer to defer it if that's OK and get review on the PR as-is, unless you think it needs to change in the short term!

Wasmtime GitHub notifications bot (Dec 01 2025 at 21:16):

alexcrichton commented on PR #12101:

All that said, I think this discussion is well ahead of what we need to get breakpoints-MVP, so I'd prefer to defer it if that's OK and get review on the PR as-is, unless you think it needs to change in the short term!

:+1: from me. I've added this to the Cranelift meeting where we can continue discussion

Wasmtime GitHub notifications bot (Dec 01 2025 at 22:07):

fitzgen submitted PR review:

LGTM modulo the pulley nop thing.

Fine with not exercising the actual patching in cranelift-tools, since we have filetests, and all the actual patching will come in time as Wasmtime starts using this stuff.

Wasmtime GitHub notifications bot (Dec 01 2025 at 22:07):

fitzgen created PR review comment:

+1 to avoiding the hard-coded assumption that nop == 0 in Pulley.

Wasmtime GitHub notifications bot (Dec 01 2025 at 22:07):

fitzgen created PR review comment:

Shouldn't it be the opposite? The function address we patch in is not necessarily in the same compilation unit as the current function, so it seems like we should always use !colocated for the most flexibility?

Wasmtime GitHub notifications bot (Dec 01 2025 at 23:06):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Dec 01 2025 at 23:06):

cfallin created PR review comment:

Ah, interesting -- I built it this way because (i) colocated calls are smaller and thus the better choice (all else equal) when there are going to be a bunch of them -- 4 bytes on aarch64, 5 bytes on x86-64 -- and (ii) the expected use in Wasmtime is that there's a patchable-to-host trampoline for this hostcall compiled into the object, so it really is colocated. I think that if the destination were patched what you say would be true but here the target is statically known still and it's just the enabled/disabled status that's dynamically patched.

Wasmtime GitHub notifications bot (Dec 02 2025 at 00:35):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Dec 02 2025 at 00:35):

cfallin created PR review comment:

Sure, yep. It does still need to be a single-byte opcode (not an extended opcode) so it can be used to NOP out a sequence of any length -- the pre-existing definition as an extended opcode (two bytes) meant that wouldn't work. So this PR now moves it and adjusts the unit test to assert a single-byte encoding (value doesn't matter).

Wasmtime GitHub notifications bot (Dec 02 2025 at 00:35):

cfallin updated PR #12101.

Wasmtime GitHub notifications bot (Dec 02 2025 at 00:36):

cfallin updated PR #12101.

Wasmtime GitHub notifications bot (Dec 02 2025 at 00:37):

cfallin has enabled auto merge for PR #12101.

Wasmtime GitHub notifications bot (Dec 02 2025 at 01:22):

cfallin merged PR #12101.

Last updated: Jun 01 2026 at 09:49 UTC