cfallin opened PR #12101 from cfallin:magical-disappearing-patchable-call-now-you-see-it-now-you-dont to bytecodealliance:main:
The new
patchable_callCLIF instruction pairs with thepatchableABI, and emits a callsite with one new key property: the MachBuffer carries metadata that describes exactly which byte range to "NOP out" (overwrite with NOP instructions) to disable that callsite. Doing so is semantically valid and explicitly supported.This enables patching of code at runtime to dynamically turn on and off features such as instrumentation or debugging hooks. We plan to use this to implement breakpoints in Wasmtime's guest debugging support.
As part of this change, I added a notion of "unit of NOP bytes" to the MachBuffer so that the consumer (e.g., Wasmtime's Cranelift-based code compilation pipeline and metadata-producing logic) can handle patchable callsites without any other special knowledge of the ISA.
For the "real metal" ISAs there are perfectly well-defined NOPs to use, but for Pulley, where all opcodes are assigned at compile time by macro magic, I explicitly defined NOP as opcode byte 0 by moving
Nop's definition to the top of the list and adding a unit test asserting its encoding.A design note: in principle it would be possible, as an alternative, to treat "patchability" as an orthogonal dimension of all callsites, and emit the metadata describing the instruction-offset range for any callsite with the flag set. The only truly necessary semantic restriction is that there are no return values (because if we turn the callsite off, nothing writes to them); we could support patchability for other ABIs and for the other kinds of call instructions. The
patchableABI would then be better described as something like the "no clobbers ABI". I opted not to generalize in this way because it creates some less-tested corners and the generalized form, at least at the MachInst level, is not really much simpler in the end.A testing note: I opted not to implement actual code patching in the
cranelift-toolsfiletest runner and test patching callsites in/out via some actuation (e.g. a magic hostcall, like we do for throws) because (i) that's a lot of new plumbing and (ii) we are going to test this very shortly in Wasmtime anyway and (iii) the correctness (or not) of the location-and-length metadata is easy enough to verify in the disassemblies in the compile-tests.<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin requested wasmtime-compiler-reviewers for a review on PR #12101.
cfallin requested wasmtime-default-reviewers for a review on PR #12101.
cfallin requested abrown for a review on PR #12101.
cfallin requested fitzgen for a review on PR #12101.
bjorn3 submitted PR review.
bjorn3 created PR review comment:
Would it be possible to do something like:
let mut bytes = smallvec::smallvec![]; pulley_interpreter::op::Nop {}.encode(&mut bytes); bytes
alexcrichton submitted PR review:
For Pulley, I'd agree it'd be nice to not have to rely on opcode 0 since that'd enable us to move it around and add payloads as necessary too. I vaguely recall seeing somewhere that some ISA (maybe wasm?) has opcode 0 as a trap which means if you accidentally start executing a bunch of zeros it'll die quickly as opposed to sliding into whatever's next and start executing that.
Otherwise though, I'm realizing that in the context of epochs-with-signal-handlers this is a pretty powerful instruction since, in theory, a patchable load (with the right ABI) could be replaced with a load-from-thing. More generally everything we could want to do with some sort of resumable exception seems like it would ideally want to go through this mechanism (ish) where IR-wise it's a call but runtime-wise we patch it to something that's a context-specific nop and signal handlers get to "rewrite" it to a call.
Given this, would it be worth fleshing out the ABI a bit here? For example is there a downside to saying that the
patchableABI clobbers 1 or 2 registers? Or something along those lines in lieu of attempting to encode the ABI in the function itself or something like that.I've not thought about this too too deeply per se, but it sort of feels like this is on the cusp of being a framework at least for catch-a-thing-with-a-signal-and-go-back-to-wasm albeit not precisely suitable as-is (due to no clobbers). I realize though that this goal is orthogonal to debugging stuff, so if you'd prefer to not try to think about this now that's also fine too
cfallin commented on PR #12101:
I agree that code-patching could be the basis for a lot of other mechanisms that push cost away from the common case (leaving it to be "just the cost of fetching some NOPs") and toward an uncommon case. In particular I like the idea (that I think you're suggesting?) of having a trap context rewrite all epoch-yield callsites into actual calls, while they're NOPs by default.
The one difficulty to keep in mind wrt anything involving signals is that patching the code will also require ownership of the
Store, which is something that we don't yet have in a signal context; I guess that's where the modified ABI comes in, so "vmctx is an implicit argument in a known fixed register" is a way around that.... but now that I think about that, the existing instruction is already suitable: the ABI is "no clobbers but there are still register args", so a
patchable_call %epoch_yield(v0)wherev0isvmctxwill already havevmctxinrdiorx0or whatever, even when the callsite is in its default NOP'd-out state. So from that we could get theStore(Instance::enter_from_wasmetc) and from that patch all sites.@erikrose there's a potential epoch-yield mechanism that requires no dead loads/stores at all -- happy to discuss further.
I'm not sure what other generalizations we could consider here other than the ones mentioned in my commit message above (any callsite / any ABI vs
patchableonly) but for that dimension I think I want to stick with what we have for the time being. If there's some other instruction we should have here too I'm happy to discuss further of course (maybe as part of a follow-on).
cfallin commented on PR #12101:
Ah, actually, I'm thinking in terms of a signal taken at the site itself; a signal taken from an arbitrary point (say
SIGALRMon a periodic basis) still has no way to getStoreownership in the general case, so scratch that...
cfallin commented on PR #12101:
So I guess the instruction one would actually want for the epochs case is
patchable_call_or_load %epoch_yield(v0), v1withv1 being the page that gets unmapped to cause a SIGSEGV, then in that context do the patch (and there we do have theStore`. Happy to sketch how that should look if @erikrose is curious.
cfallin edited a comment on PR #12101:
So I guess the instruction one would actually want for the epochs case is
patchable_call_or_load %epoch_yield(v0), v1withv1being the page that gets unmapped to cause a SIGSEGV, then in that context do the patch (and there we do have theStore. Happy to sketch how that should look if @erikrose is curious.
alexcrichton commented on PR #12101:
I should clarify I'm also stretching the imagination by using the term "patchable" here for what I'm saying. What I'd roughly envision is that we could have instructions (like epoch yields) which are translated to CLIF as a
patchable_callinstruction which gets machine code for "do the call". During compilation finalization, however, we'd rewrite all those call instructions to "do a load from rdi" or whatever register has the address of the load. At runtime nothing would actually end up being patched (avoids code copies which make instantiation expensive) and the signal handler would sort of do the patch for the code itself. The only thing that doesn't work in this scheme is the return address on x64 which must be located on the stack and we can't modify the stack in Windows vectored exception handlers (I even double-checked and apparently the red zone is 0 bytes on x64).If we got this working though this would all naturally extend to breakpoints where we could change the size of the patchable call to the size of a breakpoint instruction for that purpose.
In general my rough hope is that we wouldn't need a CLIF-instruction-per-use-case and could funnel them through one relatively narrow waist ideally. Until there's a solution for x64 Windows exceptions, however, I don't think it's worth pushing too much on this because anything non-portable is only of niche value to have lots of infrastructure supporting it
cfallin commented on PR #12101:
A few thoughts:
- a "Call or something else" doesn't need one to mutate the stack (fortunately) (I think!) because within the signal handler we just update the code we're about to resume at; then the actual call executes and does the return-address push; so this is strictly better than the difficulties we had with the various call-injection trampolines
- I am imagining that when used for breakpoints, we do the patching to NOP during code finalization, so the image on disk is all NOPs and we have the bytes to patch in for a breakpoint call stored in metadata; so there's no CoW cost for any of this in the default case
- That said, I'm not totally sure I get the full picture you have for what you'd have a breakpoint be -- currently my thought is NOP-by-default, patched-to-call. Where does the load come in?
- Making this a more general framework where we could have (say) any two instructions, or patch one to another, or whatever, seems difficult from a compiler architecture point of view. I think the direction I see as feasible would be a "swiss army knife patchable instruction" that knows how to (provides metadata to) become (i) a call, (ii) a load, (iii) a NOP, (iv) maybe something else. Then any use case that needs to patch between any subset of these possible side-effects is subsumed by it.
All that said, I think this discussion is well ahead of what we need to get breakpoints-MVP, so I'd prefer to defer it if that's OK and get review on the PR as-is, unless you think it needs to change in the short term!
alexcrichton commented on PR #12101:
All that said, I think this discussion is well ahead of what we need to get breakpoints-MVP, so I'd prefer to defer it if that's OK and get review on the PR as-is, unless you think it needs to change in the short term!
:+1: from me. I've added this to the Cranelift meeting where we can continue discussion
fitzgen submitted PR review:
LGTM modulo the pulley nop thing.
Fine with not exercising the actual patching in
cranelift-tools, since we have filetests, and all the actual patching will come in time as Wasmtime starts using this stuff.
fitzgen created PR review comment:
+1 to avoiding the hard-coded assumption that
nop == 0in Pulley.
fitzgen created PR review comment:
Shouldn't it be the opposite? The function address we patch in is not necessarily in the same compilation unit as the current function, so it seems like we should always use
!colocatedfor the most flexibility?
cfallin submitted PR review.
cfallin created PR review comment:
Ah, interesting -- I built it this way because (i) colocated calls are smaller and thus the better choice (all else equal) when there are going to be a bunch of them -- 4 bytes on aarch64, 5 bytes on x86-64 -- and (ii) the expected use in Wasmtime is that there's a patchable-to-host trampoline for this hostcall compiled into the object, so it really is colocated. I think that if the destination were patched what you say would be true but here the target is statically known still and it's just the enabled/disabled status that's dynamically patched.
cfallin submitted PR review.
cfallin created PR review comment:
Sure, yep. It does still need to be a single-byte opcode (not an extended opcode) so it can be used to NOP out a sequence of any length -- the pre-existing definition as an extended opcode (two bytes) meant that wouldn't work. So this PR now moves it and adjusts the unit test to assert a single-byte encoding (value doesn't matter).
cfallin updated PR #12101.
cfallin updated PR #12101.
cfallin has enabled auto merge for PR #12101.
cfallin merged PR #12101.
Last updated: Dec 06 2025 at 06:05 UTC