wasmtime / issue #11964 Debug: plan for simple libcall/in... · git-wasmtime · Zulip Chat Archive

Stream: git-wasmtime

Topic: wasmtime / issue #11964 Debug: plan for simple libcall/in...

Wasmtime GitHub notifications bot (Nov 01 2025 at 19:11):

cfallin opened issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 and followups, so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (done in #11895).

[ ] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin.

[ ] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments.

[ ] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[ ] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (I have a version of this in a private branch that I'll extract soon.)

[ ] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself.

[ ] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint.

[ ] Then we can implement a debugger API to enable any given breakpoint by Wasm PC.

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks.

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Nov 01 2025 at 19:14):

cfallin assigned cfallin to issue #11964.

Wasmtime GitHub notifications bot (Nov 01 2025 at 19:14):

cfallin added the wasmtime:debugging label to Issue #11964.

Wasmtime GitHub notifications bot (Nov 03 2025 at 17:03):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (done in #11895).

[ ] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin.

[ ] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments.

[ ] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[ ] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (I have a version of this in a private branch that I'll extract soon.)

[ ] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself.

[ ] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint.

[ ] Then we can implement a debugger API to enable any given breakpoint by Wasm PC.

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks.

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Nov 04 2025 at 23:40):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (done in #11895).

[ ] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin.

[ ] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (pending in #11982).

[ ] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[ ] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (I have a version of this in a private branch that I'll extract soon.)

[ ] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself.

[ ] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint.

[ ] Then we can implement a debugger API to enable any given breakpoint by Wasm PC.

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks.

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Nov 05 2025 at 01:38):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (done in #11895).

[ ] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin.

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (done in #11982).

[ ] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[ ] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (I have a version of this in a private branch that I'll extract soon.)

[ ] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself.

[ ] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint.

[ ] Then we can implement a debugger API to enable any given breakpoint by Wasm PC.

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks.

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Nov 20 2025 at 00:37):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (done in #11895).

[ ] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin.

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (done in #11982).

[ ] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (pending in #12051) That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[ ] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (I have a version of this in a private branch that I'll extract soon.)

[ ] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself.

[ ] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint.

[ ] Then we can implement a debugger API to enable any given breakpoint by Wasm PC.

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks.

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Nov 20 2025 at 01:43):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (done in #11895).

[ ] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin. (pending in #12052)

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (done in #11982).

[ ] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (pending in #12051) That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[ ] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (I have a version of this in a private branch that I'll extract soon.)

[ ] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself.

[ ] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint.

[ ] Then we can implement a debugger API to enable any given breakpoint by Wasm PC.

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks.

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Nov 22 2025 at 02:16):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (#11895).

[x] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin (#12052).

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (#11982).

[ ] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (pending in #12051) That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[ ] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (#12061)

[ ] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself.

[ ] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint.

[ ] Then we can implement a debugger API to enable any given breakpoint by Wasm PC.

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks.

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Nov 22 2025 at 04:33):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (#11895).

[x] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin (#12052).

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (#11982).

[ ] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (pending in #12051) That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[ ] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (#12061)

[x] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself.

[ ] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint.

[ ] Then we can implement a debugger API to enable any given breakpoint by Wasm PC.

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks.

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Nov 23 2025 at 18:50):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (#11895).

[x] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin (#12052).

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (#11982).

[ ] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (pending in #12051) That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[x] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (#12061)

[x] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself.

[ ] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint.

[ ] Then we can implement a debugger API to enable any given breakpoint by Wasm PC.

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks.

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Nov 23 2025 at 18:50):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (#11895).

[x] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin (#12052).

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (#11982).

[ ] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (pending in #12051) That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[x] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (#12061)

[ ] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself.

[ ] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint.

[ ] Then we can implement a debugger API to enable any given breakpoint by Wasm PC.

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks.

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Dec 02 2025 at 04:00):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (#11895).

[x] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin (#12052).

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (#11982).

[ ] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (#12051) That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[x] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (#12061)

[ ] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself. (#12101)

[ ] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint.

[ ] Then we can implement a debugger API to enable any given breakpoint by Wasm PC.

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks.

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Dec 02 2025 at 04:00):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (#11895).

[x] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin (#12052).

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (#11982).

[ ] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (#12051) That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[x] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (#12061)

[x] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself. (#12101)

[ ] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint.

[ ] Then we can implement a debugger API to enable any given breakpoint by Wasm PC.

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks.

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Dec 03 2025 at 01:45):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (#11895).

[x] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin (#12052).

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (#11982).

[x] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (#12051) That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[x] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (#12061)

[x] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself. (#12101)

[ ] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint.

[ ] Then we can implement a debugger API to enable any given breakpoint by Wasm PC.

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks.

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Dec 06 2025 at 23:32):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (#11895).

[x] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin (#12052).

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (#11982).

[x] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (#12051) That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[x] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (#12061)

[x] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself. (#12101)

[ ] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint. (#12133)

[ ] Then we can implement a debugger API to enable any given breakpoint by Wasm PC. (#12133)

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks. (#12133)

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Dec 12 2025 at 20:33):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (#11895).

[x] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin (#12052).

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (#11982).

[x] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (#12051) That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[x] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (#12061)

[x] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself. (#12101)

[x] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint. (#12133)

[ ] Then we can implement a debugger API to enable any given breakpoint by Wasm PC. (#12133)

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks. (#12133)

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Dec 12 2025 at 20:33):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (#11895).

[x] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin (#12052).

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (#11982).

[x] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (#12051) That lets us build a more efficient mechanism for...

[ ] ... breakpoints with patchable calls. The idea here is to implement

[x] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (#12061)

[x] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself. (#12101)

[x] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint. (#12133)

[x] Then we can implement a debugger API to enable any given breakpoint by Wasm PC. (#12133)

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks. (#12133)

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Dec 12 2025 at 20:33):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (#11895).

[x] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin (#12052).

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (#11982).

[x] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (#12051) That lets us build a more efficient mechanism for...

[x] ... breakpoints with patchable calls. The idea here is to implement

[x] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (#12061)

[x] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself. (#12101)

[x] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint. (#12133)

[x] Then we can implement a debugger API to enable any given breakpoint by Wasm PC. (#12133)

[ ] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks. (#12133)

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Dec 12 2025 at 20:33):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (#11895).

[x] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin (#12052).

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (#11982).

[x] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (#12051) That lets us build a more efficient mechanism for...

[x] ... breakpoints with patchable calls. The idea here is to implement

[x] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (#12061)

[x] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself. (#12101)

[x] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint. (#12133)

[x] Then we can implement a debugger API to enable any given breakpoint by Wasm PC. (#12133)

[x] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks. (#12133)

[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.

[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Dec 19 2025 at 17:48):

cfallin edited issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (#11895).

[x] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin (#12052).

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (#11982).

[x] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (#12051) That lets us build a more efficient mechanism for...

[x] ... breakpoints with patchable calls. The idea here is to implement

[x] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (#12061)

[x] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself. (#12101)

[x] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint. (#12133)

[x] Then we can implement a debugger API to enable any given breakpoint by Wasm PC. (#12133)

[x] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks. (#12133)

~~[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.~~ (Not needed for gdbstub protocol)

~~[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.~~ (Move to post-MVP)

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Dec 19 2025 at 18:10):

cfallin closed issue #11964:

After offline discussion with @alexcrichton and @fitzgen, we've discussed some of the design choices that were brought up in discussion in #11826, #11921, #11930, and elsewhere, and settled on a reasonable path for "simplest possible debug instrumentation that can work". I wanted to document that here as a meta-issue with a checklist.

Background and Main Choice: Hardware vs. Software Debugger Entries

To start: a debugger of bytecode-compiled-to-native-code needs to be able to

Inspect/recover bytecode-level program state from the native code, somehow, when observing paused stack frames;

Receive control when a trap occurs that would kill the instance;

Insert additional points ("breakpoints"/"watchpoints") where it can receive control and then eventually return it ("resumable traps" in a sense).

We have the state inspection part covered in #11769 (built on #11768 and #11783, and with #11873 and #11899 as followups), and we also have a callback/hook framework to register a debugger and listen for events (#11895), so the remaining focus is how to build the "control-flow interjection" aspect.

There is a fundamental choice in the design space: we could either make maximal use of hardware traps, and redirect them to the debugger; including creating new scenarios where a hardware trap occurs, for debugger purposes (e.g. patching in "break" instructions for breakpoints that raise SIGILL, and resuming by jumping past them). Or we could perform all checks for trapping conditions, breakpoint conditions, etc., in software, and do a "normal" libcall into the runtime.

Some aspects of the tradeoff are:

Redirecting traps is very complex and subtle, and depends on the combination of the ISA and OS/platform. We already have some specific code at each intersection point of ISA and OS to handle signals, but most of this is factored (general code for "Unix signals", and just a few lines to get the right registers on x86-64 or aarch64 or ...). In contrast, in #11930 we see that we need a full assembly stub and strategy for each ISA+OS, and sometimes ISA variants (RISC-V with or without vectors, for example) too.

This is entirely possible to build and maintain, but the complexity does imply ongoing maintenance cost, and also some additional risk as this is a core load-bearing part of the runtime (trap handling generally).

Along with that, while in principle it's possible to turn traps into "virtual libcalls", the pointer provenance story is nontrivial: we need to be able to recover the current store from TLS and plumb that into the libcall, but all libcalls today take the current instance (vmctx) as their first arg instead; and also we already have pointers to a few pieces of the store, but not the whole thing, in our TLS structure; and also some points in the matrix above (macOS in particular) don't give access to TLS during the trap-redirection phase; and also all of this requires careful reasoning about ownership and held/cached state of the store in the Cranelift code too (see the last bullet-point in this comment for more), as this is adding an implicit call with mut store borrow on every trapping instruction.

On the other hand, giving up any trap-based mechanism imposes more runtime cost:

It means that we have to turn off signal-based traps (conveniently we already have and test this option!), which implies a 1.5-2x slowdown, mainly due to explicit bounds checking for memories.

It means that we can't use a patchable single instruction (ud2 on x86-64 / brk on aarch64 / ...) for breakpoints, adding some code size.

Despite the increased cost, the complexities of call injection on traps are significant and this has pushed us to limit scope and make the software-based approach work. That, in turn, has led to some brainstorming around bringing that cost down. To that end, the plan...

Plan: Software-based with Patchable Code

Our plan for a simple yet reasonably performant debugger that can handle traps and insert breakpoints is:

[x] Augment traps raised by the trap libcall with a debugger callback point (#11895).

[x] Turn off signal-based traps when debugging is enabled. That means that every trap becomes a libcall to the trap builtin (#12052).

[x] Ensure that Pulley uses libcall traps too. Currently here Pulley is special-cased to continue to rely on trapping instructions because the "traps" are handled in the interpreter rather than with true OS signals. Modifying that conditional leads to a few cases of missing instruction lowerings that I'm working through, but otherwise this should "just work". At this point, the debugger can now catch all traps in native and Pulley environments (#11982).

[x] Modify the runtime to allow private copies of code to exist for individual instances of a module. This will allow us to flip permissions to RW, patch code, and flip back to RX whenever we have control in the runtime without fear of race conditions, and without impact on other instances of the same module. (#12051) That lets us build a more efficient mechanism for...

[x] ... breakpoints with patchable calls. The idea here is to implement

[x] A new ABI that has no clobbers (no caller-saves) and takes arg(s) only in registers; this lets callsites be single call instructions and guarantees no impact on regalloc aside from fixed-reg constraints. The idea is that we can have a function that we invoke on breakpoint that will have no impact on the common case where we do not call it. (#12061)

[x] A new "patchable call" instruction in Cranelift that emits a normal call, restricted only to functions with the above new ABI (call it the "patchable callee ABI"?). This means we don't have to use the full callsite emission implementation and can emit a single call instruction with the right register constraints. The idea is that the MachBuffer will contain new metadata that indicates "patchable callsites" and specifies the bytes to patch in to enable or disable the call; we can do so by patching in appropriate NOP(s) or the call itself. (#12101)

[x] A modification to the sequence-point emission that adds a patchable call at every Wasm sequence point so we can patch in a breakpoint. (#12133)

[x] Then we can implement a debugger API to enable any given breakpoint by Wasm PC. (#12133)

[x] Then we can implement "step" using breakpoints. For simplicity, for "step", let's start by patching in all breakpoint calls in all modules in the store. My hypothesis here is that while the debugger API may be used in an automated way driven by higher-level algorithms (e.g. reversible execution), flipping back and forth between "step" mode and "sparse breakpoint set" mode likely only happens at human speed, so it's fine to potentially patch a few megabytes of calls in the worst case. If that ends up not the case, we can do function-at-a-time patching of all breakpoints by adding separate function entry/exit hooks. (#12133)

~~[ ] Then we can implement "next" using breakpoints as well, plus function entry and exit hooks.~~ (Not needed for gdbstub protocol)

~~[ ] Then we can implement memory watchpoints using a shadow memory as described in the RFC.~~ (Move to post-MVP)

To highlight the new thinking/insights here: if we don't do call injection on traps, then the two lost capabilities are using hardware to catch normal Wasm traps, and using hardware to do very efficient "patching in of breakpoints". But patchable calls are nearly as good for the latter (exactly as good on aarch64, 2 bytes vs 5 bytes on x86-64; the ABI is key to ensuring only the call instruction itself is needed); so the "only" loss is that we need explicit bounds-checking, and we can probably live with that.

I have WIP branches for the patchable ABI, for private code, and for Pulley to use libcall-based traps universally; I'll keep working through the checklist above as time allows.

Wasmtime GitHub notifications bot (Dec 19 2025 at 18:10):

cfallin commented on issue #11964:

I've pushed the watchpoints item (last bullet-point) off to #12188. With that, I believe all the other tasks here are complete and the API is more or less sufficient to build a basic gdbstub-like interface to a top half (which I now intend to do!).

Last updated: Feb 24 2026 at 05:28 UTC