wasmtime / PR #9078 Cranelift: add stack_switch CLIF inst... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #9078 Cranelift: add stack_switch CLIF inst...

Wasmtime GitHub notifications bot (Aug 05 2024 at 14:57):

frank-emrich opened PR #9078 from frank-emrich:stack_switch to bytecodealliance:main:

This PR adds a new CLIF instruction for switching stacks. While the primary motivation is to support the Wasm stack switching proposal currently under development, the CLIF instruction here is lower level and thus intended to be useful for general-purpose stackful context switching (such as implementing coroutines, fibers, etc independently from the Wasm stack switching proposal).
This PR only adds support for the instruction on x64 Linux, but I'm planning to add support for more platforms over time. The design of the instruction should be sufficiently abstract to support all the other platforms.

While work is currently under way to implement Wasm stack switching here and indeed uses the CLIF instruction introduced by this PR successfully, it seems worthwhile just upstreaming the CLIF instruction by itself. The proposal is not fully finalized yet, and this CLIF instruction seems useful on its own and independent from the remainder of the Wasm proposal.

Concretely, the CLIF instruction looks as follows:
out_payload = stack_switch(store_context_ptr, load_context_ptr, in_payload)
This causes the following to happen:

The current execution context is saved: The current frame pointer, stack pointer and PC after the stack_switch instruction are stored at store_context_ptr. All other registers are marked as clobbered and thus spilled by regalloc as needed.

We load a new (SP, FP, PC) triple from load_context_ptr, indicating the stack/context to switch to. We assume that we are either switching to a stack that was either previously switched away from by another stack_switch, or it's a newly initialized stack.

The value in_payload is passed over to the other stack. In other words, if the instruction above switches from some stack A to another stack B, then the return value of the stack_switch instruction previously executed on B will be in_payload.

Execution continues on stack B.

If we ever switch back to stack A, the value out_payload above (i.e., the return value of the stack_switch executed when leaving stack A) is the payload argument passed to the corresponding switch.

A few additional notes:

store_context_ptr and load_context_ptr can be seen as pointers to what is conceptually a three-element struct, containing SP, FP, PC.

The pointers store_context_ptr and load_context_ptr are allowed to be equal. In particular, in steps 1 and 2 above, we ensure to actually load all required data from load_context_ptr before storing to store_context_ptr.

As mentioned above, there are two cases in step 2: We either switch to code where a matching stack_switch was executed, or to a new stack

If we switch back to (right behind) a previous stack_switch instruction, then regalloc has spilled all subsequently needed SSA values for us, no need to manually restore any context besides SP, FP, PC.

If we are executing on a new stack, we assume that execution starts inside a stack-switch aware trampoline

Payloads are currently hard-coded to be a single, word-sized value. That may seem arbitrary, but it's simply what's needed for our current implementation of Wasm stack switching on top of this CLIF instruction. It could later be extended to more general cases, if necessary.

The stack switching implemented here is "one-shot": If you execute a stack_switch to switch from a stack A to a stack B, we can only switch back to A once (unless we subsequently execute another stack_switch on A again).
This is different from setjmp/longjmp, where we may store the context once using setjmp and then return to it multiple times using longjmp without needing to call setjmp again.

Wasmtime GitHub notifications bot (Aug 05 2024 at 15:22):

frank-emrich updated PR #9078.

Wasmtime GitHub notifications bot (Aug 05 2024 at 15:26):

frank-emrich commented on PR #9078:

I'm marking this as "ready for review" now, but I have a few questions about things that will probably require additional fixes before this is actually ready:

Currently, the instruction is lowered by very simple rules in codegen/src/isa/x64/lower.isle. However, the emitted code is only valid for x64 Linux, not x64 Windows.
This means that I should probably do something similar to function calls, where my instruction is lowered using an external gen_stack_switch function. What's the cleanest way to simply refuse lowering on x64 Windows for the time being?

The layout of the data read and written by the instruction is described in the new file codegen/src/isa/x64/inst/stack_switch.rs, given that the information in that file is platform-specific. However, I'm not sure if that's the right place for that file. I imagine that the content of that file will either be split up in the future, or move somewhere else to collect the layout information for all platforms.

When emitting code for the new instruction on x64, we need two temporary registers. Currently, this is done without regalloc (see my comment in emit.rs): If I try to add any early def to my StackSwitch MInst to get temporary registers, regalloc2 fails with TooManyLiveRegs. I suspect this is just a result of the rather unique set of constraints on StackSwitch. I'm curious if there's an existing issue I could mention, a more elegant workaround than what I'm currently doing, or if I should open a new issue regarding this in the regalloc2 repo.

In instructions.rs, I'm giving stack_switch all kinds of scary side effects (namely, .other_side_effects(), .can_load(), .can_store()), given that it continues execution elsewhere, where arbitrary things may happen.
I'm wondering what a more accurate description of the side effects would be. The .call() side effect seems related, but I was unsure if using that would have unintended consequences.

I've added a filetest for the new instruction but it turned out quite monstrous: It uses stack_switch in two functions where we create enough SSA values to exceed the number of available registers before the switch, and then use all of them after the stack switch. I've manually verified the generated assembly, but I'm wondering if people prefer having a shorter test case that makes it easier to verify the assembly for the stack switching itself.

My StackSwitch MInst takes its inputs as Gprs. These values all end up in registers during code emission anyway, but I was wondering if it would be more idiomatic to make the MInst use GprMem or RegMem, then deal with moving things into registers during emission.

Wasmtime GitHub notifications bot (Aug 05 2024 at 15:27):

frank-emrich has marked PR #9078 as ready for review.

Wasmtime GitHub notifications bot (Aug 05 2024 at 15:27):

frank-emrich requested abrown for a review on PR #9078.

Wasmtime GitHub notifications bot (Aug 05 2024 at 15:27):

frank-emrich requested wasmtime-compiler-reviewers for a review on PR #9078.

Wasmtime GitHub notifications bot (Aug 05 2024 at 15:33):

frank-emrich edited PR #9078:

This PR adds a new CLIF instruction for switching stacks. While the primary motivation is to support the Wasm stack switching proposal currently under development, the CLIF instruction here is lower level and thus intended to be useful for general-purpose stackful context switching (such as implementing coroutines, fibers, etc independently from the Wasm stack switching proposal).
This PR only adds support for the instruction on x64 Linux, but I'm planning to add support for more platforms over time. The design of the instruction should be sufficiently abstract to support all the other platforms.

While work is currently under way to implement Wasm stack switching in Wasmtime here and indeed uses the CLIF instruction introduced by this PR successfully, it seems worthwhile just upstreaming the CLIF instruction by itself. The proposal is not fully finalized yet, and this CLIF instruction seems useful on its own and independent from the remainder of the Wasm proposal.

Concretely, the CLIF instruction looks as follows:
out_payload = stack_switch(store_context_ptr, load_context_ptr, in_payload)
This causes the following to happen:

The current execution context is saved: The current frame pointer, stack pointer and PC after the stack_switch instruction are stored at store_context_ptr. All other registers are marked as clobbered and thus spilled by regalloc as needed.

We load a new (SP, FP, PC) triple from load_context_ptr, indicating the stack/context to switch to. We assume that we are either switching to a stack that was either previously switched away from by another stack_switch, or it's a newly initialized stack.

The value in_payload is passed over to the other stack. In other words, if the instruction above switches from some stack A to another stack B, then the return value of the stack_switch instruction previously executed on B will be in_payload.

Execution continues on stack B.

If we ever switch back to stack A, the value out_payload above (i.e., the return value of the stack_switch executed when leaving stack A) is the payload argument passed to the corresponding switch.

A few additional notes:

store_context_ptr and load_context_ptr can be seen as pointers to what is conceptually a three-element struct, containing SP, FP, PC.

The pointers store_context_ptr and load_context_ptr are allowed to be equal. In particular, in steps 1 and 2 above, we ensure to actually load all required data from load_context_ptr before storing to store_context_ptr.

As mentioned above, there are two cases in step 2: We either switch to code where a matching stack_switch was executed, or to a new stack

If we switch back to (right behind) a previous stack_switch instruction, then regalloc has spilled all subsequently needed SSA values for us, no need to manually restore any context besides SP, FP, PC.

If we are executing on a new stack, we assume that execution starts inside a stack-switch aware trampoline

Payloads are currently hard-coded to be a single, word-sized value. That may seem arbitrary, but it's simply what's needed for our current implementation of Wasm stack switching on top of this CLIF instruction. It could later be extended to more general cases, if necessary.

The stack switching implemented here is "one-shot": If you execute a stack_switch to switch from a stack A to a stack B, we can only switch back to A once (unless we subsequently execute another stack_switch on A again).
This is different from setjmp/longjmp, where we may store the context once using setjmp and then return to it multiple times using longjmp without needing to call setjmp again.

Wasmtime GitHub notifications bot (Aug 06 2024 at 15:54):

fitzgen requested fitzgen for a review on PR #9078.

Wasmtime GitHub notifications bot (Aug 07 2024 at 16:32):

fitzgen commented on PR #9078:

Super excited for this! Haven't taken a look at the actual code yet, but here are some answers to the questions in your comment above.

However, the emitted code is only valid for x64 Linux, not x64 Windows. ... What's the cleanest way to simply refuse lowering on x64 Windows for the time being?

Will the lowering work for all posix right now or only Linux? Will macos work, for example? We should be precise about the correctness condition here.

Cranelift doesn't generally care what OS it is targeting beyond calling conventions and a few other ABI details here and there, and AFAIK we've never needed OS-specific lowering rules before, so this is sort of untrodded ground.

A TargetIsa does have a triple method, which returns a target_lexicon::Triple, which in turn has an operating_system member. You could try plumbing that stuff through to an external partial constructor for use in ISLE if-lets.

The layout of the data read and written by the instruction is described in the new file codegen/src/isa/x64/inst/stack_switch.rs, given that the information in that file is platform-specific. However, I'm not sure if that's the right place for that file. I imagine that the content of that file will either be split up in the future, or move somewhere else to collect the layout information for all platforms.

I think describing the layout of that data would be best done in the documentation for the new instruction itself.

I haven't looked at the actual code in this PR, but I'd expect that the only platform-specific bits would be pointer size, and otherwise we are always saving SP, optionally FP when frame pointers are enabled, and PC. Is that assumption incorrect? Are we, or will we be, saving more/fewer values on different platforms?

I also would expect that we wouldn't actually need to explicitly define all the details of this data and its layout. I'd imagine we would only say that it has a given size and alignment, and is otherwise only valid to be manipulated via the stack_switch instruction. That is, it should be opaque to the host.

But maybe something like stack walking means that the runtime needs to be able to peek inside this data, and we can't keep it opaque to the host?

When emitting code for the new instruction on x64, we need two temporary registers. Currently, this is done without regalloc (see my comment in emit.rs): If I try to add any early def to my StackSwitch MInst to get temporary registers, regalloc2 fails with TooManyLiveRegs. I suspect this is just a result of the rather unique set of constraints on StackSwitch. I'm curious if there's an existing issue I could mention, a more elegant workaround than what I'm currently doing, or if I should open a new issue regarding this in the regalloc2 repo.

I think we've had similar-ish issues opened in the past for regalloc2. None the less, I'd recommend opening an issue there and we can dedupe it if necessary. That will also get eyes on the problem from folks who are more familiar with our register allocator than I am.

In instructions.rs, I'm giving stack_switch all kinds of scary side effects (namely, .other_side_effects(), .can_load(), .can_store()), given that it continues execution elsewhere, where arbitrary things may happen.
I'm wondering what a more accurate description of the side effects would be. The .call() side effect seems related, but I was unsure if using that would have unintended consequences.

I think we also want to have the call() side effect because, for these instructions, we will want to do call-related things like

clear all memory aliasing info, assuming that external code we "call" could alias anything

treat the instruction as a GC safepoint, spill GC refs to the stack, and emit a stack map for this location

etc...

I've added a filetest for the new instruction but it turned out quite monstrous: It uses stack_switch in two functions where we create enough SSA values to exceed the number of available registers before the switch, and then use all of them after the stack switch. I've manually verified the generated assembly, but I'm wondering if people prefer having a shorter test case that makes it easier to verify the assembly for the stack switching itself.

I don't think having a big filetest is a problem in itself per se, but I think we should also have a very basic filetest that is effectively just the stack_switch instruction and its disassembly. Basically, a function that takes all the stack_switch operands as arguments and returns the out_payload, and is just a single basic block with a single stack_switch instruction. This gives us a smoke test for the basic lowering and its disassembly.

My StackSwitch MInst takes its inputs as Gprs. These values all end up in registers during code emission anyway, but I was wondering if it would be more idiomatic to make the MInst use GprMem or RegMem, then deal with moving things into registers during emission.

Things like GprMem are useful for x64 instructions like add where one of the operands can naturally be either in memory or a register. When we are lowering an add with a memory operand, we don't force the memory into a register during emission, we emit an instruction that operates directly on the memory. It wouldn't make sense to take a GprMem operand for add if the x64 instruction(s) we emit can't operate directly on memory.

So for the x64 MInst.StackSwitch instruction, my questions would be:

Can the instruction sequence it emits operate directly on memory, or must they have the operand in a register?

Is the emitted instruction sequence accessing the value once or multiple times?

If the answer to these questions are "can operate directly on memory" and "accessing the value once" then I'd say a GprMem makes sense. Otherwise, I think a Gpr is the way to go.

FWIW, it is also fine to start with just Gpr operands and then investigate whether it makes sense to upgrade to a GprMem in follow up PRs.

Wasmtime GitHub notifications bot (Aug 07 2024 at 17:37):

fitzgen submitted PR review:

Looking good! I think that, with the guard rails where we only lower this instruction on target OSes for which it will work, and the various nitpicks below addressed, this will be good to merge.

Wasmtime GitHub notifications bot (Aug 07 2024 at 17:37):

fitzgen submitted PR review:

Looking good! I think that, with the guard rails where we only lower this instruction on target OSes for which it will work, and the various nitpicks below addressed, this will be good to merge.

Wasmtime GitHub notifications bot (Aug 07 2024 at 17:37):

fitzgen created PR review comment:

Somewhere in here we should mention the one-shottedness of this instruction, and how resuming a context twice can result in UB due to spilled values being overwritten by the first resumption, etc...

It might even be worth naming this instruction one_shot_stack_switch.

I think we should also clarify that, while this instruction performs loads and stores, those memory operations are always assumed to be aligned, non-trapping, and otherwise valid. This instruction performs no validation itself. It is as if these memory operations had MemFlags::trusted() attached to them. Therefore, it is the user's responsibility to ensure that these assumptions are upheld.

Wasmtime GitHub notifications bot (Aug 07 2024 at 17:37):

fitzgen created PR review comment:

Is there any reason rdi was chosen in particular? Not that it is a bad choice or anything, I'm just wondering if there are constraints on the register we choose (or even if we have to choose a fixed register?) that we should document here in case we ever want to change it.

Wasmtime GitHub notifications bot (Aug 07 2024 at 17:37):

fitzgen created PR review comment:

This is a fantastic hack

Wasmtime GitHub notifications bot (Aug 07 2024 at 17:37):

fitzgen created PR review comment:

            Err(PccError::UnimplementedInst)

Wasmtime GitHub notifications bot (Aug 07 2024 at 19:18):

frank-emrich commented on PR #9078:

Thanks for your answers and looking at the code! I'll look into the things you suggested.

In the meantime, some more thoughts regarding the layout of the contexts: On most platforms, the context can indeed look like this:
pub struct ControlContext {
    pub stack_pointer: *mut u8,
    pub frame_pointer: *mut u8,
    pub instruction_pointer: *mut u8,
}
However, there are two reasons for why the layout may be different somewhere:

1. Additional info needed on Windows

On Windows, I think we would have to update data inside the "Thread Information Block" (the stuff we briefly talked about with Ryan Hunt at the Pittsburgh CG meeting). I haven't looked too deeply into it, but it looks like it contains pointers to the beginning and end of the currently active stack, which we would need to update when switching. That means that I suspect we would have to add these to the ControlContext on Windows.

In any case, it looks like we would need Windows-specific lowering rules for x64, just to emit the logic for updating the Thread Information Block.

2. Frame pointer walking

There's another issue that I've briefly mentioned in a comment in the new file stack_switch.rs, but haven't provided much detail on (mostly because it requires dumping a bunch of extra info on you, sorry!):

I made sure that the ControlContext above is laid out in a way so that it can appear inside a frame pointer chain. This is so that for stack switching approaches where you do have parent-child relationships between stacks, these are reflected in the stack trace you get just from frame pointer walking, by creating a single continuous chain crossing stacks. In other words, tools like lldb show you the frames of the active stack, but also the frames in its parent.
(There's been a recent development in the Wasm stack switching subgroup making it very likely that at least for Wasm stack switching, we will end up with a proposal where there are parent-child relationships between all the active stacks.)

Concretely, in our implementation of Wasm stack switching, this is achieved like this:
Imagine we have allocated memory from 0x1000 to 0xB000 which we want to use for a new stack S. We store a ControlContext at the very top of that allocation at 0xAFE8, with the actual stack space starting below it. When starting execution in the new stack, we will then have RSP = 0xAFE0 in the trampoline that kicks off execution inside stack S (e.g., the trampoline calls the Wasm function that we actually want to run inside that stack).

The main idea is now that we make sure that the ControlContext at 0xAFE8 always contains the information about the parent of stack S, meaning that whenever we switch from some other stack X to S, we write information about stack X (which then becomes the parent of S) into the ControlContext at 0xAFE8.

The last missing ingredient for creating the frame pointer chain is that when we are running the trampoline that kicks of execution inside stack S, we set RBP to 0xAFF0, which is the address of the frame pointer field inside the control context.
This means that while the trampoline is the bottom-most stack frame inside S, when passing the trampoline while walking the frame pointer chain we actually get to where X stack_switch-ed to S: At 0xAFF0 we stored the frame pointer of X, and the layout of ControlContext puts the PC saved for X right next to it, exactly where lldb and friends would expect it.

Long story short: It's quite neat to make sure that ControlContext has just the right layout to enable this trick. It allows users of the stack_switch instruction to get frame pointer walking across stacks entirely for free "simply" by carefully laying out where the contexts are actually stored and setting RBP correctly. If they don't care or implement a stack switching approach where there are no parents, nothing is lost. But to make this possible, the layout of the ControlContext suddenly depends on not just the ISA, but whatever your platform dictates about stack layout (in particular, where to find the return address relative to the frame pointer).

Luckily, the layout above (i.e., frame pointer right next to PC) works on all platforms supported by Cranelift, except s390x: On the latter, there's an offset of 14 words between where the FP and PC are stored.
So on that platform my plan was to make the ControlContext look like this:
pub struct S390XControlContext {
    pub frame_pointer: *mut u8,
    pub stack_pointer: *mut u8,
    padding: [*mut u8; 12],
    pub instruction_pointer: *mut u8,
}
(Or just give up on frame pointer walking there)

So at least it seems that within the same ISA, all OSes sufficiently agree on the stack layout so that the frame pointer walking causes no extra hassle.

Documenting the layout

I also would expect that we wouldn't actually need to explicitly define all the details of this data and its layout. I'd imagine we would only say that it has a given size and alignment, and is otherwise only valid to be manipulated via the stack_switch instruction. That is, it should be opaque to the host.

But maybe something like stack walking means that the runtime needs to be able to peek inside this data, and we can't keep it opaque to the host?

Yes, the layout of the ControlContext can be kept mostly opaque, except for stack creation. As long as you only create and consume these contexts with stack_switch, you really don't need to know anything about them besides their size and (boring) alignment requirements.

The only situation where you do need to know about the layout is when initializing a new stack: You need to create a corresponding ControlContext that when stack_switch-ed to executes the right trampoline on the new stack. That's why my idea was basically to leave details about the layout of the context out of the documentation of the instruction in instruction.rs, but at least have some platform-specific documentation elsewhere.

Alternatively, we could add a stack_init instruction, or instructions that act like getters and setters on the context.

Wasmtime GitHub notifications bot (Aug 09 2024 at 13:05):

frank-emrich updated PR #9078.

Wasmtime GitHub notifications bot (Aug 09 2024 at 14:26):

frank-emrich updated PR #9078.

Wasmtime GitHub notifications bot (Aug 09 2024 at 14:35):

frank-emrich submitted PR review.

Wasmtime GitHub notifications bot (Aug 09 2024 at 14:35):

frank-emrich created PR review comment:

Ha, good question! I've added a few comments now about why this needs to be a fixed register: We only have a single payload and pass it in a register when doing a stack_switch to another stack_switch on another stack. But in order for that to work, the two involved stack_switch instructions need to agree on the register used for this, similar to a calling convention.

(In a perfect world I would extend the payload handling to arbitrarily many values of different types and make sure that the "calling convention" for that is exactly the same as for functions. If you do that, you don't have to reshuffle any arguments when switching to a new stack where the payloads given to the stack_switch become the arguments of the function to run on the new stack. But that would drastically increase the complexity and it's questionable how big the performance benefit would be)

Wasmtime GitHub notifications bot (Aug 09 2024 at 14:36):

frank-emrich submitted PR review.

Wasmtime GitHub notifications bot (Aug 09 2024 at 14:36):

frank-emrich created PR review comment:

I've extended the documentation of this instruction now. Still not specifying the actual layout itself, but mentioned the one-shottedness and the fact that the instruction does not check the pointers or data.

Wasmtime GitHub notifications bot (Aug 09 2024 at 14:52):

frank-emrich commented on PR #9078:

I've implemented the restriction to only lower stack_switch on Linux now: Strictly speaking, it's each individual OS rather than the ISA that may require additional things needing to be done when switching stacks (see my remarks on Windows in my huge previous comment). I do believe that my current implementation would actually work on x64 macOS, but haven't tested it. So I would just stay on the safe side and make the support OS-dependent (instead of, say, allowing anything Unix-like), with the only allowed one being Linux for now.

I've implemented this using a partial constructor on_linux. The controversial bit may be that I'm now copying the Triple into every machinst::Callee. Is that acceptable?

Wasmtime GitHub notifications bot (Aug 09 2024 at 17:59):

fitzgen submitted PR review.

Wasmtime GitHub notifications bot (Aug 09 2024 at 17:59):

fitzgen created PR review comment:

Gotcha, that makes sense. Thanks!

Wasmtime GitHub notifications bot (Aug 09 2024 at 18:00):

fitzgen commented on PR #9078:

only allowed one being Linux for now.

SGTM

The controversial bit may be that I'm now copying the Triple into every machinst::Callee. Is that acceptable?

It doesn't seem ideal. Do we not already have access to a &dyn TargetIsa in the LowerContext?

Wasmtime GitHub notifications bot (Aug 09 2024 at 18:04):

fitzgen commented on PR #9078:

Do we not already have access to a &dyn TargetIsa in the LowerContext?

The IsleContext contains a B: LowerBackend which for x64 is a X64Backend which contains a Triple already, and all the other LowerBackend implementations also contain a Triple. I think we could add a fn triple(&self) -> &Triple trait method to LowerBackend and then reuse that in the IsleContext's implementation of the partial constructor.

Wasmtime GitHub notifications bot (Aug 09 2024 at 18:42):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (Aug 09 2024 at 18:42):

bjorn3 created PR review comment:

There is no other place in Cranelift that looks at the target OS when determining how to lower clif ir. Even TLS handling chooses the right variant for the target OS using a codegen flag rather than by looking at the OS in the triple. Using the right calling conventions is done by the producer of the clif ir.

Wasmtime GitHub notifications bot (Aug 09 2024 at 18:49):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Aug 09 2024 at 18:49):

cfallin created PR review comment:

I think I agree here: philosophically, this is if not quite literally calling-convention related, certainly like an OS-interface detail that should be "baked into" the CLIF as CLIF is ordinarily explicit about such details. Can we check the triple in the Wasm translation (i.e., the wasmtime FuncEnvironment hook called by cranelift-wasm, or wherever else this instruction is generated) and fail if on the wrong platform?

This is also a little more future-looking in the sense that there may be Wasmtime details at some future point related to stack-switching (e.g., what if we add our own stack-protection mitigations or have some custom kind of stack-growth scheme or ...) -- we wouldn't want to hardcode that into Cranelift lowering in the same way OS details are here.

Wasmtime GitHub notifications bot (Aug 09 2024 at 18:49):

cfallin edited PR review comment.

Wasmtime GitHub notifications bot (Aug 12 2024 at 17:43):

frank-emrich submitted PR review.

Wasmtime GitHub notifications bot (Aug 12 2024 at 17:43):

frank-emrich created PR review comment:

Just to recap the context sprinkled through this PR: The reason why we need to do something platform/OS-specific here eventually is that Windows requires us to update parts of the Thread Information Block when switching stacks. Unfortunately, what exactly needs to be done is undocumented, so I've not done it, yet. The "plain" stack switching I implemented in this PR was supposed to work only on Linux, but from what I can tell should work on x64 macOS, too.

@bjorn3:

Following what you said, I guess one approach would be to make the lowering of stack_switch more similar to that of tls_value. I'd imagine this would look as follows:

Create a StackSwitchMode enum with Default and UpdateTib variants.

Encode a value of that type in shared_settings::Flags and also add it as a field to the StackSwitch MInst.

When emitting code for StackSwitch, check the value of the StackSwitchMode flag and act accordingly.

I'm also happy to have more than one MInst for stack switching, and lower the stack_switch CLIF instruction to one of them, based on the value of StackSwitchMode. That would mirror tls_value more closely, but I'm inclined to avoid that: The core stack switching code is always the same, we just sometimes need to emit some extra code on top of that.

Finally, what I described above would be the medium-term solution once I've implemented Windows support. For the time being I would just add a None variant to StackSwitchMode, use that on Windows, and panic if we see it when emitting code for StackSwitch.

@cfallin:

Can we check the triple in the Wasm translation (i.e., the wasmtime FuncEnvironment hook called by cranelift-wasm, or wherever else this instruction is generated) and fail if on the wrong platform?

Did you mean for this to be just a solution for the current issue of me wanting to fail if not on Linux? Or did you mean to also use this approach (i.e., resolving differences at the Wasm -> CLIF stage) in the future when we want to generate slightly different code on different OS-es?

In the latter case, are you suggesting to have multiple stack switching CLIF instructions, for the case with and without the TIB update, then choose between them during the Wasm translation? Or are you suggesting to have a single CLIF instruction for the stack switching itself, but then emit some additional CLIF on Windows (say, some additional loads and stores) around the stack_switch during the Wasm translation?

I fully agree that for things like stack growing there's a good chance that in the future, we'd need some more customization of the generated code (e.g., customize stack probing, function preludes, ...).

Wasmtime GitHub notifications bot (Aug 12 2024 at 17:44):

frank-emrich edited PR review comment.

Wasmtime GitHub notifications bot (Aug 12 2024 at 17:57):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Aug 12 2024 at 17:57):

cfallin created PR review comment:

Did you mean for this to be just a solution for the current issue of me wanting to fail if not on Linux? Or did you mean to also use this approach (i.e., resolving differences at the Wasm -> CLIF stage) in the future when we want to generate slightly different code on different OS-es?

Both, I think -- basically, any sort of difference of behavior on that level should be reified in the CLIF, rather than implicit; that's how we've handled things like TLS and other platform dependencies.

In the latter case, are you suggesting to have multiple stack switching CLIF instructions, for the case with and without the TIB update, then choose between them during the Wasm translation? Or are you suggesting to have a single CLIF instruction for the stack switching itself, but then emit some additional CLIF on Windows (say, some additional loads and stores) around the stack_switch during the Wasm translation?

One or the other, depending on what is actually needed? I don't have enough context to actually specify the full design; that's something we can discuss further; only that we should make it explicit somehow. If one platform requires a superset of the work that another platform does, factoring out the common bit as one instruction and adding more logic at the CLIF level seems reasonable. On the other hand it's totally reasonable to have stack_switch_windows and stack_switch_sysv instructions IMHO; again see how we did TLS, with separate instructions for ELF and Mach-O cases.

Wasmtime GitHub notifications bot (Aug 12 2024 at 18:15):

frank-emrich submitted PR review.

Wasmtime GitHub notifications bot (Aug 12 2024 at 18:15):

frank-emrich created PR review comment:

On the other hand it's totally reasonable to have stack_switch_windows and stack_switch_sysv instructions IMHO; again see how we did TLS, with separate instructions for ELF and Mach-O cases.

Ah, I think that's where my confusion came from: For TLS, there is a single CLIF instruction, which is then lowered one of several per-platform MInsts (but based on a flag in the backend, not some ad hoc OS check like I did). I'm happy to do it like that, which would be similar to what I described in my response to @bjorn3.

I'd prefer that over moving the TIB update out of the stack_switch (CLIF) instruction itself.

Wasmtime GitHub notifications bot (Aug 14 2024 at 12:42):

frank-emrich requested alexcrichton for a review on PR #9078.

Wasmtime GitHub notifications bot (Aug 14 2024 at 12:42):

frank-emrich updated PR #9078.

Wasmtime GitHub notifications bot (Aug 14 2024 at 12:42):

frank-emrich requested wasmtime-core-reviewers for a review on PR #9078.

Wasmtime GitHub notifications bot (Aug 14 2024 at 12:48):

frank-emrich commented on PR #9078:

I've reworked the platform dependence logic based now, inspired by what happens for TLS:
There's a new enum StackSwitchModel with values None, Basic, and UpdateWindowsTib. A value of that is saved in the backend Flags.
This value is then used when lowering a stack_switch CLIF instruction: If None, we cannot lower them. If Basic, we lower to stack_switch_basic. In the future, it will be the case that UpdateWindowsTib causes lowering to a new dedicated MInst (like StackSwitchUpdateWindowsTib), but currently this will behave like None (i.e., we fail to lower).

Wasmtime GitHub notifications bot (Aug 14 2024 at 12:53):

frank-emrich commented on PR #9078:

@alexcrichton For some reason the automated reviewer assignment was triggered again, maybe because this PR is now touching a non-Cranelift file.

Wasmtime GitHub notifications bot (Aug 20 2024 at 19:45):

alexcrichton commented on PR #9078:

Do others have more thoughts on this? This all looks reaosnable enough to me to land, but I'd want to be sure to run by others too.

Wasmtime GitHub notifications bot (Aug 20 2024 at 20:28):

fitzgen submitted PR review:

LGTM with one final nitpick below

Wasmtime GitHub notifications bot (Aug 20 2024 at 20:28):

fitzgen created PR review comment:

Rather than making this an extractor from a type, can we make it a partial constructor?

Then the use would look like
(lower (stack_switch store_context_ptr load_context_ptr in_payload)
       (if-let (stack_switch_model) (StackSwitchModel.Basic))
       ...)

Wasmtime GitHub notifications bot (Aug 21 2024 at 10:49):

frank-emrich updated PR #9078.

Wasmtime GitHub notifications bot (Aug 21 2024 at 10:53):

frank-emrich submitted PR review.

Wasmtime GitHub notifications bot (Aug 21 2024 at 10:53):

frank-emrich created PR review comment:

Done, but you meant (if-let (StackSwitchModel.Basic) (stack_switch_model)), right?

Is there a particular reason for making it a partial constructor? Following what's happening for TLS I gave my StackSwitchModel enum a None variant to indicate that no model was set, but that means that the stack_switch_model constructor can be total, or am I missing something?

Wasmtime GitHub notifications bot (Aug 21 2024 at 20:59):

fitzgen submitted PR review.

Wasmtime GitHub notifications bot (Aug 21 2024 at 20:59):

fitzgen created PR review comment:

I might be wrong, but I had thought that you can only use partial constructors in if-lets. Might be worth removing the None variant from StackSwitchModel, if so.

Wasmtime GitHub notifications bot (Aug 21 2024 at 21:36):

fitzgen submitted PR review.

Wasmtime GitHub notifications bot (Aug 21 2024 at 21:37):

fitzgen commented on PR #9078:

I think this is in a good enough place that we can land it and then continue with any further improvements in follow ups. Thanks @frank-emrich!

Wasmtime GitHub notifications bot (Aug 21 2024 at 21:39):

frank-emrich updated PR #9078.

Wasmtime GitHub notifications bot (Aug 21 2024 at 21:39):

frank-emrich submitted PR review.

Wasmtime GitHub notifications bot (Aug 21 2024 at 21:39):

frank-emrich created PR review comment:

I tried it, and turning stack_switch_model into a total constructor seems to just work.
I don't understand the magic mechanism for configuring what goes into the backend Flags in detail, but from what I can tell it does not allow you to store something logically equivalent to Option<StackSwitchModel> in there, so keeping the None variant in StackSwitchModel itself seems like the way to go.

Wasmtime GitHub notifications bot (Aug 21 2024 at 21:43):

frank-emrich commented on PR #9078:

Oops, it looks like I responded to your comment and added a new commit in parallel to you approving things. But I think it should be uncontroversial, I kept None and made stack_switch_model total.

Wasmtime GitHub notifications bot (Aug 21 2024 at 21:45):

fitzgen commented on PR #9078:

Yeah, LGTM

Wasmtime GitHub notifications bot (Aug 21 2024 at 21:52):

fitzgen merged PR #9078.

Last updated: Apr 18 2025 at 05:03 UTC