frank-emrich opened PR #9078 from frank-emrich:stack_switch
to bytecodealliance:main
:
This PR adds a new CLIF instruction for switching stacks. While the primary motivation is to support the Wasm stack switching proposal currently under development, the CLIF instruction here is lower level and thus intended to be useful for general-purpose stackful context switching (such as implementing coroutines, fibers, etc independently from the Wasm stack switching proposal).
This PR only adds support for the instruction on x64 Linux, but I'm planning to add support for more platforms over time. The design of the instruction should be sufficiently abstract to support all the other platforms.While work is currently under way to implement Wasm stack switching here and indeed uses the CLIF instruction introduced by this PR successfully, it seems worthwhile just upstreaming the CLIF instruction by itself. The proposal is not fully finalized yet, and this CLIF instruction seems useful on its own and independent from the remainder of the Wasm proposal.
Concretely, the CLIF instruction looks as follows:
out_payload = stack_switch(store_context_ptr, load_context_ptr, in_payload)
This causes the following to happen:
- The current execution context is saved: The current frame pointer, stack pointer and PC after the
stack_switch
instruction are stored atstore_context_ptr
. All other registers are marked as clobbered and thus spilled by regalloc as needed.- We load a new
(SP, FP, PC)
triple fromload_context_ptr
, indicating the stack/context to switch to. We assume that we are either switching to a stack that was either previously switched away from by anotherstack_switch
, or it's a newly initialized stack.- The value
in_payload
is passed over to the other stack. In other words, if the instruction above switches from some stack A to another stack B, then the return value of thestack_switch
instruction previously executed on B will bein_payload
.- Execution continues on stack B.
- If we ever switch back to stack A, the value
out_payload
above (i.e., the return value of thestack_switch
executed when leaving stackA
) is the payload argument passed to the corresponding switch.A few additional notes:
store_context_ptr
andload_context_ptr
can be seen as pointers to what is conceptually a three-element struct, containing SP, FP, PC.- The pointers
store_context_ptr
andload_context_ptr
are allowed to be equal. In particular, in steps 1 and 2 above, we ensure to actually load all required data fromload_context_ptr
before storing tostore_context_ptr
.- As mentioned above, there are two cases in step 2: We either switch to code where a matching
stack_switch
was executed, or to a new stack
- If we switch back to (right behind) a previous
stack_switch
instruction, then regalloc has spilled all subsequently needed SSA values for us, no need to manually restore any context besides SP, FP, PC.- If we are executing on a new stack, we assume that execution starts inside a stack-switch aware trampoline
- Payloads are currently hard-coded to be a single, word-sized value. That may seem arbitrary, but it's simply what's needed for our current implementation of Wasm stack switching on top of this CLIF instruction. It could later be extended to more general cases, if necessary.
- The stack switching implemented here is "one-shot": If you execute a
stack_switch
to switch from a stack A to a stack B, we can only switch back to A once (unless we subsequently execute anotherstack_switch
on A again).
This is different fromsetjmp
/longjmp
, where we may store the context once usingsetjmp
and then return to it multiple times usinglongjmp
without needing to callsetjmp
again.
frank-emrich updated PR #9078.
frank-emrich commented on PR #9078:
I'm marking this as "ready for review" now, but I have a few questions about things that will probably require additional fixes before this is actually ready:
Currently, the instruction is lowered by very simple rules in
codegen/src/isa/x64/lower.isle
. However, the emitted code is only valid for x64 Linux, not x64 Windows.
This means that I should probably do something similar to function calls, where my instruction is lowered using an externalgen_stack_switch
function. What's the cleanest way to simply refuse lowering on x64 Windows for the time being?The layout of the data read and written by the instruction is described in the new file
codegen/src/isa/x64/inst/stack_switch.rs
, given that the information in that file is platform-specific. However, I'm not sure if that's the right place for that file. I imagine that the content of that file will either be split up in the future, or move somewhere else to collect the layout information for all platforms.- When emitting code for the new instruction on x64, we need two temporary registers. Currently, this is done without regalloc (see my comment in
emit.rs
): If I try to add any early def to myStackSwitch
MInst
to get temporary registers,regalloc2
fails withTooManyLiveRegs
. I suspect this is just a result of the rather unique set of constraints onStackSwitch
. I'm curious if there's an existing issue I could mention, a more elegant workaround than what I'm currently doing, or if I should open a new issue regarding this in theregalloc2
repo.In
instructions.rs
, I'm givingstack_switch
all kinds of scary side effects (namely,.other_side_effects()
,.can_load()
,.can_store()
), given that it continues execution elsewhere, where arbitrary things may happen.
I'm wondering what a more accurate description of the side effects would be. The.call()
side effect seems related, but I was unsure if using that would have unintended consequences.I've added a filetest for the new instruction but it turned out quite monstrous: It uses
stack_switch
in two functions where we create enough SSA values to exceed the number of available registers before the switch, and then use all of them after the stack switch. I've manually verified the generated assembly, but I'm wondering if people prefer having a shorter test case that makes it easier to verify the assembly for the stack switching itself.- My
StackSwitch
MInst
takes its inputs asGpr
s. These values all end up in registers during code emission anyway, but I was wondering if it would be more idiomatic to make theMInst
useGprMem
orRegMem
, then deal with moving things into registers during emission.
frank-emrich has marked PR #9078 as ready for review.
frank-emrich requested abrown for a review on PR #9078.
frank-emrich requested wasmtime-compiler-reviewers for a review on PR #9078.
frank-emrich edited PR #9078:
This PR adds a new CLIF instruction for switching stacks. While the primary motivation is to support the Wasm stack switching proposal currently under development, the CLIF instruction here is lower level and thus intended to be useful for general-purpose stackful context switching (such as implementing coroutines, fibers, etc independently from the Wasm stack switching proposal).
This PR only adds support for the instruction on x64 Linux, but I'm planning to add support for more platforms over time. The design of the instruction should be sufficiently abstract to support all the other platforms.While work is currently under way to implement Wasm stack switching in Wasmtime here and indeed uses the CLIF instruction introduced by this PR successfully, it seems worthwhile just upstreaming the CLIF instruction by itself. The proposal is not fully finalized yet, and this CLIF instruction seems useful on its own and independent from the remainder of the Wasm proposal.
Concretely, the CLIF instruction looks as follows:
out_payload = stack_switch(store_context_ptr, load_context_ptr, in_payload)
This causes the following to happen:
- The current execution context is saved: The current frame pointer, stack pointer and PC after the
stack_switch
instruction are stored atstore_context_ptr
. All other registers are marked as clobbered and thus spilled by regalloc as needed.- We load a new
(SP, FP, PC)
triple fromload_context_ptr
, indicating the stack/context to switch to. We assume that we are either switching to a stack that was either previously switched away from by anotherstack_switch
, or it's a newly initialized stack.- The value
in_payload
is passed over to the other stack. In other words, if the instruction above switches from some stack A to another stack B, then the return value of thestack_switch
instruction previously executed on B will bein_payload
.- Execution continues on stack B.
- If we ever switch back to stack A, the value
out_payload
above (i.e., the return value of thestack_switch
executed when leaving stackA
) is the payload argument passed to the corresponding switch.A few additional notes:
store_context_ptr
andload_context_ptr
can be seen as pointers to what is conceptually a three-element struct, containing SP, FP, PC.- The pointers
store_context_ptr
andload_context_ptr
are allowed to be equal. In particular, in steps 1 and 2 above, we ensure to actually load all required data fromload_context_ptr
before storing tostore_context_ptr
.- As mentioned above, there are two cases in step 2: We either switch to code where a matching
stack_switch
was executed, or to a new stack
- If we switch back to (right behind) a previous
stack_switch
instruction, then regalloc has spilled all subsequently needed SSA values for us, no need to manually restore any context besides SP, FP, PC.- If we are executing on a new stack, we assume that execution starts inside a stack-switch aware trampoline
- Payloads are currently hard-coded to be a single, word-sized value. That may seem arbitrary, but it's simply what's needed for our current implementation of Wasm stack switching on top of this CLIF instruction. It could later be extended to more general cases, if necessary.
- The stack switching implemented here is "one-shot": If you execute a
stack_switch
to switch from a stack A to a stack B, we can only switch back to A once (unless we subsequently execute anotherstack_switch
on A again).
This is different fromsetjmp
/longjmp
, where we may store the context once usingsetjmp
and then return to it multiple times usinglongjmp
without needing to callsetjmp
again.
fitzgen requested fitzgen for a review on PR #9078.
fitzgen commented on PR #9078:
Super excited for this! Haven't taken a look at the actual code yet, but here are some answers to the questions in your comment above.
However, the emitted code is only valid for x64 Linux, not x64 Windows. ... What's the cleanest way to simply refuse lowering on x64 Windows for the time being?
Will the lowering work for all posix right now or only Linux? Will macos work, for example? We should be precise about the correctness condition here.
Cranelift doesn't generally care what OS it is targeting beyond calling conventions and a few other ABI details here and there, and AFAIK we've never needed OS-specific lowering rules before, so this is sort of untrodded ground.
A
TargetIsa
does have atriple
method, which returns atarget_lexicon::Triple
, which in turn has anoperating_system
member. You could try plumbing that stuff through to an external partial constructor for use in ISLEif-let
s.The layout of the data read and written by the instruction is described in the new file
codegen/src/isa/x64/inst/stack_switch.rs
, given that the information in that file is platform-specific. However, I'm not sure if that's the right place for that file. I imagine that the content of that file will either be split up in the future, or move somewhere else to collect the layout information for all platforms.I think describing the layout of that data would be best done in the documentation for the new instruction itself.
I haven't looked at the actual code in this PR, but I'd expect that the only platform-specific bits would be pointer size, and otherwise we are always saving SP, optionally FP when frame pointers are enabled, and PC. Is that assumption incorrect? Are we, or will we be, saving more/fewer values on different platforms?
I also would expect that we wouldn't actually need to explicitly define all the details of this data and its layout. I'd imagine we would only say that it has a given size and alignment, and is otherwise only valid to be manipulated via the
stack_switch
instruction. That is, it should be opaque to the host.But maybe something like stack walking means that the runtime needs to be able to peek inside this data, and we can't keep it opaque to the host?
When emitting code for the new instruction on x64, we need two temporary registers. Currently, this is done without regalloc (see my comment in
emit.rs
): If I try to add any early def to myStackSwitch
MInst
to get temporary registers,regalloc2
fails withTooManyLiveRegs
. I suspect this is just a result of the rather unique set of constraints onStackSwitch
. I'm curious if there's an existing issue I could mention, a more elegant workaround than what I'm currently doing, or if I should open a new issue regarding this in theregalloc2
repo.I think we've had similar-ish issues opened in the past for
regalloc2
. None the less, I'd recommend opening an issue there and we can dedupe it if necessary. That will also get eyes on the problem from folks who are more familiar with our register allocator than I am.In
instructions.rs
, I'm givingstack_switch
all kinds of scary side effects (namely,.other_side_effects()
,.can_load()
,.can_store()
), given that it continues execution elsewhere, where arbitrary things may happen.
I'm wondering what a more accurate description of the side effects would be. The.call()
side effect seems related, but I was unsure if using that would have unintended consequences.I think we also want to have the
call()
side effect because, for these instructions, we will want to docall
-related things like
- clear all memory aliasing info, assuming that external code we "call" could alias anything
- treat the instruction as a GC safepoint, spill GC refs to the stack, and emit a stack map for this location
- etc...
I've added a filetest for the new instruction but it turned out quite monstrous: It uses
stack_switch
in two functions where we create enough SSA values to exceed the number of available registers before the switch, and then use all of them after the stack switch. I've manually verified the generated assembly, but I'm wondering if people prefer having a shorter test case that makes it easier to verify the assembly for the stack switching itself.I don't think having a big filetest is a problem in itself per se, but I think we should also have a very basic filetest that is effectively just the
stack_switch
instruction and its disassembly. Basically, a function that takes all thestack_switch
operands as arguments and returns theout_payload
, and is just a single basic block with a singlestack_switch
instruction. This gives us a smoke test for the basic lowering and its disassembly.My
StackSwitch
MInst
takes its inputs asGpr
s. These values all end up in registers during code emission anyway, but I was wondering if it would be more idiomatic to make theMInst
useGprMem
orRegMem
, then deal with moving things into registers during emission.Things like
GprMem
are useful for x64 instructions likeadd
where one of the operands can naturally be either in memory or a register. When we are lowering anadd
with a memory operand, we don't force the memory into a register during emission, we emit an instruction that operates directly on the memory. It wouldn't make sense to take aGprMem
operand foradd
if the x64 instruction(s) we emit can't operate directly on memory.So for the x64
MInst.StackSwitch
instruction, my questions would be:
- Can the instruction sequence it emits operate directly on memory, or must they have the operand in a register?
- Is the emitted instruction sequence accessing the value once or multiple times?
If the answer to these questions are "can operate directly on memory" and "accessing the value once" then I'd say a
GprMem
makes sense. Otherwise, I think aGpr
is the way to go.FWIW, it is also fine to start with just
Gpr
operands and then investigate whether it makes sense to upgrade to aGprMem
in follow up PRs.
fitzgen submitted PR review:
Looking good! I think that, with the guard rails where we only lower this instruction on target OSes for which it will work, and the various nitpicks below addressed, this will be good to merge.
fitzgen submitted PR review:
Looking good! I think that, with the guard rails where we only lower this instruction on target OSes for which it will work, and the various nitpicks below addressed, this will be good to merge.
fitzgen created PR review comment:
Somewhere in here we should mention the one-shottedness of this instruction, and how resuming a context twice can result in UB due to spilled values being overwritten by the first resumption, etc...
It might even be worth naming this instruction
one_shot_stack_switch
.I think we should also clarify that, while this instruction performs loads and stores, those memory operations are always assumed to be aligned, non-trapping, and otherwise valid. This instruction performs no validation itself. It is as if these memory operations had
MemFlags::trusted()
attached to them. Therefore, it is the user's responsibility to ensure that these assumptions are upheld.
fitzgen created PR review comment:
Is there any reason
rdi
was chosen in particular? Not that it is a bad choice or anything, I'm just wondering if there are constraints on the register we choose (or even if we have to choose a fixed register?) that we should document here in case we ever want to change it.
fitzgen created PR review comment:
This is a fantastic hack
fitzgen created PR review comment:
Err(PccError::UnimplementedInst)
frank-emrich commented on PR #9078:
Thanks for your answers and looking at the code! I'll look into the things you suggested.
In the meantime, some more thoughts regarding the layout of the contexts: On most platforms, the context can indeed look like this:
pub struct ControlContext { pub stack_pointer: *mut u8, pub frame_pointer: *mut u8, pub instruction_pointer: *mut u8, }
However, there are two reasons for why the layout may be different somewhere:
1. Additional info needed on Windows
On Windows, I think we would have to update data inside the "Thread Information Block" (the stuff we briefly talked about with Ryan Hunt at the Pittsburgh CG meeting). I haven't looked too deeply into it, but it looks like it contains pointers to the beginning and end of the currently active stack, which we would need to update when switching. That means that I suspect we would have to add these to the
ControlContext
on Windows.In any case, it looks like we would need Windows-specific lowering rules for
x64
, just to emit the logic for updating the Thread Information Block.2. Frame pointer walking
There's another issue that I've briefly mentioned in a comment in the new file
stack_switch.rs
, but haven't provided much detail on (mostly because it requires dumping a bunch of extra info on you, sorry!):I made sure that the
ControlContext
above is laid out in a way so that it can appear inside a frame pointer chain. This is so that for stack switching approaches where you do have parent-child relationships between stacks, these are reflected in the stack trace you get just from frame pointer walking, by creating a single continuous chain crossing stacks. In other words, tools likelldb
show you the frames of the active stack, but also the frames in its parent.
(There's been a recent development in the Wasm stack switching subgroup making it very likely that at least for Wasm stack switching, we will end up with a proposal where there are parent-child relationships between all the active stacks.)Concretely, in our implementation of Wasm stack switching, this is achieved like this:
Imagine we have allocated memory from0x1000
to0xB000
which we want to use for a new stackS
. We store aControlContext
at the very top of that allocation at0xAFE8
, with the actual stack space starting below it. When starting execution in the new stack, we will then haveRSP = 0xAFE0
in the trampoline that kicks off execution inside stackS
(e.g., the trampoline calls the Wasm function that we actually want to run inside that stack).The main idea is now that we make sure that the
ControlContext
at0xAFE8
always contains the information about the parent of stackS
, meaning that whenever we switch from some other stackX
toS
, we write information about stackX
(which then becomes the parent ofS
) into theControlContext
at0xAFE8
.The last missing ingredient for creating the frame pointer chain is that when we are running the trampoline that kicks of execution inside stack S, we set
RBP
to0xAFF0
, which is the address of the frame pointer field inside the control context.
This means that while the trampoline is the bottom-most stack frame insideS
, when passing the trampoline while walking the frame pointer chain we actually get to whereX
stack_switch
-ed toS
: At0xAFF0
we stored the frame pointer ofX
, and the layout ofControlContext
puts the PC saved forX
right next to it, exactly wherelldb
and friends would expect it.Long story short: It's quite neat to make sure that
ControlContext
has just the right layout to enable this trick. It allows users of thestack_switch
instruction to get frame pointer walking across stacks entirely for free "simply" by carefully laying out where the contexts are actually stored and settingRBP
correctly. If they don't care or implement a stack switching approach where there are no parents, nothing is lost. But to make this possible, the layout of theControlContext
suddenly depends on not just the ISA, but whatever your platform dictates about stack layout (in particular, where to find the return address relative to the frame pointer).Luckily, the layout above (i.e., frame pointer right next to PC) works on all platforms supported by Cranelift, except s390x: On the latter, there's an offset of 14 words between where the FP and PC are stored.
So on that platform my plan was to make theControlContext
look like this:pub struct S390XControlContext { pub frame_pointer: *mut u8, pub stack_pointer: *mut u8, padding: [*mut u8; 12], pub instruction_pointer: *mut u8, }
(Or just give up on frame pointer walking there)
So at least it seems that within the same ISA, all OSes sufficiently agree on the stack layout so that the frame pointer walking causes no extra hassle.
Documenting the layout
I also would expect that we wouldn't actually need to explicitly define all the details of this data and its layout. I'd imagine we would only say that it has a given size and alignment, and is otherwise only valid to be manipulated via the
stack_switch
instruction. That is, it should be opaque to the host.But maybe something like stack walking means that the runtime needs to be able to peek inside this data, and we can't keep it opaque to the host?
Yes, the layout of the
ControlContext
can be kept mostly opaque, except for stack creation. As long as you only create and consume these contexts withstack_switch
, you really don't need to know anything about them besides their size and (boring) alignment requirements.The only situation where you do need to know about the layout is when initializing a new stack: You need to create a corresponding
ControlContext
that whenstack_switch
-ed to executes the right trampoline on the new stack. That's why my idea was basically to leave details about the layout of the context out of the documentation of the instruction ininstruction.rs
, but at least have some platform-specific documentation elsewhere.Alternatively, we could add a
stack_init
instruction, or instructions that act like getters and setters on the context.
frank-emrich updated PR #9078.
frank-emrich updated PR #9078.
frank-emrich submitted PR review.
frank-emrich created PR review comment:
Ha, good question! I've added a few comments now about why this needs to be a fixed register: We only have a single payload and pass it in a register when doing a
stack_switch
to anotherstack_switch
on another stack. But in order for that to work, the two involvedstack_switch
instructions need to agree on the register used for this, similar to a calling convention.(In a perfect world I would extend the payload handling to arbitrarily many values of different types and make sure that the "calling convention" for that is exactly the same as for functions. If you do that, you don't have to reshuffle any arguments when switching to a new stack where the payloads given to the
stack_switch
become the arguments of the function to run on the new stack. But that would drastically increase the complexity and it's questionable how big the performance benefit would be)
frank-emrich submitted PR review.
frank-emrich created PR review comment:
I've extended the documentation of this instruction now. Still not specifying the actual layout itself, but mentioned the one-shottedness and the fact that the instruction does not check the pointers or data.
frank-emrich commented on PR #9078:
I've implemented the restriction to only lower
stack_switch
on Linux now: Strictly speaking, it's each individual OS rather than the ISA that may require additional things needing to be done when switching stacks (see my remarks on Windows in my huge previous comment). I do believe that my current implementation would actually work on x64 macOS, but haven't tested it. So I would just stay on the safe side and make the support OS-dependent (instead of, say, allowing anything Unix-like), with the only allowed one being Linux for now.I've implemented this using a partial constructor
on_linux
. The controversial bit may be that I'm now copying theTriple
into everymachinst::Callee
. Is that acceptable?
fitzgen submitted PR review.
fitzgen created PR review comment:
Gotcha, that makes sense. Thanks!
fitzgen commented on PR #9078:
only allowed one being Linux for now.
SGTM
The controversial bit may be that I'm now copying the
Triple
into everymachinst::Callee
. Is that acceptable?It doesn't seem ideal. Do we not already have access to a
&dyn TargetIsa
in theLowerContext
?
fitzgen commented on PR #9078:
Do we not already have access to a
&dyn TargetIsa
in theLowerContext
?The
IsleContext
contains aB: LowerBackend
which for x64 is aX64Backend
which contains aTriple
already, and all the otherLowerBackend
implementations also contain aTriple
. I think we could add afn triple(&self) -> &Triple
trait method toLowerBackend
and then reuse that in theIsleContext
's implementation of the partial constructor.
bjorn3 submitted PR review.
bjorn3 created PR review comment:
There is no other place in Cranelift that looks at the target OS when determining how to lower clif ir. Even TLS handling chooses the right variant for the target OS using a codegen flag rather than by looking at the OS in the triple. Using the right calling conventions is done by the producer of the clif ir.
cfallin submitted PR review.
cfallin created PR review comment:
I think I agree here: philosophically, this is if not quite literally calling-convention related, certainly like an OS-interface detail that should be "baked into" the CLIF as CLIF is ordinarily explicit about such details. Can we check the triple in the Wasm translation (i.e., the wasmtime
FuncEnvironment
hook called bycranelift-wasm
, or wherever else this instruction is generated) and fail if on the wrong platform?This is also a little more future-looking in the sense that there may be Wasmtime details at some future point related to stack-switching (e.g., what if we add our own stack-protection mitigations or have some custom kind of stack-growth scheme or ...) -- we wouldn't want to hardcode that into Cranelift lowering in the same way OS details are here.
cfallin edited PR review comment.
frank-emrich submitted PR review.
frank-emrich created PR review comment:
Just to recap the context sprinkled through this PR: The reason why we need to do something platform/OS-specific here eventually is that Windows requires us to update parts of the Thread Information Block when switching stacks. Unfortunately, what exactly needs to be done is undocumented, so I've not done it, yet. The "plain" stack switching I implemented in this PR was supposed to work only on Linux, but from what I can tell should work on x64 macOS, too.
@bjorn3:
Following what you said, I guess one approach would be to make the lowering of
stack_switch
more similar to that oftls_value
. I'd imagine this would look as follows:
- Create a
StackSwitchMode
enum withDefault
andUpdateTib
variants.- Encode a value of that type in
shared_settings::Flags
and also add it as a field to theStackSwitch
MInst
.- When emitting code for
StackSwitch
, check the value of theStackSwitchMode
flag and act accordingly.I'm also happy to have more than one
MInst
for stack switching, and lower thestack_switch
CLIF instruction to one of them, based on the value ofStackSwitchMode
. That would mirrortls_value
more closely, but I'm inclined to avoid that: The core stack switching code is always the same, we just sometimes need to emit some extra code on top of that.Finally, what I described above would be the medium-term solution once I've implemented Windows support. For the time being I would just add a
None
variant toStackSwitchMode
, use that on Windows, and panic if we see it when emitting code forStackSwitch
.@cfallin:
Can we check the triple in the Wasm translation (i.e., the wasmtime
FuncEnvironment
hook called bycranelift-wasm
, or wherever else this instruction is generated) and fail if on the wrong platform?Did you mean for this to be just a solution for the current issue of me wanting to fail if not on Linux? Or did you mean to also use this approach (i.e., resolving differences at the Wasm -> CLIF stage) in the future when we want to generate slightly different code on different OS-es?
In the latter case, are you suggesting to have multiple stack switching CLIF instructions, for the case with and without the TIB update, then choose between them during the Wasm translation? Or are you suggesting to have a single CLIF instruction for the stack switching itself, but then emit some additional CLIF on Windows (say, some additional loads and stores) around the
stack_switch
during the Wasm translation?I fully agree that for things like stack growing there's a good chance that in the future, we'd need some more customization of the generated code (e.g., customize stack probing, function preludes, ...).
frank-emrich edited PR review comment.
cfallin submitted PR review.
cfallin created PR review comment:
Did you mean for this to be just a solution for the current issue of me wanting to fail if not on Linux? Or did you mean to also use this approach (i.e., resolving differences at the Wasm -> CLIF stage) in the future when we want to generate slightly different code on different OS-es?
Both, I think -- basically, any sort of difference of behavior on that level should be reified in the CLIF, rather than implicit; that's how we've handled things like TLS and other platform dependencies.
In the latter case, are you suggesting to have multiple stack switching CLIF instructions, for the case with and without the TIB update, then choose between them during the Wasm translation? Or are you suggesting to have a single CLIF instruction for the stack switching itself, but then emit some additional CLIF on Windows (say, some additional loads and stores) around the
stack_switch
during the Wasm translation?One or the other, depending on what is actually needed? I don't have enough context to actually specify the full design; that's something we can discuss further; only that we should make it explicit somehow. If one platform requires a superset of the work that another platform does, factoring out the common bit as one instruction and adding more logic at the CLIF level seems reasonable. On the other hand it's totally reasonable to have
stack_switch_windows
andstack_switch_sysv
instructions IMHO; again see how we did TLS, with separate instructions for ELF and Mach-O cases.
frank-emrich submitted PR review.
frank-emrich created PR review comment:
On the other hand it's totally reasonable to have
stack_switch_windows
andstack_switch_sysv
instructions IMHO; again see how we did TLS, with separate instructions for ELF and Mach-O cases.Ah, I think that's where my confusion came from: For TLS, there is a single CLIF instruction, which is then lowered one of several per-platform
MInst
s (but based on a flag in the backend, not some ad hoc OS check like I did). I'm happy to do it like that, which would be similar to what I described in my response to @bjorn3.I'd prefer that over moving the TIB update out of the
stack_switch
(CLIF) instruction itself.
frank-emrich requested alexcrichton for a review on PR #9078.
frank-emrich updated PR #9078.
frank-emrich requested wasmtime-core-reviewers for a review on PR #9078.
frank-emrich commented on PR #9078:
I've reworked the platform dependence logic based now, inspired by what happens for TLS:
There's a new enumStackSwitchModel
with valuesNone
,Basic
, andUpdateWindowsTib
. A value of that is saved in the backendFlags
.
This value is then used when lowering astack_switch
CLIF instruction: IfNone
, we cannot lower them. IfBasic
, we lower tostack_switch_basic
. In the future, it will be the case thatUpdateWindowsTib
causes lowering to a new dedicatedMInst
(likeStackSwitchUpdateWindowsTib
), but currently this will behave likeNone
(i.e., we fail to lower).
frank-emrich commented on PR #9078:
@alexcrichton For some reason the automated reviewer assignment was triggered again, maybe because this PR is now touching a non-Cranelift file.
alexcrichton commented on PR #9078:
Do others have more thoughts on this? This all looks reaosnable enough to me to land, but I'd want to be sure to run by others too.
fitzgen submitted PR review:
LGTM with one final nitpick below
fitzgen created PR review comment:
Rather than making this an extractor from a type, can we make it a partial constructor?
Then the use would look like
(lower (stack_switch store_context_ptr load_context_ptr in_payload) (if-let (stack_switch_model) (StackSwitchModel.Basic)) ...)
frank-emrich updated PR #9078.
frank-emrich submitted PR review.
frank-emrich created PR review comment:
Done, but you meant
(if-let (StackSwitchModel.Basic) (stack_switch_model))
, right?Is there a particular reason for making it a partial constructor? Following what's happening for TLS I gave my
StackSwitchModel
enum aNone
variant to indicate that no model was set, but that means that thestack_switch_model
constructor can be total, or am I missing something?
fitzgen submitted PR review.
fitzgen created PR review comment:
I might be wrong, but I had thought that you can only use partial constructors in
if-let
s. Might be worth removing theNone
variant fromStackSwitchModel
, if so.
fitzgen submitted PR review.
fitzgen commented on PR #9078:
I think this is in a good enough place that we can land it and then continue with any further improvements in follow ups. Thanks @frank-emrich!
frank-emrich updated PR #9078.
frank-emrich submitted PR review.
frank-emrich created PR review comment:
I tried it, and turning
stack_switch_model
into a total constructor seems to just work.
I don't understand the magic mechanism for configuring what goes into the backendFlags
in detail, but from what I can tell it does not allow you to store something logically equivalent toOption<StackSwitchModel>
in there, so keeping theNone
variant inStackSwitchModel
itself seems like the way to go.
frank-emrich commented on PR #9078:
Oops, it looks like I responded to your comment and added a new commit in parallel to you approving things. But I think it should be uncontroversial, I kept
None
and madestack_switch_model
total.
fitzgen commented on PR #9078:
Yeah, LGTM
fitzgen merged PR #9078.
Last updated: Jan 24 2025 at 00:11 UTC