afonso360 opened PR #4747 from inline-stackprobing
to main
:
:wave: Hey,
This PR adds inline stack probing for the x64 backend. Opening as a draft since there are still some pending questions.
This is feature gated behind a
enable_inline_probestack
and also requires the user to enableenable_probestack
.Currently we have two implementations, If we only need to probe a few stack pages, we unroll the probes. If we need more we fall back to using a loop.
Closes: #2299
cc: @bjorn3 @cfallin
afonso360 submitted PR review.
afonso360 created PR review comment:
I'm not too happy about this, and did this mostly as a proof of concept to see if it fixes the issues we were having in https://github.com/bjorn3/rustc_codegen_cranelift/issues/1261#issuecomment-1219307630 .
I think the real solution here is allowing us to emit labels and use those with the regular jump instruction.
gen_prologue
curently does two things:
- Initializes part of the
Callee
struct (calculates stack sizes)- Generates the prologue instructions
These instructions aren't emitted immediatley, instead we call this early on in vcode emission to initialize the
Callee
struct, and the only later on emit the previously generated instructions.I think we can change this slightly to allow label emission in
gen_prologue
by splitting it into two functions.One that precomputes the stack sizes and initializes the
Callee
struct, that is called early on (where the currentgen_prologue
currently is).And a second one that directly emits the prologue instructions. This way we can emit them and at the same time pass a function to generate a label at the current emit point.
Are there issues with emitting labels so late in the pipeline? And does this sound like a reasonable refactor?
afonso360 submitted PR review.
afonso360 created PR review comment:
I'm not too happy about this, and did this mostly as a proof of concept to see if it fixes the issues we were having in https://github.com/bjorn3/rustc_codegen_cranelift/issues/1261#issuecomment-1219307630 .
I think the real solution here is allowing us to emit labels and use those with the regular jump instruction.
gen_prologue
curently does two things:
- Initializes part of the
Callee
struct (calculates stack sizes)- Generates the prologue instructions
These instructions aren't emitted immediatley, instead we call this early on in vcode emission to initialize the
Callee
struct, and the only later on emit the previously generated instructions.I think we can change this slightly to allow label emission in
gen_prologue
by splitting it into two functions.One that precomputes the stack sizes and initializes the
Callee
struct, that is called early on (where the currentgen_prologue
currently is).And a second one that directly emits the prologue instructions. This way we can emit them and at the same time pass a function to generate a label at the current emit point.
Are there issues with emitting labels so late in the pipeline? And does this sound like a reasonable refactor?
afonso360 deleted PR review comment.
bjorn3 submitted PR review.
bjorn3 submitted PR review.
bjorn3 created PR review comment:
You can only have either outlined or inlined stack probes, not both at the same time. Maybe make it an enum?
afonso360 submitted PR review.
afonso360 created PR review comment:
Yeah, that sounds like a good idea.
afonso360 edited PR review comment.
cfallin submitted PR review.
cfallin created PR review comment:
@afonso360 in principle we could allow for control flow to be inserted this late in the pipeline, but in practice it's a fairly intrusive refactor as we have to somehow communicate out-of-band the label and reference to it (or hack it with an explicit offset as here, but this definitely is not the right answer IMHO as it bakes in a dependence on encoded instruction sizes determined elsewhere in the code, in a way that could lead to a CVE if it later breaks).
I think the better answer here is to actually have a psuedo-
MachInst
that represents "stack-probe loop" and emits whatever is appropriate. We already have control-flow-inside-an-instruction for cases likeTrapIf
; this works well enough and maintains the abstraction that an instruction is a single-in, single-out thing (ignoring CLIF-level branches for now), which the rest of the backend assumes. Does that make sense?
afonso360 created PR review comment:
Yeah that does sound reasonable, and a lot easier that performing that refactor. Thanks!
afonso360 submitted PR review.
afonso360 updated PR #4747 from inline-stackprobing
to main
.
afonso360 submitted PR review.
afonso360 created PR review comment:
I've added a
stackprobe_strategy
flag, and while right now its justinline
andoutline
I think in the future we can experiment with adding different strategies to mix the two, such as allowing both and inOptLevel::Size
emitting the unroll inline or the call, whichever is smallest.
bjorn3 created PR review comment:
Looks like you haven't pushed it yet.
bjorn3 submitted PR review.
afonso360 edited PR review comment.
afonso360 updated PR #4747 from inline-stackprobing
to main
.
afonso360 submitted PR review.
afonso360 created PR review comment:
Oops, fixed
afonso360 edited PR review comment.
afonso360 updated PR #4747 from inline-stackprobing
to main
.
afonso360 updated PR #4747 from inline-stackprobing
to main
.
afonso360 has marked PR #4747 as ready for review.
afonso360 updated PR #4747 from inline-stackprobing
to main
.
jameysharp submitted PR review.
jameysharp submitted PR review.
jameysharp created PR review comment:
This comment is no longer true, is it? I think you can delete it.
jameysharp created PR review comment:
Minor edits I just noticed:
// The inline stack probe loop has 3 phases: // // We generate the "guard area" register which is essentially the frame_size aligned to // guard_size. We copy the stack pointer and subtract the guard area from it. This // gets us a register that we can use to compare when looping. // // After that we emit the loop. Essentially we just adjust the stack pointer one guard_size'd // distance at a time and then touch the stack by writing anything to it. We use the previously // created "guard area" register to know when to stop looping.
jameysharp created PR review comment:
Why do just a few runtests need stack-probing disabled?
afonso360 updated PR #4747 from inline-stackprobing
to main
.
afonso360 submitted PR review.
afonso360 created PR review comment:
These tests have stack slots above the threshold where we emit a stack probe. The default strategy is to outline the stack probing to a external function call, but since we never provide an implementation for that anywhere the JIT can't link these tests, so they fail.
I manually disabled stack probing for these which reflects the old behaviour of force disabling it everywhere.
afonso360 edited PR review comment.
jameysharp has enabled auto merge for PR #4747.
jameysharp merged PR #4747.
Last updated: Jan 24 2025 at 00:11 UTC