iximeow edited PR #1378 from windows-fprs-preservation
to master
:
- [x] This has been discussed in issue #1366
- [x] A short description: Windows appears to have extended ABI constraints to require callees to save XMM6-XMM15 (see #1366 for links to reference material). This PR adds preserve/restore behavior for those registers.
- [x] This PR contains test cases. Ish. There is an additional test that would test the right behavior if this was totally ready to go.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so and/or ping
bnjbvr
. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!
iximeow updated PR #1378 from windows-fprs-preservation
to master
:
- [x] This has been discussed in issue #1366
- [x] A short description: Windows appears to have extended ABI constraints to require callees to save XMM6-XMM15 (see #1366 for links to reference material). This PR adds preserve/restore behavior for those registers.
- [x] This PR contains test cases. Ish. There is an additional test that would test the right behavior if this was totally ready to go.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so and/or ping
bnjbvr
. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!
iximeow updated PR #1378 from windows-fprs-preservation
to master
:
- [x] This has been discussed in issue #1366
- [x] A short description: Windows appears to have extended ABI constraints to require callees to save XMM6-XMM15 (see #1366 for links to reference material). This PR adds preserve/restore behavior for those registers.
- [x] This PR contains test cases. Ish. There is an additional test that would test the right behavior if this was totally ready to go.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so and/or ping
bnjbvr
. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!
iximeow submitted PR Review.
iximeow created PR Review Comment:
As I mention in the commit message, I do not know having this part of the check causes an error. I tried copying the style in f.ex
probestack-disabled.clif
, and I swear this passed once, but now yields an error like> [RexOp1spaddr8_id#808d,%rcx] v11 = stack_addr.i64 ss0 ^~~~~~~~~~~~~~~~~~~~~~~~ Matched #7: \bv11 = stack_addr\.i64 ss0\b > [Op2fldDisp8#410,%xmm7] v13 = load.f64x2 notrap aligned v11+16 Missed #8: \[Op2fldDisp8\#410,%xmm7\] v13 = load\.f64x2 notrap aligned v11\+16\bnote that
got: [Op2fldDisp8#410,%xmm7] v13 = load.f64x2 notrap aligned v11+16 expected: [Op2fldDisp8#410,%xmm7] v13 = load.f64x2 notrap aligned v11+16these are the same aside from whitespace. I tested having the same amount of whitespace just in case, and that didn't influence passing or not. Am I misunderstanding the
nextln
directive?
iximeow updated PR #1378 from windows-fprs-preservation
to master
:
- [x] This has been discussed in issue #1366
- [x] A short description: Windows appears to have extended ABI constraints to require callees to save XMM6-XMM15 (see #1366 for links to reference material). This PR adds preserve/restore behavior for those registers.
- [x] This PR contains test cases. Ish. There is an additional test that would test the right behavior if this was totally ready to go.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so and/or ping
bnjbvr
. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!
iximeow submitted PR Review.
iximeow created PR Review Comment:
This is brittle in the face of frame pointer elision. A more applicable register would be
r11
, but there are no rex encodings offstDisp8
orfldDisp8
, so usingr11
as a base violates constraints that the registers must have index <=7.
hrydgard submitted PR Review.
hrydgard created PR Review Comment:
csr_stack_size = (csr_stack_size + 15) & !0b1111;
would save extending when already aligned to 16 bytes (unless I'm misunderstanding something and that's desirable)
hrydgard edited PR Review Comment.
hrydgard edited PR Review Comment.
iximeow updated PR #1378 from windows-fprs-preservation
to master
:
- [x] This has been discussed in issue #1366
- [x] A short description: Windows appears to have extended ABI constraints to require callees to save XMM6-XMM15 (see #1366 for links to reference material). This PR adds preserve/restore behavior for those registers.
- [x] This PR contains test cases. Ish. There is an additional test that would test the right behavior if this was totally ready to go.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so and/or ping
bnjbvr
. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!
iximeow submitted PR Review.
iximeow created PR Review Comment:
that'll teach me for trying to do math. Good catch!
iximeow updated PR #1378 from windows-fprs-preservation
to master
:
- [x] This has been discussed in issue #1366
- [x] A short description: Windows appears to have extended ABI constraints to require callees to save XMM6-XMM15 (see #1366 for links to reference material). This PR adds preserve/restore behavior for those registers.
- [x] This PR contains test cases. Ish. There is an additional test that would test the right behavior if this was totally ready to go.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so and/or ping
bnjbvr
. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!
iximeow updated PR #1378 from windows-fprs-preservation
to master
:
- [x] This has been discussed in issue #1366
- [x] A short description: Windows appears to have extended ABI constraints to require callees to save XMM6-XMM15 (see #1366 for links to reference material). This PR adds preserve/restore behavior for those registers.
- [x] This PR contains test cases. Ish. There is an additional test that would test the right behavior if this was totally ready to go.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so and/or ping
bnjbvr
. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!
iximeow updated PR #1378 from windows-fprs-preservation
to master
:
- [x] This has been discussed in issue #1366
- [x] A short description: Windows appears to have extended ABI constraints to require callees to save XMM6-XMM15 (see #1366 for links to reference material). This PR adds preserve/restore behavior for those registers.
- [x] This PR contains test cases. Ish. There is an additional test that would test the right behavior if this was totally ready to go.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so and/or ping
bnjbvr
. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!
jaykrell submitted PR Review.
jaykrell created PR Review Comment:
You only need to preserve the lower 128 bits.
The upper 128 (or 384, whatever) are volatile.
The rationale is:
- You cannot make registers nonvolatile after the fact.
- The ABI cannot be revised in a compatible way, adding nonvolatile registers.
- There are no unwind codes to describe how to restore anything not in the v1 ABI.
You can save more if you want, but it buys little/nothing.
Except, you know, you can have private calling conventions among known callers/callees.
Callers can assume more volatiles (ymm, zmm) are preserved, and can allow for nonvolatiles to be trashed (that the caller preserves).
Calling conventions are largely?entirely about separate compilation.
jaykrell edited PR Review Comment.
peterhuene submitted PR Review.
peterhuene created PR Review Comment:
Speaking of unwind, can we update
unwind.rs
to emit the unwind codes for these CSRs as part of this PR? The original implementation made some assumptions as to what instructions were present in the prologue.
peterhuene edited PR Review Comment.
iximeow updated PR #1378 from windows-fprs-preservation
to master
:
- [x] This has been discussed in issue #1366
- [x] A short description: Windows appears to have extended ABI constraints to require callees to save XMM6-XMM15 (see #1366 for links to reference material). This PR adds preserve/restore behavior for those registers.
- [x] This PR contains test cases. Ish. There is an additional test that would test the right behavior if this was totally ready to go.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so and/or ping
bnjbvr
. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!
iximeow updated PR #1378 from windows-fprs-preservation
to master
:
- [x] This has been discussed in issue #1366
- [x] A short description: Windows appears to have extended ABI constraints to require callees to save XMM6-XMM15 (see #1366 for links to reference material). This PR adds preserve/restore behavior for those registers.
- [x] This PR contains test cases. Ish. There is an additional test that would test the right behavior if this was totally ready to go.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so and/or ping
bnjbvr
. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!
iximeow updated PR #1378 from windows-fprs-preservation
to master
:
- [x] This has been discussed in issue #1366
- [x] A short description: Windows appears to have extended ABI constraints to require callees to save XMM6-XMM15 (see #1366 for links to reference material). This PR adds preserve/restore behavior for those registers.
- [x] This PR contains test cases. Ish. There is an additional test that would test the right behavior if this was totally ready to go.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so and/or ping
bnjbvr
. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!
iximeow submitted PR Review.
iximeow created PR Review Comment:
I've update Windows fastcall unwind generation to emit XmmSave128 and XmmSave128Far, though I'm suspect of the offset.
From reading https://docs.microsoft.com/en-us/cpp/build/exception-handling-x64?view=vs-2019#unwind-operation-code it looks like the offset for SaveXmm counts a number of 16-byte stack slots from the base of the call frame, so storing to
[rbp - 0x10]
would have an offset of 1,[rbp - 0x20]
would be 2. In contrast, the callee-save code added here should be[rsp - csr_stack_slot_offset + FPR_offset]
whereFPR_offset
is 0x10, 0x20, ... as appropriate. The offsets recorded areFPR_offset
, and my suspicion is that they would be interpreted as offsets from the other side of callee-save space by Windows.This is well beyond my experience with Windows to actually verify. If this looks right to you, I'd be happy to hear it - threading the size of
csr_stack_slot
through toUnwindInfo
seems kind of annoying. I think it would best be a field onFunction
?
iximeow updated PR #1378 from windows-fprs-preservation
to master
:
- [x] This has been discussed in issue #1366
- [x] A short description: Windows appears to have extended ABI constraints to require callees to save XMM6-XMM15 (see #1366 for links to reference material). This PR adds preserve/restore behavior for those registers.
- [x] This PR contains test cases. Ish. There is an additional test that would test the right behavior if this was totally ready to go.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so and/or ping
bnjbvr
. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!
iximeow submitted PR Review.
iximeow created PR Review Comment:
I also updated the comment to be more strongly worded on preserving the low 128 bits, thanks for the feedback, @jaykrell.
jaykrell submitted PR Review.
jaykrell created PR Review Comment:
To better understand, just look at C++ examples.
Here is a start. Change it up with alloca to get a frame pointer.
But ouch, keep in mind, compiler does some odd things that seem to violate the spec.
Like getting the rsp/rbp before adjustment, and recording offsets from after.
That does occur below unfortunately -- the use of rax.
That is an optimization and you don't have to do it. Please don't, really.
It is just trying to keep the offsets in the instructions to fit in 8 bits instead of 32 I think, to shrink instruction size, and looks unnecessary in this case, and wastes an instruction.C:\s>type 1.c __declspec(dllexport) float (*f1)(float, float); __declspec(dllexport) float f2(float a, float b) { float c = a + b; float d = a - b; float e = a * b; float f = a / b; return f1(a, b) + f1(c, d) + f1(c, d) - f1(e, f) + f1(e + f, e - f); } C:\s>cl /O2 /LD /Zi 1.c /link /noentry /incremental:no Microsoft (R) C/C++ Optimizing Compiler Version 19.21.27702.2 for x64 Copyright (C) Microsoft Corporation. All rights reserved. 1.c Microsoft (R) Incremental Linker Version 14.21.27702.2 Copyright (C) Microsoft Corporation. All rights reserved. /out:1.dll /dll /implib:1.lib /debug /noentry /incremental:no 1.obj Creating library 1.lib and object 1.exp C:\s>link /dump /unwindinfo /disasm 1.dll Microsoft (R) COFF/PE Dumper Version 14.21.27702.2 Copyright (C) Microsoft Corporation. All rights reserved. Dump of file 1.dll File Type: DLL f2: ; This instruction is ok but imho an unnecessary and possibly failed optimization. 0000000180001000: 48 8B C4 mov rax,rsp 0000000180001003: 48 81 EC 98 00 00 sub rsp,98h 00 ; for example, this could/should be encoded as mov [rsp-80] 000000018000100A: 0F 29 70 E8 movaps xmmword ptr [rax-18h],xmm6 ; for example, this could/should be encoded as mov [rsp-70] 000000018000100E: 0F 29 78 D8 movaps xmmword ptr [rax-28h],xmm7 0000000180001012: 0F 28 F9 movaps xmm7,xmm1 0000000180001015: 44 0F 29 40 C8 movaps xmmword ptr [rax-38h],xmm8 000000018000101A: 44 0F 28 C0 movaps xmm8,xmm0 000000018000101E: 44 0F 29 48 B8 movaps xmmword ptr [rax-48h],xmm9 0000000180001023: 44 0F 28 C8 movaps xmm9,xmm0 0000000180001027: 44 0F 29 50 A8 movaps xmmword ptr [rax-58h],xmm10 000000018000102C: F3 44 0F 5C CF subss xmm9,xmm7 0000000180001031: 44 0F 29 58 98 movaps xmmword ptr [rax-68h],xmm11 0000000180001036: 44 0F 28 D0 movaps xmm10,xmm0 000000018000103A: 44 0F 29 60 88 movaps xmmword ptr [rax-78h],xmm12 000000018000103F: F3 44 0F 58 D7 addss xmm10,xmm7 0000000180001044: 44 0F 28 E0 movaps xmm12,xmm0 0000000180001048: 44 0F 28 D8 movaps xmm11,xmm0 000000018000104C: F3 44 0F 59 E7 mulss xmm12,xmm7 0000000180001051: 41 0F 28 C9 movaps xmm1,xmm9 0000000180001055: 41 0F 28 C2 movaps xmm0,xmm10 0000000180001059: F3 44 0F 5E DF divss xmm11,xmm7 000000018000105E: FF 15 A4 1F 00 00 call qword ptr [f1] 0000000180001064: 0F 28 F0 movaps xmm6,xmm0 0000000180001067: 0F 28 CF movaps xmm1,xmm7 000000018000106A: 41 0F 28 C0 movaps xmm0,xmm8 000000018000106E: FF 15 94 1F 00 00 call qword ptr [f1] 0000000180001074: F3 0F 58 F0 addss xmm6,xmm0 0000000180001078: 41 0F 28 C9 movaps xmm1,xmm9 000000018000107C: 41 0F 28 C2 movaps xmm0,xmm10 0000000180001080: FF 15 82 1F 00 00 call qword ptr [f1] 0000000180001086: 0F 28 F8 movaps xmm7,xmm0 0000000180001089: 41 0F 28 CB movaps xmm1,xmm11 000000018000108D: 41 0F 28 C4 movaps xmm0,xmm12 0000000180001091: F3 0F 58 FE addss xmm7,xmm6 0000000180001095: FF 15 6D 1F 00 00 call qword ptr [f1] 000000018000109B: 41 0F 28 CC movaps xmm1,xmm12 000000018000109F: F3 0F 5C F8 subss xmm7,xmm0 00000001800010A3: F3 41 0F 5C CB subss xmm1,xmm11 00000001800010A8: F3 45 0F 58 DC addss xmm11,xmm12 00000001800010AD: 41 0F 28 C3 movaps xmm0,xmm11 00000001800010B1: FF 15 51 1F 00 00 call qword ptr [f1] ; blech, see below 00000001800010B7: 4C 8D 9C 24 98 00 lea r11,[rsp+98h] 00 00 00000001800010BF: 41 0F 28 73 E8 movaps xmm6,xmmword ptr [r11-18h] 00000001800010C4: F3 0F 58 C7 addss xmm0,xmm7 00000001800010C8: 0F 28 7C 24 70 movaps xmm7,xmmword ptr [rsp+70h] 00000001800010CD: 45 0F 28 43 C8 movaps xmm8,xmmword ptr [r11-38h] 00000001800010D2: 45 0F 28 4B B8 movaps xmm9,xmmword ptr [r11-48h] 00000001800010D7: 45 0F 28 53 A8 movaps xmm10,xmmword ptr [r11-58h] 00000001800010DC: 45 0F 28 5B 98 movaps xmm11,xmmword ptr [r11-68h] 00000001800010E1: 45 0F 28 63 88 movaps xmm12,xmmword ptr [r11-78h] ; This violates the letter of the spec, alas. ; It is supposed to be add rsp,n, or lea rsp, n[frame] 00000001800010E6: 49 8B E3 mov rsp,r11 00000001800010E9: C3 ret Function Table (1) Begin End Info Function Name 00000000 00001000 000010EA 0000211C f2 Unwind version: 1 Unwind flags: None Size of prologue: 0x3F Count of codes: 16 Unwind codes: 3F: SAVE_XMM128, register=xmm12 offset=0x20 36: SAVE_XMM128, register=xmm11 offset=0x30 2C: SAVE_XMM128, register=xmm10 offset=0x40 23: SAVE_XMM128, register=xmm9 offset=0x50 1A: SAVE_XMM128, register=xmm8 offset=0x60 12: SAVE_XMM128, register=xmm7 offset=0x70 0E: SAVE_XMM128, register=xmm6 offset=0x80 0A: ALLOC_LARGE, size=0x98So, overall, mixed. The C++ examples help to understand, but they also confuse, because the C++ compiler violates the spec in ways that end up meaning the same thing at runtime, if you understand closely how unwinding works.
I would encourage you to follow the spec instead, though I suppose there might be measurable perf benefits otherwise.You can do the link /dump /unwindinfo on any binary -- no need for symbols.
jaykrell submitted PR Review.
jaykrell created PR Review Comment:
The NT unwinder is also open sourced, here:
https://github.com/dotnet/runtime/blob/master/src/coreclr/src/unwinder/amd64/unwinder_amd64.cpp#L895
at least an old maybe subsetted/altered form, but probably useful.
jaykrell edited PR Review Comment.
peterhuene submitted PR Review.
peterhuene created PR Review Comment:
The offsets in the generated unwind information in the test do seem odd, especially compared to the above output form VC++, but I think that's a result of adjusting the stack pointer for the frame after we're writing in the SIMD CSRs. In the test output, an offset of 0, 0x10, and 0x20 make sense given that the SP will be adjusted up during unwinding restores the XMM registers.
Perhaps we should adjust the SP first?
peterhuene edited PR Review Comment.
peterhuene edited PR Review Comment.
peterhuene edited PR Review Comment.
peterhuene edited PR Review Comment.
iximeow submitted PR Review.
iximeow created PR Review Comment:
If the generated unwind info looks okay, are you good with this change and revisiting in case it turns out not to be? I don't have anything resembling a useful Windows development environment, so I'm not even entirely sure what to look for being wrong in case the unwind information is wrong - I imagine "only" an incorrect value for a callee-save register in some parent frame? Given that overall this PR fixes an ABI error that results in dangerously wrong code, I _would_ like to get this figured out :)
peterhuene submitted PR Review.
peterhuene created PR Review Comment:
I'm good with the change and open to revisiting it in the future. I think the order of operations in our prologues is just a little "strange" compared to VC++'s prologues, but I don't necessarily see that as a problem with respect to unwind.
peterhuene submitted PR Review.
peterhuene created PR Review Comment:
Sorry, I just caught this. I would expect a value of 4 originally, indicating a 40 byte SP adjustment: 8 bytes to realign the stack from the even number of GPR pushes and 32 bytes for the callee spill space we never elide (even for leaf functions). So there's a bug currently in how we're aligning the stack even when spilling just the GPRs.
With your changes, I would expect a value of 10, indicating 88 bytes: 8 bytes to realign the stack from the even number of GPR pushes, 48 bytes for the saved XMM registers, and 32 bytes for the callee spill space.
peterhuene submitted PR Review.
peterhuene submitted PR Review.
peterhuene submitted PR Review.
peterhuene submitted PR Review.
peterhuene created PR Review Comment:
For comparison, the VC++ example above allocates 152 bytes for its frame: 8 bytes for stack alignment (upon entry stack is always off by 8 bytes due to call instruction pushing return address), 112 bytes for saved XMM registers, and 32 bytes for callee spill space (the function does call).
peterhuene edited PR Review Comment.
peterhuene edited PR Review Comment.
iximeow created PR Review Comment:
Yep, there's definitely an error here. I hadn't checked the generated assembly on this case but
Disassembly of 356 bytes: [69/1840] 0: 55 push rbp 1: 48 89 e5 mov rbp, rsp 4: 53 push rbx 5: 56 push rsi 6: 57 push rdi 7: 41 54 push r12 9: 41 55 push r13 b: 41 56 push r14 d: 41 57 push r15 f: 48 8d 84 24 10 00 00 00 lea rax, [rsp + 0x10] 17: f2 0f 11 30 movsd qword ptr [rax], xmm6 1b: f2 0f 11 78 10 movsd qword ptr [rax + 0x10], xmm7 20: f2 44 0f 11 40 20 movsd qword ptr [rax + 0x20], xmm8is definitely not the prologue I'd want to see. It looks like the callee-save for fprs begins at where the stack was when created, rather than being a non-overlapping region laid out somewhere later on. Which makes sense, since this is so late in compilation. urgh.
iximeow submitted PR Review.
peterhuene submitted PR Review.
peterhuene created PR Review Comment:
Was the unwind information "correct" in that RSP is adjusted by only 0x28 following saving the XMM registers?
iximeow submitted PR Review.
iximeow created PR Review Comment:
correct in that it was adjusted by 0x30 (5 * 8 == 40 == 0x28, but there's the implied extra 8 bytes bringing it to 0x30)
fitzgen submitted PR Review.
fitzgen created PR Review Comment:
File an issue for this todo and link to it here, so that it doesn't get lost, please.
fitzgen submitted PR Review.
fitzgen created PR Review Comment:
Wait -- shouldn't
fpr_offset
be initialized to the offset chosen for the stack slot earlier? Right now it is initialized to 0. (Or alternatively but equivalently use the stack slot offset as an immediate for thestack_addr
instruction?) Otherwise we are always reading xmm6-15 from stack frame offsets0..
. Is this the source of the overlapping stack slot bug?
fitzgen created PR Review Comment:
Is it too late in the pipeline to make
offset
beNone
here and then letlayout_stack
assign the offset? This might help with the overlapping stack slot issues you and Peter are discussing elsewhere in this thread (can't figure out how to comment inline there??). Or does the fastcall ABI require that they are saved at a particular location?
fitzgen created PR Review Comment:
ditto
iximeow submitted PR Review.
iximeow created PR Review Comment:
I was just looking at this for the same reason, I think you're on the right track. These offsets are up from the bottom of the stack slot so zero is on the opposite side from where
push
starts placing registers. My expectation was the callee-save region would look likesize | rbp size - 8 | rbx size - 16 | rsi ... other callee-save gprs 0x20 | xmm8 0x10 | xmm7 0x00 | xmm6which is close to what the actual layout looks like, but the base isn't what I'd expected.
iximeow submitted PR Review.
iximeow created PR Review Comment:
same as above, either this _is_ or is closely related to the overlapping stack slot issue. Either this will change or I'll update (probably with a comment describing CSR region layout) when it's fixed.
iximeow submitted PR Review.
iximeow created PR Review Comment:
(a nice benefit of picking the end of the CSR region is that it is defined to be aligned by ABI requirements, so we don't need to figure out if the last GPR save left the stack aligned or not)
peterhuene submitted PR Review.
peterhuene created PR Review Comment:
Oh sorry, I meant 0x30 (the 5 in the unwind info indicating (8*5)+8; I seem to always forget the +8). Thanks for confirming.
peterhuene edited PR Review Comment.
iximeow edited PR Review Comment.
peterhuene submitted PR Review.
peterhuene created PR Review Comment:
Also, judging from the above disassembly, The offsets from RSP for the XMM registers are off by 1 in the resulting unwind information it seems. We're recording the offsets from RAX in the unwind info (scaled-by-16 values of 0, 1, and 2), but RAX itself is offset 0x10 from RSP.
peterhuene edited PR Review Comment.
peterhuene deleted PR Review Comment.
peterhuene submitted PR Review.
peterhuene created PR Review Comment:
I've deleted my last comment since it's probably just related to the overlapping stack slot issues. Once we have the proper prologue generated, I'll verify that the unwind information properly describes it.
iximeow submitted PR Review.
iximeow created PR Review Comment:
Some notes as it took me a surprising amount of time to figure out what _exactly_ was going on. I didn't really understand
StackSlot
too well.The gist of it is that
lea rax, [rsp + 0x10]
may look like an address of nothing in particular, but that's because it seemsStackSlot
used alongside adjusting the stack is a recipe for bad times. It's the instruction that would be correct to get the stack slot's start at the entry of the function without intervening changes. That is, the return address andrbp
are preserved on the stack by Cranelift, and 0x10 down the stack from the base address of the call frame would berbx
and the start of the rest of the csr region.rsp
isn't really "ready for use" until after the prologue though, sostack_addr
doesn't seem appropriate to use as it currently exists for this purpose.Subsequently, the xmm stores _would_ overwrite the return address,
rbp
,rbx
, ..., were the stack slot address right - I misunderstood which direction offsets count from too.I think what I can do is make a second stack slot that we can directly address for FPR save/lrestore, and requiring 16-byte alignment on that would keep offsets simple as well. I'll give it a try and we'll see how it goes :crossed_fingers:
iximeow updated PR #1378 from windows-fprs-preservation
to master
:
- [x] This has been discussed in issue #1366
- [x] A short description: Windows appears to have extended ABI constraints to require callees to save XMM6-XMM15 (see #1366 for links to reference material). This PR adds preserve/restore behavior for those registers.
- [x] This PR contains test cases. Ish. There is an additional test that would test the right behavior if this was totally ready to go.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so and/or ping
bnjbvr
. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!
iximeow updated PR #1378 from windows-fprs-preservation
to master
:
- [x] This has been discussed in issue #1366
- [x] A short description: Windows appears to have extended ABI constraints to require callees to save XMM6-XMM15 (see #1366 for links to reference material). This PR adds preserve/restore behavior for those registers.
- [x] This PR contains test cases. Ish. There is an additional test that would test the right behavior if this was totally ready to go.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so and/or ping
bnjbvr
. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!
iximeow updated PR #1378 from windows-fprs-preservation
to master
:
- [x] This has been discussed in issue #1366
- [x] A short description: Windows appears to have extended ABI constraints to require callees to save XMM6-XMM15 (see #1366 for links to reference material). This PR adds preserve/restore behavior for those registers.
- [x] This PR contains test cases. Ish. There is an additional test that would test the right behavior if this was totally ready to go.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so and/or ping
bnjbvr
. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!
iximeow submitted PR Review.
iximeow created PR Review Comment:
the missing stack adjustment in this test maybe should have been a giveaway that something was awry too. I'd read it as not necessary, assuming
ss0
would be some valid stack slot, but in reality this was testing that a store tov7
would clobber the start of the CSR region :(
ss0
is still the correct stack slot with this change because I add the FPR stack slot before the GPR stack slot. I've since updated this test to verify that.
iximeow submitted PR Review.
iximeow created PR Review Comment:
Surprise of surprises, the number is now not
3
,4
,5
, or10
, but12
! I can argue the correctness of 12, though. Forstack_slot
reasons I don't understand, the FPR preservation region is aligned more strictly than necessary. This grows what would be 48 bytes into 64. Plus eight bytes to realign the stack, and 32 bytes of shadow stack space, that makes for a 104-byte allocation. That would be 13 8-byte slots, minus one for the implied slot makes 12!A similar wider-than-necessary FPR region is also present in the
float_callee_save
test I added. I've tried to follow through what causes this inlayout_stack
and my best theory is that treating a stack slot as a collection of smaller units is not quite how it's intended to be used, and instead it tries to align the provided stack slots using their sizes as an alignment requirement. It looks like treating a stack slot as an aggregate of more permissible units might be a bit of a change, so I'm open to recommendations if taking ~16-48 extra bytes per call frame is a nonstarter.
iximeow submitted PR Review.
iximeow created PR Review Comment:
update: it's not too late, but
layout_stack
requires slots of typeIncomingArg
to have a defined offset. That offset is used for the base of the stack layout so aNone
ends up being a panic.I ended up making FPR save/restore go through a second stack slot of kind
SpillSlot
- it seems like either this orExplicitSlot
might make sense on account of stack store/load is a lot like a spill (if they weren't used, they wouldn't have to be "spilled" to the stack). I'm not entirely sure whatExplicitSlot
ought to be used for, but it also looks applicable?
fitzgen submitted PR Review.
fitzgen created PR Review Comment:
I'm not totally sure, but I think that
SpillSlot
s are intended to be reused/de-duped across spilled values with distinct live ranges. This should be possible to do in theory for calling convention related stack slots (same as with struct return) but I'm not sure that it works as of today nor that it wouldn't be buggy.I think (again not totally sure, as they're mostly used in tests and pretty much all slot kinds other than incoming args are treated the same way) that explicit slots are intended as catch all slots used for random purposes.
It seems that other csr slots are all
IncomingArg
tho, so maybe its best to keep that slot kind and figure out what the explicit offset needs to be, rather than rely onlayout_stack
.
iximeow submitted PR Review.
iximeow created PR Review Comment:
I think we'd need to make some changes to
stack_layout
andStackSlot
for "correct offset inStackSlotKind::IncomingArg
" to be a question we can really answer.stack_addr
and friends assume the prologue has already done its stack setup, and, seemingly, thatrsp
won't move through the body of the function. So for the address of anIncomingArg
slot to be used _in_ the prologue, we'd need to defer actual resolution of stack offsets even after the instructions are added.The other option is tracking the offset ourselves in the prologue, which I expect would look like tracking the amount we've moved SP with pushes/explicit adjustment, then acknowledging that the address from
stack_addr(IncomingArgSlot)
will be wrong (it'd be off by however much we've pushed) and correcting for that after the fact. Because floating point save/restore is past any GPR pushes, I think the offset will always be valid, just smaller than it "should" be.I'd be more suspicious that I'm misunderstanding something, but I don't think we have other instances of wanting to do stack loads/stores in prologues/epilogues, so maybe this is actually a rough spot? :(
iximeow closed without merge PR #1378.
jaykrell submitted PR Review.
jaykrell created PR Review Comment:
The VIsual C++ code above is weird.
I do not recommend emulating it.
It is surely correct, and it might be optimized, but do not emulate it.
Visual C++ definitely does some strange things as optimizations.
In particular, it stores to the stack before adjusting rsp, and then has to misstate frame pointer offsets. "Normal" code only pushes prior to storing to the stack, I believe.
Maybe try looking at unoptimized output instead.
Or link /dump /disasm /unwindinfo ntoskrnl.exe to find simple stuff written in assembly.
Or read and follow the spec -- Visual C++ kinda skirts the spec.
Last updated: Oct 23 2024 at 20:03 UTC