Stream: git-cranelift

Topic: cranelift / PR #1378 Preserve SIMD registers in Windows f...


view this post on Zulip GitHub (Feb 06 2020 at 22:37):

iximeow edited PR #1378 from windows-fprs-preservation to master:

This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.

@sstangl if you get a chance, I think you might know the right things for the questions to follow!

view this post on Zulip GitHub (Feb 06 2020 at 22:39):

iximeow updated PR #1378 from windows-fprs-preservation to master:

This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.

@sstangl if you get a chance, I think you might know the right things for the questions to follow!

view this post on Zulip GitHub (Feb 06 2020 at 22:46):

iximeow updated PR #1378 from windows-fprs-preservation to master:

This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.

@sstangl if you get a chance, I think you might know the right things for the questions to follow!

view this post on Zulip GitHub (Feb 06 2020 at 22:50):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 06 2020 at 22:50):

iximeow created PR Review Comment:

As I mention in the commit message, I do not know having this part of the check causes an error. I tried copying the style in f.ex probestack-disabled.clif, and I swear this passed once, but now yields an error like

> [RexOp1spaddr8_id#808d,%rcx]        v11 = stack_addr.i64 ss0
                                      ^~~~~~~~~~~~~~~~~~~~~~~~
Matched #7: \bv11 = stack_addr\.i64 ss0\b
> [Op2fldDisp8#410,%xmm7]             v13 = load.f64x2 notrap aligned v11+16
Missed #8: \[Op2fldDisp8\#410,%xmm7\] v13 = load\.f64x2 notrap aligned v11\+16\b

note that

got:      [Op2fldDisp8#410,%xmm7]             v13 = load.f64x2 notrap aligned v11+16
expected: [Op2fldDisp8#410,%xmm7] v13 = load.f64x2 notrap aligned v11+16

these are the same aside from whitespace. I tested having the same amount of whitespace just in case, and that didn't influence passing or not. Am I misunderstanding the nextln directive?

view this post on Zulip GitHub (Feb 06 2020 at 23:21):

iximeow updated PR #1378 from windows-fprs-preservation to master:

This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.

@sstangl if you get a chance, I think you might know the right things for the questions to follow!

view this post on Zulip GitHub (Feb 06 2020 at 23:28):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 06 2020 at 23:28):

iximeow created PR Review Comment:

This is brittle in the face of frame pointer elision. A more applicable register would be r11, but there are no rex encodings of fstDisp8 or fldDisp8, so using r11 as a base violates constraints that the registers must have index <=7.

view this post on Zulip GitHub (Feb 10 2020 at 22:13):

hrydgard submitted PR Review.

view this post on Zulip GitHub (Feb 10 2020 at 22:13):

hrydgard created PR Review Comment:

csr_stack_size = (csr_stack_size + 15) & !0b1111;

would save extending when already aligned to 16 bytes (unless I'm misunderstanding something and that's desirable)

view this post on Zulip GitHub (Feb 10 2020 at 22:14):

hrydgard edited PR Review Comment.

view this post on Zulip GitHub (Feb 10 2020 at 22:14):

hrydgard edited PR Review Comment.

view this post on Zulip GitHub (Feb 10 2020 at 22:45):

iximeow updated PR #1378 from windows-fprs-preservation to master:

This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.

@sstangl if you get a chance, I think you might know the right things for the questions to follow!

view this post on Zulip GitHub (Feb 10 2020 at 22:47):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 10 2020 at 22:47):

iximeow created PR Review Comment:

that'll teach me for trying to do math. Good catch!

view this post on Zulip GitHub (Feb 11 2020 at 01:31):

iximeow updated PR #1378 from windows-fprs-preservation to master:

This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.

@sstangl if you get a chance, I think you might know the right things for the questions to follow!

view this post on Zulip GitHub (Feb 11 2020 at 01:34):

iximeow updated PR #1378 from windows-fprs-preservation to master:

This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.

@sstangl if you get a chance, I think you might know the right things for the questions to follow!

view this post on Zulip GitHub (Feb 11 2020 at 01:35):

iximeow updated PR #1378 from windows-fprs-preservation to master:

This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.

@sstangl if you get a chance, I think you might know the right things for the questions to follow!

view this post on Zulip GitHub (Feb 12 2020 at 12:36):

jaykrell submitted PR Review.

view this post on Zulip GitHub (Feb 12 2020 at 12:36):

jaykrell created PR Review Comment:

You only need to preserve the lower 128 bits.
The upper 128 (or 384, whatever) are volatile.
The rationale is:

You can save more if you want, but it buys little/nothing.

Except, you know, you can have private calling conventions among known callers/callees.

Callers can assume more volatiles (ymm, zmm) are preserved, and can allow for nonvolatiles to be trashed (that the caller preserves).

Calling conventions are largely?entirely about separate compilation.

view this post on Zulip GitHub (Feb 12 2020 at 12:40):

jaykrell edited PR Review Comment.

view this post on Zulip GitHub (Feb 12 2020 at 18:53):

peterhuene submitted PR Review.

view this post on Zulip GitHub (Feb 12 2020 at 18:53):

peterhuene created PR Review Comment:

Speaking of unwind, can we update unwind.rs to emit the unwind codes for these CSRs as part of this PR? The original implementation made some assumptions as to what instructions were present in the prologue.

view this post on Zulip GitHub (Feb 12 2020 at 18:57):

peterhuene edited PR Review Comment.

view this post on Zulip GitHub (Feb 12 2020 at 23:38):

iximeow updated PR #1378 from windows-fprs-preservation to master:

This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.

@sstangl if you get a chance, I think you might know the right things for the questions to follow!

view this post on Zulip GitHub (Feb 12 2020 at 23:41):

iximeow updated PR #1378 from windows-fprs-preservation to master:

This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.

@sstangl if you get a chance, I think you might know the right things for the questions to follow!

view this post on Zulip GitHub (Feb 12 2020 at 23:45):

iximeow updated PR #1378 from windows-fprs-preservation to master:

This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.

@sstangl if you get a chance, I think you might know the right things for the questions to follow!

view this post on Zulip GitHub (Feb 12 2020 at 23:57):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 12 2020 at 23:57):

iximeow created PR Review Comment:

I've update Windows fastcall unwind generation to emit XmmSave128 and XmmSave128Far, though I'm suspect of the offset.

From reading https://docs.microsoft.com/en-us/cpp/build/exception-handling-x64?view=vs-2019#unwind-operation-code it looks like the offset for SaveXmm counts a number of 16-byte stack slots from the base of the call frame, so storing to [rbp - 0x10] would have an offset of 1, [rbp - 0x20] would be 2. In contrast, the callee-save code added here should be [rsp - csr_stack_slot_offset + FPR_offset] where FPR_offset is 0x10, 0x20, ... as appropriate. The offsets recorded are FPR_offset, and my suspicion is that they would be interpreted as offsets from the other side of callee-save space by Windows.

This is well beyond my experience with Windows to actually verify. If this looks right to you, I'd be happy to hear it - threading the size of csr_stack_slot through to UnwindInfo seems kind of annoying. I think it would best be a field on Function?

view this post on Zulip GitHub (Feb 13 2020 at 00:01):

iximeow updated PR #1378 from windows-fprs-preservation to master:

This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.

@sstangl if you get a chance, I think you might know the right things for the questions to follow!

view this post on Zulip GitHub (Feb 13 2020 at 00:04):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 13 2020 at 00:04):

iximeow created PR Review Comment:

I also updated the comment to be more strongly worded on preserving the low 128 bits, thanks for the feedback, @jaykrell.

view this post on Zulip GitHub (Feb 13 2020 at 02:39):

jaykrell submitted PR Review.

view this post on Zulip GitHub (Feb 13 2020 at 02:39):

jaykrell created PR Review Comment:

To better understand, just look at C++ examples.
Here is a start. Change it up with alloca to get a frame pointer.
But ouch, keep in mind, compiler does some odd things that seem to violate the spec.
Like getting the rsp/rbp before adjustment, and recording offsets from after.
That does occur below unfortunately -- the use of rax.
That is an optimization and you don't have to do it. Please don't, really.
It is just trying to keep the offsets in the instructions to fit in 8 bits instead of 32 I think, to shrink instruction size, and looks unnecessary in this case, and wastes an instruction.

C:\s>type 1.c

__declspec(dllexport)
float (*f1)(float, float);

__declspec(dllexport)
float f2(float a, float b)
{
 float c = a + b;
 float d = a - b;
 float e = a * b;
 float f = a / b;

 return f1(a, b) + f1(c, d) + f1(c, d) - f1(e, f) + f1(e + f, e - f);
}

C:\s>cl /O2 /LD /Zi 1.c /link /noentry /incremental:no
Microsoft (R) C/C++ Optimizing Compiler Version 19.21.27702.2 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

1.c
Microsoft (R) Incremental Linker Version 14.21.27702.2
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:1.dll
/dll
/implib:1.lib
/debug
/noentry
/incremental:no
1.obj
   Creating library 1.lib and object 1.exp

C:\s>link /dump /unwindinfo /disasm 1.dll
Microsoft (R) COFF/PE Dumper Version 14.21.27702.2
Copyright (C) Microsoft Corporation.  All rights reserved.


Dump of file 1.dll

File Type: DLL

f2:
; This instruction is ok but imho an unnecessary and possibly failed optimization.

  0000000180001000: 48 8B C4           mov         rax,rsp

  0000000180001003: 48 81 EC 98 00 00  sub         rsp,98h
                    00
; for example, this could/should be encoded as mov [rsp-80]

  000000018000100A: 0F 29 70 E8        movaps      xmmword ptr [rax-18h],xmm6

; for example, this could/should be encoded as mov [rsp-70]

  000000018000100E: 0F 29 78 D8        movaps      xmmword ptr [rax-28h],xmm7
  0000000180001012: 0F 28 F9           movaps      xmm7,xmm1
  0000000180001015: 44 0F 29 40 C8     movaps      xmmword ptr [rax-38h],xmm8
  000000018000101A: 44 0F 28 C0        movaps      xmm8,xmm0
  000000018000101E: 44 0F 29 48 B8     movaps      xmmword ptr [rax-48h],xmm9
  0000000180001023: 44 0F 28 C8        movaps      xmm9,xmm0
  0000000180001027: 44 0F 29 50 A8     movaps      xmmword ptr [rax-58h],xmm10
  000000018000102C: F3 44 0F 5C CF     subss       xmm9,xmm7
  0000000180001031: 44 0F 29 58 98     movaps      xmmword ptr [rax-68h],xmm11
  0000000180001036: 44 0F 28 D0        movaps      xmm10,xmm0
  000000018000103A: 44 0F 29 60 88     movaps      xmmword ptr [rax-78h],xmm12
  000000018000103F: F3 44 0F 58 D7     addss       xmm10,xmm7
  0000000180001044: 44 0F 28 E0        movaps      xmm12,xmm0
  0000000180001048: 44 0F 28 D8        movaps      xmm11,xmm0
  000000018000104C: F3 44 0F 59 E7     mulss       xmm12,xmm7
  0000000180001051: 41 0F 28 C9        movaps      xmm1,xmm9
  0000000180001055: 41 0F 28 C2        movaps      xmm0,xmm10
  0000000180001059: F3 44 0F 5E DF     divss       xmm11,xmm7
  000000018000105E: FF 15 A4 1F 00 00  call        qword ptr [f1]
  0000000180001064: 0F 28 F0           movaps      xmm6,xmm0
  0000000180001067: 0F 28 CF           movaps      xmm1,xmm7
  000000018000106A: 41 0F 28 C0        movaps      xmm0,xmm8
  000000018000106E: FF 15 94 1F 00 00  call        qword ptr [f1]
  0000000180001074: F3 0F 58 F0        addss       xmm6,xmm0
  0000000180001078: 41 0F 28 C9        movaps      xmm1,xmm9
  000000018000107C: 41 0F 28 C2        movaps      xmm0,xmm10
  0000000180001080: FF 15 82 1F 00 00  call        qword ptr [f1]
  0000000180001086: 0F 28 F8           movaps      xmm7,xmm0
  0000000180001089: 41 0F 28 CB        movaps      xmm1,xmm11
  000000018000108D: 41 0F 28 C4        movaps      xmm0,xmm12
  0000000180001091: F3 0F 58 FE        addss       xmm7,xmm6
  0000000180001095: FF 15 6D 1F 00 00  call        qword ptr [f1]
  000000018000109B: 41 0F 28 CC        movaps      xmm1,xmm12
  000000018000109F: F3 0F 5C F8        subss       xmm7,xmm0
  00000001800010A3: F3 41 0F 5C CB     subss       xmm1,xmm11
  00000001800010A8: F3 45 0F 58 DC     addss       xmm11,xmm12
  00000001800010AD: 41 0F 28 C3        movaps      xmm0,xmm11
  00000001800010B1: FF 15 51 1F 00 00  call        qword ptr [f1]

 ; blech, see below
  00000001800010B7: 4C 8D 9C 24 98 00  lea         r11,[rsp+98h]
                    00 00
  00000001800010BF: 41 0F 28 73 E8     movaps      xmm6,xmmword ptr [r11-18h]
  00000001800010C4: F3 0F 58 C7        addss       xmm0,xmm7
  00000001800010C8: 0F 28 7C 24 70     movaps      xmm7,xmmword ptr [rsp+70h]
  00000001800010CD: 45 0F 28 43 C8     movaps      xmm8,xmmword ptr [r11-38h]
  00000001800010D2: 45 0F 28 4B B8     movaps      xmm9,xmmword ptr [r11-48h]
  00000001800010D7: 45 0F 28 53 A8     movaps      xmm10,xmmword ptr [r11-58h]
  00000001800010DC: 45 0F 28 5B 98     movaps      xmm11,xmmword ptr [r11-68h]
  00000001800010E1: 45 0F 28 63 88     movaps      xmm12,xmmword ptr [r11-78h]

; This violates the letter of the spec, alas.
; It is supposed to be add rsp,n, or lea rsp, n[frame]
  00000001800010E6: 49 8B E3           mov         rsp,r11
  00000001800010E9: C3                 ret

Function Table (1)

           Begin    End      Info      Function Name

  00000000 00001000 000010EA 0000211C  f2
    Unwind version: 1
    Unwind flags: None
    Size of prologue: 0x3F
    Count of codes: 16
    Unwind codes:
      3F: SAVE_XMM128, register=xmm12 offset=0x20
      36: SAVE_XMM128, register=xmm11 offset=0x30
      2C: SAVE_XMM128, register=xmm10 offset=0x40
      23: SAVE_XMM128, register=xmm9 offset=0x50
      1A: SAVE_XMM128, register=xmm8 offset=0x60
      12: SAVE_XMM128, register=xmm7 offset=0x70
      0E: SAVE_XMM128, register=xmm6 offset=0x80
      0A: ALLOC_LARGE, size=0x98

So, overall, mixed. The C++ examples help to understand, but they also confuse, because the C++ compiler violates the spec in ways that end up meaning the same thing at runtime, if you understand closely how unwinding works.
I would encourage you to follow the spec instead, though I suppose there might be measurable perf benefits otherwise.

You can do the link /dump /unwindinfo on any binary -- no need for symbols.

view this post on Zulip GitHub (Feb 13 2020 at 06:22):

jaykrell submitted PR Review.

view this post on Zulip GitHub (Feb 13 2020 at 06:22):

jaykrell created PR Review Comment:

The NT unwinder is also open sourced, here:

https://github.com/dotnet/runtime/blob/master/src/coreclr/src/unwinder/amd64/unwinder_amd64.cpp#L895

at least an old maybe subsetted/altered form, but probably useful.

view this post on Zulip GitHub (Feb 13 2020 at 06:23):

jaykrell edited PR Review Comment.

view this post on Zulip GitHub (Feb 13 2020 at 17:32):

peterhuene submitted PR Review.

view this post on Zulip GitHub (Feb 13 2020 at 17:32):

peterhuene created PR Review Comment:

The offsets in the generated unwind information in the test do seem odd, especially compared to the above output form VC++, but I think that's a result of adjusting the stack pointer for the frame after we're writing in the SIMD CSRs. In the test output, an offset of 0, 0x10, and 0x20 make sense given that the SP will be adjusted up during unwinding restores the XMM registers.

Perhaps we should adjust the SP first?

view this post on Zulip GitHub (Feb 13 2020 at 17:32):

peterhuene edited PR Review Comment.

view this post on Zulip GitHub (Feb 13 2020 at 17:33):

peterhuene edited PR Review Comment.

view this post on Zulip GitHub (Feb 13 2020 at 17:40):

peterhuene edited PR Review Comment.

view this post on Zulip GitHub (Feb 13 2020 at 17:40):

peterhuene edited PR Review Comment.

view this post on Zulip GitHub (Feb 20 2020 at 18:52):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 18:52):

iximeow created PR Review Comment:

If the generated unwind info looks okay, are you good with this change and revisiting in case it turns out not to be? I don't have anything resembling a useful Windows development environment, so I'm not even entirely sure what to look for being wrong in case the unwind information is wrong - I imagine "only" an incorrect value for a callee-save register in some parent frame? Given that overall this PR fixes an ABI error that results in dangerously wrong code, I _would_ like to get this figured out :)

view this post on Zulip GitHub (Feb 20 2020 at 19:15):

peterhuene submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 19:15):

peterhuene created PR Review Comment:

I'm good with the change and open to revisiting it in the future. I think the order of operations in our prologues is just a little "strange" compared to VC++'s prologues, but I don't necessarily see that as a problem with respect to unwind.

view this post on Zulip GitHub (Feb 20 2020 at 19:27):

peterhuene submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 19:45):

peterhuene created PR Review Comment:

Sorry, I just caught this. I would expect a value of 4 originally, indicating a 40 byte SP adjustment: 8 bytes to realign the stack from the even number of GPR pushes and 32 bytes for the callee spill space we never elide (even for leaf functions). So there's a bug currently in how we're aligning the stack even when spilling just the GPRs.

With your changes, I would expect a value of 10, indicating 88 bytes: 8 bytes to realign the stack from the even number of GPR pushes, 48 bytes for the saved XMM registers, and 32 bytes for the callee spill space.

view this post on Zulip GitHub (Feb 20 2020 at 19:45):

peterhuene submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 19:54):

peterhuene submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 19:54):

peterhuene submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 20:03):

peterhuene submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 20:03):

peterhuene created PR Review Comment:

For comparison, the VC++ example above allocates 152 bytes for its frame: 8 bytes for stack alignment (upon entry stack is always off by 8 bytes due to call instruction pushing return address), 112 bytes for saved XMM registers, and 32 bytes for callee spill space (the function does call).

view this post on Zulip GitHub (Feb 20 2020 at 20:06):

peterhuene edited PR Review Comment.

view this post on Zulip GitHub (Feb 20 2020 at 20:11):

peterhuene edited PR Review Comment.

view this post on Zulip GitHub (Feb 20 2020 at 21:33):

iximeow created PR Review Comment:

Yep, there's definitely an error here. I hadn't checked the generated assembly on this case but

Disassembly of 356 bytes:                                                                                                                                                               [69/1840]
   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   53                      push    rbx
   5:   56                      push    rsi
   6:   57                      push    rdi
   7:   41 54                   push    r12
   9:   41 55                   push    r13
   b:   41 56                   push    r14
   d:   41 57                   push    r15
   f:   48 8d 84 24 10 00 00 00 lea     rax, [rsp + 0x10]
  17:   f2 0f 11 30             movsd   qword ptr [rax], xmm6
  1b:   f2 0f 11 78 10          movsd   qword ptr [rax + 0x10], xmm7
  20:   f2 44 0f 11 40 20       movsd   qword ptr [rax + 0x20], xmm8

is definitely not the prologue I'd want to see. It looks like the callee-save for fprs begins at where the stack was when created, rather than being a non-overlapping region laid out somewhere later on. Which makes sense, since this is so late in compilation. urgh.

view this post on Zulip GitHub (Feb 20 2020 at 21:33):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 22:10):

peterhuene submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 22:10):

peterhuene created PR Review Comment:

Was the unwind information "correct" in that RSP is adjusted by only 0x28 following saving the XMM registers?

view this post on Zulip GitHub (Feb 20 2020 at 22:12):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 22:12):

iximeow created PR Review Comment:

correct in that it was adjusted by 0x30 (5 * 8 == 40 == 0x28, but there's the implied extra 8 bytes bringing it to 0x30)

view this post on Zulip GitHub (Feb 20 2020 at 22:19):

fitzgen submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 22:19):

fitzgen created PR Review Comment:

File an issue for this todo and link to it here, so that it doesn't get lost, please.

view this post on Zulip GitHub (Feb 20 2020 at 22:19):

fitzgen submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 22:19):

fitzgen created PR Review Comment:

Wait -- shouldn't fpr_offset be initialized to the offset chosen for the stack slot earlier? Right now it is initialized to 0. (Or alternatively but equivalently use the stack slot offset as an immediate for the stack_addr instruction?) Otherwise we are always reading xmm6-15 from stack frame offsets 0... Is this the source of the overlapping stack slot bug?

view this post on Zulip GitHub (Feb 20 2020 at 22:19):

fitzgen created PR Review Comment:

Is it too late in the pipeline to make offset be None here and then let layout_stack assign the offset? This might help with the overlapping stack slot issues you and Peter are discussing elsewhere in this thread (can't figure out how to comment inline there??). Or does the fastcall ABI require that they are saved at a particular location?

view this post on Zulip GitHub (Feb 20 2020 at 22:19):

fitzgen created PR Review Comment:

ditto

view this post on Zulip GitHub (Feb 20 2020 at 22:34):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 22:34):

iximeow created PR Review Comment:

I was just looking at this for the same reason, I think you're on the right track. These offsets are up from the bottom of the stack slot so zero is on the opposite side from where push starts placing registers. My expectation was the callee-save region would look like

size      | rbp
size - 8  | rbx
size - 16 | rsi
... other callee-save gprs
0x20      | xmm8
0x10      | xmm7
0x00      | xmm6

which is close to what the actual layout looks like, but the base isn't what I'd expected.

view this post on Zulip GitHub (Feb 20 2020 at 22:35):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 22:35):

iximeow created PR Review Comment:

same as above, either this _is_ or is closely related to the overlapping stack slot issue. Either this will change or I'll update (probably with a comment describing CSR region layout) when it's fixed.

view this post on Zulip GitHub (Feb 20 2020 at 22:39):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 22:39):

iximeow created PR Review Comment:

(a nice benefit of picking the end of the CSR region is that it is defined to be aligned by ABI requirements, so we don't need to figure out if the last GPR save left the stack aligned or not)

view this post on Zulip GitHub (Feb 20 2020 at 22:40):

peterhuene submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 22:40):

peterhuene created PR Review Comment:

Oh sorry, I meant 0x30 (the 5 in the unwind info indicating (8*5)+8; I seem to always forget the +8). Thanks for confirming.

view this post on Zulip GitHub (Feb 20 2020 at 22:41):

peterhuene edited PR Review Comment.

view this post on Zulip GitHub (Feb 20 2020 at 22:44):

iximeow edited PR Review Comment.

view this post on Zulip GitHub (Feb 20 2020 at 22:53):

peterhuene submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 22:53):

peterhuene created PR Review Comment:

Also, judging from the above disassembly, The offsets from RSP for the XMM registers are off by 1 in the resulting unwind information it seems. We're recording the offsets from RAX in the unwind info (scaled-by-16 values of 0, 1, and 2), but RAX itself is offset 0x10 from RSP.

view this post on Zulip GitHub (Feb 20 2020 at 22:56):

peterhuene edited PR Review Comment.

view this post on Zulip GitHub (Feb 20 2020 at 22:59):

peterhuene deleted PR Review Comment.

view this post on Zulip GitHub (Feb 20 2020 at 23:00):

peterhuene submitted PR Review.

view this post on Zulip GitHub (Feb 20 2020 at 23:00):

peterhuene created PR Review Comment:

I've deleted my last comment since it's probably just related to the overlapping stack slot issues. Once we have the proper prologue generated, I'll verify that the unwind information properly describes it.

view this post on Zulip GitHub (Feb 21 2020 at 01:43):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 21 2020 at 01:43):

iximeow created PR Review Comment:

Some notes as it took me a surprising amount of time to figure out what _exactly_ was going on. I didn't really understand StackSlot too well.

The gist of it is that lea rax, [rsp + 0x10] may look like an address of nothing in particular, but that's because it seems StackSlot used alongside adjusting the stack is a recipe for bad times. It's the instruction that would be correct to get the stack slot's start at the entry of the function without intervening changes. That is, the return address and rbp are preserved on the stack by Cranelift, and 0x10 down the stack from the base address of the call frame would be rbx and the start of the rest of the csr region. rsp isn't really "ready for use" until after the prologue though, so stack_addr doesn't seem appropriate to use as it currently exists for this purpose.

Subsequently, the xmm stores _would_ overwrite the return address, rbp, rbx, ..., were the stack slot address right - I misunderstood which direction offsets count from too.

I think what I can do is make a second stack slot that we can directly address for FPR save/lrestore, and requiring 16-byte alignment on that would keep offsets simple as well. I'll give it a try and we'll see how it goes :crossed_fingers:

view this post on Zulip GitHub (Feb 21 2020 at 05:20):

iximeow updated PR #1378 from windows-fprs-preservation to master:

This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.

@sstangl if you get a chance, I think you might know the right things for the questions to follow!

view this post on Zulip GitHub (Feb 21 2020 at 05:21):

iximeow updated PR #1378 from windows-fprs-preservation to master:

This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.

@sstangl if you get a chance, I think you might know the right things for the questions to follow!

view this post on Zulip GitHub (Feb 21 2020 at 05:29):

iximeow updated PR #1378 from windows-fprs-preservation to master:

This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.

@sstangl if you get a chance, I think you might know the right things for the questions to follow!

view this post on Zulip GitHub (Feb 21 2020 at 05:29):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 21 2020 at 05:29):

iximeow created PR Review Comment:

the missing stack adjustment in this test maybe should have been a giveaway that something was awry too. I'd read it as not necessary, assuming ss0 would be some valid stack slot, but in reality this was testing that a store to v7 would clobber the start of the CSR region :(

ss0 is still the correct stack slot with this change because I add the FPR stack slot before the GPR stack slot. I've since updated this test to verify that.

view this post on Zulip GitHub (Feb 21 2020 at 05:45):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 21 2020 at 05:45):

iximeow created PR Review Comment:

Surprise of surprises, the number is now not 3, 4, 5, or 10, but 12! I can argue the correctness of 12, though. For stack_slot reasons I don't understand, the FPR preservation region is aligned more strictly than necessary. This grows what would be 48 bytes into 64. Plus eight bytes to realign the stack, and 32 bytes of shadow stack space, that makes for a 104-byte allocation. That would be 13 8-byte slots, minus one for the implied slot makes 12!

A similar wider-than-necessary FPR region is also present in the float_callee_save test I added. I've tried to follow through what causes this in layout_stack and my best theory is that treating a stack slot as a collection of smaller units is not quite how it's intended to be used, and instead it tries to align the provided stack slots using their sizes as an alignment requirement. It looks like treating a stack slot as an aggregate of more permissible units might be a bit of a change, so I'm open to recommendations if taking ~16-48 extra bytes per call frame is a nonstarter.

view this post on Zulip GitHub (Feb 21 2020 at 05:51):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 21 2020 at 05:51):

iximeow created PR Review Comment:

update: it's not too late, but layout_stack requires slots of type IncomingArg to have a defined offset. That offset is used for the base of the stack layout so a None ends up being a panic.

I ended up making FPR save/restore go through a second stack slot of kind SpillSlot - it seems like either this or ExplicitSlot might make sense on account of stack store/load is a lot like a spill (if they weren't used, they wouldn't have to be "spilled" to the stack). I'm not entirely sure what ExplicitSlot ought to be used for, but it also looks applicable?

view this post on Zulip GitHub (Feb 21 2020 at 18:34):

fitzgen submitted PR Review.

view this post on Zulip GitHub (Feb 21 2020 at 18:34):

fitzgen created PR Review Comment:

I'm not totally sure, but I think that SpillSlots are intended to be reused/de-duped across spilled values with distinct live ranges. This should be possible to do in theory for calling convention related stack slots (same as with struct return) but I'm not sure that it works as of today nor that it wouldn't be buggy.

I think (again not totally sure, as they're mostly used in tests and pretty much all slot kinds other than incoming args are treated the same way) that explicit slots are intended as catch all slots used for random purposes.

It seems that other csr slots are all IncomingArg tho, so maybe its best to keep that slot kind and figure out what the explicit offset needs to be, rather than rely on layout_stack.

view this post on Zulip GitHub (Feb 21 2020 at 19:15):

iximeow submitted PR Review.

view this post on Zulip GitHub (Feb 21 2020 at 19:15):

iximeow created PR Review Comment:

I think we'd need to make some changes to stack_layout and StackSlot for "correct offset in StackSlotKind::IncomingArg" to be a question we can really answer. stack_addr and friends assume the prologue has already done its stack setup, and, seemingly, that rsp won't move through the body of the function. So for the address of an IncomingArg slot to be used _in_ the prologue, we'd need to defer actual resolution of stack offsets even after the instructions are added.

The other option is tracking the offset ourselves in the prologue, which I expect would look like tracking the amount we've moved SP with pushes/explicit adjustment, then acknowledging that the address from stack_addr(IncomingArgSlot) will be wrong (it'd be off by however much we've pushed) and correcting for that after the fact. Because floating point save/restore is past any GPR pushes, I think the offset will always be valid, just smaller than it "should" be.

I'd be more suspicious that I'm misunderstanding something, but I don't think we have other instances of wanting to do stack loads/stores in prologues/epilogues, so maybe this is actually a rough spot? :(

view this post on Zulip GitHub (Mar 03 2020 at 19:13):

iximeow closed without merge PR #1378.

view this post on Zulip GitHub (Mar 04 2020 at 00:47):

jaykrell submitted PR Review.

view this post on Zulip GitHub (Mar 04 2020 at 00:47):

jaykrell created PR Review Comment:

The VIsual C++ code above is weird.
I do not recommend emulating it.
It is surely correct, and it might be optimized, but do not emulate it.
Visual C++ definitely does some strange things as optimizations.
In particular, it stores to the stack before adjusting rsp, and then has to misstate frame pointer offsets. "Normal" code only pushes prior to storing to the stack, I believe.
Maybe try looking at unoptimized output instead.
Or link /dump /disasm /unwindinfo ntoskrnl.exe to find simple stuff written in assembly.
Or read and follow the spec -- Visual C++ kinda skirts the spec.


Last updated: Oct 23 2024 at 20:03 UTC