bjorn3 commented on Issue #2678:
I got a panic at cranelift/codegen/src/isa/x64/abi.rs:508 when trying this branch out. There is an
unimplemented!()
there under the assumption that no XMM registers are callee-save, which is an incorrect assumption for WindowsFastcall.
cfallin commented on Issue #2678:
Ah, I had missed that -- fixed now and test added.
bjorn3 commented on Issue #2678:
The sysroot compiled successfully. The only missing things for full Windows support in cg_clif are TLS support and a few small things on the cg_clif side.
cfallin commented on Issue #2678:
One thing I didn't see (and maybe I missed) is how "args above four" are pushed in a right-to-left fashion to the stack
I went back and looked to be sure -- this is somewhat confusing phrasing, but "pushed right-to-left" is functionally the same as "args in order are at ascending stack addresses", which is what this PR does. This quick experiment on Compiler Explorer shows arg 5 is at
[rsp+40]
and arg 6 is at[rsp+48]
(40 because return address plus 32-byte shadow space). So I think we're OK here, and e.g.%f6
in the tests verifies this order (returns first of two on-stack args).[CI test]
I'll look into adding this! We may need to temporarily disable some tests to make it work, since unwind info for Windows is not implemented yet.
An interesting question is: how do we actually validate that we match what other compilers provide and expect? Even for System V, we don't (I think) have an ABI test that explicitly links Cranelift-produced code with other native code. I suppose we implicitly get that when we run tests that make calls into the Wasmtime runtime; perhaps we should add a few "test weird ABI corners" VM calls just for this purpose.
cfallin commented on Issue #2678:
An update on this: I've been pulling on a thread the last few days in order to implement fully correct "non-Rust" i128 support and I'm starting to wonder if this is the right path.
This has sort of turned into a refactor of the entire ABI machinery in order to support by-ref values and cross-XMM/integer movement, and it is maddeningly complex. Basically, because we don't have a legalization-based approach wherein we can "reduce to a previously solved problem", we end up handling an "arg within an arg" (the by-ref pointer) which contorts the code quite a bit. By-ref support adds another stack area on the caller side whose layout we have to compute, and tweaks a bunch of the ABI traits because we need some temp registers in various places. It's asymmetric wrt the return case: fastcall passes 128-bit return values, even integers (
__m128
), in XMM registers; so our two-I64-reg integer handling that is consistent throughout the rest of the backend needs special-cased split/concat code for return values only.To recap, this is necessary only for
extern "C"
calls from cg_clif withi128
values. Everything with cg_clif + Rust ABI works correctly; this is only for the hypothetical case that one links with a C library and passes 128-bit integers to it. I'm currently at a 1K-line diff and it's ~50% done.This is the complexity I was worried about when I had suggested "just panicing on i128 in the non-Rust case" above; @bjorn3, your pushback then was that this is not Fastcall-compliant, and indeed that is true! However, I am wondering how much this corner case really matters at this point in time, versus the disproportionate cost of building it (and the opportunity cost of not working on other things); and perhaps if it is very important later, we can talk about properly allocating time for implementing this.
Thoughts?
bjorn3 commented on Issue #2678:
As far as I know the following pass modes are used for SystemV and Fastcall with cg_llvm:
what sysv pass mode sysv abi fastcall pass mode fastcall abi argument i128 Rust abi cast -> i128 two integer registers cast -> i128 two integer registers i128 C abi cast -> i128 two integer registers indirect pass pointer in register __m128 Rust abi indirect pass pointer in register indirect pass pointer in register __m128 C abi vector i64x2 xmm register indirect pass pointer in register 128bit struct Rust abi cast -> i128 pass in two registers cast -> i128 pass in two registers 128bit struct C abi cast -> i128 pass in two registers indirect pass pointer in register return value i128 Rust abi cast -> i128 two integer registers cast -> i128 two integer registers i128 C abi cast -> i128 two integer registers indirect pass return pointer in register __m128 Rust abi indirect pass return pointer in register indirect pass return pointer in register __m128 C abi vector i64x2 xmm register vector i64x2 xmm register 128bit struct Rust abi cast -> i128 pass in two registers cast -> i128 pass in two registers 128bit struct C abi cast -> i128 pass in two registers indirect pass return pointer in register The pass mode here is the
PassMode
computed by rustc. Whencast -> i128
is used, the value is always passed in two integer registers, even when the calling convention theoretically doesn't support it. If the calling convention doesn't support it, for the C abi, rustc will use an indirect pass mode.I think Cranelift should follow LLVM and always pass i128 in two integer registers. If it needs to be passed as pointer or on the stack, the abi adjustement code of the user of Cranelift can do this. It is impossible to fully handle the abi in Cranelift anyway as it doesn't support structs. Cranelift should have the building blocks based on which the calling convention can be followed.
cfallin commented on Issue #2678:
Thank you for the detailed comparison table, @bjorn3!
So, yes, it seems we are in agreement: the by-ref argument passing can exist at a layer above the CLIF as finally seen by the codegen backend. This seems to be a better factorization of complexity.
Given this plan, I'll fix up the PR to address all of the other issues, then document that the i128-in-two-registers scheme is an LLVM/rustc-inspired extension. The open question is whether we want to do this by default (when the user just asks for
WindowsFastcall
) or only when explicitly indicated by a variant ABI (WindowsFastcallRust
orWindowsFastcallLLVM
for example) or by an ISA flag (fastcall_i128_in_integer_reg_pair
for example). I think I lean toward the latter, to minimize confusion later -- it forces the Cranelift user to think about and understand these subtleties of Fastcall -- and we can link to this PR discussion as a fairly detailed exploration of the space :-)
bjorn3 commented on Issue #2678:
Maybe it could be called
enable_llvm_abi_extensions
?
cfallin commented on Issue #2678:
In the last few commits I fell down a bit of a CI rabbithole but now this PR also allows the Cranelift backend variant to be set via environment variable; I couldn't get a separate action/runner to work for Windows CI so this lets us add it to the main testing matrix instead. Let's see if this works...
cfallin commented on Issue #2678:
Hmm, actually, scratch that -- that won't work because our filetests need to be aware of the dynamic choice of backend as well (to enable the right ones).
This seems to be another case where a simple request ("run it on Windows CI") has turned into a deep rabbithole of maddening complexity. Given that we are planning to switch the default Wasmtime backend soon (at which point all of our tests will provide coverage for this), given the increasingly uncomfortable maze of features and environment-variable switches, and given that filetests are covering fastcall with golden-output matching for now, I would strongly prefer not to dive further down this rabbithole, instead just waiting until we switch over to have actual run-test coverage -- @abrown, what do you think?
cfallin commented on Issue #2678:
Updated again after consulting with alexcrichton and abrown on Zulip --
runs-on: windows
andruns-on: windows-latest
are entirely different things apparently :-) Let's see if this works better...@sunfishcode, see the wasi-related test exclusion in the latest commit -- any thoughts on that? I was seeing failures of some symlink-related WASI tests on my local Windows machine (MinGW64 / MSYS2, Win10, possible complication: space in username/path (?)) but the shell script passes cleanly when excluding these.
cfallin commented on Issue #2678:
The segfault in CI is odd -- I can reproduce something similar locally but it seems to be a nondeterministic failure; it is always in
__report_gsfailure
in the Visual Studio runtime library, but I'm not yet able to trace how we're getting there (there is norr
on Windows sadly) -- though execution is throughlongjmp
and for some very odd reason the disassembly also shows a call to__telemetry_main_return_trigger()
, which appears to be telemetry (!) that MSVC inserts via the standard library. It's entirely possible that all of this is a red herring and the real issue is an ABI mismatch somewhere else; will continue to dig.
cfallin commented on Issue #2678:
More info: the segfault happens whenever a library call raises a trap (e.g. try running the table_fill Wasm spec tests); it results in the unit-test process exiting via abort in
__report_gsfailure()
. This is a failure of the "GS" stack-check mechanism in MSVC, which stores and checks stack cookies. Unfortunately it does not reproduce undergdb
consistently (also MinGW gdb does not load symbols from the .exe, and MinGW LLDB hangs, so I'm debugging somewhat blind) -- I can get it to appear or not appear by causing the frame that callssetjmp
to be a tail cail or not (with a debugging println). If not tail call, then failure still happens except under GDB.I'm trying to learn more about the GS stack-cookie mechanism; it doesn't seem that it would have any sort of action-at-a-distance (check other stack frames, etc) behavior, and the frames on either side of the setjmp/longjmp are in rustc-compiled native code. It's possible that the fastcall impl is going way out of bounds, of course. Does anyone know more about this mechanism (@sunfishcode or @peterhuene maybe)?
cfallin commented on Issue #2678:
Hmm, actually, things go awry after an
rdsspq
instruction and mismatch in__report_gsfailure
; this appears to be Intel CET (Control-flow Enforcement Technology?) which is a shadow stack thing. I'm wondering if we need to do something special in our generated code to keep this mechanism happy...
cfallin commented on Issue #2678:
I think I've worked out what's going on:
longjmp()
on windows-msvc, on a recent-enough machine with CET instructions, tracks the shadow stack depth and if it does not match a value stored in the longjmp block (i.e. we're skipping up the stack), invokesRtlUnwind()
, a Win32 function, to unwind frames.- We aren't generating/registering Windows unwind info yet, so this causes the program to abort. (This is the proximate issue we're seeing.)
- I see this issue now and not earlier, and GitHub CI sees it, I think because of presence of CET instructions: I get crashes on my Win10 VM on my 2020 desktop but not Skylake-era mini PC. Presumably GitHub's Windows runners are also running on recent cores (?).
- It's possible my speculation about ISA differences is wrong and something else is causing the runtime to try to unwind frames; in any case, it's trying to rely on unwind info that isn't there.
So our possible avenues forward seem to be:
- Land this without Windows CI. We don't actually support a Wasmtime-with-new-BE execution on Windows, because longjmp might break without unwind info. But this isn't any worse than now, where we don't support said execution because we don't do fastcall.
- Implement unwind info as well and land one PR with both.
- Roll our own setjmp that somehow circumvents Windows' structured exception handling. I'm skeptical of interfacing with complex mechanisms that require correct metadata, for security and correctness reasons; the most correct metadata is the metadata you don't have to generate at all. But there may (?) be consequences to that, or it may just not be possible; can someone more knowledgeable about Win32 chime in?
If the third option is not possible, I'd favor #1, so this PR doesn't grow unboundedly; we can then include Windows CI when we add unwind (next) and get coverage of everything at once. Thoughts?
cfallin commented on Issue #2678:
Sounds good -- thanks for the review efforts!
Last updated: Jan 24 2025 at 00:11 UTC