bnjbvr opened PR #2068 from x64
to main
:
This mixed bag of commits implements enough support so that Cranelift-in-Spidermonkey runs all the wasm test cases correctly, including multi-value, reference types, explicit bounds checks, etc. So with respect to Spidermonkey, this brings the x64 level support to the same as aarch64's (on Linux at least; Win64 fastcall support is still to be implemented); wasmtime might require additional work.
Next step is setting up CI so this is actually tested in automation.
bnjbvr requested cfallin, julian-seward1 and jlb6740 for a review on PR #2068.
bnjbvr requested cfallin, julian-seward1 and jlb6740 for a review on PR #2068.
bnjbvr requested cfallin, julian-seward1 and jlb6740 for a review on PR #2068.
cfallin submitted PR Review.
cfallin submitted PR Review.
cfallin created PR Review Comment:
{max,min}{ss,sd}
?
cfallin created PR Review Comment:
This could use some explanatory comments (it's very clever though!). It looks like it's emulating a rounding mode by converting input/2, except rounding up by OR'ing in the original LSB?
cfallin created PR Review Comment:
Can we pull this out as a top-level helper function?
cfallin created PR Review Comment:
(Ah, I see comments now below for this. Perhaps just add a forward-reference here, as in "see below for explanation".)
cfallin created PR Review Comment:
Would it make sense to merge
CvtFloatToSintSeq
andCvtFloatToUintSeq
with a flagis_signed: bool
?
cfallin created PR Review Comment:
Perhaps instead change
add_stackmap
to take a parameterStackmapExtent
, which is an enum with variantsStackmapExtent::UpcomingBytes(len)
andStackmapExtent::PriorBytes(len)
or something like that?
cfallin created PR Review Comment:
Should this be
0xffff_ffff_ffff_ffff
(i.e.u64::MAX
)? Or is theRegMemImm::imm
sign-extended? (Comment if so?)
cfallin created PR Review Comment:
Looking at the sequences above for these conversions, it appears that
src
is sometimes accessed after the temps. A def is allowed to reuse the source's register if the source is dead after this instruction, so we might inadvertently overwrite the source when we initialize the temp. Could you verify that we copy the source out to a temp as the first action in each of the sequences above?
cfallin created PR Review Comment:
Extract this sequence to a helper
emit_fcmp
, analogous toemit_cmp
above?
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
nit: could you add a comment here, explaining how the Movd/Movq cases are distinguished? Presumably via REX.W ? Because the two match- arms here are identical, and it's confusing to look at at first (kinda looks like a bug :)
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
ditto
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
This is (indeed as Chris says) very clever. Just a note that it won't work unchanged if we ever come to support X86_32, since
cvtsi2s{d,s}
will only take a 32 bit input in that case.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Please, for clarity, specify whether the said "unsigned int" is 32 or 64 bits.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
It seems like the two comments are inconsistent:
/// Converts an unsigned int64 to a float64. /// Is the target a 64-bits or 32-bits register?
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
We (collectively) should discuss register conventions, regarding
get_regs
and how it interacts with temp regs in these sequences, since I think we are in danger of falling into a nasty correctness hole otherwise. Not just here; the in-progress aarch64 atomics work has the same problem. Once we have a clear common understanding, we can go back and audit this stuff. I'm just noting it here; doesn't need to block landing right now.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
See also my comment above about need for clear analysis/policy/audit of temp register uses.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Since
Andps
already exists, is there any point in addingAndpd
? Similar question forOrpd
. They are functionally identical, so the only reason for doing this is if we believe we'll get some kind of domain-crossing penalty by applying (eg)Andps
to the output of some insn that generatesF64x2
rather thanF32x4
. Reading the opt guide sec 3.5.2.3 "Bypass between execution domains" leads me to believe thatAndpd/Andps/Orpd/Orps
would all "live" in the "integer domain".
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Or maybe all in the FP domain (who knows). Point is, I can't see that the implied 32x4 vs 64x2 differentiation would make any difference.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Similar comments to the
Andps/pd
/Orps/pd
duality.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Would it be safer to panic on any cases that we don't expect to ever get -- are there any? Specifically, for equality, signfulness is meaningless, and for overflows -- aren't there both signed and unsigned overflows? Certainly on x64 there are (
rflags.C
vsrflags.O
).
julian-seward1 edited PR Review Comment.
julian-seward1 submitted PR Review.
bnjbvr created PR Review Comment:
@julian-seward1 Indeed, fixed the above comment for
CvtUint64ToFloatSeq
so it saysto a float32/float64
, thanks!@cfallin I had thought of this; it just makes the emit code very long, and it was nice to have one axis to cut them. I hesitated between signed vs unsigned, or saturated vs not saturated, but the sat instructions emit very similar code, so splitting signed and unsigned made more sense. We could merge them back at some point, maybe using an enum to avoid manipulating the two booleans.
bnjbvr submitted PR Review.
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
Good catch for
CvtUint64ToFloatSeq
, scary that testing didn't stumble upon it. Agreed we'll need a better policy in this kind of situation; in the meantime, I'll mark it asmod
instead ofuse
, and will make sure to add a copy of the real src to a temporary fed to this Inst's src; this is sufficient to ensure the src and tmps will get different registers (otherwise, they'd be conflicting writes). Thanks!Other sequences use
mod
for the src operand for this same reason.[and several minutes later] Even scarier, the checked div or rem sequence had the same potential bug. Changed it to mark the
divisor
operand as modified too.
bnjbvr created PR Review Comment:
It's sign-extended, i'll add a comment!
bnjbvr submitted PR Review.
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
I wasn't sure about this, to be honest: there aren't many opportunities for reuse, since the actual Fcmp lowering does things quite differently. Ok on principle, though.
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
Ah, good catch! At first, when the cvt instructions were not handled here, this would also set the RexFlags. Now we could unify movd and movq into a single enum variant, since they effectively have the same effect here. Will consider for later; it's nice to have the proper name in the enum list...
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
Noted!
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
Good question, I like having the properly typed names, but since the F64 variants may be one byte longer (because of 66-prefix)... That's a question for our Intel experts :-)
@abrown @jlb6740 Is there a preference to use e.g.
andps
overandpd
, if we know the inputs are actually doubles? (See above reference to Intel's manual that Julian quoted.)
bnjbvr edited PR Review Comment.
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
Hmm, reverted this change; this was code motion from the aarch64 code, but your comment made me realize the returned values likely only apply to aarch64, and eventually x64 didn't use it.
bnjbvr updated PR #2068 from x64
to main
:
This mixed bag of commits implements enough support so that Cranelift-in-Spidermonkey runs all the wasm test cases correctly, including multi-value, reference types, explicit bounds checks, etc. So with respect to Spidermonkey, this brings the x64 level support to the same as aarch64's (on Linux at least; Win64 fastcall support is still to be implemented); wasmtime might require additional work.
Next step is setting up CI so this is actually tested in automation.
abrown created PR Review Comment:
leads me to believe that Andpd/Andps/Orpd/Orps would all "live" in the "integer domain".
@julian-seward1, reading that section, I would rather think that
Andpd/Andps/Orpd/Orps
all live in the FP domain, not in the SIMD (a.k.a. "integer") domain. The example there switches betweenADDPS
(FP) andPAND
(SIMD) and back toADDPS
(FP). So I would expect that to keep everything in the same domain, to follow the manual's example, we would needADDPS; ANDPS; ADDPS
. Or am I missing something?Since Andps already exists, is there any point in adding Andpd?
The question still stands, though, and since I don't know of any penalty like the one mentioned (and I believe packed single floats and packed double floats to be in the same FP domain), then the assembler could avoid knowing about another instruction. But what is the benefit of that? We can always choose
ANDPS
in the lowering if we are concerned about code size? I would leave it in so that lowering users have the option to use it if it feels more natural.I think where I'm coming from (and I'm still developing this thought, so bear with me) is that at the assembler level, which is what we are talking about here, the best naming convention in the assembler code is one that matches the ISA naming convention 1:1. Instead of writing
Inst::XmmAbcDef(SseOpcode::Andps, ...)
(or its helper function) I am suggesting the abstraction should look more like...::Andps(...)
so that we more directly think in the ISA namespace than in Cranelift's custom namespace. From that viewpoint, it makes sense to me to have more instruction variants available (e.g.ANDPD
) to reduce friction during lowering. (I understand that's still not strongest argument for keepingANDPD
but I'm really thinking more about the ISA naming point... it was a bit frustrating to have to think in terms of an ad hoc convention instead when I was addding SIMD FP arithmetic support).
abrown submitted PR Review.
bnjbvr updated PR #2068 from x64
to main
:
This mixed bag of commits implements enough support so that Cranelift-in-Spidermonkey runs all the wasm test cases correctly, including multi-value, reference types, explicit bounds checks, etc. So with respect to Spidermonkey, this brings the x64 level support to the same as aarch64's (on Linux at least; Win64 fastcall support is still to be implemented); wasmtime might require additional work.
Next step is setting up CI so this is actually tested in automation.
bnjbvr merged PR #2068.
Last updated: Nov 22 2024 at 17:03 UTC