afonso360 opened PR #2928 from aarch64-i128-ops
to main
:
This PR implements some basic operations (add, sub, mul) for i128 types on AArch64
cfallin created PR review comment:
Could we include this test in a separate PR for the iconst issue? Or alternately just update the iconst implementation here (it should only be a few lines of code, I think!).
cfallin submitted PR review.
cfallin created PR review comment:
A few more test vectors would be good to have here, I think -- e.g. the upper 64 bits of the first operand are always zero here. Want to make sure that all four of the half-product quadrants are computed properly :-)
cfallin submitted PR review.
cfallin created PR review comment:
I think we could get away with not using a
tmp
here, no? We couldumulh
directly intodst_hi
and then usemadd
to add the other terms; that's slightly better from an allocate-fewer-vregs (compiler performance) perspective.
akirilov-arm created PR review comment:
This should be the second instruction in the sequence, since it can execute in parallel with
UMULH
.
akirilov-arm submitted PR review.
afonso360 submitted PR review.
afonso360 created PR review comment:
Yeah, I think I prefer including this in the PR for the iconst.
I think changing it here involves changing a bit more than just the emission for Iconst.
get_constant
only returns a u64 andget_immediate
doesn't support 128bit values yet.
cfallin submitted PR review.
cfallin created PR review comment:
Ah, yes, you're right, we need more infrastructure for that; sorry! Happy to have this whole test deferred then.
afonso360 updated PR #2928 from aarch64-i128-ops
to main
.
afonso360 created PR review comment:
I added a few more tests to all three functions, but if you have any particular test case in mind, let me know.
afonso360 submitted PR review.
afonso360 submitted PR review.
afonso360 created PR review comment:
It looks like when we do this, we do get a few extra mov's in the emitted code.
When placing the
mul
as the first or second instruction in the sequence we get the following code:umulh x5, x0, x2 madd x4, x0, x2, xzr madd x0, x0, x3, x5 madd x0, x1, x2, x0 mov x1, x0 mov x0, x4
Which makes sense, since we need x0 (lhs_lo) for the 2nd
madd
. I don't know if this is preferred to the 4 instruction codegen? It may be faster, but is longer?Adding it after the second
madd
(as the third instruction in the sequence) gets us back to a 4 instruction result. But I'm not sure ifmadd
executes in parallel withmul
, I had a quick look at the Software Optimization guide for a few different cores and it doesn't look like it.umulh x4, x0, x2 madd x3, x0, x3, x4 madd x0, x0, x2, xzr madd x1, x1, x2, x3
Let me know which one you prefer
afonso360 edited PR review comment.
afonso360 requested cfallin for a review on PR #2928.
cfallin submitted PR review.
cfallin created PR review comment:
the
C0FFEE
->DECAF
test vector is excellent; thanks :-)
cfallin submitted PR review.
cfallin created PR review comment:
IMHO we should prefer code that keeps the regalloc happy and reduces moves (those can have a measurable impact as code-fetch bandwidth can be a limiting factor in some cases) to small reorderings, at least on modern out-of-order cores that can reason about dependencies and issue the
mul
as soon as possible. I'll defer to @akirilov-arm as to whether it makes sense to try harder here for older in-order microarchitectures though!
cfallin submitted PR review.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
Are you testing with a simple function that just does the multiplication and returns the product? If yes, I wouldn't be too bothered by the additional moves because that's an overly simplified test case (and yes, the longer version might still be faster on an in-order core like the Cortex-A55 that I have in mind with my comments), but you can also try another experiment - reverse the order of the last 2 instructions and see what you get.
afonso360 submitted PR review.
afonso360 created PR review comment:
Yeah, these are all for the function that is in tests:
function %mul_i128(i128, i128) -> i128 { block0(v0: i128, v1: i128): v2 = imul v0, v1 return v2 }
but you can also try another experiment - reverse the order of the last 2 instructions and see what you get.
Do you mean the last
mul
andmadd
?For the version with
mul
as the last instruction we get:umulh x4, x0, x2 madd x3, x0, x3, x4 madd x1, x1, x2, x3 madd x0, x0, x2, xzr
For the version with the last
madd
as the last instruction, we get this:umulh x4, x0, x2 madd x3, x0, x3, x4 madd x0, x0, x2, xzr madd x1, x1, x2, x3
akirilov-arm edited PR review comment.
afonso360 edited PR review comment.
akirilov-arm submitted PR review.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
Yes, I meant precisely your last experiment and the result is disappointing (the first move is unnecessary at the very least). Anyway, let's go with the current version, i.e.
MUL
in the last position (which still makes it possible to use the dedicated forwarding path between the pair ofMADD
s and to dual-issue), and I (or someone else) can check later what the real performance impact is and if there is some way to avoid the moves.
cfallin merged PR #2928.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
@cfallin Just a quick note - now that the announcement is public, I can actually say that not only current generation, but also some of the next generation cores (i.e. Cortex-A510) are also going to be in-order.
Last updated: Jan 24 2025 at 00:11 UTC