wasmtime / PR #2928 Implement iadd,isub,imul for i128 in ... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #2928 Implement iadd,isub,imul for i128 in ...

Wasmtime GitHub notifications bot (May 23 2021 at 17:51):

afonso360 opened PR #2928 from aarch64-i128-ops to main:

This PR implements some basic operations (add, sub, mul) for i128 types on AArch64

Wasmtime GitHub notifications bot (May 23 2021 at 23:15):

Could we include this test in a separate PR for the iconst issue? Or alternately just update the iconst implementation here (it should only be a few lines of code, I think!).

Wasmtime GitHub notifications bot (May 23 2021 at 23:15):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (May 23 2021 at 23:15):

cfallin created PR review comment:

A few more test vectors would be good to have here, I think -- e.g. the upper 64 bits of the first operand are always zero here. Want to make sure that all four of the half-product quadrants are computed properly :-)

Wasmtime GitHub notifications bot (May 23 2021 at 23:15):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (May 23 2021 at 23:15):

cfallin created PR review comment:

I think we could get away with not using a tmp here, no? We could umulh directly into dst_hi and then use madd to add the other terms; that's slightly better from an allocate-fewer-vregs (compiler performance) perspective.

Wasmtime GitHub notifications bot (May 24 2021 at 13:33):

akirilov-arm created PR review comment:

This should be the second instruction in the sequence, since it can execute in parallel with UMULH.

Wasmtime GitHub notifications bot (May 24 2021 at 13:33):

akirilov-arm submitted PR review.

Wasmtime GitHub notifications bot (May 24 2021 at 15:34):

afonso360 submitted PR review.

Wasmtime GitHub notifications bot (May 24 2021 at 15:34):

afonso360 created PR review comment:

Yeah, I think I prefer including this in the PR for the iconst.

I think changing it here involves changing a bit more than just the emission for Iconst. get_constant only returns a u64 and get_immediate doesn't support 128bit values yet.

Wasmtime GitHub notifications bot (May 24 2021 at 16:10):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (May 24 2021 at 16:10):

cfallin created PR review comment:

Ah, yes, you're right, we need more infrastructure for that; sorry! Happy to have this whole test deferred then.

Wasmtime GitHub notifications bot (May 24 2021 at 17:25):

afonso360 updated PR #2928 from aarch64-i128-ops to main.

Wasmtime GitHub notifications bot (May 24 2021 at 17:27):

afonso360 created PR review comment:

I added a few more tests to all three functions, but if you have any particular test case in mind, let me know.

Wasmtime GitHub notifications bot (May 24 2021 at 17:27):

afonso360 submitted PR review.

Wasmtime GitHub notifications bot (May 24 2021 at 17:36):

afonso360 submitted PR review.

Wasmtime GitHub notifications bot (May 24 2021 at 17:36):

afonso360 created PR review comment:

It looks like when we do this, we do get a few extra mov's in the emitted code.

When placing the mul as the first or second instruction in the sequence we get the following code:
umulh x5, x0, x2
madd x4, x0, x2, xzr
madd x0, x0, x3, x5
madd x0, x1, x2, x0
mov x1, x0
mov x0, x4
Which makes sense, since we need x0 (lhs_lo) for the 2nd madd. I don't know if this is preferred to the 4 instruction codegen? It may be faster, but is longer?

Adding it after the second madd (as the third instruction in the sequence) gets us back to a 4 instruction result. But I'm not sure if madd executes in parallel with mul, I had a quick look at the Software Optimization guide for a few different cores and it doesn't look like it.
umulh x4, x0, x2
madd x3, x0, x3, x4
madd x0, x0, x2, xzr
madd x1, x1, x2, x3
Let me know which one you prefer

Wasmtime GitHub notifications bot (May 24 2021 at 17:37):

afonso360 edited PR review comment.

Wasmtime GitHub notifications bot (May 24 2021 at 17:40):

afonso360 requested cfallin for a review on PR #2928.

Wasmtime GitHub notifications bot (May 24 2021 at 17:50):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (May 24 2021 at 17:50):

cfallin created PR review comment:

the C0FFEE -> DECAF test vector is excellent; thanks :-)

Wasmtime GitHub notifications bot (May 24 2021 at 17:54):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (May 24 2021 at 17:54):

cfallin created PR review comment:

IMHO we should prefer code that keeps the regalloc happy and reduces moves (those can have a measurable impact as code-fetch bandwidth can be a limiting factor in some cases) to small reorderings, at least on modern out-of-order cores that can reason about dependencies and issue the mul as soon as possible. I'll defer to @akirilov-arm as to whether it makes sense to try harder here for older in-order microarchitectures though!

Wasmtime GitHub notifications bot (May 24 2021 at 17:55):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (May 24 2021 at 18:05):

akirilov-arm submitted PR review.

Wasmtime GitHub notifications bot (May 24 2021 at 18:05):

akirilov-arm created PR review comment:

Are you testing with a simple function that just does the multiplication and returns the product? If yes, I wouldn't be too bothered by the additional moves because that's an overly simplified test case (and yes, the longer version might still be faster on an in-order core like the Cortex-A55 that I have in mind with my comments), but you can also try another experiment - reverse the order of the last 2 instructions and see what you get.

Wasmtime GitHub notifications bot (May 24 2021 at 18:19):

afonso360 submitted PR review.

Wasmtime GitHub notifications bot (May 24 2021 at 18:19):

afonso360 created PR review comment:

Yeah, these are all for the function that is in tests:
function %mul_i128(i128, i128) -> i128 {
block0(v0: i128, v1: i128):
    v2 = imul v0, v1
    return v2
}
but you can also try another experiment - reverse the order of the last 2 instructions and see what you get.

Do you mean the last mul and madd?

For the version with mul as the last instruction we get:
umulh x4, x0, x2
madd x3, x0, x3, x4
madd x1, x1, x2, x3
madd x0, x0, x2, xzr
For the version with the last madd as the last instruction, we get this:
umulh x4, x0, x2
madd x3, x0, x3, x4
madd x0, x0, x2, xzr
madd x1, x1, x2, x3

Wasmtime GitHub notifications bot (May 24 2021 at 18:23):

akirilov-arm edited PR review comment.

Wasmtime GitHub notifications bot (May 24 2021 at 18:31):

afonso360 edited PR review comment.

Wasmtime GitHub notifications bot (May 24 2021 at 19:17):

akirilov-arm submitted PR review.

Wasmtime GitHub notifications bot (May 24 2021 at 19:17):

akirilov-arm submitted PR review.

Wasmtime GitHub notifications bot (May 24 2021 at 19:17):

akirilov-arm created PR review comment:

Yes, I meant precisely your last experiment and the result is disappointing (the first move is unnecessary at the very least). Anyway, let's go with the current version, i.e. MUL in the last position (which still makes it possible to use the dedicated forwarding path between the pair of MADDs and to dual-issue), and I (or someone else) can check later what the real performance impact is and if there is some way to avoid the moves.

Wasmtime GitHub notifications bot (May 24 2021 at 20:27):

cfallin merged PR #2928.

Wasmtime GitHub notifications bot (May 25 2021 at 16:23):

akirilov-arm submitted PR review.

Wasmtime GitHub notifications bot (May 25 2021 at 16:23):

akirilov-arm created PR review comment:

@cfallin Just a quick note - now that the announcement is public, I can actually say that not only current generation, but also some of the next generation cores (i.e. Cortex-A510) are also going to be in-order.

Last updated: Apr 18 2025 at 10:03 UTC