Stream: git-wasmtime

Topic: wasmtime / PR #8653 cranelift: Optimize __multi3-style mu...


view this post on Zulip Wasmtime GitHub notifications bot (May 19 2024 at 21:17):

jameysharp opened PR #8653 from jameysharp:opt-multi3 to bytecodealliance:main:

LLVM's __multi3 function works by splitting a wide multiplication into several narrower ones. This optimization recognizes the algebraic identities involved and merges them back into the original wide multiply.

This is not yet done but illustrates how part of the optimization can work, at least.

Currently, the lower half of the result is optimized into a single imul instruction, but most of the intermediate values that are optimized away there are still used in computing the upper half, so elaboration brings them back later.

Fixes #4077

<!--
Please make sure you include the following information:

Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.html

Please ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->

view this post on Zulip Wasmtime GitHub notifications bot (May 19 2024 at 23:44):

github-actions[bot] commented on PR #8653:

Subscribe to Label Action

cc @cfallin, @fitzgen

<details>
This issue or pull request has been labeled: "cranelift", "isle"

Thus the following users have been cc'd because of the following labels:

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2024 at 04:14):

scottmcm commented on PR #8653:

I don't know if it'd help at all, but cc https://github.com/bytecodealliance/wasmtime/pull/7719/files#diff-2041f67049d5ac3d8f62ea91d3cb45cdb8608d5f5cdab988731ae2addf90ef01 which will convert larger muls back to smaller muls and mulhis. So I think if you can recognize the open-coded multiplication and turn it into a larger cranelift multiplication, the rest might just work.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2024 at 06:36):

jameysharp commented on PR #8653:

I don't know if it'd help at all, but cc https://github.com/bytecodealliance/wasmtime/pull/7719/files#diff-2041f67049d5ac3d8f62ea91d3cb45cdb8608d5f5cdab988731ae2addf90ef01 which will convert larger muls back to smaller muls and mulhis. So I think if you can recognize the open-coded multiplication and turn it into a larger cranelift multiplication, the rest might just work.

Yeah, this definitely felt similar to that, so I placed my new rules next to those mulhi rules. I don't see a way to simplify these rules in terms of those, though. The hard part is recognizing that four multiplies, two adds with overflow checks, and some other instructions, all together are equivalent to a single mulhi. I suspect that by the time we can recognize all those pieces, it's easier to just emit the umulhi instruction than it is to emit a couple uextends, an imul, and an ireduce. But if you can see a way to do it more simply I'd love that!

Thinking about this comment, though, led me to realize that once we can do that, ISLE is capable enough today to transform a sequence like this:

i = umulhi a_lo, b_lo
j = imul a_lo, b_hi
k = imul a_hi, b_lo
l = imul a_hi, b_hi
iadd i, j, k, l

into something like this:

a = iconcat a_hi, a_lo
b = iconcat b_hi, b_lo
x = imul a, b
y = ushr x, width/2
ireduce y

If our mid-end rules supported producing instructions with multiple results, the ushr/ireduce pair would be better as isplit, since isplit and iconcat produce no instructions for 128-bit values.

On x86 at least, an imul.i128 currently lowers to roughly the first sequence above so I'm not sure this is an improvement, but I hadn't previously been sure it was even possible.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2024 at 08:21):

afonso360 commented on PR #8653:

On x86 at least, an imul.i128 currently lowers to roughly the first sequence above so I'm not sure this is an improvement, but I hadn't previously been sure it was even possible.

It might not be an improvement by itself, but if we recognize some other i128 ops we can probably start applying some optimization rules at the i128 level which would be really neat. For example, we could now recognize (imul.i128 x (iconst 2)) and turn it into a (iadd.i128 x x) which would be hard to do with the other sequences.

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2024 at 15:32):

jameysharp commented on PR #8653:

Good point! Except we currently can't deal with 128-bit constants, so rules like imul x, 2 won't actually fire. We should figure out how to fix that too :sweat_smile:

The other thing we can't currently do is notice that this sequence has a redundant multiply:

a = iconcat a_hi, a_lo
b = iconcat b_hi, b_lo
x = imul a, b
y = imul a_lo, b_lo

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2024 at 15:35):

scottmcm commented on PR #8653:

Since there's no 128-bit iconst it's imul.i128 x (uextend i128 (iconst 2)), but with https://github.com/bytecodealliance/wasmtime/pull/7693/files#diff-2041f67049d5ac3d8f62ea91d3cb45cdb8608d5f5cdab988731ae2addf90ef01R92 and https://github.com/bytecodealliance/wasmtime/pull/7670/files#diff-bf1f1ea991da38588a1785fc7012072e7a685411ea6a262c1fab18379411ef50R94 hopefully it'll already work :fingers_crossed:

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2024 at 16:38):

scottmcm edited a comment on PR #8653:

Since there's no 128-bit iconst it's imul.i128 x (uextend i128 (iconst 2)), but with https://github.com/bytecodealliance/wasmtime/pull/7693/files#diff-2041f67049d5ac3d8f62ea91d3cb45cdb8608d5f5cdab988731ae2addf90ef01R92 and https://github.com/bytecodealliance/wasmtime/pull/7670/files#diff-bf1f1ea991da38588a1785fc7012072e7a685411ea6a262c1fab18379411ef50R94 hopefully it actually already works :fingers_crossed:

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2024 at 18:00):

jameysharp commented on PR #8653:

Since there's no 128-bit iconst it's imul.i128 x (uextend i128 (iconst 2)), but with ... hopefully it'll already work :fingers_crossed:

Ooh, I hadn't looked at those carefully. That's awesome that iconst_u special cases i128! However I see that it doesn't do that for the extractor, only the constructor. Right now it would need an external extractor to match an optional uextend, which is yet more motivation for something like or-patterns (#6128 or #8599).

view this post on Zulip Wasmtime GitHub notifications bot (May 22 2024 at 18:11):

scottmcm commented on PR #8653:

For extend we don't need to wait on or patterns, because it could use

https://github.com/bytecodealliance/wasmtime/blob/f1fe2afffa9430b4a7765459a462cb32a4d81de8/cranelift/codegen/src/prelude_opt.isle#L113-L116

which is a special extractor added in #7710 for exactly the "there might be an extend" problem :)

(Though I'd love to have or patterns to stop needing custom extractors for things like this!)

view this post on Zulip Wasmtime GitHub notifications bot (May 23 2024 at 15:17):

scottmcm edited a comment on PR #8653:

For extend we don't need to wait on or patterns, because it could use

https://github.com/bytecodealliance/wasmtime/blob/f1fe2afffa9430b4a7765459a462cb32a4d81de8/cranelift/codegen/src/prelude_opt.isle#L113-L116

which is a special extractor added in #7710 for exactly the "there might be an extend" problem :)

(Though I'd love to have or patterns to stop needing custom extractors for things like this!)

EDIT: landed in #8686


Last updated: Nov 22 2024 at 17:03 UTC