jameysharp opened PR #8653 from jameysharp:opt-multi3
to bytecodealliance:main
:
LLVM's
__multi3
function works by splitting a wide multiplication into several narrower ones. This optimization recognizes the algebraic identities involved and merges them back into the original wide multiply.This is not yet done but illustrates how part of the optimization can work, at least.
Currently, the lower half of the result is optimized into a single
imul
instruction, but most of the intermediate values that are optimized away there are still used in computing the upper half, so elaboration brings them back later.Fixes #4077
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
github-actions[bot] commented on PR #8653:
Subscribe to Label Action
cc @cfallin, @fitzgen
<details>
This issue or pull request has been labeled: "cranelift", "isle"Thus the following users have been cc'd because of the following labels:
- cfallin: isle
- fitzgen: isle
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
scottmcm commented on PR #8653:
I don't know if it'd help at all, but cc https://github.com/bytecodealliance/wasmtime/pull/7719/files#diff-2041f67049d5ac3d8f62ea91d3cb45cdb8608d5f5cdab988731ae2addf90ef01 which will convert larger
mul
s back to smallermul
s andmulhi
s. So I think if you can recognize the open-coded multiplication and turn it into a larger cranelift multiplication, the rest might just work.
jameysharp commented on PR #8653:
I don't know if it'd help at all, but cc https://github.com/bytecodealliance/wasmtime/pull/7719/files#diff-2041f67049d5ac3d8f62ea91d3cb45cdb8608d5f5cdab988731ae2addf90ef01 which will convert larger
mul
s back to smallermul
s andmulhi
s. So I think if you can recognize the open-coded multiplication and turn it into a larger cranelift multiplication, the rest might just work.Yeah, this definitely felt similar to that, so I placed my new rules next to those
mulhi
rules. I don't see a way to simplify these rules in terms of those, though. The hard part is recognizing that four multiplies, two adds with overflow checks, and some other instructions, all together are equivalent to a singlemulhi
. I suspect that by the time we can recognize all those pieces, it's easier to just emit theumulhi
instruction than it is to emit a coupleuextend
s, animul
, and anireduce
. But if you can see a way to do it more simply I'd love that!Thinking about this comment, though, led me to realize that once we can do that, ISLE is capable enough today to transform a sequence like this:
i = umulhi a_lo, b_lo j = imul a_lo, b_hi k = imul a_hi, b_lo l = imul a_hi, b_hi iadd i, j, k, l
into something like this:
a = iconcat a_hi, a_lo b = iconcat b_hi, b_lo x = imul a, b y = ushr x, width/2 ireduce y
If our mid-end rules supported producing instructions with multiple results, the
ushr
/ireduce
pair would be better asisplit
, sinceisplit
andiconcat
produce no instructions for 128-bit values.On x86 at least, an
imul.i128
currently lowers to roughly the first sequence above so I'm not sure this is an improvement, but I hadn't previously been sure it was even possible.
afonso360 commented on PR #8653:
On x86 at least, an imul.i128 currently lowers to roughly the first sequence above so I'm not sure this is an improvement, but I hadn't previously been sure it was even possible.
It might not be an improvement by itself, but if we recognize some other i128 ops we can probably start applying some optimization rules at the i128 level which would be really neat. For example, we could now recognize
(imul.i128 x (iconst 2))
and turn it into a(iadd.i128 x x)
which would be hard to do with the other sequences.
jameysharp commented on PR #8653:
Good point! Except we currently can't deal with 128-bit constants, so rules like
imul x, 2
won't actually fire. We should figure out how to fix that too :sweat_smile:The other thing we can't currently do is notice that this sequence has a redundant multiply:
a = iconcat a_hi, a_lo b = iconcat b_hi, b_lo x = imul a, b y = imul a_lo, b_lo
scottmcm commented on PR #8653:
Since there's no 128-bit
iconst
it'simul.i128 x (uextend i128 (iconst 2))
, but with https://github.com/bytecodealliance/wasmtime/pull/7693/files#diff-2041f67049d5ac3d8f62ea91d3cb45cdb8608d5f5cdab988731ae2addf90ef01R92 and https://github.com/bytecodealliance/wasmtime/pull/7670/files#diff-bf1f1ea991da38588a1785fc7012072e7a685411ea6a262c1fab18379411ef50R94 hopefully it'll already work :fingers_crossed:
scottmcm edited a comment on PR #8653:
Since there's no 128-bit
iconst
it'simul.i128 x (uextend i128 (iconst 2))
, but with https://github.com/bytecodealliance/wasmtime/pull/7693/files#diff-2041f67049d5ac3d8f62ea91d3cb45cdb8608d5f5cdab988731ae2addf90ef01R92 and https://github.com/bytecodealliance/wasmtime/pull/7670/files#diff-bf1f1ea991da38588a1785fc7012072e7a685411ea6a262c1fab18379411ef50R94 hopefully it actually already works :fingers_crossed:
jameysharp commented on PR #8653:
Since there's no 128-bit
iconst
it'simul.i128 x (uextend i128 (iconst 2))
, but with ... hopefully it'll already work :fingers_crossed:Ooh, I hadn't looked at those carefully. That's awesome that
iconst_u
special cases i128! However I see that it doesn't do that for the extractor, only the constructor. Right now it would need an external extractor to match an optionaluextend
, which is yet more motivation for something like or-patterns (#6128 or #8599).
scottmcm commented on PR #8653:
For extend we don't need to wait on
or
patterns, because it could usewhich is a special extractor added in #7710 for exactly the "there might be an
extend
" problem :)(Though I'd love to have or patterns to stop needing custom extractors for things like this!)
scottmcm edited a comment on PR #8653:
For extend we don't need to wait on
or
patterns, because it could usewhich is a special extractor added in #7710 for exactly the "there might be an
extend
" problem :)(Though I'd love to have or patterns to stop needing custom extractors for things like this!)
EDIT: landed in #8686
Last updated: Nov 22 2024 at 17:03 UTC