shamatar commented on issue #4106:
Now it can match the full body of __multi3 by logical sequence (with allowed variations over parameters in symmetrical opcodes) and replaces it with "imul" over I128 type
cfallin commented on issue #4106:
@shamatar I'm going through old open PRs and found this; at this point we now have a new mid-end optimization framework that allows rewrites to be written in a more pattern-matchy way. Would you be willing to try reworking your pattern-matching using that? (No worries if not, we can close this if we're not actively pursuing the optimization any more, or have a tracking issue if we are)
shamatar commented on issue #4106:
Hey @cfallin
I can try to change to another matcher, but to be able to efficiently use this optimization it's necessary to also be able to inline this function call after internal optimization. Was it added during the time I opened a PR?
jameysharp commented on issue #4106:
No, we don't have inlining yet, and yes that would help a lot. But we're finding there are a lot of opportunities for peephole optimizations. If we could reduce the body of
__multi3
to a single CLIFimul.i128
instruction, that might still make a big difference, even with call overhead and missing opportunities from some inputs always being zero. If we can do that using more general peephole optimizations, then we can improve performance of other code besides__multi3
, making the effort even more worth-while.I was recently investigating performance improvements for big-integer math. I added https://github.com/bytecodealliance/sightglass/tree/main/benchmarks/blind-sig as a small benchmark that we can study. And indeed,
__multi3
was a huge hotspot there. So I'd love to explore how people can get better performance from this stuff.One thing I learned is that the
u64_digit
feature of the num-bigint-dig crate is quite bad for WebAssembly targets. That feature makes the "limbs" of a big integer 64-bits wide, which means it needs 128-bit integers for math on intermediate results, which leads to calling__multi3
. Disabling that feature makes limbs 32-bit and intermediate products 64-bit, which means all values fit in native wasm types, and LLVM stops using__multi3
to simulate wider types. I saw something like 8-15% performance improvements from that change alone in my little benchmark.Other ways to eliminate this function specifically might be worth investigating. Inlining
__multi3
in the wasm binary could be an option, either by changing LLVM to not do a libcall here, or by making some specialized wasm inliner using something like binaryen. In either case we wouldn't be doing the inlining in Cranelift, but encouraging developers to produce more efficient wasm in the first place.In short: I think there's a lot we can improve even without making Cranelift support inlining, and this PR is a good start which I'd love to get finished up.
shamatar commented on issue #4106:
As practice shows integer math in WASM is still better to be u32 words based one (recent ZPrize works). It'll be nice to have an option to match our __mul subroutine, but inlining is very desirable
Last updated: Nov 22 2024 at 16:03 UTC