Stream: git-wasmtime

Topic: wasmtime / issue #4106 Try to optimize out `__multi3`


view this post on Zulip Wasmtime GitHub notifications bot (May 07 2022 at 02:34):

shamatar commented on issue #4106:

Now it can match the full body of __multi3 by logical sequence (with allowed variations over parameters in symmetrical opcodes) and replaces it with "imul" over I128 type

view this post on Zulip Wasmtime GitHub notifications bot (Feb 09 2023 at 00:25):

cfallin commented on issue #4106:

@shamatar I'm going through old open PRs and found this; at this point we now have a new mid-end optimization framework that allows rewrites to be written in a more pattern-matchy way. Would you be willing to try reworking your pattern-matching using that? (No worries if not, we can close this if we're not actively pursuing the optimization any more, or have a tracking issue if we are)

view this post on Zulip Wasmtime GitHub notifications bot (Feb 09 2023 at 13:34):

shamatar commented on issue #4106:

Hey @cfallin

I can try to change to another matcher, but to be able to efficiently use this optimization it's necessary to also be able to inline this function call after internal optimization. Was it added during the time I opened a PR?

view this post on Zulip Wasmtime GitHub notifications bot (Feb 09 2023 at 16:49):

jameysharp commented on issue #4106:

No, we don't have inlining yet, and yes that would help a lot. But we're finding there are a lot of opportunities for peephole optimizations. If we could reduce the body of __multi3 to a single CLIF imul.i128 instruction, that might still make a big difference, even with call overhead and missing opportunities from some inputs always being zero. If we can do that using more general peephole optimizations, then we can improve performance of other code besides __multi3, making the effort even more worth-while.

I was recently investigating performance improvements for big-integer math. I added https://github.com/bytecodealliance/sightglass/tree/main/benchmarks/blind-sig as a small benchmark that we can study. And indeed, __multi3 was a huge hotspot there. So I'd love to explore how people can get better performance from this stuff.

One thing I learned is that the u64_digit feature of the num-bigint-dig crate is quite bad for WebAssembly targets. That feature makes the "limbs" of a big integer 64-bits wide, which means it needs 128-bit integers for math on intermediate results, which leads to calling __multi3. Disabling that feature makes limbs 32-bit and intermediate products 64-bit, which means all values fit in native wasm types, and LLVM stops using __multi3 to simulate wider types. I saw something like 8-15% performance improvements from that change alone in my little benchmark.

Other ways to eliminate this function specifically might be worth investigating. Inlining __multi3 in the wasm binary could be an option, either by changing LLVM to not do a libcall here, or by making some specialized wasm inliner using something like binaryen. In either case we wouldn't be doing the inlining in Cranelift, but encouraging developers to produce more efficient wasm in the first place.

In short: I think there's a lot we can improve even without making Cranelift support inlining, and this PR is a good start which I'd love to get finished up.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 16 2023 at 21:47):

shamatar commented on issue #4106:

As practice shows integer math in WASM is still better to be u32 words based one (recent ZPrize works). It'll be nice to have an option to match our __mul subroutine, but inlining is very desirable


Last updated: Oct 23 2024 at 20:03 UTC