adambratschikaye opened PR #8313 from adambratschikaye:abk/no-branch-nan-canonicalization
to bytecodealliance:main
:
Modify the cranelift pass that performs NaN-canonicalization to avoid branches on x64. The current implementation uses two branches:
8: be 00 00 c0 7f mov esi, 0x7fc00000 d: c5 f9 6e de vmovd xmm3, esi 11: 0f 2e c0 ucomiss xmm0, xmm0 14: 0f 8b 04 00 00 00 jnp 0x1e <wasm[0]::function[0]+0x1e> 1a: f2 0f 10 c3 movsd xmm0, xmm3 # xmm0 = xmm3[0],xmm0[1] 1e: 0f 84 04 00 00 00 je 0x28 <wasm[0]::function[0]+0x28> 24: f2 0f 10 c3 movsd xmm0, xmm3 # xmm0 = xmm3[0],xmm0[1]
With these changes, NaN-canonicalization becomes:
8: c5 e8 c2 da 03 vcmpunordps xmm3, xmm2, xmm2 d: be 00 00 c0 7f mov esi, 0x7fc00000 12: c5 f9 6e e6 vmovd xmm4, esi 16: c4 e3 69 4c c4 30 vpblendvb xmm0, xmm2, xmm4, xmm3
Running both versions against an small image classification inference benchmark here resulted in a ~50% improvement:
image_classification/opt-level=0 time: [728.16 ms 730.00 ms 732.05 ms] change: [-44.476% -44.251% -44.029%] (p = 0.00 < 0.05) Performance has improved. image_classification/opt-level=1 time: [593.90 ms 595.51 ms 597.34 ms] change: [-51.561% -51.396% -51.211%] (p = 0.00 < 0.05) Performance has improved.
As a side note, I didn't notice any sightglass benchmark that was performing mainly float arithmetic to test against. I'd be happy to add this image classification case if there's interest.
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
adambratschikaye updated PR #8313.
adambratschikaye submitted PR review.
adambratschikaye submitted PR review.
adambratschikaye created PR review comment:
Without any of the other changes, just changing this comparison from
NotEqual
toUnordered
removes one of the two jumps which is a significant improvement. With the other changes I don't think there's a difference between usingNotEqual
orUnordered
, butUnordered
seemed more precise.
adambratschikaye created PR review comment:
The original pattern was never triggered because
fcmp
will result in either anI32X4
orI64X2
which always needs to be bitcast back toF32X4
orF64X2
before it can be passed tobitselect
.
adambratschikaye updated PR #8313.
adambratschikaye edited PR review comment.
adambratschikaye has marked PR #8313 as ready for review.
adambratschikaye requested wasmtime-compiler-reviewers for a review on PR #8313.
adambratschikaye requested abrown for a review on PR #8313.
adambratschikaye commented on PR #8313:
Also is there a way to enable NaN-canonicalization in a
clif
test to add a test for this?
afonso360 commented on PR #8313:
You should be able to add something along these lines, to test with nan canonicalization enabled:
test {run,compile,etc...} set enable_nan_canonicalization=true target x86_64
afonso360 submitted PR review.
afonso360 created PR review comment:
I'm not sure this works with the RISC-V backend, our vector support is gated on a feature flag which may not always be available.
One solution would be to keep the previous code around for RISC-V, but it feels wrong to select on targets at this level.
adambratschikaye updated PR #8313.
adambratschikaye submitted PR review.
adambratschikaye created PR review comment:
Ah I see that now. What about choosing based on the targets and features (that is specifically
riscv64
whenhas_v=false
)? Maybe in the future there will be other target/feature combinations that don't support vectors?
afonso360 submitted PR review.
afonso360 created PR review comment:
I think that is what we need right now. I'm not sure if eventually we should add something to
TargetIsa
, but It might be worth reviewing later on if this situation ends up happening more often.
adambratschikaye updated PR #8313.
adambratschikaye submitted PR review.
adambratschikaye created PR review comment:
Added the fall back to the old version for RiscV here.
afonso360 submitted PR review:
This LGTM! Thanks! I don't know if @abrown wants to review it as well.
abrown submitted PR review.
abrown merged PR #8313.
Last updated: Jan 24 2025 at 00:11 UTC