adambratschikaye opened PR #8313 from adambratschikaye:abk/no-branch-nan-canonicalization to bytecodealliance:main:
Modify the cranelift pass that performs NaN-canonicalization to avoid branches on x64. The current implementation uses two branches:
8: be 00 00 c0 7f mov esi, 0x7fc00000 d: c5 f9 6e de vmovd xmm3, esi 11: 0f 2e c0 ucomiss xmm0, xmm0 14: 0f 8b 04 00 00 00 jnp 0x1e <wasm[0]::function[0]+0x1e> 1a: f2 0f 10 c3 movsd xmm0, xmm3 # xmm0 = xmm3[0],xmm0[1] 1e: 0f 84 04 00 00 00 je 0x28 <wasm[0]::function[0]+0x28> 24: f2 0f 10 c3 movsd xmm0, xmm3 # xmm0 = xmm3[0],xmm0[1]With these changes, NaN-canonicalization becomes:
8: c5 e8 c2 da 03 vcmpunordps xmm3, xmm2, xmm2 d: be 00 00 c0 7f mov esi, 0x7fc00000 12: c5 f9 6e e6 vmovd xmm4, esi 16: c4 e3 69 4c c4 30 vpblendvb xmm0, xmm2, xmm4, xmm3Running both versions against an small image classification inference benchmark here resulted in a ~50% improvement:
image_classification/opt-level=0 time: [728.16 ms 730.00 ms 732.05 ms] change: [-44.476% -44.251% -44.029%] (p = 0.00 < 0.05) Performance has improved. image_classification/opt-level=1 time: [593.90 ms 595.51 ms 597.34 ms] change: [-51.561% -51.396% -51.211%] (p = 0.00 < 0.05) Performance has improved.As a side note, I didn't notice any sightglass benchmark that was performing mainly float arithmetic to test against. I'd be happy to add this image classification case if there's interest.
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
adambratschikaye updated PR #8313.
adambratschikaye submitted PR review.
adambratschikaye submitted PR review.
adambratschikaye created PR review comment:
Without any of the other changes, just changing this comparison from
NotEqualtoUnorderedremoves one of the two jumps which is a significant improvement. With the other changes I don't think there's a difference between usingNotEqualorUnordered, butUnorderedseemed more precise.
adambratschikaye created PR review comment:
The original pattern was never triggered because
fcmpwill result in either anI32X4orI64X2which always needs to be bitcast back toF32X4orF64X2before it can be passed tobitselect.
adambratschikaye updated PR #8313.
adambratschikaye edited PR review comment.
adambratschikaye has marked PR #8313 as ready for review.
adambratschikaye requested wasmtime-compiler-reviewers for a review on PR #8313.
adambratschikaye requested abrown for a review on PR #8313.
adambratschikaye commented on PR #8313:
Also is there a way to enable NaN-canonicalization in a
cliftest to add a test for this?
afonso360 commented on PR #8313:
You should be able to add something along these lines, to test with nan canonicalization enabled:
test {run,compile,etc...} set enable_nan_canonicalization=true target x86_64
afonso360 submitted PR review.
afonso360 created PR review comment:
I'm not sure this works with the RISC-V backend, our vector support is gated on a feature flag which may not always be available.
One solution would be to keep the previous code around for RISC-V, but it feels wrong to select on targets at this level.
adambratschikaye updated PR #8313.
adambratschikaye submitted PR review.
adambratschikaye created PR review comment:
Ah I see that now. What about choosing based on the targets and features (that is specifically
riscv64whenhas_v=false)? Maybe in the future there will be other target/feature combinations that don't support vectors?
afonso360 submitted PR review.
afonso360 created PR review comment:
I think that is what we need right now. I'm not sure if eventually we should add something to
TargetIsa, but It might be worth reviewing later on if this situation ends up happening more often.
adambratschikaye updated PR #8313.
adambratschikaye submitted PR review.
adambratschikaye created PR review comment:
Added the fall back to the old version for RiscV here.
afonso360 submitted PR review:
This LGTM! Thanks! I don't know if @abrown wants to review it as well.
abrown submitted PR review.
abrown merged PR #8313.
This does the same lowering on arm64 and s390x. Does it actually benefit performance there?
adambratschikaye commented on PR #8313:
I'm not sure. At the moment I don't have a setup to test on arm or s390x.
Last updated: Dec 06 2025 at 06:05 UTC