jlb6740 opened PR #3517 from fix-issue-3327
to main
:
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
jlb6740 requested alexcrichton for a review on PR #3517.
jlb6740 requested cfallin for a review on PR #3517.
abrown submitted PR review.
abrown created PR review comment:
ctx.emit(Inst::equals(types::I32X4, RegMem::from(tmp), tmp));
This change would avoid the extra instruction and might fix the NaN comparison issue.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
I believe this would fix the issue, yes, because this would avoid doing a floating-point-compare which runs the risk of doing NaN comparisons. I don't know about the performance implications of this, though, and whether it's better to do or not.
Otherwise though I think that this
xor
is probably worth both a comment and a test, or otherwise somehow sharing code with the other spot inf32x4.abs
which generates an all-ones mask.
jlb6740 submitted PR review.
jlb6740 created PR review comment:
@alexcrichton @abrown The xorps isn't a new idea. It's what we do in a couple of places. For example here is the sequence that prompted the bug:
(module (func (result v128) v128.const f32x4 0 0 0 0 f32x4.abs v128.not) (export "1" (func 0)) )
0000000000000000 <_wasm_function_0>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: f3 0f 6f 05 24 00 00 movdqu 0x24(%rip),%xmm0 # 30 <_wasm_function_0+0x30> b: 00 c: 0f 57 c9 xorps %xmm1,%xmm1 f: 0f c2 c9 00 cmpeqps %xmm1,%xmm1 13: 66 0f 72 d1 01 psrld $0x1,%xmm1 18: 0f 54 c1 andps %xmm1,%xmm0 1b: 0f c2 c9 00 cmpeqps %xmm1,%xmm1 1f: 0f 57 c1 xorps %xmm1,%xmm0 22: 48 89 ec mov %rbp,%rsp 25: 5d pop %rbp 26: c3 retq
The instruction at address "c" does an xorps instead of an integer compare and we do this because of the types. Is it really a good idea to override the types used? Should we override types for the abs as well?
jlb6740 edited PR review comment.
jlb6740 edited PR review comment.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Ah yeah I don't know myself which is better. I agree that both should do the same thing, but whether we do an integer compare then move that to floating-point-typed instructions or instead to xor/cmp both with a floating point type I dunno.
FWIW for
f32x4.abs
LLVM generates this code for x86_64:.LCPI0_0: .long 0x7fffffff # float NaN .long 0x7fffffff # float NaN .long 0x7fffffff # float NaN .long 0x7fffffff # float NaN foo: # @foo andps xmm0, xmmword ptr [rip + .LCPI0_0] ret
and the rough
v128.not
equiavlent for LLVM seems to be:foo: # @foo pcmpeqd xmm1, xmm1 pxor xmm0, xmm1 ret
Although this is in isolation where it's not doing other float ops around there so this isn't necessarily definitive I think.
jlb6740 updated PR #3517 from fix-issue-3327
to main
.
jlb6740 submitted PR review.
jlb6740 created PR review comment:
Thanks @alexcrichton So looks like LLVM is using integer compare too. I've made that update. I didn't touch abs because I think that is a separate patch but looks like that could be more efficient as well.
abrown submitted PR review.
abrown created PR review comment:
I agree that from a Cranelift perspective it is a bit odd to use the
types::I32X4
since it may have nothing to do with the types we are lowering here, but I think that is a refactoring problem. We actually don't want to be writingInst::equals(...)
here (and in other places)--we want to writeconstruct_all_ones(...)
since that is actually what we are trying to do.construct_all_ones
does not exist and I can't remember if there was some other helper that could do this type of thing (and the inverse, construct all zeroes, as well).Inside such a helper, I think the best x86 code to emit is probably
PCMPEQD
(the output ofInst::equals(types::I32X4, ...
) since that fixes the NaN issue and avoids an extra instruction. I checked to see how much the extra instruction would affect things and my conclusion is that 1) in tight loops the extra instruction could be a factor and 2) the int-to-FP domain switching is not an issue.
jlb6740 submitted PR review.
jlb6740 created PR review comment:
I commented to @abrown offline .. but my concern was also about mixing integer and float instructions for SIMD. In real codes (not this fuzzing example) they'll likely be other floating point SIMD instructions operating on these registers and it's supposed to be a bad for performance to mix between integer and floating point operations. Not sure where in the manuals but it's mentioned here in this stack overflow on some older system: https://stackoverflow.com/questions/4996384/do-i-get-a-performance-penalty-when-mixing-sse-integer-float-simd-instructions . None-the-less I pushed the patch to hardcode the type.
jlb6740 edited PR review comment.
jlb6740 submitted PR review.
jlb6740 created PR review comment:
@abrown .. Yes construct all ones is exactly what we are inlining here.
jlb6740 updated PR #3517 from fix-issue-3327
to main
.
jlb6740 updated PR #3517 from fix-issue-3327
to main
.
jlb6740 updated PR #3517 from fix-issue-3327
to main
.
abrown submitted PR review.
jlb6740 updated PR #3517 from fix-issue-3327
to main
.
jlb6740 updated PR #3517 from fix-issue-3327
to main
.
jlb6740 merged PR #3517.
Last updated: Dec 23 2024 at 12:05 UTC