If you have a spare moment could you review some of the fixes I pushed as I was running the SIMD spec tests in wasmtime? The reviews should be rather short (<50 lines): https://github.com/bytecodealliance/cranelift/pulls/abrown.
If anyone is interested in AVX-512 support in Cranelift, I just pushed https://github.com/bytecodealliance/cranelift/pull/1370 to add the ability to encode instructions using the EVEX format.
Before @Benjamin Bouvier comes back online, I am giving anyone interested a heads up of a four PR sequence adding some remaining legalizations for Wasm SIMD instructions. The sequence is: #1765 -> #1820 -> #1821 -> #1822. The blocking issue for me to finish this PRs has been that the spec defines semantics that make it very difficult to lower efficiently on x86. To get around this in the short term, I've added some disabled-by-default flags, assert_no_nans
and assert_in_bounds
, that allow users to reassure Cranelift that it doesn't need to do the extra work. I'm interested in feedback so please take a look if interested!
Tagging some people per @Benjamin Bouvier's comment in https://github.com/bytecodealliance/wasmtime/pull/1820: @Dan Gohman, @Chris Fallin, @Julian Seward
If assert_no_nans is set and a NaN happens, it's undefined behavior?
Yes
So we won't be able to set that for wasm code
(I'm writing up a more full explanation... one sec)
The idea of adding assert_no_nans
(or any other flag that would make runtime guarantees) is a tricky one. With #1820 I added only a first step in that direction by allowing users to assert at runtime that their program will not have NaNs--this allows Cranelift to generate faster code. Of course, this is a hammer approach to a screwdriver problem because it affects, e.g., every f32x4.max in the program. This is bad: 1) the user has to apply the guarantee to the whole program, and 2) the user usually can't make this guarantee since they don't understand all the parts of the code they are running
I thought of this flag as an experimental thing and not the end solution. The end solution would have to involve programmer guarantees through, e.g., new C attributes like __attribute__((no_nans))
which would have to propagate through a Wasm custom section to Cranelift.
So, it could be set for Wasm code... initially for experimental benchmarking with the hammer flag and eventually through a more fine-grained mechanism
Does that explain the idea better?
Ok, so that would be something we'd coordinate with the wasm simd proposal?
Eventually... I would like to try out this experimental flag approach with some benchmarks and see if it is worth it to pursue further
Ah, ok.
If it is and there is support from other people, then we could standardize it; if not, we just remove the flags
So, looking at the code, why is there a band
with an all-ones operand?
Um... it's been a bit since I looked at it but I thought that had to do with quieting NaNs
(or it could be a mistake as I was looking at V8 at the time)
We run with signaling-NaN signaling disabled, so we can probably skip that
We = all Cranelift-using wasm engines
:+1: and can you point me to where that is happening?
No, I mean the FP configuration
...if any
oh, it's the default mxcsr setting
Also, I'm trying to figure out that bxor at the end. Should that be a band_not?
Here's what I referenced in V8: https://github.com/v8/v8/blob/master/src/compiler/backend/x64/code-generator-x64.cc#L3011-L3028
I documented some of the other instructions better than this one...
Need to drop but I'll check back in tomorrow; I'm interested in hearing more opinions on this experimental flag idea... and Dan if you just want to review that PR that would be great!!!
oh, I think the cmpeqps in there isn't generating an all-ones
@Andrew Brown can you say some more about the difficulties in implementing fcvt_to_sint_sat (f32x4 -> i32x4)
efficiently? What sequence do you have to generate?
I can't say I suffered through that sequence much since I followed what v8 does (with one misunderstanding about their use of CMPEQPS). I think the bigger problem is that this and many other instructions have inefficient lowerings on x86 compared to other architectures. I've raised the issues and measured benchmarks and what I have concluded is that to make up for the extra instructions the spec semantics require we need several optimizations (like this one) and something along the lines of the guarantees I proposed in #1820. As noted above the "experimental flag" approach is not the best long-term solution but it can reduce the number of instructions emitted for several of the most inefficient instructions.
Re: the flag, i was trying to think briefly in terms of security behaviors. The only case in which it's fine to set the flag is when you're running Cranelift on your own hardware; otherwise, it seems that setting the flag could open the door for correctness issues in the best case, security vulnerabilities in the worst one. So it might not be applicable to the browser use case, for one thing.
I hear the performance improvements it can bring, and from my memories of the wasm spec in general (around max/min), I understand there might be some nice wins from doing this. I would be tempted to have this if we had a clear understanding of what's the worst that can happen when the flag isn't respected, and what we can do for this.
Or considering alternatives; maybe we want some other low-level CLIR instructions that don't have these checks, and the emitter can choose to emit one or the other, or group the unsafe ones in a code sequence guarded by a check that the inputs are not NaNs. (Mostly equivalent to hoisting the NaN checks in one place)
Could the experiments be done without checking in all the code upstream (in wasmtime master)?
The only case in which it's fine to set the flag is when you're running Cranelift on your own hardware
I agree with what you said about experimental flags, but I have convinced myself that a per-instruction attribute (the better, long-term solution) set by the programmer (not knowing the runtime hardware) would also be just as safe--I mean, the user is already trusting the programmer to write correct code so they can also trust the programmer to make a per-instruction guarantee
I would be tempted to have this if we had a clear understanding of what's the worst that can happen when the flag isn't respected
I don't think I understand this concern: if the user doesn't apply the flag, the code runs slower; if the user applies the flag, they may have unexpected data (if they assumed a NaN was converted to a zero but then was not).
On the alternatives: hoisting doesn't work if the operands of the guaranteed instruction are created inside a loop. And I could just keep all my optimizations local and just test locally but I guess the point of discussing this is to figure out if you all agree that Cranelift should try out this approach, from a philosophical standpoint, first as an experimental flag and possibly later with a per-instruction attribute.
(looks like we are all online if and I would be open to discussing on Zoom for a few minutes if that is quicker)
I must be honest and say I am not in favour of assert_no_nans
. The effect of it is to make the meaning of CLIR dependent on some external context, where it wasn't before. "Under-the-table" semantics, I think of it as.
As Dan says, we can't use it for translating wasm. What would perhaps be better is to wait for v2 of the wasm simd spec, which Lars tells me might have some variants of existing insns that have cheaper to implement on Intel. Then we'd have a cheaper implementation available exactly for those insns which are currently expensive, without assert_no_nans
, and without creating the possibility of tagging arbitrary cranelift insns with a "cheap semantics please" flag in cases where that has no meaning. Specifically, we could add cheap-variant CLIR ops for exactly those which are currently problematic, and for no others.
I don't have much confidence in v2 of the spec... my experience so far has been that it is very difficult raise these types of lowering issues and get some type of resolution for x86 (hence looking at "under-the-table" semantics :slight_smile:). I agree that a solution at the spec level would be best but, in the absence of that, this approach seemed second-best and perhaps even a way to influence the spec, e.g. "if this flag is necessary to achieve near-native performance, why don't we just add some Wasm SIMD instructions?"
As Dan says, we can't use it for translating Wasm
I think you can in certain scenarios (e.g. performance experiments) but by default this flag would be off; users would have to know what they're doing before using something like this.
Ok, to resolve this discussion: I closed #1820 (which had guarantee flags) and re-opened it without flags as https://github.com/bytecodealliance/wasmtime/pull/1876.
I tagged @Julian Seward as the reviewer since you looked at the last one--thanks for the reviews!
Last updated: Jan 24 2025 at 00:11 UTC