Kmeakin commented on issue #6135:
Didn't realize when I made this branch, but this fixes https://github.com/bytecodealliance/wasmtime/issues/6129!
Kmeakin commented on issue #6135:
fmin
andfmax
are indeed commutative, because we've specifically defined away the weird x86-specific behavior that privileges the first operand in some NaN cases. (This "canonicalization" is, incidentally, what some of the opcodes in the Relaxed-SIMD Wasm proposal are looking to, well, relax; but those will be different opcodes.)Thanks. Are
fmin_pseudo
andfmax_pseudo
commutative as well then?
Kmeakin commented on issue #6135:
We plan to remove all the non-egraph parts sometime soon, so it isn't really worth making this change here. Instead it should happen somewhere in the vicinity of
insert_pure_enode
incranelift/codegen/src/egraph.rs
. The most similar place to what you've done here would beNewOrExistingInst::get_inst_key
.I'll add an egraph implementation in another commit ASAP
cfallin commented on issue #6135:
fmin
andfmax
are indeed commutative, because we've specifically defined away the weird x86-specific behavior that privileges the first operand in some NaN cases. (This "canonicalization" is, incidentally, what some of the opcodes in the Relaxed-SIMD Wasm proposal are looking to, well, relax; but those will be different opcodes.)Thanks. Are
fmin_pseudo
andfmax_pseudo
commutative as well then?No, unfortunately not (or at least, not on all CPUs).
jameysharp commented on issue #6135:
I was curious whether
fmin_pseudo
andfmax_pseudo
are commutative as well. I think the answer is a well-defined "no": if either argument is NaN or both arguments are ±0, the result is defined to always be the first argument. So swapping the arguments can produce a different NaN or a differently-signed zero.The WebAssembly spec for
fmin
/fmax
nearly guarantees that the results will be the same if the arguments are swapped, except that if both arguments are NaN, it's not specified which one will be returned. But that unspecified case means we can choose whichever one we want while remaining conforming, so it's still okay to treat it as commutative. That is a little weird for our "no undefined behavior" policy in Cranelift though…Also, today I learned that signed integer min/max produces the WebAssembly-specified result for fmin/fmax as long as neither argument is NaN. Neat.
Kmeakin commented on issue #6135:
However, I think it's better to rewrite the actual instructions in the data-flow graph
Could you point me to the correct place to make this change? The egraph code is quite complex
jameysharp commented on issue #6135:
Yeah, there's a lot of subtlety to the egraph code!
I think the best option here is in
insert_pure_enode
. It's given aNewOrExistingInst
which defines the two cases we care about here:
- For
Existing(inst)
, we want to normalize&mut self.func.dfg.insts[inst]
.- For
New(data, _)
, we want to normalize&mut data
.In both cases we want to do this before the first use of
gvn_map
in that function, so that it's normalized before we compute a hash key for it.
jameysharp commented on issue #6135:
Nice idea with the reflexive icmp/fcmp normalization. That one shouldn't change the results of an egraph pass since we have rules that rewrite those cases to constants, but a hit in the GVN map means we can avoid calling
simplify
at all, so it's still worth doing.I'm not entirely sold on the "Change formatting of normalize_in_place" commit—I think especially after adding the reflexive cases that it was more clear with a single match on
IntCompare
and a single match onFloatCompare
—but I'm not inclined to hold up merging this PR for that.
jameysharp commented on issue #6135:
I remembered only after merging this that it might cause a performance regression, because it breaks these rules for moving constants. I mentioned this in the "Challenges" section of #6129 but then forgot about it.
The pulldown-cmark and spidermonkey benchmarks in Sightglass show small regressions, when compared against the immediately preceding commit on
main
:compilation :: cpu-cycles :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 21305700.50 ± 9802702.99 (confidence = 99%) main-bf1aaba06.so is 1.01x to 1.04x faster than normalize-gvn-57e42d0c4.so! [798371867 803835104.60 841073014] main-bf1aaba06.so [814510194 825140805.10 848435012] normalize-gvn-57e42d0c4.so compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 16370993.55 ± 111369.32 (confidence = 99%) main-bf1aaba06.so is 1.01x to 1.01x faster than normalize-gvn-57e42d0c4.so! [1566424391 1566654680.05 1566986371] main-bf1aaba06.so [1582847533 1583025673.60 1583356398] normalize-gvn-57e42d0c4.so execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 1289.05 ± 201.02 (confidence = 99%) main-bf1aaba06.so is 1.00x to 1.00x faster than normalize-gvn-57e42d0c4.so! [21058403 21058589.05 21059116] main-bf1aaba06.so [21059745 21059878.10 21060457] normalize-gvn-57e42d0c4.so
compilation :: cpu-cycles :: benchmarks/spidermonkey/benchmark.wasm Δ = 414681539.40 ± 180320641.50 (confidence = 99%) main-bf1aaba06.so is 1.01x to 1.03x faster than normalize-gvn-57e42d0c4.so! [19755236415 20095956851.10 20596830121] main-bf1aaba06.so [20147733881 20510638390.50 20966561627] normalize-gvn-57e42d0c4.so compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 483524391.30 ± 2430599.58 (confidence = 99%) main-bf1aaba06.so is 1.01x to 1.01x faster than normalize-gvn-57e42d0c4.so! [37523394580 37531517182.20 37533623288] main-bf1aaba06.so [38006693589 38015041573.50 38016782123] normalize-gvn-57e42d0c4.so execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 4010130.85 ± 77289.63 (confidence = 99%) normalize-gvn-57e42d0c4.so is 1.00x to 1.00x faster than main-bf1aaba06.so! [2677225595 2677345556.30 2677547575] main-bf1aaba06.so [2673233921 2673335425.45 2673499424] normalize-gvn-57e42d0c4.so
The spidermonkey benchmark actually runs a tiny bit faster with this PR, showing that in this case merging more operations through GVN is a slightly bigger win than the loss due to the regression in constant-folding.
This PR is clearly a good idea but we should figure out how to fix constant-folding. We may decide to temporarily revert this PR while sorting that out.
cfallin commented on issue #6135:
Thanks for catching the regression, @jameysharp (and sorry for not noticing this either). And thanks for the PR in any case @Kmeakin as I think it's still generally a good idea to do something about commutativity!
IMHO with compilation regressions up to 3-4% we should probably revert and see if we can find a way to do better: such a hit is acceptable if we also see gains in runtime performance (maybe, with a 1:1 rule-of-thumb, 3-4% speedups) but otherwise the numbers as-is are not quite there.
jameysharp commented on issue #6135:
I'm confident we can sort out the regression. So we'll revert this PR temporarily but we should be able to merge it again.
It's possible that there's a compile-time cost to the normalization itself, but I think that's unlikely. We're just adding a couple compares and a conditional swap per CLIF instruction, and that time should be dwarfed by the hash-map lookup that immediately follows it.
It's also possible that higher hit-rates in the GVN map make compilation slower, but I can't imagine how.
So I'm assuming that this regression is only due to a conflict with the existing ISLE rules.
A straightforward but tedious solution would be to find all mid-end rules that match a constant on the right, and if the matched operator is commutative, add the corresponding rule with the constant on the left. The associativity constant-folding rules need similar treatment.
The more satisfying fix would be to deal with #6128 and then write rules with the appropriate commutative matches baked in.
Last updated: Dec 23 2024 at 12:05 UTC