elliottt edited PR #4619 from trevor/x64-isle-select
to main
:
- Lower selectif and selectif_spectre_guard
- Remove unused code
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
elliottt edited PR #4619 from trevor/x64-isle-select
to main
:
Lower
selectif
andselectif_spectre_guard
in ISLE, which allows a lot of code related toemit_cmp
to be removed from the x64 lower.rs module.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
elliottt has marked PR #4619 as ready for review.
elliottt submitted PR review.
elliottt created PR review comment:
This might not be the most reusable signature, should I rework this to take the type as an argument and
ValueRegs
arguments instead ofValue
?
cfallin submitted PR review.
cfallin created PR review comment:
That might be slightly more idiomatic for the "inst helpers" layer, though I don't think it's too important either way; strong types mean we can always refactor easily and relatively safely...
cfallin submitted PR review.
cfallin created PR review comment:
Here is I think a good reason to force the inputs into registers: otherwise the heap legalization is going to have a load in the hotpath.
Although, after saying that, I am a bit curious whether the reduced register pressure (no
rdx
needed here anymore) has an effect in the opposite direction... if it's not too much trouble, would you be willing to run a quick set of Sightglass benchmarks (on at least saybz2
,spidermonkey
,pulldown-cmark
)?
elliottt updated PR #4619 from trevor/x64-isle-select
to main
.
elliottt created PR review comment:
Sure! I have a fix that forces the value to registers now so this PR shouldn't change performance, but I'll run the sightglass benchmarks without that fix just to see what the difference is.
elliottt submitted PR review.
elliottt submitted PR review.
elliottt created PR review comment:
Here are the results,
branch.so
is loading a0
from memory,main.so
is withxor r, r, r
: https://gist.github.com/elliottt/67be59da8bbdc3a9f946837efc81dedfOverall, it looks like using
xor r, r, r
is still a win, so I'll merge the version of this PR that preserves that behavior :+1:
elliottt edited PR review comment.
elliottt created PR review comment:
My naming of the two outputs was bothering me, so I ran another set of benchmarks with the sha of the commit as the name for the shared object. Here are the results: https://gist.github.com/elliottt/cd2ad5c97d53c57a315b76b4b674af46.
25c436f4480b5e20f832eab4d165734692c298a0 is the head of this branch, loading 0 via
xor
643f53789890f2a776839fe4f5a662d0ec7f891b is the previous commit that would load 0 via memory
elliottt submitted PR review.
elliottt edited PR review comment.
elliottt merged PR #4619.
cfallin created PR review comment:
A bit late but I realized this morning that we probably aren't actually benchmarking the thing with the default Sightglass config: it should be using static memories with a large enough guard region (4+2GiB by default I think?) that bounds checks don't actually happen. We would need to configure it to use dynamic bounds to see a delta; we may just be reading into noise here.
If you're up for it,
--static-memory-maximum-size 0 --static-memory-guard-size 0
will I think give a completely dynamic (bounds-checked) configuration; running the two different wasmtimes and timing them (with e.g.hyperfine
, running a cwasm with--allow-precompiled
) is probably easier than hacking the default config to make Sightglass use it.I still remain a little skeptical that loads of 0 from memory are faster, but I'd love to be proven wrong...
cfallin submitted PR review.
elliottt submitted PR review.
elliottt created PR review comment:
Happy to run some more tests in the background -- I've still got the shared objects hanging around. I figured it was better to merge the version that didn't introduce the loads, as it would be pretty easy to revert the last commit on this PR if we decided that it was worthwhile :+1:
elliottt submitted PR review.
elliottt created PR review comment:
The results for bz2 are pretty interesting: initialization really takes a hit from the load.
instantiation :: nanoseconds :: benchmarks/bz2/benchmark.wasm Δ = 26233.52 ± 5394.98 (confidence = 99%) 25c436f4480b5e20f832eab4d165734692c298a0.so is 1.43x to 1.65x faster than 643f53789890f2a776839fe4f5a662d0ec7f891b.so! 643f53789890f2a776839fe4f5a662d0ec7f891b.so is 0.58x to 0.72x faster than 25c436f4480b5e20f832eab4d165734692c298a0.so! [43591 48321.01 90806] 25c436f4480b5e20f832eab4d165734692c298a0.so [43783 74554.53 122088] 643f53789890f2a776839fe4f5a662d0ec7f891b.so instantiation :: cycles :: benchmarks/bz2/benchmark.wasm Δ = 94444.02 ± 19422.58 (confidence = 99%) 25c436f4480b5e20f832eab4d165734692c298a0.so is 1.43x to 1.65x faster than 643f53789890f2a776839fe4f5a662d0ec7f891b.so! 643f53789890f2a776839fe4f5a662d0ec7f891b.so is 0.58x to 0.72x faster than 25c436f4480b5e20f832eab4d165734692c298a0.so! [156936 173963.32 326914] 25c436f4480b5e20f832eab4d165734692c298a0.so [157626 268407.34 439534] 643f53789890f2a776839fe4f5a662d0ec7f891b.so execution :: nanoseconds :: benchmarks/bz2/benchmark.wasm Δ = 269493.01 ± 178023.21 (confidence = 99%) 25c436f4480b5e20f832eab4d165734692c298a0.so is 0.99x to 1.00x faster than 643f53789890f2a776839fe4f5a662d0ec7f891b.so! 643f53789890f2a776839fe4f5a662d0ec7f891b.so is 1.00x to 1.01x faster than 25c436f4480b5e20f832eab4d165734692c298a0.so! [30236428 30873377.89 31490866] 25c436f4480b5e20f832eab4d165734692c298a0.so [30184192 30603884.88 31578144] 643f53789890f2a776839fe4f5a662d0ec7f891b.so execution :: cycles :: benchmarks/bz2/benchmark.wasm Δ = 970207.46 ± 640905.23 (confidence = 99%) 25c436f4480b5e20f832eab4d165734692c298a0.so is 0.99x to 1.00x faster than 643f53789890f2a776839fe4f5a662d0ec7f891b.so! 643f53789890f2a776839fe4f5a662d0ec7f891b.so is 1.00x to 1.01x faster than 25c436f4480b5e20f832eab4d165734692c298a0.so! [108854820 111147916.48 113370948] 25c436f4480b5e20f832eab4d165734692c298a0.so [108666764 110177709.02 113685162] 643f53789890f2a776839fe4f5a662d0ec7f891b.so compilation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [235950166 248161159.74 266311074] 25c436f4480b5e20f832eab4d165734692c298a0.so [237357286 247903675.98 301295484] 643f53789890f2a776839fe4f5a662d0ec7f891b.so compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm No difference in performance. [65539498 68931326.53 73972799] 25c436f4480b5e20f832eab4d165734692c298a0.so [65930352 68859805.71 83690362] 643f53789890f2a776839fe4f5a662d0ec7f891b.so
As a note about how I set these flags: I modified
crates/bench-api/src/lib.rs
to initialize both of the fields @cfallin mentioned to0
so that I could continue usingsightglass
for generating the benchmarks.
cfallin created PR review comment:
Interesting, so load-from-zero is still 0-1% faster, if I'm reading this right? Great if so; perhaps the register pressure is the deciding factor!
cfallin submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
Actually after discussing offline with @elliottt it seems that what's on
main
now is the xor-based approach; the gap is small enough and xor is otherwise the idiomatic way to zero a register (optimized as such by microarchitectures etc) that I think I'd prefer to keep that approach. Thanks for running the benchmarks here!
Last updated: Nov 22 2024 at 17:03 UTC