alexcrichton commented on Issue #2171:
Thanks for this! Do you have some wall-clock benchmarks as well which show the improvement?
MaxGraey commented on Issue #2171:
no, I just compared metrics from llvm-mca. Btw it will be great to have some online benchmark tools like
quick-bench.com
which allow you bench on different architectures and version of compiler
MaxGraey commented on Issue #2171:
Here benchmark results:
test nearest_abs_copysign ... bench: 29,973 ns/iter (+/- 9,049) test nearest_branch ... bench: 35,935 ns/iter (+/- 4,135) test nearest_copysign ... bench: 33,115 ns/iter (+/- 1,961) test nearest_original ... bench: 102,740 ns/iter (+/- 22,607)
nearest_original
is original function in wasmtime
nearest_copysign
is proposed in current PR.
MaxGraey edited a comment on Issue #2171:
Here benchmark results:
test nearest_abs_copysign ... bench: 29,973 ns/iter (+/- 9,049) test nearest_branch ... bench: 35,935 ns/iter (+/- 4,135) test nearest_copysign ... bench: 33,115 ns/iter (+/- 1,961) test nearest_original ... bench: 102,740 ns/iter (+/- 22,607)
nearest_original
is original function in wasmtime
nearest_copysign
is proposed in current PR.So it seems new proposed in 3 times faster
MaxGraey edited a comment on Issue #2171:
Here benchmark results:
test nearest_abs_copysign ... bench: 29,973 ns/iter (+/- 9,049) test nearest_branch ... bench: 35,935 ns/iter (+/- 4,135) test nearest_copysign ... bench: 33,115 ns/iter (+/- 1,961) test nearest_original ... bench: 102,740 ns/iter (+/- 22,607)
nearest_original
is original function in wasmtime
nearest_copysign
is proposed in current PR.So it seems new proposed approach in 3 times faster
MaxGraey edited a comment on Issue #2171:
Here benchmark results:
test nearest_abs_copysign ... bench: 29,973 ns/iter (+/- 9,049) test nearest_branch ... bench: 35,935 ns/iter (+/- 4,135) test nearest_copysign ... bench: 33,115 ns/iter (+/- 1,961) test nearest_original ... bench: 102,740 ns/iter (+/- 22,607)
nearest_original
is original function in wasmtime
nearest_copysign
is proposed in current PR.So it seems new proposed approach in 3 times faster
And btw all this was expected from llvm-mca metrics. I choose
nearest_copysign
due to potentially it could have better speed on ARMs
MaxGraey edited a comment on Issue #2171:
Here benchmark results:
test nearest_abs_copysign ... bench: 29,973 ns/iter (+/- 9,049) test nearest_branch ... bench: 35,935 ns/iter (+/- 4,135) test nearest_copysign ... bench: 33,115 ns/iter (+/- 1,961) test nearest_original ... bench: 102,740 ns/iter (+/- 22,607)
nearest_original
is original function in wasmtime
nearest_copysign
is proposed in current PR.So it seems new proposed approach in 3 times faster
And btw all this was expected from llvm-mca metrics. I choose
nearest_copysign
due to potentially it could have better performance on ARMs
MaxGraey edited a comment on Issue #2171:
Here benchmark results on my MacBook Pro (15-inch 2019, 2,3 GHz 8-cores i9):
test nearest_abs_copysign ... bench: 29,973 ns/iter (+/- 9,049) test nearest_branch ... bench: 35,935 ns/iter (+/- 4,135) test nearest_copysign ... bench: 33,115 ns/iter (+/- 1,961) test nearest_original ... bench: 102,740 ns/iter (+/- 22,607)
nearest_original
is original function in wasmtime
nearest_copysign
is proposed in current PR.So it seems new proposed approach in 3 times faster
And btw all this was expected from llvm-mca metrics. I choose
nearest_copysign
due to potentially it could have better performance on ARMs
MaxGraey edited a comment on Issue #2171:
Here benchmark results on MacBook Pro (15-inch 2019, 2,3 GHz 8-cores i9):
test nearest_abs_copysign ... bench: 29,973 ns/iter (+/- 9,049) test nearest_branch ... bench: 35,935 ns/iter (+/- 4,135) test nearest_copysign ... bench: 33,115 ns/iter (+/- 1,961) test nearest_original ... bench: 102,740 ns/iter (+/- 22,607)
nearest_original
is original function in wasmtime
nearest_copysign
is proposed in current PR.So it seems new proposed approach in 3 times faster
And btw all this was expected from llvm-mca metrics. I choose
nearest_copysign
due to potentially it could have better performance on ARMs
MaxGraey commented on Issue #2171:
@alexcrichton I'm wondering is it possible use
sse4.1 _mm_round_pd intrinsic
for this in wasmtime? And fallback to current polyfill for rest of architectures
MaxGraey edited a comment on Issue #2171:
@alexcrichton I'm wondering is it possible use
sse4.1 _mm_round_pd(ps) intrinsic
for this in wasmtime? And fallback to current polyfill for rest of architectures
bjorn3 commented on Issue #2171:
The
NearestF32
andNearestF64
libcalls are already a fallback for when there is no hardware instruction to do this. The old style x86 backend already usesroundss
androundsd
: https://github.com/bytecodealliance/wasmtime/blob/5c5a30f76c35e15697fc150fb00c4b86be621d66/cranelift/codegen/meta/src/isa/x86/encodings.rs#L1341-L1345 The new style x86_64 backend has a todo for this: https://github.com/bytecodealliance/wasmtime/blob/8ac4bd1d0d8228b97a88b1841cfc0247e9ef4306/cranelift/codegen/src/isa/x64/lower.rs#L1749
MaxGraey commented on Issue #2171:
@bjorn3 Good to know. Thanks!
MaxGraey edited a comment on Issue #2171:
~@alexcrichton I'm wondering is it possible use
sse4.1 _mm_round_pd(ps) intrinsic
for this in wasmtime? And fallback to current polyfill for rest of architectures~~
MaxGraey edited a comment on Issue #2171:
@alexcrichton I'm wondering is it possible usesse4.1 _mm_round_pd(ps) intrinsic
for this in wasmtime? And fallback to current polyfill for rest of architectures
MaxGraey commented on Issue #2171:
Also added sse 4.1 intrinsic to gist. Update results (via
RUSTFLAGS='-C target-cpu=native' cargo bench
):test nearest_abs_copysign ... bench: 29,344 ns/iter (+/- 2,691) test nearest_branch ... bench: 33,893 ns/iter (+/- 4,300) test nearest_copysign ... bench: 32,487 ns/iter (+/- 5,328) test nearest_original ... bench: 51,732 ns/iter (+/- 4,668) test nearest_sse41 ... bench: 20,537 ns/iter (+/- 3,958)
MaxGraey edited a comment on Issue #2171:
Also added sse 4.1 intrinsic to gist. Update results (via
RUSTFLAGS='-C target-cpu=native' cargo bench
):test nearest_abs_copysign ... bench: 29,322 ns/iter (+/- 3,160) test nearest_branch ... bench: 33,783 ns/iter (+/- 8,887) test nearest_copysign ... bench: 32,049 ns/iter (+/- 4,080) test nearest_original ... bench: 52,751 ns/iter (+/- 6,244) test nearest_sse41 ... bench: 19,452 ns/iter (+/- 3,967)
MaxGraey edited a comment on Issue #2171:
Also added sse 4.1 intrinsic to gist.
Upd
cargo bench
:test nearest_abs_copysign ... bench: 31,500 ns/iter (+/- 1,883) test nearest_branch ... bench: 35,911 ns/iter (+/- 5,852) test nearest_copysign ... bench: 32,282 ns/iter (+/- 10,079) test nearest_original ... bench: 106,932 ns/iter (+/- 13,186) test nearest_sse41 ... bench: 41,642 ns/iter (+/- 2,501)
RUSTFLAGS='-C target-cpu=native' cargo bench
:test nearest_abs_copysign ... bench: 29,554 ns/iter (+/- 7,914) test nearest_branch ... bench: 44,846 ns/iter (+/- 4,056) test nearest_copysign ... bench: 33,609 ns/iter (+/- 3,196) test nearest_original ... bench: 52,212 ns/iter (+/- 6,702) test nearest_sse41 ... bench: 19,542 ns/iter (+/- 1,766)
MaxGraey edited a comment on Issue #2171:
Also added sse 4.1 intrinsic to gist.
Upd
cargo bench
:test nearest_abs_copysign ... bench: 31,500 ns/iter (+/- 1,883) test nearest_branch ... bench: 35,911 ns/iter (+/- 5,852) test nearest_copysign ... bench: 32,282 ns/iter (+/- 10,079) test nearest_original ... bench: 106,932 ns/iter (+/- 13,186) test nearest_sse41 ... bench: 41,642 ns/iter (+/- 2,501) ;; fallback to soft implementation
RUSTFLAGS='-C target-cpu=native' cargo bench
:test nearest_abs_copysign ... bench: 29,554 ns/iter (+/- 7,914) test nearest_branch ... bench: 44,846 ns/iter (+/- 4,056) test nearest_copysign ... bench: 33,609 ns/iter (+/- 3,196) test nearest_original ... bench: 52,212 ns/iter (+/- 6,702) test nearest_sse41 ... bench: 19,542 ns/iter (+/- 1,766)
MaxGraey edited a comment on Issue #2171:
Also added sse 4.1 intrinsic to gist.
Upd
cargo bench
:test nearest_abs_copysign ... bench: 31,500 ns/iter (+/- 1,883) test nearest_branch ... bench: 35,911 ns/iter (+/- 5,852) test nearest_copysign ... bench: 32,282 ns/iter (+/- 10,079) test nearest_original ... bench: 106,932 ns/iter (+/- 13,186) test nearest_sse41 ... bench: 41,642 ns/iter (+/- 2,501) ;; fallback to soft implementation
RUSTFLAGS='-C target-cpu=native' cargo bench
:test nearest_abs_copysign ... bench: 29,554 ns/iter (+/- 7,914) test nearest_branch ... bench: 44,846 ns/iter (+/- 4,056) test nearest_copysign ... bench: 33,609 ns/iter (+/- 3,196) test nearest_original ... bench: 52,212 ns/iter (+/- 6,702) test nearest_sse41 ... bench: 19,542 ns/iter (+/- 1,766) ;; rael usage of `roundpd`
MaxGraey edited a comment on Issue #2171:
Also added sse 4.1 intrinsic to gist.
Upd
cargo bench
:test nearest_abs_copysign ... bench: 31,500 ns/iter (+/- 1,883) test nearest_branch ... bench: 35,911 ns/iter (+/- 5,852) test nearest_copysign ... bench: 32,282 ns/iter (+/- 10,079) test nearest_original ... bench: 106,932 ns/iter (+/- 13,186) test nearest_sse41 ... bench: 41,642 ns/iter (+/- 2,501) ;; fallback to soft implementation
RUSTFLAGS='-C target-cpu=native' cargo bench
:test nearest_abs_copysign ... bench: 29,554 ns/iter (+/- 7,914) test nearest_branch ... bench: 44,846 ns/iter (+/- 4,056) test nearest_copysign ... bench: 33,609 ns/iter (+/- 3,196) test nearest_original ... bench: 52,212 ns/iter (+/- 6,702) test nearest_sse41 ... bench: 19,542 ns/iter (+/- 1,766) ;; real usage of `roundpd`
MaxGraey edited a comment on Issue #2171:
Also added sse 4.1 intrinsic to gist.
Upd
cargo bench
:test nearest_abs_copysign ... bench: 31,500 ns/iter (+/- 1,883) test nearest_branch ... bench: 35,911 ns/iter (+/- 5,852) test nearest_copysign ... bench: 32,282 ns/iter (+/- 10,079) test nearest_original ... bench: 106,932 ns/iter (+/- 13,186) test nearest_sse41 ... bench: 41,642 ns/iter (+/- 2,501) ;; fallback to soft implementation
RUSTFLAGS='-C target-cpu=native' cargo bench
:test nearest_abs_copysign ... bench: 29,554 ns/iter (+/- 7,914) test nearest_branch ... bench: 44,846 ns/iter (+/- 4,056) test nearest_copysign ... bench: 33,609 ns/iter (+/- 3,196) test nearest_original ... bench: 52,212 ns/iter (+/- 6,702) test nearest_sse41 ... bench: 19,542 ns/iter (+/- 1,766) ;; real usage of `roundpd` on x86_64
MaxGraey commented on Issue #2171:
Squashed commits
sunfishcode commented on Issue #2171:
Great, thanks!
Last updated: Jan 24 2025 at 00:11 UTC