Stream: git-wasmtime

Topic: wasmtime / Issue #2171 runtime: new implementations for n...


view this post on Zulip Wasmtime GitHub notifications bot (Aug 29 2020 at 17:54):

alexcrichton commented on Issue #2171:

Thanks for this! Do you have some wall-clock benchmarks as well which show the improvement?

view this post on Zulip Wasmtime GitHub notifications bot (Aug 29 2020 at 17:58):

MaxGraey commented on Issue #2171:

no, I just compared metrics from llvm-mca. Btw it will be great to have some online benchmark tools like quick-bench.com which allow you bench on different architectures and version of compiler

view this post on Zulip Wasmtime GitHub notifications bot (Aug 29 2020 at 20:37):

MaxGraey commented on Issue #2171:

Here benchmark results:

benchmark code

test nearest_abs_copysign ... bench:      29,973 ns/iter (+/- 9,049)
test nearest_branch       ... bench:      35,935 ns/iter (+/- 4,135)
test nearest_copysign     ... bench:      33,115 ns/iter (+/- 1,961)
test nearest_original     ... bench:     102,740 ns/iter (+/- 22,607)

nearest_original is original function in wasmtime
nearest_copysign is proposed in current PR.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 29 2020 at 20:38):

MaxGraey edited a comment on Issue #2171:

Here benchmark results:

benchmark code

test nearest_abs_copysign ... bench:      29,973 ns/iter (+/- 9,049)
test nearest_branch       ... bench:      35,935 ns/iter (+/- 4,135)
test nearest_copysign     ... bench:      33,115 ns/iter (+/- 1,961)
test nearest_original     ... bench:     102,740 ns/iter (+/- 22,607)

nearest_original is original function in wasmtime
nearest_copysign is proposed in current PR.

So it seems new proposed in 3 times faster

view this post on Zulip Wasmtime GitHub notifications bot (Aug 29 2020 at 20:43):

MaxGraey edited a comment on Issue #2171:

Here benchmark results:

benchmark code

test nearest_abs_copysign ... bench:      29,973 ns/iter (+/- 9,049)
test nearest_branch       ... bench:      35,935 ns/iter (+/- 4,135)
test nearest_copysign     ... bench:      33,115 ns/iter (+/- 1,961)
test nearest_original     ... bench:     102,740 ns/iter (+/- 22,607)

nearest_original is original function in wasmtime
nearest_copysign is proposed in current PR.

So it seems new proposed approach in 3 times faster

view this post on Zulip Wasmtime GitHub notifications bot (Aug 29 2020 at 20:46):

MaxGraey edited a comment on Issue #2171:

Here benchmark results:

benchmark code

test nearest_abs_copysign ... bench:      29,973 ns/iter (+/- 9,049)
test nearest_branch       ... bench:      35,935 ns/iter (+/- 4,135)
test nearest_copysign     ... bench:      33,115 ns/iter (+/- 1,961)
test nearest_original     ... bench:     102,740 ns/iter (+/- 22,607)

nearest_original is original function in wasmtime
nearest_copysign is proposed in current PR.

So it seems new proposed approach in 3 times faster

And btw all this was expected from llvm-mca metrics. I choose nearest_copysign due to potentially it could have better speed on ARMs

view this post on Zulip Wasmtime GitHub notifications bot (Aug 29 2020 at 20:47):

MaxGraey edited a comment on Issue #2171:

Here benchmark results:

benchmark code

test nearest_abs_copysign ... bench:      29,973 ns/iter (+/- 9,049)
test nearest_branch       ... bench:      35,935 ns/iter (+/- 4,135)
test nearest_copysign     ... bench:      33,115 ns/iter (+/- 1,961)
test nearest_original     ... bench:     102,740 ns/iter (+/- 22,607)

nearest_original is original function in wasmtime
nearest_copysign is proposed in current PR.

So it seems new proposed approach in 3 times faster

And btw all this was expected from llvm-mca metrics. I choose nearest_copysign due to potentially it could have better performance on ARMs

view this post on Zulip Wasmtime GitHub notifications bot (Aug 29 2020 at 20:51):

MaxGraey edited a comment on Issue #2171:

Here benchmark results on my MacBook Pro (15-inch 2019, 2,3 GHz 8-cores i9):

benchmark code

test nearest_abs_copysign ... bench:      29,973 ns/iter (+/- 9,049)
test nearest_branch       ... bench:      35,935 ns/iter (+/- 4,135)
test nearest_copysign     ... bench:      33,115 ns/iter (+/- 1,961)
test nearest_original     ... bench:     102,740 ns/iter (+/- 22,607)

nearest_original is original function in wasmtime
nearest_copysign is proposed in current PR.

So it seems new proposed approach in 3 times faster

And btw all this was expected from llvm-mca metrics. I choose nearest_copysign due to potentially it could have better performance on ARMs

view this post on Zulip Wasmtime GitHub notifications bot (Aug 29 2020 at 20:53):

MaxGraey edited a comment on Issue #2171:

Here benchmark results on MacBook Pro (15-inch 2019, 2,3 GHz 8-cores i9):

benchmark code

test nearest_abs_copysign ... bench:      29,973 ns/iter (+/- 9,049)
test nearest_branch       ... bench:      35,935 ns/iter (+/- 4,135)
test nearest_copysign     ... bench:      33,115 ns/iter (+/- 1,961)
test nearest_original     ... bench:     102,740 ns/iter (+/- 22,607)

nearest_original is original function in wasmtime
nearest_copysign is proposed in current PR.

So it seems new proposed approach in 3 times faster

And btw all this was expected from llvm-mca metrics. I choose nearest_copysign due to potentially it could have better performance on ARMs

view this post on Zulip Wasmtime GitHub notifications bot (Aug 29 2020 at 22:37):

MaxGraey commented on Issue #2171:

@alexcrichton I'm wondering is it possible use sse4.1 _mm_round_pd intrinsic for this in wasmtime? And fallback to current polyfill for rest of architectures

view this post on Zulip Wasmtime GitHub notifications bot (Aug 29 2020 at 22:40):

MaxGraey edited a comment on Issue #2171:

@alexcrichton I'm wondering is it possible use sse4.1 _mm_round_pd(ps) intrinsic for this in wasmtime? And fallback to current polyfill for rest of architectures

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 08:12):

bjorn3 commented on Issue #2171:

The NearestF32 and NearestF64 libcalls are already a fallback for when there is no hardware instruction to do this. The old style x86 backend already uses roundss and roundsd: https://github.com/bytecodealliance/wasmtime/blob/5c5a30f76c35e15697fc150fb00c4b86be621d66/cranelift/codegen/meta/src/isa/x86/encodings.rs#L1341-L1345 The new style x86_64 backend has a todo for this: https://github.com/bytecodealliance/wasmtime/blob/8ac4bd1d0d8228b97a88b1841cfc0247e9ef4306/cranelift/codegen/src/isa/x64/lower.rs#L1749

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 08:16):

MaxGraey commented on Issue #2171:

@bjorn3 Good to know. Thanks!

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 08:23):

MaxGraey edited a comment on Issue #2171:

~@alexcrichton I'm wondering is it possible use sse4.1 _mm_round_pd(ps) intrinsic for this in wasmtime? And fallback to current polyfill for rest of architectures~~

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 08:23):

MaxGraey edited a comment on Issue #2171:

@alexcrichton I'm wondering is it possible use sse4.1 _mm_round_pd(ps) intrinsic for this in wasmtime? And fallback to current polyfill for rest of architectures

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 08:58):

MaxGraey commented on Issue #2171:

Also added sse 4.1 intrinsic to gist. Update results (via RUSTFLAGS='-C target-cpu=native' cargo bench):

test nearest_abs_copysign ... bench:      29,344 ns/iter (+/- 2,691)
test nearest_branch       ... bench:      33,893 ns/iter (+/- 4,300)
test nearest_copysign     ... bench:      32,487 ns/iter (+/- 5,328)
test nearest_original     ... bench:      51,732 ns/iter (+/- 4,668)
test nearest_sse41        ... bench:      20,537 ns/iter (+/- 3,958)

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 09:59):

MaxGraey edited a comment on Issue #2171:

Also added sse 4.1 intrinsic to gist. Update results (via RUSTFLAGS='-C target-cpu=native' cargo bench):

test nearest_abs_copysign ... bench:      29,322 ns/iter (+/- 3,160)
test nearest_branch       ... bench:      33,783 ns/iter (+/- 8,887)
test nearest_copysign     ... bench:      32,049 ns/iter (+/- 4,080)
test nearest_original     ... bench:      52,751 ns/iter (+/- 6,244)
test nearest_sse41        ... bench:      19,452 ns/iter (+/- 3,967)

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 11:02):

MaxGraey edited a comment on Issue #2171:

Also added sse 4.1 intrinsic to gist.

Upd
cargo bench:

test nearest_abs_copysign ... bench:      31,500 ns/iter (+/- 1,883)
test nearest_branch       ... bench:      35,911 ns/iter (+/- 5,852)
test nearest_copysign     ... bench:      32,282 ns/iter (+/- 10,079)
test nearest_original     ... bench:     106,932 ns/iter (+/- 13,186)
test nearest_sse41        ... bench:      41,642 ns/iter (+/- 2,501)

RUSTFLAGS='-C target-cpu=native' cargo bench:

test nearest_abs_copysign ... bench:      29,554 ns/iter (+/- 7,914)
test nearest_branch       ... bench:      44,846 ns/iter (+/- 4,056)
test nearest_copysign     ... bench:      33,609 ns/iter (+/- 3,196)
test nearest_original     ... bench:      52,212 ns/iter (+/- 6,702)
test nearest_sse41        ... bench:      19,542 ns/iter (+/- 1,766)

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 11:05):

MaxGraey edited a comment on Issue #2171:

Also added sse 4.1 intrinsic to gist.

Upd
cargo bench:

test nearest_abs_copysign ... bench:      31,500 ns/iter (+/- 1,883)
test nearest_branch       ... bench:      35,911 ns/iter (+/- 5,852)
test nearest_copysign     ... bench:      32,282 ns/iter (+/- 10,079)
test nearest_original     ... bench:     106,932 ns/iter (+/- 13,186)
test nearest_sse41        ... bench:      41,642 ns/iter (+/- 2,501)  ;; fallback to soft implementation

RUSTFLAGS='-C target-cpu=native' cargo bench:

test nearest_abs_copysign ... bench:      29,554 ns/iter (+/- 7,914)
test nearest_branch       ... bench:      44,846 ns/iter (+/- 4,056)
test nearest_copysign     ... bench:      33,609 ns/iter (+/- 3,196)
test nearest_original     ... bench:      52,212 ns/iter (+/- 6,702)
test nearest_sse41        ... bench:      19,542 ns/iter (+/- 1,766)

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 11:07):

MaxGraey edited a comment on Issue #2171:

Also added sse 4.1 intrinsic to gist.

Upd
cargo bench:

test nearest_abs_copysign ... bench:      31,500 ns/iter (+/- 1,883)
test nearest_branch       ... bench:      35,911 ns/iter (+/- 5,852)
test nearest_copysign     ... bench:      32,282 ns/iter (+/- 10,079)
test nearest_original     ... bench:     106,932 ns/iter (+/- 13,186)
test nearest_sse41        ... bench:      41,642 ns/iter (+/- 2,501)  ;; fallback to soft implementation

RUSTFLAGS='-C target-cpu=native' cargo bench:

test nearest_abs_copysign ... bench:      29,554 ns/iter (+/- 7,914)
test nearest_branch       ... bench:      44,846 ns/iter (+/- 4,056)
test nearest_copysign     ... bench:      33,609 ns/iter (+/- 3,196)
test nearest_original     ... bench:      52,212 ns/iter (+/- 6,702)
test nearest_sse41        ... bench:      19,542 ns/iter (+/- 1,766)  ;; rael usage of `roundpd`

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 11:07):

MaxGraey edited a comment on Issue #2171:

Also added sse 4.1 intrinsic to gist.

Upd
cargo bench:

test nearest_abs_copysign ... bench:      31,500 ns/iter (+/- 1,883)
test nearest_branch       ... bench:      35,911 ns/iter (+/- 5,852)
test nearest_copysign     ... bench:      32,282 ns/iter (+/- 10,079)
test nearest_original     ... bench:     106,932 ns/iter (+/- 13,186)
test nearest_sse41        ... bench:      41,642 ns/iter (+/- 2,501)  ;; fallback to soft implementation

RUSTFLAGS='-C target-cpu=native' cargo bench:

test nearest_abs_copysign ... bench:      29,554 ns/iter (+/- 7,914)
test nearest_branch       ... bench:      44,846 ns/iter (+/- 4,056)
test nearest_copysign     ... bench:      33,609 ns/iter (+/- 3,196)
test nearest_original     ... bench:      52,212 ns/iter (+/- 6,702)
test nearest_sse41        ... bench:      19,542 ns/iter (+/- 1,766)  ;; real usage of `roundpd`

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 11:07):

MaxGraey edited a comment on Issue #2171:

Also added sse 4.1 intrinsic to gist.

Upd
cargo bench:

test nearest_abs_copysign ... bench:      31,500 ns/iter (+/- 1,883)
test nearest_branch       ... bench:      35,911 ns/iter (+/- 5,852)
test nearest_copysign     ... bench:      32,282 ns/iter (+/- 10,079)
test nearest_original     ... bench:     106,932 ns/iter (+/- 13,186)
test nearest_sse41        ... bench:      41,642 ns/iter (+/- 2,501)  ;; fallback to soft implementation

RUSTFLAGS='-C target-cpu=native' cargo bench:

test nearest_abs_copysign ... bench:      29,554 ns/iter (+/- 7,914)
test nearest_branch       ... bench:      44,846 ns/iter (+/- 4,056)
test nearest_copysign     ... bench:      33,609 ns/iter (+/- 3,196)
test nearest_original     ... bench:      52,212 ns/iter (+/- 6,702)
test nearest_sse41        ... bench:      19,542 ns/iter (+/- 1,766)  ;; real usage of `roundpd` on x86_64

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 19:00):

MaxGraey commented on Issue #2171:

Squashed commits

view this post on Zulip Wasmtime GitHub notifications bot (Aug 31 2020 at 16:39):

sunfishcode commented on Issue #2171:

Great, thanks!


Last updated: Jan 24 2025 at 00:11 UTC