Stream: git-wasmtime

Topic: wasmtime / PR #2171 runtime: new implementations for near...


view this post on Zulip Wasmtime GitHub notifications bot (Aug 28 2020 at 21:12):

MaxGraey edited PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

view this post on Zulip Wasmtime GitHub notifications bot (Aug 28 2020 at 21:42):

MaxGraey edited PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

So I guess second approach with copysign more preferable. wdyt?

view this post on Zulip Wasmtime GitHub notifications bot (Aug 28 2020 at 21:53):

MaxGraey updated PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

So I guess second approach with copysign more preferable. wdyt?

view this post on Zulip Wasmtime GitHub notifications bot (Aug 28 2020 at 21:54):

MaxGraey edited PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Upd So I chose the second approach

view this post on Zulip Wasmtime GitHub notifications bot (Aug 28 2020 at 21:55):

MaxGraey updated PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Upd So I chose the second approach

view this post on Zulip Wasmtime GitHub notifications bot (Aug 28 2020 at 22:14):

MaxGraey edited PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Upd So I chose the second approach. Also it branchless on ARM

view this post on Zulip Wasmtime GitHub notifications bot (Aug 28 2020 at 22:27):

MaxGraey edited PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Upd So I chose the second approach. Also it branchless on ARM32

view this post on Zulip Wasmtime GitHub notifications bot (Aug 28 2020 at 23:13):

MaxGraey edited PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Upd So I chose the second approach. Also it branchless on ARM32

Another possible approach:

pub extern "C" fn nearest_new_3(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
        return x;
    }
    (x.abs() + TOINT - TOINT).copysign(x)
}

view this post on Zulip Wasmtime GitHub notifications bot (Aug 28 2020 at 23:13):

MaxGraey edited PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Upd So I chose the second approach. Also it branchless on ARM32

Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
        return x;
    }
    (x.abs() + TOINT - TOINT).copysign(x)
}

view this post on Zulip Wasmtime GitHub notifications bot (Aug 28 2020 at 23:13):

MaxGraey edited PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Upd So I chose the second approach. Also it branchless on ARM32

Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
        return x;
    }
    (x.abs() + TOINT_64 - TOINT_64).copysign(x)
}

view this post on Zulip Wasmtime GitHub notifications bot (Aug 28 2020 at 23:14):

MaxGraey edited PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Upd So I chose the second approach. Also it branchless on ARM32

Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

view this post on Zulip Wasmtime GitHub notifications bot (Aug 29 2020 at 08:38):

MaxGraey edited PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Upd So I chose the second approach. Also it branchless on ARM32

Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

But this approach has lower IPC

view this post on Zulip Wasmtime GitHub notifications bot (Aug 29 2020 at 08:39):

MaxGraey edited PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Upd So I chose the second approach. Also it branchless on ARM32

Upd 2
Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

But this approach has lower IPC

view this post on Zulip Wasmtime GitHub notifications bot (Aug 29 2020 at 21:01):

MaxGraey edited PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Benchmark results

Upd So I chose the second approach. Also it branchless on ARM32

Upd 2
Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

But this approach has lower IPC

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 10:23):

MaxGraey updated PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Benchmark results

Upd So I chose the second approach. Also it branchless on ARM32

Upd 2
Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

But this approach has lower IPC

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 10:57):

MaxGraey updated PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Benchmark results

Upd So I chose the second approach. Also it branchless on ARM32

Upd 2
Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

But this approach has lower IPC

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 11:02):

MaxGraey updated PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Benchmark results

Upd So I chose the second approach. Also it branchless on ARM32

Upd 2
Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

But this approach has lower IPC

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 11:04):

MaxGraey edited PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Benchmark results

Upd So I chose the second approach. Also it branchless on ARM32

Upd 2
Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

But this approach has lower IPC

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 11:09):

MaxGraey updated PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Benchmark results

Upd So I chose the second approach. Also it branchless on ARM32

Upd 2
Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

But this approach has lower IPC

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:10):

sunfishcode submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:10):

sunfishcode submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:10):

sunfishcode created PR Review Comment:

With copysign here, you could also replace the if above with just x.abs() + TOINT_32 - TOINT_32, letting the copysign restore the sign bit, so that we don't get branch mispredicts if inputs have a mix of signs.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:10):

sunfishcode created PR Review Comment:

You could also check to see if it's faster to do the first if using abs() with a floating-point range check, instead of to_bits() with an integer range check.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:16):

MaxGraey submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:16):

MaxGraey created PR Review Comment:

Yes, x.abs() + TOINT_32 - TOINT_32 little bit faster. This variant has in benchmark. But I'm not sure it will be great on ARM32: https://godbolt.org/z/jsMba8.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:16):

MaxGraey created PR Review Comment:

That's make sense. Will add this case to benchmark

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:16):

MaxGraey submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:30):

MaxGraey submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:30):

MaxGraey created PR Review Comment:

Unfortunately it will be slower:

test nearest_abs_copysign              ... bench:      35,993 ns/iter (+/- 7,475)
test nearest_abs_copysign_without_bits ... bench:      37,380 ns/iter (+/- 16,714)  ;;   <-- suggested
test nearest_branch                    ... bench:      37,300 ns/iter (+/- 7,593)
test nearest_copysign                  ... bench:      32,348 ns/iter (+/- 4,869)   ;; current
test nearest_original                  ... bench:      99,693 ns/iter (+/- 16,491)
test nearest_sse41                     ... bench:      40,587 ns/iter (+/- 3,854)

updated benchmark

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:30):

MaxGraey edited PR Review Comment.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:31):

MaxGraey edited PR Review Comment.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:44):

sunfishcode submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:44):

sunfishcode created PR Review Comment:

Ah, sorry I missed that you had benchmarked that already. I'm not very familiar with ARM32, but in that godbolt link, the only thing that sticks out to me as being slower is that the abs version doesn't have the early exit for inputs for which nearest is an identity operation. On other inputs, the abs version has fewer instructions.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:56):

MaxGraey submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:56):

MaxGraey created PR Review Comment:

For second approach (wih abs) ARM has much more ALU / VFP switchings which in theory will be slower. Unfortunately llvm-mca doesn't work for arm targets yet. And I can't benchmark this

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:58):

MaxGraey edited PR Review Comment.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 15:58):

MaxGraey edited PR Review Comment.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 16:38):

sunfishcode submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 16:38):

sunfishcode created PR Review Comment:

Are you referring to the vmovs that move between d and r registers? I see the same number in both versions.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 17:23):

MaxGraey submitted PR Review.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 17:23):

MaxGraey created PR Review Comment:

Alright, I'll use abs + copysign approach. Thanks for review btw

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 17:25):

MaxGraey updated PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Benchmark results

Upd So I chose the second approach. Also it branchless on ARM32

Upd 2
Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

But this approach has lower IPC

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 17:44):

MaxGraey updated PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Benchmark results

Upd So I chose the second approach. Also it branchless on ARM32

Upd 2
Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

But this approach has lower IPC

view this post on Zulip Wasmtime GitHub notifications bot (Aug 30 2020 at 18:59):

MaxGraey updated PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Benchmark results

Upd So I chose the second approach. Also it branchless on ARM32

Upd 2
Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

But this approach has lower IPC

view this post on Zulip Wasmtime GitHub notifications bot (Aug 31 2020 at 16:02):

MaxGraey updated PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Benchmark results

Upd So I chose the second approach. Also it branchless on ARM32

Upd 2
Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

But this approach has lower IPC

view this post on Zulip Wasmtime GitHub notifications bot (Aug 31 2020 at 16:02):

MaxGraey updated PR #2171 from new-nearest-functions to main:

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Benchmark results

Upd So I chose the second approach. Also it branchless on ARM32

Upd 2
Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

But this approach has lower IPC

view this post on Zulip Wasmtime GitHub notifications bot (Aug 31 2020 at 16:39):

sunfishcode merged PR #2171.


Last updated: Oct 23 2024 at 20:03 UTC