MaxGraey edited PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
MaxGraey edited PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
So I guess second approach
with copysign
more preferable. wdyt?
MaxGraey updated PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
So I guess second approach
with copysign
more preferable. wdyt?
MaxGraey edited PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach
MaxGraey updated PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach
MaxGraey edited PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM
MaxGraey edited PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
MaxGraey edited PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Another possible approach:
pub extern "C" fn nearest_new_3(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { return x; } (x.abs() + TOINT - TOINT).copysign(x) }
MaxGraey edited PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Another possible approach:
pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { return x; } (x.abs() + TOINT - TOINT).copysign(x) }
MaxGraey edited PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Another possible approach:
pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { return x; } (x.abs() + TOINT_64 - TOINT_64).copysign(x) }
MaxGraey edited PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Another possible approach:
pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { x } else { (x.abs() + TOINT_64 - TOINT_64).copysign(x) } }
MaxGraey edited PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Another possible approach:
pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { x } else { (x.abs() + TOINT_64 - TOINT_64).copysign(x) } }
But this approach has lower IPC
MaxGraey edited PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Upd 2
Another possible approach:pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { x } else { (x.abs() + TOINT_64 - TOINT_64).copysign(x) } }
But this approach has lower IPC
MaxGraey edited PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Upd 2
Another possible approach:pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { x } else { (x.abs() + TOINT_64 - TOINT_64).copysign(x) } }
But this approach has lower IPC
MaxGraey updated PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Upd 2
Another possible approach:pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { x } else { (x.abs() + TOINT_64 - TOINT_64).copysign(x) } }
But this approach has lower IPC
MaxGraey updated PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Upd 2
Another possible approach:pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { x } else { (x.abs() + TOINT_64 - TOINT_64).copysign(x) } }
But this approach has lower IPC
MaxGraey updated PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Upd 2
Another possible approach:pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { x } else { (x.abs() + TOINT_64 - TOINT_64).copysign(x) } }
But this approach has lower IPC
MaxGraey edited PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Upd 2
Another possible approach:pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { x } else { (x.abs() + TOINT_64 - TOINT_64).copysign(x) } }
But this approach has lower IPC
MaxGraey updated PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Upd 2
Another possible approach:pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { x } else { (x.abs() + TOINT_64 - TOINT_64).copysign(x) } }
But this approach has lower IPC
sunfishcode submitted PR Review.
sunfishcode submitted PR Review.
sunfishcode created PR Review Comment:
With copysign here, you could also replace the
if
above with justx.abs() + TOINT_32 - TOINT_32
, letting the copysign restore the sign bit, so that we don't get branch mispredicts if inputs have a mix of signs.
sunfishcode created PR Review Comment:
You could also check to see if it's faster to do the first
if
usingabs()
with a floating-point range check, instead ofto_bits()
with an integer range check.
MaxGraey submitted PR Review.
MaxGraey created PR Review Comment:
Yes,
x.abs() + TOINT_32 - TOINT_32
little bit faster. This variant has in benchmark. But I'm not sure it will be great on ARM32: https://godbolt.org/z/jsMba8.
MaxGraey created PR Review Comment:
That's make sense. Will add this case to benchmark
MaxGraey submitted PR Review.
MaxGraey submitted PR Review.
MaxGraey created PR Review Comment:
Unfortunately it will be slower:
test nearest_abs_copysign ... bench: 35,993 ns/iter (+/- 7,475) test nearest_abs_copysign_without_bits ... bench: 37,380 ns/iter (+/- 16,714) ;; <-- suggested test nearest_branch ... bench: 37,300 ns/iter (+/- 7,593) test nearest_copysign ... bench: 32,348 ns/iter (+/- 4,869) ;; current test nearest_original ... bench: 99,693 ns/iter (+/- 16,491) test nearest_sse41 ... bench: 40,587 ns/iter (+/- 3,854)
MaxGraey edited PR Review Comment.
MaxGraey edited PR Review Comment.
sunfishcode submitted PR Review.
sunfishcode created PR Review Comment:
Ah, sorry I missed that you had benchmarked that already. I'm not very familiar with ARM32, but in that godbolt link, the only thing that sticks out to me as being slower is that the abs version doesn't have the early exit for inputs for which
nearest
is an identity operation. On other inputs, the abs version has fewer instructions.
MaxGraey submitted PR Review.
MaxGraey created PR Review Comment:
For second approach (wih abs) ARM has much more ALU / VFP switchings which in theory will be slower. Unfortunately llvm-mca doesn't work for arm targets yet. And I can't benchmark this
MaxGraey edited PR Review Comment.
MaxGraey edited PR Review Comment.
sunfishcode submitted PR Review.
sunfishcode created PR Review Comment:
Are you referring to the
vmov
s that move between d and r registers? I see the same number in both versions.
MaxGraey submitted PR Review.
MaxGraey created PR Review Comment:
Alright, I'll use
abs + copysign
approach. Thanks for review btw
MaxGraey updated PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Upd 2
Another possible approach:pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { x } else { (x.abs() + TOINT_64 - TOINT_64).copysign(x) } }
But this approach has lower IPC
MaxGraey updated PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Upd 2
Another possible approach:pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { x } else { (x.abs() + TOINT_64 - TOINT_64).copysign(x) } }
But this approach has lower IPC
MaxGraey updated PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Upd 2
Another possible approach:pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { x } else { (x.abs() + TOINT_64 - TOINT_64).copysign(x) } }
But this approach has lower IPC
MaxGraey updated PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Upd 2
Another possible approach:pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { x } else { (x.abs() + TOINT_64 - TOINT_64).copysign(x) } }
But this approach has lower IPC
MaxGraey updated PR #2171 from new-nearest-functions
to main
:
More efficient implementations for
wasmtime_f32_nearest
andwasmtime_f64_nearest
based on musl'srint
andrintf
implementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0
:Iterations: 100 Instructions: 1900 Total Cycles: 1611 Total uOps: 2900 Dispatch Width: 6 uOps Per Cycle: 1.80 IPC: 1.18 Block RThroughput: 4.8
and with new approach but using
copysign
at the end for handling-0.0
:Iterations: 100 Instructions: 1800 Total Cycles: 1308 Total uOps: 2200 Dispatch Width: 6 uOps Per Cycle: 1.68 IPC: 1.38 Block RThroughput: 3.7
Upd So I chose the second approach. Also it branchless on ARM32
Upd 2
Another possible approach:pub extern "C" fn nearest(x: f64) -> f64 { let i = x.to_bits(); let e = i >> 52 & 0x7ff_u64; if e >= 0x3ff_u64 + 52 { x } else { (x.abs() + TOINT_64 - TOINT_64).copysign(x) } }
But this approach has lower IPC
sunfishcode merged PR #2171.
Last updated: Nov 22 2024 at 16:03 UTC