Stream: git-wasmtime

Topic: wasmtime / Issue #2439 [proposal] fast path for general i...


view this post on Zulip Wasmtime GitHub notifications bot (Nov 22 2020 at 20:51):

MaxGraey opened Issue #2439:

Feature

Latency and RCP for division depends on register sizes. So Clang do one interesting trick which may speedup 64-bit division if high parts of operands equal to zero. Pseudocode:

idiv(a: i64, b: i64) -> i64 {
  if ((a | b) >> 32) return a / b; // full 64-bit division
  return i32(a) / i32(b) as i64; // 32-bit division
}

godbolt: https://godbolt.org/z/Tqqzs1

Is it make sense apply same optimization for cranelift only for x84/x64 architecture?

Benefit

it may speedup div / rem over 2x for arguments without high parts with small constant overhead according to this table:
<img width="525" alt="comparision" src="https://user-images.githubusercontent.com/1301959/99916476-0cf5dc80-2d13-11eb-9ffe-7dcfd04eab8e.png">

But it is worth excluding Zen1,2,3 architecture due to it uses a more modern scheme for division which doesn't dependent on register size. Also it doesn't need for ARM.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 22 2020 at 20:52):

MaxGraey edited Issue #2439:

Feature

Latency and RCP for division depends on register sizes for most of x64 architectures. So Clang do one interesting trick which may speedup 64-bit division if high parts of operands equal to zero. Pseudocode:

idiv(a: i64, b: i64) -> i64 {
  if ((a | b) >> 32) return a / b; // full 64-bit division
  return i32(a) / i32(b) as i64; // 32-bit division
}

godbolt: https://godbolt.org/z/Tqqzs1

Is it make sense apply same optimization for cranelift only for x84/x64 architecture?

Benefit

it may speedup div / rem over 2x for arguments without high parts with small constant overhead according to this table:
<img width="525" alt="comparision" src="https://user-images.githubusercontent.com/1301959/99916476-0cf5dc80-2d13-11eb-9ffe-7dcfd04eab8e.png">

But it is worth excluding Zen1,2,3 architecture due to it uses a more modern scheme for division which doesn't dependent on register size. Also it doesn't need for ARM.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 22 2020 at 21:05):

MaxGraey edited Issue #2439:

Feature

Latency and RCP for division depends on register sizes for most of x64 architectures. So Clang do one interesting trick which may speedup 64-bit division if high parts of operands equal to zero. Pseudocode:

idiv(a: i64, b: i64) -> i64 {
  if ((a | b) >> 32) return a / b; // full 64-bit division
  return i32(a) / i32(b) as i64; // 32-bit division
}

godbolt: https://godbolt.org/z/Tqqzs1

Is it make sense apply same optimization for cranelift only for x84/x64 architecture?

Benefit

it may speedup div / rem over 2x for arguments without high parts with small constant overhead according to this table:
<img width="525" alt="comparision" src="https://user-images.githubusercontent.com/1301959/99916476-0cf5dc80-2d13-11eb-9ffe-7dcfd04eab8e.png">

But it is worth excluding Zen1,2,3 architecture due to it uses a more modern scheme for division which doesn't dependent on register size. Also it doesn't need for ARM.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 22 2020 at 21:05):

MaxGraey edited Issue #2439:

Feature

Latency and RCP for division depends on register sizes for most of x64 architectures. So Clang do one interesting trick which may speedup 64-bit division if high parts of operands equal to zero. Pseudocode:

idiv(a: i64, b: i64) -> i64 {
  if ((a | b) >> 32) return a / b; // full 64-bit division
  return i32(a) / i32(b) as i64; // 32-bit division
}

godbolt: https://godbolt.org/z/Tqqzs1

Is it make sense apply same optimization for cranelift only for x64 architecture?

Benefit

it may speedup div / rem over 2x for arguments without high parts with small constant overhead according to this table:
<img width="525" alt="comparision" src="https://user-images.githubusercontent.com/1301959/99916476-0cf5dc80-2d13-11eb-9ffe-7dcfd04eab8e.png">

But it is worth excluding Zen1,2,3 architecture due to it uses a more modern scheme for division which doesn't dependent on register size. Also it doesn't need for ARM.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 22 2020 at 22:58):

cfallin commented on Issue #2439:

Hi @MaxGraey, thanks for bringing this up!

From an instruction-selector perspective, this would certainly be possible, though in general I want to be careful about adding complexity without evidence that it will bring benefit in some way. (It's perhaps possible for one case, but if the isel starts growing conditionals for microarchitecture and for optimized cases everywhere, it soon becomes unmaintainable if we don't have a framework for reasoning about machine costs in a more principled way.) Have you seen workloads where division latency in particular became a bottleneck?

view this post on Zulip Wasmtime GitHub notifications bot (Nov 22 2020 at 23:22):

MaxGraey commented on Issue #2439:

Have you seen workloads where division latency in particular became a bottleneck?

Div / Rem usually most expensive operations. There are many possibilities to reduce cost of this operations, most often this is done before instruction scheduling. But there is usually very little opportunity to speed up the most general case. In this case, I think this is interesting approach which shouldn't be too complex.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 22 2020 at 23:22):

MaxGraey edited a comment on Issue #2439:

Have you seen workloads where division latency in particular became a bottleneck?

Div / Rem usually most expensive operations. There are many possibilities to reduce cost of this operations, most often this is done before instruction scheduling. But there is usually very little opportunity to speed up the most general case. In this case, I think this is interesting approach which shouldn't be too complex in term of implementation.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 22 2020 at 23:54):

cfallin commented on Issue #2439:

Indeed, I agree that division is an expensive instruction! What I'm wondering is whether workloads exist where (i) the presence of divide instructions is a bottleneck, i.e., the latency is not hidden by other operations and is a visible portion of total runtime, and (ii) the workload uses 64-bit divides but values always have zeroed top-32-bits.

That data would show the benefit; the cost, in turn is (i) an additional comparison (logical-or, right-shift, branch-if-zero, so three instructions) in every divide's main code path, (ii) additional code size due to the comparison and the fastpath, (iii) complexity in isel, which imposes maintenance burden and a small compile time increase.

So, basically, I have a better handle on what the cost would be and I'm curious if we have any data points on the potential benefit. I'm hesitant to include something like this, because of the above downsides, but evidence of real upside could convince me :-)

view this post on Zulip Wasmtime GitHub notifications bot (Nov 23 2020 at 01:54):

MaxGraey commented on Issue #2439:

Oh, so you ask about real scenarios which could potentially benefit from this?

I guess all range of modern cryptography which based on modular arithmetic with small primes. For example 64-bit integer modulo intensively used in Montgomery reduction for example in modPow (x1 ^ x2 mod m) part. I guess @jedisct1 could tell much more than me. Another potential user case is multi-precision float points and integers.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 23 2020 at 01:54):

MaxGraey edited a comment on Issue #2439:

Oh, so you ask about real scenarios which could potentially benefit from this?

I guess all range of modern cryptography which based on modular arithmetic with small primes. For example 64-bit integer modulo intensively used in Montgomery reduction for example in modPow (x1 ^ x2 mod m) part. I guess @jedisct1 could tell much more than me. Another potential user case is multi-precision float points and integers which also use modular arithmetic

view this post on Zulip Wasmtime GitHub notifications bot (Nov 23 2020 at 02:33):

cfallin commented on Issue #2439:

Fair enough -- I think that we could investigate further. I'm still worried about the impact of the three additional instructions for every divide; so I'd want us to collect data on various numeric benchmarks (compression and media codecs come to mind) to make sure this impact isn't an issue. The benchmarking situation is looking to improve soon (bytecodealliance/rfcs#4) so we should be able to study this fairly easily!

@MaxGraey would you be willing to prototype it?

view this post on Zulip Wasmtime GitHub notifications bot (Nov 23 2020 at 02:40):

MaxGraey commented on Issue #2439:

historically, this was implemented exclusively for Intel Atom back in 2013, but later it is already used even for generic x64. I think the guys at LLVM have learned pretty well the specifics of this BypassSlowDivision transsformation.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 23 2020 at 02:40):

MaxGraey edited a comment on Issue #2439:

historically, this implemented exclusively for Intel Atom back in 2013, but later it is already used even for generic x64. I think the guys from LLVM have learned pretty well the specifics of this BypassSlowDivision transsformation.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 23 2020 at 02:43):

MaxGraey edited a comment on Issue #2439:

historically, this implemented exclusively for Intel Atom back in 2013, but later it is already used even for generic x64. I think the guys from LLVM have learned pretty well the specifics of this BypassSlowDivision transsformation.

The benchmarking situation is looking to improve soon

Nice!

@MaxGraey would you be willing to prototype it?

Thanks for the suggestions, but I think this is a bit tricky case for me)

view this post on Zulip Wasmtime GitHub notifications bot (Nov 23 2020 at 02:44):

MaxGraey edited a comment on Issue #2439:

historically, this implemented exclusively for Intel Atom back in 2013, but later it is already used even for generic x64. I think the guys from LLVM have learned pretty well the specifics of this BypassSlowDivision transsformation.

The benchmarking situation is looking to improve soon

Nice!

@MaxGraey would you be willing to prototype it?

Thanks for the suggestions, but I think this is a bit tricky and low-level case for me)

view this post on Zulip Wasmtime GitHub notifications bot (Nov 23 2020 at 02:44):

MaxGraey edited a comment on Issue #2439:

historically, this implemented exclusively for Intel Atom back in 2013, but later it is already used even for generic x64. I think the guys from LLVM have learned pretty well the specifics of this BypassSlowDivision transformation.

The benchmarking situation is looking to improve soon

Nice!

@MaxGraey would you be willing to prototype it?

Thanks for the suggestions, but I think this is a bit tricky and low-level case for me)

view this post on Zulip Wasmtime GitHub notifications bot (Nov 23 2020 at 02:52):

cfallin commented on Issue #2439:

Thanks for the link to the LLVM transform!

Re: overhead, one option would be to add a flag to our CPU-specific backend flags to enable this; then the embedder could enable on platforms where this helps.

I'll see if I can get to this at some point (lots of things on my plate at the moment) but anyone feel free to try this in the meantime!

view this post on Zulip Wasmtime GitHub notifications bot (Nov 23 2020 at 06:44):

bjorn3 commented on Issue #2439:

As an example where division jad disproportionate effects: https://github.com/gimli-rs/gimli/pull/476 made line debuginfo generation twice as fast by avoiding a division in the common case.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 23 2020 at 06:45):

bjorn3 edited a comment on Issue #2439:

As an example where division had disproportionate effects: https://github.com/gimli-rs/gimli/pull/476 made line debuginfo generation twice as fast by avoiding a division in the common case.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 23 2020 at 12:52):

MaxGraey edited a comment on Issue #2439:

historically, this was implemented exclusively for Intel Atom back in 2013, but later it is already used even for generic x64. I think the guys from LLVM have learned pretty well the specifics of this BypassSlowDivision transformation.

The benchmarking situation is looking to improve soon

Nice!

@MaxGraey would you be willing to prototype it?

Thanks for the suggestions, but I think this is a bit tricky and low-level case for me)

view this post on Zulip Wasmtime GitHub notifications bot (Nov 23 2020 at 12:59):

MaxGraey edited Issue #2439:

Feature

Latency and RCP for division depends on register sizes for most of x64 architectures. So Clang do one interesting trick which may speedup 64-bit division if high parts of operands equal to zero. Pseudocode:

idiv(a: i64, b: i64) -> i64 {
  if ((a | b) >> 32) return a / b; // full 64-bit division
  return i32(a) / i32(b) as i64; // 32-bit division
}

godbolt: https://godbolt.org/z/Tqqzs1

Is it make sense apply same optimization for cranelift only for x64 architecture?

Benefit

it may speedup div / rem over 2x for arguments without high parts with small constant overhead according to this table:
<img width="525" alt="comparision" src="https://user-images.githubusercontent.com/1301959/99916476-0cf5dc80-2d13-11eb-9ffe-7dcfd04eab8e.png">

But it is worth excluding Zen1,2,3 architecture due to it uses a more modern scheme for division which doesn't dependent on register width. Also it doesn't need for ARM.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 23 2020 at 12:59):

MaxGraey edited Issue #2439:

Feature

Latency and RCP for division depends on register sizes for most of x64 architectures. So Clang do one interesting trick which may speedup 64-bit division if high parts of operands equal to zero. Pseudocode:

idiv(a: i64, b: i64) -> i64 {
  if ((a | b) >> 32) return a / b; // full 64-bit division
  return i32(a) / i32(b) as i64; // 32-bit division
}

godbolt: https://godbolt.org/z/Tqqzs1

Is it make sense apply same optimization for cranelift only for x64 architecture?

Benefit

it may speedup div / rem over 2x for arguments without high parts with small constant overhead according to this table:
<img width="525" alt="comparision" src="https://user-images.githubusercontent.com/1301959/99916476-0cf5dc80-2d13-11eb-9ffe-7dcfd04eab8e.png">

But it is worth excluding Zen1,2,3 architecture due to it uses a more modern scheme for division which doesn't dependent on register's width. Also it doesn't need for ARM.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 27 2021 at 16:08):

bnjbvr labeled Issue #2439:

Feature

Latency and RCP for division depends on register sizes for most of x64 architectures. So Clang do one interesting trick which may speedup 64-bit division if high parts of operands equal to zero. Pseudocode:

idiv(a: i64, b: i64) -> i64 {
  if ((a | b) >> 32) return a / b; // full 64-bit division
  return i32(a) / i32(b) as i64; // 32-bit division
}

godbolt: https://godbolt.org/z/Tqqzs1

Is it make sense apply same optimization for cranelift only for x64 architecture?

Benefit

it may speedup div / rem over 2x for arguments without high parts with small constant overhead according to this table:
<img width="525" alt="comparision" src="https://user-images.githubusercontent.com/1301959/99916476-0cf5dc80-2d13-11eb-9ffe-7dcfd04eab8e.png">

But it is worth excluding Zen1,2,3 architecture due to it uses a more modern scheme for division which doesn't dependent on register's width. Also it doesn't need for ARM.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 27 2021 at 16:08):

bnjbvr labeled Issue #2439:

Feature

Latency and RCP for division depends on register sizes for most of x64 architectures. So Clang do one interesting trick which may speedup 64-bit division if high parts of operands equal to zero. Pseudocode:

idiv(a: i64, b: i64) -> i64 {
  if ((a | b) >> 32) return a / b; // full 64-bit division
  return i32(a) / i32(b) as i64; // 32-bit division
}

godbolt: https://godbolt.org/z/Tqqzs1

Is it make sense apply same optimization for cranelift only for x64 architecture?

Benefit

it may speedup div / rem over 2x for arguments without high parts with small constant overhead according to this table:
<img width="525" alt="comparision" src="https://user-images.githubusercontent.com/1301959/99916476-0cf5dc80-2d13-11eb-9ffe-7dcfd04eab8e.png">

But it is worth excluding Zen1,2,3 architecture due to it uses a more modern scheme for division which doesn't dependent on register's width. Also it doesn't need for ARM.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 27 2021 at 16:08):

bnjbvr labeled Issue #2439:

Feature

Latency and RCP for division depends on register sizes for most of x64 architectures. So Clang do one interesting trick which may speedup 64-bit division if high parts of operands equal to zero. Pseudocode:

idiv(a: i64, b: i64) -> i64 {
  if ((a | b) >> 32) return a / b; // full 64-bit division
  return i32(a) / i32(b) as i64; // 32-bit division
}

godbolt: https://godbolt.org/z/Tqqzs1

Is it make sense apply same optimization for cranelift only for x64 architecture?

Benefit

it may speedup div / rem over 2x for arguments without high parts with small constant overhead according to this table:
<img width="525" alt="comparision" src="https://user-images.githubusercontent.com/1301959/99916476-0cf5dc80-2d13-11eb-9ffe-7dcfd04eab8e.png">

But it is worth excluding Zen1,2,3 architecture due to it uses a more modern scheme for division which doesn't dependent on register's width. Also it doesn't need for ARM.


Last updated: Jan 24 2025 at 00:11 UTC