Stream: git-wasmtime

Topic: wasmtime / issue #6759 Cranelift: Get the `tail` calling ...


view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2023 at 18:28):

fitzgen opened issue #6759:

That is, regular calls with the tail calling convention should be as fast as regular calls with the fast calling convention.

https://github.com/bytecodealliance/wasmtime/issues/1065#issuecomment-1624395771

So @jameysharp and I did a little profiling/investigation of switching the internal Wasm calling convention over to tail on our sightglass benchmarks. I was really expecting this to have no measurable change, but unfortunately it looks like it has a ~7% overhead on bz2 and spidermonkey.wasm and ~1% overhead on pulldown-cmark. This is surprising! We think this means that we ~frequently call functions that don't have enough register pressure to clobber all callee-save registers, and since tail only has caller-save registers and zero callee-save registers, we are doing more spills than we used to. Enough more that it is really measurable.

Here are the histograms of number of clobbered callee-save registers in a function for some of our benchmarks:

<details>

pulldown-cmark

# Number of samples = 757
# Min = 0
# Max = 5
#
# Mean = 1.9682959048877162
# Standard deviation = 2.4428038716280174
# Variance = 5.967290755240832
#
# Each  is a count of 9
#
 0 ..  1 [ 459 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1 ..  2 [   0 ]:
 2 ..  3 [   0 ]:
 3 ..  4 [   0 ]:
 4 ..  5 [   0 ]:
 5 ..  6 [ 298 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 6 ..  7 [   0 ]:
 7 ..  8 [   0 ]:
 8 ..  9 [   0 ]:
 9 .. 10 [   0 ]:

spidermonkey

# Number of samples = 18279
# Min = 0
# Max = 5
#
# Mean = 1.8119153126538674
# Standard deviation = 2.4034432514706436
# Variance = 5.77653946303978
#
# Each  is a count of 233
#
 0 ..  1 [ 11655 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1 ..  2 [     0 ]:
 2 ..  3 [     0 ]:
 3 ..  4 [     0 ]:
 4 ..  5 [     0 ]:
 5 ..  6 [  6624 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 6 ..  7 [     0 ]:
 7 ..  8 [     0 ]:
 8 ..  9 [     0 ]:
 9 .. 10 [     0 ]:

bz2

# Number of samples = 127
# Min = 0
# Max = 5
#
# Mean = 0.5511811023622047
# Standard deviation = 1.5659198268780583
# Variance = 2.452104904209808
#
# Each  is a count of 2
#
 0 ..  1 [ 113 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1 ..  2 [   0 ]:
 2 ..  3 [   0 ]:
 3 ..  4 [   0 ]:
 4 ..  5 [   0 ]:
 5 ..  6 [  14 ]: ∎∎∎∎∎∎∎
 6 ..  7 [   0 ]:
 7 ..  8 [   0 ]:
 8 ..  9 [   0 ]:
 9 .. 10 [   0 ]:

</details>

I think we just need to support callee-save registers in the tail calling convention. For simplicity, we can probably just match sys-v / the default native calling convention. A little unfortunate, as it means that chains of tail calls will be saving and restoring callee-save registers that the next function isn't going to use (won't be used again till the chain completes) but we definitely can't pessimize regular calls for the sake of tail call chains.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2023 at 18:28):

fitzgen added the cranelift label to Issue #6759.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2023 at 18:28):

fitzgen added the cranelift:goal:optimize-speed label to Issue #6759.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2023 at 18:45):

fitzgen edited issue #6759:

That is, regular calls with the tail calling convention should be as fast as regular calls with the fast calling convention.

https://github.com/bytecodealliance/wasmtime/issues/1065#issuecomment-1624395771

So @jameysharp and I did a little profiling/investigation of switching the internal Wasm calling convention over to tail on our sightglass benchmarks. I was really expecting this to have no measurable change, but unfortunately it looks like it has a ~7% overhead on bz2 and spidermonkey.wasm and ~1% overhead on pulldown-cmark. This is surprising! We think this means that we ~frequently call functions that don't have enough register pressure to clobber all callee-save registers, and since tail only has caller-save registers and zero callee-save registers, we are doing more spills than we used to. Enough more that it is really measurable.

Here are the histograms of number of clobbered callee-save registers in a function for some of our benchmarks:

<details>

pulldown-cmark

# Number of samples = 757
# Min = 0
# Max = 5
#
# Mean = 1.9682959048877162
# Standard deviation = 2.4428038716280174
# Variance = 5.967290755240832
#
# Each  is a count of 9
#
 0 ..  1 [ 459 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1 ..  2 [   0 ]:
 2 ..  3 [   0 ]:
 3 ..  4 [   0 ]:
 4 ..  5 [   0 ]:
 5 ..  6 [ 298 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 6 ..  7 [   0 ]:
 7 ..  8 [   0 ]:
 8 ..  9 [   0 ]:
 9 .. 10 [   0 ]:

spidermonkey

# Number of samples = 18279
# Min = 0
# Max = 5
#
# Mean = 1.8119153126538674
# Standard deviation = 2.4034432514706436
# Variance = 5.77653946303978
#
# Each  is a count of 233
#
 0 ..  1 [ 11655 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1 ..  2 [     0 ]:
 2 ..  3 [     0 ]:
 3 ..  4 [     0 ]:
 4 ..  5 [     0 ]:
 5 ..  6 [  6624 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 6 ..  7 [     0 ]:
 7 ..  8 [     0 ]:
 8 ..  9 [     0 ]:
 9 .. 10 [     0 ]:

bz2

# Number of samples = 127
# Min = 0
# Max = 5
#
# Mean = 0.5511811023622047
# Standard deviation = 1.5659198268780583
# Variance = 2.452104904209808
#
# Each  is a count of 2
#
 0 ..  1 [ 113 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1 ..  2 [   0 ]:
 2 ..  3 [   0 ]:
 3 ..  4 [   0 ]:
 4 ..  5 [   0 ]:
 5 ..  6 [  14 ]: ∎∎∎∎∎∎∎
 6 ..  7 [   0 ]:
 7 ..  8 [   0 ]:
 8 ..  9 [   0 ]:
 9 .. 10 [   0 ]:

</details>

I think we just need to support callee-save registers in the tail calling convention. For simplicity, we can probably just match sys-v / the default native calling convention. A little unfortunate, as it means that chains of tail calls will be saving and restoring callee-save registers that the next function isn't going to use (won't be used again till the chain completes) but we definitely can't pessimize regular calls for the sake of tail call chains.
```[tasklist]

Tasks

~~~

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2023 at 18:46):

fitzgen edited issue #6759:

That is, regular calls with the tail calling convention should be as fast as regular calls with the fast calling convention.

https://github.com/bytecodealliance/wasmtime/issues/1065#issuecomment-1624395771

So @jameysharp and I did a little profiling/investigation of switching the internal Wasm calling convention over to tail on our sightglass benchmarks. I was really expecting this to have no measurable change, but unfortunately it looks like it has a ~7% overhead on bz2 and spidermonkey.wasm and ~1% overhead on pulldown-cmark. This is surprising! We think this means that we ~frequently call functions that don't have enough register pressure to clobber all callee-save registers, and since tail only has caller-save registers and zero callee-save registers, we are doing more spills than we used to. Enough more that it is really measurable.

Here are the histograms of number of clobbered callee-save registers in a function for some of our benchmarks:

<details>

pulldown-cmark

# Number of samples = 757
# Min = 0
# Max = 5
#
# Mean = 1.9682959048877162
# Standard deviation = 2.4428038716280174
# Variance = 5.967290755240832
#
# Each  is a count of 9
#
 0 ..  1 [ 459 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1 ..  2 [   0 ]:
 2 ..  3 [   0 ]:
 3 ..  4 [   0 ]:
 4 ..  5 [   0 ]:
 5 ..  6 [ 298 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 6 ..  7 [   0 ]:
 7 ..  8 [   0 ]:
 8 ..  9 [   0 ]:
 9 .. 10 [   0 ]:

spidermonkey

# Number of samples = 18279
# Min = 0
# Max = 5
#
# Mean = 1.8119153126538674
# Standard deviation = 2.4034432514706436
# Variance = 5.77653946303978
#
# Each  is a count of 233
#
 0 ..  1 [ 11655 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1 ..  2 [     0 ]:
 2 ..  3 [     0 ]:
 3 ..  4 [     0 ]:
 4 ..  5 [     0 ]:
 5 ..  6 [  6624 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 6 ..  7 [     0 ]:
 7 ..  8 [     0 ]:
 8 ..  9 [     0 ]:
 9 .. 10 [     0 ]:

bz2

# Number of samples = 127
# Min = 0
# Max = 5
#
# Mean = 0.5511811023622047
# Standard deviation = 1.5659198268780583
# Variance = 2.452104904209808
#
# Each  is a count of 2
#
 0 ..  1 [ 113 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1 ..  2 [   0 ]:
 2 ..  3 [   0 ]:
 3 ..  4 [   0 ]:
 4 ..  5 [   0 ]:
 5 ..  6 [  14 ]: ∎∎∎∎∎∎∎
 6 ..  7 [   0 ]:
 7 ..  8 [   0 ]:
 8 ..  9 [   0 ]:
 9 .. 10 [   0 ]:

</details>

I think we just need to support callee-save registers in the tail calling convention. For simplicity, we can probably just match sys-v / the default native calling convention. A little unfortunate, as it means that chains of tail calls will be saving and restoring callee-save registers that the next function isn't going to use (won't be used again till the chain completes) but we definitely can't pessimize regular calls for the sake of tail call chains.

view this post on Zulip Wasmtime GitHub notifications bot (May 10 2024 at 02:15):

alexcrichton closed issue #6759:

That is, regular calls with the tail calling convention should be as fast as regular calls with the fast calling convention.

https://github.com/bytecodealliance/wasmtime/issues/1065#issuecomment-1624395771

So @jameysharp and I did a little profiling/investigation of switching the internal Wasm calling convention over to tail on our sightglass benchmarks. I was really expecting this to have no measurable change, but unfortunately it looks like it has a ~7% overhead on bz2 and spidermonkey.wasm and ~1% overhead on pulldown-cmark. This is surprising! We think this means that we ~frequently call functions that don't have enough register pressure to clobber all callee-save registers, and since tail only has caller-save registers and zero callee-save registers, we are doing more spills than we used to. Enough more that it is really measurable.

Here are the histograms of number of clobbered callee-save registers in a function for some of our benchmarks:

<details>

pulldown-cmark

# Number of samples = 757
# Min = 0
# Max = 5
#
# Mean = 1.9682959048877162
# Standard deviation = 2.4428038716280174
# Variance = 5.967290755240832
#
# Each  is a count of 9
#
 0 ..  1 [ 459 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1 ..  2 [   0 ]:
 2 ..  3 [   0 ]:
 3 ..  4 [   0 ]:
 4 ..  5 [   0 ]:
 5 ..  6 [ 298 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 6 ..  7 [   0 ]:
 7 ..  8 [   0 ]:
 8 ..  9 [   0 ]:
 9 .. 10 [   0 ]:

spidermonkey

# Number of samples = 18279
# Min = 0
# Max = 5
#
# Mean = 1.8119153126538674
# Standard deviation = 2.4034432514706436
# Variance = 5.77653946303978
#
# Each  is a count of 233
#
 0 ..  1 [ 11655 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1 ..  2 [     0 ]:
 2 ..  3 [     0 ]:
 3 ..  4 [     0 ]:
 4 ..  5 [     0 ]:
 5 ..  6 [  6624 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 6 ..  7 [     0 ]:
 7 ..  8 [     0 ]:
 8 ..  9 [     0 ]:
 9 .. 10 [     0 ]:

bz2

# Number of samples = 127
# Min = 0
# Max = 5
#
# Mean = 0.5511811023622047
# Standard deviation = 1.5659198268780583
# Variance = 2.452104904209808
#
# Each  is a count of 2
#
 0 ..  1 [ 113 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1 ..  2 [   0 ]:
 2 ..  3 [   0 ]:
 3 ..  4 [   0 ]:
 4 ..  5 [   0 ]:
 5 ..  6 [  14 ]: ∎∎∎∎∎∎∎
 6 ..  7 [   0 ]:
 7 ..  8 [   0 ]:
 8 ..  9 [   0 ]:
 9 .. 10 [   0 ]:

</details>

I think we just need to support callee-save registers in the tail calling convention. For simplicity, we can probably just match sys-v / the default native calling convention. A little unfortunate, as it means that chains of tail calls will be saving and restoring callee-save registers that the next function isn't going to use (won't be used again till the chain completes) but we definitely can't pessimize regular calls for the sake of tail call chains.

view this post on Zulip Wasmtime GitHub notifications bot (May 10 2024 at 02:15):

alexcrichton commented on issue #6759:

I believe that this is done now, so closing.


Last updated: Jan 24 2025 at 00:11 UTC