fitzgen opened issue #6759:
That is, regular calls with the
tail
calling convention should be as fast as regular calls with thefast
calling convention.https://github.com/bytecodealliance/wasmtime/issues/1065#issuecomment-1624395771
So @jameysharp and I did a little profiling/investigation of switching the internal Wasm calling convention over to tail on our sightglass benchmarks. I was really expecting this to have no measurable change, but unfortunately it looks like it has a ~7% overhead on bz2 and spidermonkey.wasm and ~1% overhead on pulldown-cmark. This is surprising! We think this means that we ~frequently call functions that don't have enough register pressure to clobber all callee-save registers, and since tail only has caller-save registers and zero callee-save registers, we are doing more spills than we used to. Enough more that it is really measurable.
Here are the histograms of number of clobbered callee-save registers in a function for some of our benchmarks:
<details>
pulldown-cmark
# Number of samples = 757 # Min = 0 # Max = 5 # # Mean = 1.9682959048877162 # Standard deviation = 2.4428038716280174 # Variance = 5.967290755240832 # # Each ∎ is a count of 9 # 0 .. 1 [ 459 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 1 .. 2 [ 0 ]: 2 .. 3 [ 0 ]: 3 .. 4 [ 0 ]: 4 .. 5 [ 0 ]: 5 .. 6 [ 298 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 6 .. 7 [ 0 ]: 7 .. 8 [ 0 ]: 8 .. 9 [ 0 ]: 9 .. 10 [ 0 ]:
spidermonkey
# Number of samples = 18279 # Min = 0 # Max = 5 # # Mean = 1.8119153126538674 # Standard deviation = 2.4034432514706436 # Variance = 5.77653946303978 # # Each ∎ is a count of 233 # 0 .. 1 [ 11655 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 1 .. 2 [ 0 ]: 2 .. 3 [ 0 ]: 3 .. 4 [ 0 ]: 4 .. 5 [ 0 ]: 5 .. 6 [ 6624 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 6 .. 7 [ 0 ]: 7 .. 8 [ 0 ]: 8 .. 9 [ 0 ]: 9 .. 10 [ 0 ]:
bz2
# Number of samples = 127 # Min = 0 # Max = 5 # # Mean = 0.5511811023622047 # Standard deviation = 1.5659198268780583 # Variance = 2.452104904209808 # # Each ∎ is a count of 2 # 0 .. 1 [ 113 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 1 .. 2 [ 0 ]: 2 .. 3 [ 0 ]: 3 .. 4 [ 0 ]: 4 .. 5 [ 0 ]: 5 .. 6 [ 14 ]: ∎∎∎∎∎∎∎ 6 .. 7 [ 0 ]: 7 .. 8 [ 0 ]: 8 .. 9 [ 0 ]: 9 .. 10 [ 0 ]:
</details>
I think we just need to support callee-save registers in the
tail
calling convention. For simplicity, we can probably just match sys-v / the default native calling convention. A little unfortunate, as it means that chains of tail calls will be saving and restoring callee-save registers that the next function isn't going to use (won't be used again till the chain completes) but we definitely can't pessimize regular calls for the sake of tail call chains.
fitzgen added the cranelift label to Issue #6759.
fitzgen added the cranelift:goal:optimize-speed label to Issue #6759.
fitzgen edited issue #6759:
That is, regular calls with the
tail
calling convention should be as fast as regular calls with thefast
calling convention.https://github.com/bytecodealliance/wasmtime/issues/1065#issuecomment-1624395771
So @jameysharp and I did a little profiling/investigation of switching the internal Wasm calling convention over to tail on our sightglass benchmarks. I was really expecting this to have no measurable change, but unfortunately it looks like it has a ~7% overhead on bz2 and spidermonkey.wasm and ~1% overhead on pulldown-cmark. This is surprising! We think this means that we ~frequently call functions that don't have enough register pressure to clobber all callee-save registers, and since tail only has caller-save registers and zero callee-save registers, we are doing more spills than we used to. Enough more that it is really measurable.
Here are the histograms of number of clobbered callee-save registers in a function for some of our benchmarks:
<details>
pulldown-cmark
# Number of samples = 757 # Min = 0 # Max = 5 # # Mean = 1.9682959048877162 # Standard deviation = 2.4428038716280174 # Variance = 5.967290755240832 # # Each ∎ is a count of 9 # 0 .. 1 [ 459 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 1 .. 2 [ 0 ]: 2 .. 3 [ 0 ]: 3 .. 4 [ 0 ]: 4 .. 5 [ 0 ]: 5 .. 6 [ 298 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 6 .. 7 [ 0 ]: 7 .. 8 [ 0 ]: 8 .. 9 [ 0 ]: 9 .. 10 [ 0 ]:
spidermonkey
# Number of samples = 18279 # Min = 0 # Max = 5 # # Mean = 1.8119153126538674 # Standard deviation = 2.4034432514706436 # Variance = 5.77653946303978 # # Each ∎ is a count of 233 # 0 .. 1 [ 11655 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 1 .. 2 [ 0 ]: 2 .. 3 [ 0 ]: 3 .. 4 [ 0 ]: 4 .. 5 [ 0 ]: 5 .. 6 [ 6624 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 6 .. 7 [ 0 ]: 7 .. 8 [ 0 ]: 8 .. 9 [ 0 ]: 9 .. 10 [ 0 ]:
bz2
# Number of samples = 127 # Min = 0 # Max = 5 # # Mean = 0.5511811023622047 # Standard deviation = 1.5659198268780583 # Variance = 2.452104904209808 # # Each ∎ is a count of 2 # 0 .. 1 [ 113 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 1 .. 2 [ 0 ]: 2 .. 3 [ 0 ]: 3 .. 4 [ 0 ]: 4 .. 5 [ 0 ]: 5 .. 6 [ 14 ]: ∎∎∎∎∎∎∎ 6 .. 7 [ 0 ]: 7 .. 8 [ 0 ]: 8 .. 9 [ 0 ]: 9 .. 10 [ 0 ]:
</details>
I think we just need to support callee-save registers in the
tail
calling convention. For simplicity, we can probably just match sys-v / the default native calling convention. A little unfortunate, as it means that chains of tail calls will be saving and restoring callee-save registers that the next function isn't going to use (won't be used again till the chain completes) but we definitely can't pessimize regular calls for the sake of tail call chains.
```[tasklist]Tasks
~~~
fitzgen edited issue #6759:
That is, regular calls with the
tail
calling convention should be as fast as regular calls with thefast
calling convention.https://github.com/bytecodealliance/wasmtime/issues/1065#issuecomment-1624395771
So @jameysharp and I did a little profiling/investigation of switching the internal Wasm calling convention over to tail on our sightglass benchmarks. I was really expecting this to have no measurable change, but unfortunately it looks like it has a ~7% overhead on bz2 and spidermonkey.wasm and ~1% overhead on pulldown-cmark. This is surprising! We think this means that we ~frequently call functions that don't have enough register pressure to clobber all callee-save registers, and since tail only has caller-save registers and zero callee-save registers, we are doing more spills than we used to. Enough more that it is really measurable.
Here are the histograms of number of clobbered callee-save registers in a function for some of our benchmarks:
<details>
pulldown-cmark
# Number of samples = 757 # Min = 0 # Max = 5 # # Mean = 1.9682959048877162 # Standard deviation = 2.4428038716280174 # Variance = 5.967290755240832 # # Each ∎ is a count of 9 # 0 .. 1 [ 459 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 1 .. 2 [ 0 ]: 2 .. 3 [ 0 ]: 3 .. 4 [ 0 ]: 4 .. 5 [ 0 ]: 5 .. 6 [ 298 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 6 .. 7 [ 0 ]: 7 .. 8 [ 0 ]: 8 .. 9 [ 0 ]: 9 .. 10 [ 0 ]:
spidermonkey
# Number of samples = 18279 # Min = 0 # Max = 5 # # Mean = 1.8119153126538674 # Standard deviation = 2.4034432514706436 # Variance = 5.77653946303978 # # Each ∎ is a count of 233 # 0 .. 1 [ 11655 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 1 .. 2 [ 0 ]: 2 .. 3 [ 0 ]: 3 .. 4 [ 0 ]: 4 .. 5 [ 0 ]: 5 .. 6 [ 6624 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 6 .. 7 [ 0 ]: 7 .. 8 [ 0 ]: 8 .. 9 [ 0 ]: 9 .. 10 [ 0 ]:
bz2
# Number of samples = 127 # Min = 0 # Max = 5 # # Mean = 0.5511811023622047 # Standard deviation = 1.5659198268780583 # Variance = 2.452104904209808 # # Each ∎ is a count of 2 # 0 .. 1 [ 113 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 1 .. 2 [ 0 ]: 2 .. 3 [ 0 ]: 3 .. 4 [ 0 ]: 4 .. 5 [ 0 ]: 5 .. 6 [ 14 ]: ∎∎∎∎∎∎∎ 6 .. 7 [ 0 ]: 7 .. 8 [ 0 ]: 8 .. 9 [ 0 ]: 9 .. 10 [ 0 ]:
</details>
I think we just need to support callee-save registers in the
tail
calling convention. For simplicity, we can probably just match sys-v / the default native calling convention. A little unfortunate, as it means that chains of tail calls will be saving and restoring callee-save registers that the next function isn't going to use (won't be used again till the chain completes) but we definitely can't pessimize regular calls for the sake of tail call chains.
alexcrichton closed issue #6759:
That is, regular calls with the
tail
calling convention should be as fast as regular calls with thefast
calling convention.https://github.com/bytecodealliance/wasmtime/issues/1065#issuecomment-1624395771
So @jameysharp and I did a little profiling/investigation of switching the internal Wasm calling convention over to tail on our sightglass benchmarks. I was really expecting this to have no measurable change, but unfortunately it looks like it has a ~7% overhead on bz2 and spidermonkey.wasm and ~1% overhead on pulldown-cmark. This is surprising! We think this means that we ~frequently call functions that don't have enough register pressure to clobber all callee-save registers, and since tail only has caller-save registers and zero callee-save registers, we are doing more spills than we used to. Enough more that it is really measurable.
Here are the histograms of number of clobbered callee-save registers in a function for some of our benchmarks:
<details>
pulldown-cmark
# Number of samples = 757 # Min = 0 # Max = 5 # # Mean = 1.9682959048877162 # Standard deviation = 2.4428038716280174 # Variance = 5.967290755240832 # # Each ∎ is a count of 9 # 0 .. 1 [ 459 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 1 .. 2 [ 0 ]: 2 .. 3 [ 0 ]: 3 .. 4 [ 0 ]: 4 .. 5 [ 0 ]: 5 .. 6 [ 298 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 6 .. 7 [ 0 ]: 7 .. 8 [ 0 ]: 8 .. 9 [ 0 ]: 9 .. 10 [ 0 ]:
spidermonkey
# Number of samples = 18279 # Min = 0 # Max = 5 # # Mean = 1.8119153126538674 # Standard deviation = 2.4034432514706436 # Variance = 5.77653946303978 # # Each ∎ is a count of 233 # 0 .. 1 [ 11655 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 1 .. 2 [ 0 ]: 2 .. 3 [ 0 ]: 3 .. 4 [ 0 ]: 4 .. 5 [ 0 ]: 5 .. 6 [ 6624 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 6 .. 7 [ 0 ]: 7 .. 8 [ 0 ]: 8 .. 9 [ 0 ]: 9 .. 10 [ 0 ]:
bz2
# Number of samples = 127 # Min = 0 # Max = 5 # # Mean = 0.5511811023622047 # Standard deviation = 1.5659198268780583 # Variance = 2.452104904209808 # # Each ∎ is a count of 2 # 0 .. 1 [ 113 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 1 .. 2 [ 0 ]: 2 .. 3 [ 0 ]: 3 .. 4 [ 0 ]: 4 .. 5 [ 0 ]: 5 .. 6 [ 14 ]: ∎∎∎∎∎∎∎ 6 .. 7 [ 0 ]: 7 .. 8 [ 0 ]: 8 .. 9 [ 0 ]: 9 .. 10 [ 0 ]:
</details>
I think we just need to support callee-save registers in the
tail
calling convention. For simplicity, we can probably just match sys-v / the default native calling convention. A little unfortunate, as it means that chains of tail calls will be saving and restoring callee-save registers that the next function isn't going to use (won't be used again till the chain completes) but we definitely can't pessimize regular calls for the sake of tail call chains.
alexcrichton commented on issue #6759:
I believe that this is done now, so closing.
Last updated: Jan 24 2025 at 00:11 UTC