EthanBlackburn opened issue #11974:
I upgraded wasmtime from 34.0.2 to 35.0.0 and noticed the average latency of our wasm components increased by ~20%. On-CPU profiles look slightly cheaper in 35, but off-CPU shows a large increase in time parked in finish_task_switch.isra.0. The issue persists even when upgrading to wasmtime 38.0.3
34.0.2 vs 35.0.0 on-cpu
34.0.2 vs 35.0.0 off-cpu
34.0.2 vs 38.0.0 on-cpu
34.0.2 vs 38.0.0 off-cpu
Test Case
Compiled with 34.0.2
Compiled with 35.0.0
Compiled with 38.0.3Steps to Reproduce
Benchmarking a wasm component compiled from this go code
Build on 34.0.2
git clone https://github.com/telophasehq/tangent cd ~/tangent/examples/golang mkdir plugins make run (in a separate window) tangent bench --config tangent.yaml --seconds 30 --payload tests/input.jsonThis will print some stats like
producer bytes (consumed): 6861.75 MiB → 228.66 MiB/s guest: bytes_in=6854.63 MiB, avg_latency=12.824 ms (over 29179 calls)Build on 38.0.3
git fetch origin wasmtime-38 git checkout wasmtime-38 make run (in a separate window) tangent bench --config tangent.yaml --seconds 30 --payload tests/input.jsonThis will print stats showing guest latency and throughput are lower
producer bytes (consumed): 5855.87 MiB → 195.15 MiB/s guest: bytes_in=5848.09 MiB, avg_latency=15.845 ms (over 24617 calls)Expected Results
I expected guest latency to stay the say the same
Actual Results
average guest latency as reported by the the benchmark is ~20% higher in wasmtime version 35.0.0 or higher
Versions and Environment
Wasmtime version or commit: 34.0.2 (good), 35.0.0 (bad), 38.0.3 (bad)
Operating system: Darwin
Architecture: arm64
I see the same issue on Linux aarch64
Extra Info
I feel like I've misconfigured something or am not using the async API correctly, but I'm not sure where to start looking.
Additionally, if y'all suspect there is an issue in wasmtime async I'm happy to poke around myself if I can get some pointers
EthanBlackburn added the bug label to Issue #11974.
EthanBlackburn edited issue #11974:
I upgraded wasmtime from 34.0.2 to 35.0.0 and noticed the average latency of our wasm components increased by ~20%. On-CPU profiles look slightly cheaper in 35, but off-CPU shows a large increase in time parked in finish_task_switch.isra.0. The issue persists even when upgrading to wasmtime 38.0.3
34.0.2 vs 35.0.0 on-cpu
34.0.2 vs 35.0.0 off-cpu
34.0.2 vs 38.0.0 on-cpu
34.0.2 vs 38.0.0 off-cpu
Test Case
Compiled with 34.0.2
Compiled with 35.0.0
Compiled with 38.0.3Steps to Reproduce
Benchmarking a wasm component compiled from this go code
Build on 34.0.2
git clone https://github.com/telophasehq/tangent cd tangent && cargo build --bin tangent --release cd ~/tangent/examples/golang mkdir plugins make run (in a separate window) tangent bench --config tangent.yaml --seconds 30 --payload tests/input.jsonThis will print some stats like
producer bytes (consumed): 6861.75 MiB → 228.66 MiB/s guest: bytes_in=6854.63 MiB, avg_latency=12.824 ms (over 29179 calls)Build on 38.0.3
git fetch origin wasmtime-38 git checkout wasmtime-38 make run (in a separate window) tangent bench --config tangent.yaml --seconds 30 --payload tests/input.jsonThis will print stats showing guest latency and throughput are lower
producer bytes (consumed): 5855.87 MiB → 195.15 MiB/s guest: bytes_in=5848.09 MiB, avg_latency=15.845 ms (over 24617 calls)Expected Results
I expected guest latency to stay the say the same
Actual Results
average guest latency as reported by the the benchmark is ~20% higher in wasmtime version 35.0.0 or higher
Versions and Environment
Wasmtime version or commit: 34.0.2 (good), 35.0.0 (bad), 38.0.3 (bad)
Operating system: Darwin
Architecture: arm64
I see the same issue on Linux aarch64
Extra Info
I feel like I've misconfigured something or am not using the async API correctly, but I'm not sure where to start looking.
Additionally, if y'all suspect there is an issue in wasmtime async I'm happy to poke around myself if I can get some pointers
EthanBlackburn edited issue #11974:
I upgraded wasmtime from 34.0.2 to 35.0.0 and noticed the average latency of our wasm components increased by ~20%. On-CPU profiles look slightly cheaper in 35, but off-CPU shows a large increase in time parked in finish_task_switch.isra.0. The issue persists even when upgrading to wasmtime 38.0.3
34.0.2 vs 35.0.0 on-cpu
34.0.2 vs 35.0.0 off-cpu
34.0.2 vs 38.0.0 on-cpu
34.0.2 vs 38.0.3 off-cpu
Test Case
Compiled with 34.0.2
Compiled with 35.0.0
Compiled with 38.0.3Steps to Reproduce
Benchmarking a wasm component compiled from this go code
Build on 34.0.2
git clone https://github.com/telophasehq/tangent cd tangent && cargo build --bin tangent --release cd ~/tangent/examples/golang mkdir plugins make run (in a separate window) tangent bench --config tangent.yaml --seconds 30 --payload tests/input.jsonThis will print some stats like
producer bytes (consumed): 6861.75 MiB → 228.66 MiB/s guest: bytes_in=6854.63 MiB, avg_latency=12.824 ms (over 29179 calls)Build on 38.0.3
git fetch origin wasmtime-38 git checkout wasmtime-38 make run (in a separate window) tangent bench --config tangent.yaml --seconds 30 --payload tests/input.jsonThis will print stats showing guest latency and throughput are lower
producer bytes (consumed): 5855.87 MiB → 195.15 MiB/s guest: bytes_in=5848.09 MiB, avg_latency=15.845 ms (over 24617 calls)Expected Results
I expected guest latency to stay the say the same
Actual Results
average guest latency as reported by the the benchmark is ~20% higher in wasmtime version 35.0.0 or higher
Versions and Environment
Wasmtime version or commit: 34.0.2 (good), 35.0.0 (bad), 38.0.3 (bad)
Operating system: Darwin
Architecture: arm64
I see the same issue on Linux aarch64
Extra Info
I feel like I've misconfigured something or am not using the async API correctly, but I'm not sure where to start looking.
Additionally, if y'all suspect there is an issue in wasmtime async I'm happy to poke around myself if I can get some pointers
EthanBlackburn edited issue #11974:
I upgraded wasmtime from 34.0.2 to 35.0.0 and noticed the average latency of our wasm components increased by ~20%. On-CPU profiles look slightly cheaper in 35, but off-CPU shows a large increase in time parked in finish_task_switch.isra.0. The issue persists even when upgrading to wasmtime 38.0.3
34.0.2 vs 35.0.0 on-cpu
34.0.2 vs 35.0.0 off-cpu
34.0.2 vs 38.0.3 on-cpu
34.0.2 vs 38.0.3 off-cpu
Test Case
Compiled with 34.0.2
Compiled with 35.0.0
Compiled with 38.0.3Steps to Reproduce
Benchmarking a wasm component compiled from this go code
Build on 34.0.2
git clone https://github.com/telophasehq/tangent cd tangent && cargo build --bin tangent --release cd ~/tangent/examples/golang mkdir plugins make run (in a separate window) tangent bench --config tangent.yaml --seconds 30 --payload tests/input.jsonThis will print some stats like
producer bytes (consumed): 6861.75 MiB → 228.66 MiB/s guest: bytes_in=6854.63 MiB, avg_latency=12.824 ms (over 29179 calls)Build on 38.0.3
git fetch origin wasmtime-38 git checkout wasmtime-38 make run (in a separate window) tangent bench --config tangent.yaml --seconds 30 --payload tests/input.jsonThis will print stats showing guest latency and throughput are lower
producer bytes (consumed): 5855.87 MiB → 195.15 MiB/s guest: bytes_in=5848.09 MiB, avg_latency=15.845 ms (over 24617 calls)Expected Results
I expected guest latency to stay the say the same
Actual Results
average guest latency as reported by the the benchmark is ~20% higher in wasmtime version 35.0.0 or higher
Versions and Environment
Wasmtime version or commit: 34.0.2 (good), 35.0.0 (bad), 38.0.3 (bad)
Operating system: Darwin
Architecture: arm64
I see the same issue on Linux aarch64
Extra Info
I feel like I've misconfigured something or am not using the async API correctly, but I'm not sure where to start looking.
Additionally, if y'all suspect there is an issue in wasmtime async I'm happy to poke around myself if I can get some pointers
alexcrichton commented on issue #11974:
Bisection shows https://github.com/bytecodealliance/wasmtime/pull/10959 as the culprit. Local testing shows that this clone is the exact culprit.
I'm not sure how this factors in the on/off cpu graphs you're showing myself, but I've also never gotten a perf comparison to ever work before either. Nevertheless this makes sense to me because it's a highly contended atomic inc/dec which is plausible that it could cause a slowdown such as this. I'll see if I can poke around tomorrow and see what comes up.
alexcrichton added the performance label to Issue #11974.
alexcrichton commented on issue #11974:
Also, I should mention, thank you for the thorough report!
EthanBlackburn commented on issue #11974:
You got it. Thank you for the fast response! I'm also seeing the problem start on that commit.
I'll test the
Arc::clonetheory tomorrow as well and Im interested to see the result. My understanding is that clone would show up as on-cpu time instead of off-cpu time, but Im out of my depth hereWill report back
EthanBlackburn commented on issue #11974:
You were right!
https://github.com/bytecodealliance/wasmtime/pull/11979
Before change
producer bytes (consumed): 2798.62 MiB → 93.28 MiB/s guest: bytes_in=2794.94 MiB, avg_latency=19.490 ms (over 12169 calls)After change
producer bytes (consumed): 3731.03 MiB → 124.36 MiB/s guest: bytes_in=3728.28 MiB, avg_latency=13.473 ms (over 16252 calls)
alexcrichton commented on issue #11974:
Ok https://github.com/bytecodealliance/wasmtime/pull/11987 is what I came up with for this. Still uses
unsafebut it's needed for a number of other locations other than this one and is something I've wanted to do for awhile as well. My hope is that it's encapsulated as well as the external interface is a safe one, it's "just" an unsafe implementation.
EthanBlackburn commented on issue #11974:
Let's gooo! Thanks for the fast turnaround.
BTW, since you mentioned you hadn't gotten a perf comparison to work before, here's what I did
On-CPU
(run server with wasmtime 34.0.2) perf record -F 999 -g -- tangent run --config examples/golang/tangent.yaml perf script | ~/FlameGraph/stackcollapse-perf.pl > 34.folded (run server with wasmtime 38.0.3) perf record -F 999 -g -- tangent run --config examples/golang/tangent.yaml perf script | ~/FlameGraph/stackcollapse-perf.pl > 38.folded ~/FlameGraph/difffolded.pl 34.folded 38.folded | ~/FlameGraph/flamegraph.pl --negate --title "Wasmtime 38 vs 34 (red=regression)" > diff-34-38.svgOff-CPU
(while our server was running with wasmtime 34.0.2) sudo offcputime -f -p $(pgrep -n tangent) 30 > offcpu34.folded (while our server was running with wasmtime 38.0.3) sudo offcputime -f -p $(pgrep -n tangent) 30 > offcpu38.folded ~/FlameGraph/difffolded.pl offcpu34.folded offcpu35.folded | ~/FlameGraph/flamegraph.pl --negate --title "Off-CPU diff: 35 vs 34 (red = more wait)" > offcpu-diff-34-35.svgShoutout Brendan Gregg for FlameGraph and offcputime
alexcrichton closed issue #11974:
I upgraded wasmtime from 34.0.2 to 35.0.0 and noticed the average latency of our wasm components increased by ~20%. On-CPU profiles look slightly cheaper in 35, but off-CPU shows a large increase in time parked in finish_task_switch.isra.0. The issue persists even when upgrading to wasmtime 38.0.3
34.0.2 vs 35.0.0 on-cpu
34.0.2 vs 35.0.0 off-cpu
34.0.2 vs 38.0.3 on-cpu
34.0.2 vs 38.0.3 off-cpu
Test Case
Compiled with 34.0.2
Compiled with 35.0.0
Compiled with 38.0.3Steps to Reproduce
Benchmarking a wasm component compiled from this go code
Build on 34.0.2
git clone https://github.com/telophasehq/tangent cd tangent && cargo build --bin tangent --release cd ~/tangent/examples/golang mkdir plugins make run (in a separate window) tangent bench --config tangent.yaml --seconds 30 --payload tests/input.jsonThis will print some stats like
producer bytes (consumed): 6861.75 MiB → 228.66 MiB/s guest: bytes_in=6854.63 MiB, avg_latency=12.824 ms (over 29179 calls)Build on 38.0.3
git fetch origin wasmtime-38 git checkout wasmtime-38 make run (in a separate window) tangent bench --config tangent.yaml --seconds 30 --payload tests/input.jsonThis will print stats showing guest latency and throughput are lower
producer bytes (consumed): 5855.87 MiB → 195.15 MiB/s guest: bytes_in=5848.09 MiB, avg_latency=15.845 ms (over 24617 calls)Expected Results
I expected guest latency to stay the say the same
Actual Results
average guest latency as reported by the the benchmark is ~20% higher in wasmtime version 35.0.0 or higher
Versions and Environment
Wasmtime version or commit: 34.0.2 (good), 35.0.0 (bad), 38.0.3 (bad)
Operating system: Darwin
Architecture: arm64
I see the same issue on Linux aarch64
Extra Info
I feel like I've misconfigured something or am not using the async API correctly, but I'm not sure where to start looking.
Additionally, if y'all suspect there is an issue in wasmtime async I'm happy to poke around myself if I can get some pointers
Last updated: Dec 06 2025 at 06:05 UTC