Hi, as some of you may know I'm working on stack switching in Wasm(time).
I just wanted to share a few insights I've had recently when working on moving the actual stack switching from being done using libcalls (using a customized version of wasmtime-fiber
under the hood) to Cranelift-generated code.
This may be interesting for the people I talked to at the Wasm CG meeting in Pittsburgh, because I realized since then that the performance improvements have a different underlying cause than I previously thought.
Summary:
I've been able to reproduce the fact that moving the stack switching to Cranelift-generated code drastically improves performance across more benchmarks, shaving away up to 85% of the overall runtime in micro-benchmarks.
However, my new insight is that this drastic change isn't simply due to removing the general overhead induced by libcalls for crossing from generated code into the runtime.
Instead, the performance improvements mainly come from the fact that stack switching messes up the CPU's return address branch prediction (RABP), and our previous use of libcalls for switching was accidentally amplifying that effect.
In our old implementation (using libcalls), performing a stack switch while executing Wasm involved a series of nested function calls until reaching the code that does the actual stack switching.
On the switched-to stack, we would then return from a similar series of slightly different calls to continue executing Wasm. But the stack switch would cause the RABP to consistently mispredict the return addresses when returning from these nested calls, leading to a series of 4 consecutive mispredicts on every stack switch.
In the new design, where the stack switching is effectively inlined, these mispredicts per stack switch are avoided.
Additional details for the morbidly curious:
The micro-benchmark where we see the 85% runtime reduction is the simplest one testing just stack switching performance:
The Wasm function $f
resume
s a stack running $g
, which immediately suspend
s/yields back to the stack running $f
. This happens in a tight loop, meaning that we constantly switch between the stacks running $f
and $g
.
Here's how the stacks look like at the point when $g
is about to suspend itself to go back to $f
(most recent frame on top):
+---------------------------+ +---------------------------+
| wasmtime_fiber_switch | | wasmtime_fiber_switch | <- PC here
+---------------------------+ +---------------------------+
| libcalls:::resume | | libcalls:::suspend |
+---------------------------+ +---------------------------+
| catch_unwind_and_longjmp0 | | catch_unwind_and_longjmp1 |
+---------------------------+ +---------------------------+
| libcalls::raw::resume | | libcalls::raw::suspend |
+---------------------------+ +---------------------------+
| $f | | $g |
+---------------------------+ +---------------------------+
| .... | | .... |
+---------------------------+ +---------------------------+
(inactive stack) (active stack)
I'm taking some liberty here regarding the names of the functions on the stack, but the bottom line is:
The left, currently inactive stack has frames on top from the last time $f
switched over to the other stack by executing a resume
:
Some frames come from the libcall infrastructure (libcalls::*
, catch_unwind_and_longjmp
, ...), followed by the actual implementation using (a modified version of) wasmtime_fiber (wasmtime_fiber_switch
).
The right stack, currently running, is in the middle of a similar sequence of nested function calls for a suspend
and is just about to perform the actual switch back to the left stack.
The issue now is as follows: At the point in the picture, the CPU's internal return address stack (RAS) currently ends with entries reflecting that the most recent calls came from $g
, libcalls::raw::suspend
, catch_unwind_and_longjmp1
, and libcalls:::suspend
.
After switching stacks, we continue inside the frame for wasmtime_fiber_switch
on the left, and return from the 4 functions on top back into $f
. However, the RAS' information is now invalid, resulting each of these 4 returns to be mispredicted.
I'm doing all my testing on a Tiger Lake CPU, but seeing the same on on Zen 2. From what I can tell, they seem to rely exclusively on the RAS for return branch prediction: No matter the number of iterations, I always end up with exactly 4 mispredicts per suspend
or resume
, meaning that the predictor never "learns" that something funky is happening at these returns and that it should predict them without the RAS.
Whoa this is a fascinating insight! Nice catch!
I'm not sure if it's possible, but would things perhaps work out better by structuring the switching like wsamtime_fiber_switch
where it's the same function for suspending and resuming? That way everything would be "mirrored" in the sense that the only mispredict would be the return location back to the destination stack. If even 1 mispredict hurts performance that much though it seems like inline without call/ret is the only way to go?
Yes, you are right that having only a single libcall/function would have led to a less severe performance impact: If there was only a single libcall (lets call it switch
), then the four topmost frames on each stack in my picture above would be the same. You would then indeed only mispredict a single return per switch, from the lowest of these four frames back into the Wasm frame.
This difference between calling one vs two different functions to do the switching goes away if instead of using libcalls we directly called leaf functions from Wasm that do all of the switching themselves, rather than calling further functions:
We could have functions fn_resume_
and fn_suspend
, both of which do all of the switching inline without further calls, and call them directly from Wasm, withouth using the normal libcall machinery.
In other words, you call fn_suspend
on one stack and return from a matching fn_resume
call in the other stack. That would also reduce the number of mispredicts to 1 per stack switch.
So maybe my insight can be summarized as follows:
If you want to do the stack switching by calling some external function, you should either make sure that it's always the same function at all stack switch sites, or, if there are multiple switching functions they should be leaf functions. In each of these cases you will pay by having one guaranteed mispredict.
And you're right that if you really want to avoid that last mispredict, you have to fully inline the switching.
Of course, based on where you do the stack switching, the fact that the switching effectively invalidates the RAS may result in some mispredicts down the line anyway, which can't be avoided by inlining the switching itself:
If your code tends to do stack switches while on a deep stack (for example, if there are a lot of frames hidden in the "...." part below $f
and $g
in my picture), then you'll end up getting mispredicts when popping those frames anyway.
That's a good point yeah about guaranteed mispredictes as stacks are switched. That may also mean that the performance of this microbenchmark may not be too too realistic to optimize for?
In that if 90% of the "slowdown" is just the return address mispredicts and "real world" applications are expected to also hit branch mispredicts then the microbenchmark is exploiting a non-real-world property where the benchmark doesn't hit the mispredicts on its own
Yes, the benchmark I mentioned is mostly intended to measure raw switching performance when everything is in cache, mostly to establish an upper bound on what performance improvement that may be gained by moving to Cranelift-generated stack switching code. I didn't mean to give the impression that I'm expecting this to translate to a 85% or so performance improvement in a real-world program.
I'd guess that there are few programs where the raw performance of stack switching dominates the runtime. I'd guess that generators would come closest to that (if you really want to implement them using stack switching). We have a benchmark in our suite that's similar to a simple generator, and are seeing something like a 40-50% improvement there.
But you're also right that trying to measure "raw switching performance" is muddied a bit by the realization that a stack switch may cause subsequent mispredictions when returning further down the stack, after the switch itself finished.
I guess it's safe to say that the amount of follow-up mispredictions heavily depends on the workload. I could imagine that there are workloads where you frequently switch between stacks and then do additional work on top of the frames on the existing stacks between switches (the generator probably fits into that box). For something like async-await and coroutines I'm much less sure...
In general, we really want to create more realistic benchmarks to get an idea of how these changes affect real-world workloads. This message here was mostly just intended to provide a bit of a follow-up to my talk at the CG meeting, back when I had much less of an understanding where the surprisingly large performance difference was coming from (even for a micro-benchmark). In particular, I wanted to clarify that it's not just the usual benefits of inlining plus avoiding the overhead introduced by libcalls, but that there was something more subtle going on, directly related to the fact that we are switching stacks.
So in any case, my sentence "If you want to do the stack switching by calling some external function, you should ..." sounded like a guideline when in reality it probably makes to meaningful difference :smile:
Oh sorry I also don't mean to dissuade benchmarking efforts either! This is a fascinating realization that I had no idea would be so impactful. To me this is basically justification for the asterisk on this benchmark and interpreting its results, but regardless it's still important to measure!
No worries, I didn't take it that way! I think the main takeaway message here is that something interesting was affecting our previous implementation, but quantifying the actual impact requires some more work
Last updated: Nov 22 2024 at 16:03 UTC