SpiderMonkey emits a membarrier call while finalizing wasm code for execution. Right now we only do this if we compile a second tier of code, and I've been trying to understand why we don't do this for all module compilations (as modules can migrate between threads).
I found this previous discussion and issue for why wasmtime uses membarrier
which has some nice context. Those discussions also refer to earlier discussions that led to SM using membarrier
. This led to me thinking that we should likely use membarrier
for all compilations to match wasmtime's behavior.
However, trying to use membarrier
for all compilation runs into the issue that the syscall is only available on linux >=4.16, which is problematic for a significant number of android devices we support below that kernel version.
I looked at other VM's (V8/Dart/dotnetcore/mono) to see what they do, and could not find them using membarrier
at all. They would just perform the icache synchronization step, but nothing beyond AFAICT. I can provide links if useful. V8 for sure shares the code between threads, not sure on the others.
@Anton Kirilov do you know if there's some other trick that VM's use for broadcasting context synchronization that doesn't use membarrier? Or is it likely they're depending on some hidden behavior to make this not an issue in practice?
Ryan Hunt said:
I looked at other VM's (V8/Dart/dotnetcore/mono) to see what they do, and could not find them using
membarrier
at all. They would just perform the icache synchronization step, but nothing beyond AFAICT. I can provide links if useful. V8 for sure shares the code between threads, not sure on the others.
Unfortunately some of those engines may simply not be compliant with the Arm memory model. As usual with issues of that nature, things may appear to work most of the time, with occasional problems that are hard to diagnose, so no immediate incentive to use membarrier()
. V8 is an interesting case because some of my teammates work on it actively, so they should be aware of this issue - I will check with them.
In principle rolling your own inter-thread messaging system (which the application may already implement for other purposes) and then defining a message that causes the receiver thread to execute an ISB
should be enough. In fact, on Unix-like platforms sending a signal should work, since it is going to result in at least one transition from user- to kernelspace. Of course, enumerating all running threads is a bit of a challenge.
Thank you. If it helps, here are some links for V8's wasm compiler generating code.
It's possible there is some extra synchronization in V8 somewhere, the codebase is quite large. But grepping for membarrier doesn't find it. And I've not seen any references in wasm-code-manager.cpp to this problem. There is reference to isb
in their tiering code, with a justification for why their solution is safe. But nothing for their general multi-threaded compilation scheme.
Oh, another point about what other engines are doing - they may have determined that currently they already have an ISB
in the right place in their JIT flow. However, that's IMHO quite fragile - complex codebases tend to evolve constantly and any change might break that guarantee.
The problem is that this is a bit of a complex implicit dependency/coupling that is difficult to track and affects only one platform (well, 2 if you distinguish between AArch32 and AArch64).
Good point re: inter-thread messaging. Outside of tiering, it's well defined when a thread receives a new module and we can execute an isb
at that point.
The membarrier()
mechanism is quite robust becomes it keeps the solution localized and independent of what the surrounding code is doing (including with respect to recycling code buffers, etc.).
Is there guidance on the cost of an isb
? I'm wondering how much effort it's worth to avoid redundant ones if we're placing them manually when receiving a module (which may have been compiled on the same thread, and already ran an isb).
Ryan Hunt said:
There is reference to
isb
in their tiering code, with a justification for why their solution is safe.
That's probably the answer you are looking for. What we need in Wasmtime (or any JIT runtime for that matter) is an ISB
instruction that is broadcast to and executed by all processors. Such a functionality does not exist in the Arm architecture, so the membarrier()
system call with the particular arguments used by Wasmtime simulates it.
Ryan Hunt said:
Is there guidance on the cost of an
isb
?
It's a bit of a special instruction, but it is essentially equivalent to a pipeline flush, so it is comparable to a branch mispredict (e.g. on Arm Neoverse V1 an 11 cycle pause until execution resumes normally). It definitely doesn't have any place in the middle of a hot loop (e.g. a matrix multiplication), but personally I wouldn't be too worried about it as part of a JIT code workflow (in the absence of profiling data, of course), hence its usage in Wasmtime.
Okay, thank you. This was very helpful!
I also got some comments from my colleagues - you might be able to avoid this requirement if you don't recycle code buffers.
The architecture allows you to do a limited amount of code editing without the ISB
requirement.
For example, you can change a direct branch to one code buffer into another direct branch to a different code buffer.
The architecture has prefetch speculation protection that guarantees that if the updated branch is visible, then the code buffer is visible as well.
However, there are caveats and I think I am simplifying the explanation.
(e.g. the bit about not recycling code buffers is important in this case)
Anton Kirilov said:
I also got some comments from my colleagues - you might be able to avoid this requirement if you don't recycle code buffers.
What does 'recycle code buffers' mean? We may re-use virtual addresses of previous code, but will always decommit and recommit the pages using mmap when creating a new buffer in the virtual address space a previous buffer may have existed in.
Not a kernel expert, but that may not be enough because the kernel is not required to do much more than updating the page tables and flushing TLBs.
Anton Kirilov said:
The architecture has prefetch speculation protection that guarantees that if the updated branch is visible, then the code buffer is visible as well.
Interesting. The case we'd have isn't a direct branch being re-written, but an indirect branch to call into the JIT'ed wasm. The operand to the indirect branch is a code pointer stored in our function objects.
I think this argument is the justification V8 uses for how they patch their jump tables when tiering though.
Last updated: Dec 23 2024 at 13:07 UTC