Stream: wasmtime

Topic: Using membarrier when finalizing code


view this post on Zulip Ryan Hunt (Jul 19 2022 at 14:29):

SpiderMonkey emits a membarrier call while finalizing wasm code for execution. Right now we only do this if we compile a second tier of code, and I've been trying to understand why we don't do this for all module compilations (as modules can migrate between threads).

I found this previous discussion and issue for why wasmtime uses membarrier which has some nice context. Those discussions also refer to earlier discussions that led to SM using membarrier. This led to me thinking that we should likely use membarrier for all compilations to match wasmtime's behavior.

However, trying to use membarrier for all compilation runs into the issue that the syscall is only available on linux >=4.16, which is problematic for a significant number of android devices we support below that kernel version.

I looked at other VM's (V8/Dart/dotnetcore/mono) to see what they do, and could not find them using membarrier at all. They would just perform the icache synchronization step, but nothing beyond AFAICT. I can provide links if useful. V8 for sure shares the code between threads, not sure on the others.

@Anton Kirilov do you know if there's some other trick that VM's use for broadcasting context synchronization that doesn't use membarrier? Or is it likely they're depending on some hidden behavior to make this not an issue in practice?

This is the first part of a fix to issue #3310. Unfortunately, there are more calls than necessary to rsix::process::membarrier(rsix::process::MembarrierCommand::RegisterPrivateExpeditedSyncCore) (...

view this post on Zulip Anton Kirilov (Jul 19 2022 at 14:45):

Ryan Hunt said:

I looked at other VM's (V8/Dart/dotnetcore/mono) to see what they do, and could not find them using membarrier at all. They would just perform the icache synchronization step, but nothing beyond AFAICT. I can provide links if useful. V8 for sure shares the code between threads, not sure on the others.

Unfortunately some of those engines may simply not be compliant with the Arm memory model. As usual with issues of that nature, things may appear to work most of the time, with occasional problems that are hard to diagnose, so no immediate incentive to use membarrier(). V8 is an interesting case because some of my teammates work on it actively, so they should be aware of this issue - I will check with them.

view this post on Zulip Anton Kirilov (Jul 19 2022 at 14:51):

In principle rolling your own inter-thread messaging system (which the application may already implement for other purposes) and then defining a message that causes the receiver thread to execute an ISB should be enough. In fact, on Unix-like platforms sending a signal should work, since it is going to result in at least one transition from user- to kernelspace. Of course, enumerating all running threads is a bit of a challenge.

view this post on Zulip Ryan Hunt (Jul 19 2022 at 14:58):

Thank you. If it helps, here are some links for V8's wasm compiler generating code.

  1. module-compiler.cpp ExecuteCompilationUnits (run on any thread)
  2. wasm-code-manager.cpp AddCompiledCode, called by above
  3. wasm-code-manager.cpp AddCodeWithCodeSpace, called by above
  4. FlushInstructionCache on ARM64

It's possible there is some extra synchronization in V8 somewhere, the codebase is quite large. But grepping for membarrier doesn't find it. And I've not seen any references in wasm-code-manager.cpp to this problem. There is reference to isb in their tiering code, with a justification for why their solution is safe. But nothing for their general multi-threaded compilation scheme.

view this post on Zulip Anton Kirilov (Jul 19 2022 at 14:58):

Oh, another point about what other engines are doing - they may have determined that currently they already have an ISB in the right place in their JIT flow. However, that's IMHO quite fragile - complex codebases tend to evolve constantly and any change might break that guarantee.

view this post on Zulip Anton Kirilov (Jul 19 2022 at 15:00):

The problem is that this is a bit of a complex implicit dependency/coupling that is difficult to track and affects only one platform (well, 2 if you distinguish between AArch32 and AArch64).

view this post on Zulip Ryan Hunt (Jul 19 2022 at 15:00):

Good point re: inter-thread messaging. Outside of tiering, it's well defined when a thread receives a new module and we can execute an isb at that point.

view this post on Zulip Anton Kirilov (Jul 19 2022 at 15:03):

The membarrier() mechanism is quite robust becomes it keeps the solution localized and independent of what the surrounding code is doing (including with respect to recycling code buffers, etc.).

view this post on Zulip Ryan Hunt (Jul 19 2022 at 15:03):

Is there guidance on the cost of an isb? I'm wondering how much effort it's worth to avoid redundant ones if we're placing them manually when receiving a module (which may have been compiled on the same thread, and already ran an isb).

view this post on Zulip Anton Kirilov (Jul 19 2022 at 15:06):

Ryan Hunt said:

There is reference to isb in their tiering code, with a justification for why their solution is safe.

That's probably the answer you are looking for. What we need in Wasmtime (or any JIT runtime for that matter) is an ISB instruction that is broadcast to and executed by all processors. Such a functionality does not exist in the Arm architecture, so the membarrier() system call with the particular arguments used by Wasmtime simulates it.

view this post on Zulip Anton Kirilov (Jul 19 2022 at 15:24):

Ryan Hunt said:

Is there guidance on the cost of an isb?

It's a bit of a special instruction, but it is essentially equivalent to a pipeline flush, so it is comparable to a branch mispredict (e.g. on Arm Neoverse V1 an 11 cycle pause until execution resumes normally). It definitely doesn't have any place in the middle of a hot loop (e.g. a matrix multiplication), but personally I wouldn't be too worried about it as part of a JIT code workflow (in the absence of profiling data, of course), hence its usage in Wasmtime.

view this post on Zulip Ryan Hunt (Jul 19 2022 at 15:32):

Okay, thank you. This was very helpful!

view this post on Zulip Anton Kirilov (Jul 19 2022 at 15:34):

I also got some comments from my colleagues - you might be able to avoid this requirement if you don't recycle code buffers.

view this post on Zulip Anton Kirilov (Jul 19 2022 at 15:35):

The architecture allows you to do a limited amount of code editing without the ISB requirement.

view this post on Zulip Anton Kirilov (Jul 19 2022 at 15:36):

For example, you can change a direct branch to one code buffer into another direct branch to a different code buffer.

view this post on Zulip Anton Kirilov (Jul 19 2022 at 15:39):

The architecture has prefetch speculation protection that guarantees that if the updated branch is visible, then the code buffer is visible as well.

view this post on Zulip Anton Kirilov (Jul 19 2022 at 15:39):

However, there are caveats and I think I am simplifying the explanation.

view this post on Zulip Anton Kirilov (Jul 19 2022 at 15:55):

(e.g. the bit about not recycling code buffers is important in this case)

view this post on Zulip Ryan Hunt (Jul 19 2022 at 15:55):

Anton Kirilov said:

I also got some comments from my colleagues - you might be able to avoid this requirement if you don't recycle code buffers.

What does 'recycle code buffers' mean? We may re-use virtual addresses of previous code, but will always decommit and recommit the pages using mmap when creating a new buffer in the virtual address space a previous buffer may have existed in.

view this post on Zulip Anton Kirilov (Jul 19 2022 at 15:57):

Not a kernel expert, but that may not be enough because the kernel is not required to do much more than updating the page tables and flushing TLBs.

view this post on Zulip Ryan Hunt (Jul 19 2022 at 16:05):

Anton Kirilov said:

The architecture has prefetch speculation protection that guarantees that if the updated branch is visible, then the code buffer is visible as well.

Interesting. The case we'd have isn't a direct branch being re-written, but an indirect branch to call into the JIT'ed wasm. The operand to the indirect branch is a code pointer stored in our function objects.

I think this argument is the justification V8 uses for how they patch their jump tables when tiering though.


Last updated: Dec 23 2024 at 13:07 UTC