Code generation and CPU cache maintenance · wasmtime

I am trying to determine if any CPU cache maintenance is performed after code generation finishes and before execution starts, and currently it looks like only the memory region is changed from readable and writable to readable and executable. Am I missing something?

bytecodealliance/wasmtime

Standalone JIT-style runtime for WebAssembly, using Cranelift - bytecodealliance/wasmtime

Chris Fallin (Jul 02 2021 at 17:10):

I think that is indeed all we do. On x86 at least the coherence model handles self-modifying code in the most programmer-friendly way (i.e. a write to code memory is immediately seen by fetch, architecturally, causing great pain and suffering in the icache logic) but I imagine this isn't the case on other architectures; do we need to do something special here for e.g. aarch64?

Anton Kirilov (Jul 02 2021 at 17:23):

Yes, we do. The 64-bit Arm architecture requires two actions - cache maintenance and context synchronization (usually an ISB instruction). The full sequence can be seen in GCC's implementation of __builtin___clear_cache, for example. Alternatively, section B2.4.4 of the Arm Architecture Reference Manual provides the appropriate instruction sequence.

Anton Kirilov (Jul 02 2021 at 17:26):

The Arm Neoverse cores actually have coherent data and instruction caches, so cache maintenance is not necessary in that case (and the GCC code does the appropriate thing), but that's an optional feature of the architecture.

Anton Kirilov (Jul 02 2021 at 17:32):

Context synchronization is still necessary and is indeed the trickier bit because it must be performed by all processors (in practice that would mean threads) that may execute the modified code. I remember that @Benjamin Bouvier had some fun implementing that in Spidermonkey :smile:.

Chris Fallin (Jul 02 2021 at 17:34):

And yes, I was just about to ask about the multicore case, especially as we compile on a thread pool. Is it enough to flush dcache out (write back dirty lines, which broadly speaking looks like the first half of that code sequence) at the end of compile, and flush icache (second half of sequence) on all cores?

Chris Fallin (Jul 02 2021 at 17:34):

I can imagine a lazy design for the latter where we track "code generation number" and latest generation seen on each core, and flush only when needed

Chris Fallin (Jul 02 2021 at 17:35):

I guess I should go read SpiderMonkey's code to see how @Benjamin Bouvier solved this :-)

Anton Kirilov (Jul 02 2021 at 17:37):

The instruction cache invalidation is not an issue because it broadcasts in the whole coherency domain.

Chris Fallin (Jul 02 2021 at 17:39):

ah, cool, so we don't need the equivalent of the linux kernel's "run this code on all cores" primitive

Chris Fallin (Jul 02 2021 at 17:39):

Anton Kirilov (Jul 02 2021 at 17:40):

Well, we don't need it for cache maintenance, but we need it for context synchronization.

Anton Kirilov (Jul 02 2021 at 17:43):

Chris Fallin (Jul 02 2021 at 17:44):

but I haven't fully internalized this and in any case you would know best what is actually needed -- so please do send a PR if you've got an idea of a reasonable fix :-)

Chris Fallin (Jul 02 2021 at 17:45):

(fwiw we don't update code pointers with atomic stores because we don't do the racy swap-out-new-code thing that FF does between tiers, but we could easily add a fence in the right place)

bjorn3 (Jul 02 2021 at 17:46):

Anton Kirilov (Jul 02 2021 at 17:51):

Right, I've read through the same discussion before when discussing with @Benjamin Bouvier, so I'll just repeat my points - I don't necessarily disagree with what my colleague Jacob Bramley has said, but my feeling is that this is relying on implementation details, which is IMHO fragile.

Anton Kirilov (Jul 02 2021 at 17:53):

Basically the idea is that if the only way for a core to see the updated code is via an update to a function pointer, then context synchronization might not be necessary.

Anton Kirilov (Jul 02 2021 at 17:54):

The key idea being that an update to a function pointer is a data update, not a code update.

Anton Kirilov (Jul 02 2021 at 17:55):

This assumes that the buffer holding the updated code is "fresh", i.e. no one has ever executed code from it before.

Chris Fallin (Jul 02 2021 at 17:57):

Point of clarification: what do you mean by "context synchronization" exactly? (I understand cache coherence and weak memory models but am not familiar with that exact term)

Chris Fallin (Jul 02 2021 at 17:58):

and, yes, I agree, relying on the incidental synchronization edges that occur because of the data-structure updates is more fragile than I would prefer as well

bjorn3 (Jul 02 2021 at 17:59):

Anton Kirilov (Jul 02 2021 at 18:00):

For application programmers "context synchronization" almost always means "execution of an ISB instruction". That's a good point, I'll just say that from now on to make things clearer - I am obviously used to the Arm architecture terminology :smile:.

Chris Fallin (Jul 02 2021 at 18:01):

Ah, OK, so this just means "happens-before edge exists according to the memory model"

Anton Kirilov (Jul 02 2021 at 18:02):

@bjorn3 Do you mean mprotect()? Either way, the answer is no, and part of the reason is that TLB invalidations also broadcast (so there is no need for inter-processor interrupts).

Anton Kirilov (Jul 02 2021 at 18:07):

@Chris Fallin ISB is not so much about the memory model (which is covered by cache maintenance), but about speculative execution and the associated buffers/caches.

Chris Fallin (Jul 02 2021 at 18:08):

right, specifically in-flight instructions in the pipeline that may have already been fetched from stale cache lines are an issue, and data in store buffers, and ... (I worked on hardware details related to this in a past life :-) )

Chris Fallin (Jul 02 2021 at 18:08):

Anton Kirilov (Jul 02 2021 at 18:09):

Anton Kirilov (Jul 02 2021 at 18:12):

Basically, what people need is a broadcast ISB, which doesn't exist in the architecture.

Anton Kirilov (Jul 02 2021 at 18:13):

(and I don't think that's an accidental omission - pretty sure it has been discussed before)

Anton Kirilov (Jul 02 2021 at 18:14):

On Linux at least there is a straightforward equivalent system call - membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, 0).

Chris Fallin (Jul 02 2021 at 18:16):

OK, great. Happy to take a PR to add that. We should look into whether there is an equivalent on macOS/aarch64 as well (sadly I don't have hardware to test this; friendly ping to @Benjamin Bouvier if interested)

Anton Kirilov (Jul 02 2021 at 18:20):

Anton Kirilov (Jul 02 2021 at 18:22):

I think I can look into adding the cache maintenance, but the ISB part is more complicated because membarrier() doesn't just interrupt all threads in the process.

Anton Kirilov (Jul 02 2021 at 18:23):

Actually, threads that are interested in the notification have to register first and they are the ones that are interrupted.

Anton Kirilov (Jul 02 2021 at 18:25):

So, in Wasmtime, for example, threads that just compile functions don't need to execute ISB - I am not sure, do we distinguish between them and the threads that might run generated code?

Chris Fallin (Jul 02 2021 at 18:27):

I think that the thread pool management happens under the hood in Rayon (library)

Chris Fallin (Jul 02 2021 at 18:27):

so potentially any compilation task or async wasm function invocation could happen on any thread

Chris Fallin (Jul 02 2021 at 18:29):

I'm kind of thinking again about the lazy-ISB / "code generation number" approach. Could we just have a conditional and ISB in the wasm-entry thunk that triggers the first time we execute wasm after someone else generates some code?

Chris Fallin (Jul 02 2021 at 18:29):

that's a cost for sure but if code almost never changes then the branch should be predictable

Anton Kirilov (Jul 02 2021 at 18:30):

First thought - you'd need to switch between RX and RW permissions in the middle of execution, though.

Anton Kirilov (Jul 02 2021 at 18:31):

Chris Fallin (Jul 02 2021 at 18:31):

Anton Kirilov (Jul 02 2021 at 18:31):

Chris Fallin (Jul 02 2021 at 18:32):

Anton Kirilov (Jul 02 2021 at 18:33):

Well, the architecture also allows a very limited form of instruction patching that doesn't require any synchronization.

Anton Kirilov (Jul 02 2021 at 18:34):

Anton Kirilov (Jul 02 2021 at 18:36):

Still, I think the membarrier() solution is cleaner and simpler, so I'll just have to read up on Rayon a bit.

Chris Fallin (Jul 02 2021 at 18:39):

Ah, I see, then the nop-ification of the ISB lazily propagates into reality, and it's ok if we get a spurious ISB before it is fully synchronized. makes sense!

Chris Fallin (Jul 02 2021 at 18:40):

I would actually lean a bit toward the generation-and-ISB approach because it avoids a syscall, and is more portable (any aarch64 OS)

Chris Fallin (Jul 02 2021 at 18:40):

but if you want to do the membarrier thing first that's fine, I think, as it is simpler

Anton Kirilov (Jul 02 2021 at 18:54):

Hm, I am reading the documentation again and I might be misremembering - the opt-in done by the membarrier() call might be process-wide, so there would be no need for individual threads to opt in, which makes things even easier.

Benjamin Bouvier (Jul 05 2021 at 15:14):

Oh yes, I do remember this stuff :sweat_smile: I do remember that the membarrier was process-wide, and past-me seemed to confirm this in the changeset message:

Benjamin Bouvier (Jul 05 2021 at 15:15):

Anton Kirilov (Jul 05 2021 at 18:50):

Anton Kirilov (Sep 07 2021 at 18:15):

Missing cache maintenance and context synchronization operations after JIT compilation is completed on AArch64 · Issue #3310 · bytecodealliance/wasmtime

Currently (as of commit 164835e), after code generation is finished and before execution from the newly generated code starts on AArch64, both cranelift-jit and wasmtime-jit do not perform any acti...

Anton Kirilov (Sep 07 2021 at 18:40):

I believe that in order to have a clean solution some additions are necessary to the compiler-builtins and the libc crates (I really don't think that hardcoding the assembly sequence for cache maintenance in Wasmtime is the best approach), which is going to take a while, so it is useful to have a tracking issue.

Chris Fallin (Sep 07 2021 at 18:43):

Having builtins would be ideal, but fwiw, I don't think it would be too much of a problem to just write the assembly in the meantime. The wasmtime-fiber crate for example has an assembly file per architecture

Anton Kirilov (Sep 07 2021 at 18:57):

I considered adding helpers to the crates/runtime/src/helpers.c file (one helper for the cache maintenance operations and another for membarrier(); in fact, in the former case we can just call the C function, I suppose), but it seemed wrong to make cranelift-jit use that crate, so I decided to pursue what is in any case the proper solution (which would benefit other JIT runtimes implemented in Rust).

Chris Fallin (Sep 07 2021 at 19:15):

Yeah, that's true, it should be factored out somewhere. If getting things into libc or compiler-builtins becomes a bottleneck, I could see the case for a separate jit-icache-coherence crate (either as part of wasmtime or maybe even as an independent thing on crates.io)

Stream: wasmtime

Topic: Code generation and CPU cache maintenance

Anton Kirilov (Jul 02 2021 at 17:06):

Chris Fallin (Jul 02 2021 at 17:10):

Anton Kirilov (Jul 02 2021 at 17:23):

Anton Kirilov (Jul 02 2021 at 17:26):

Anton Kirilov (Jul 02 2021 at 17:32):

Chris Fallin (Jul 02 2021 at 17:34):

Chris Fallin (Jul 02 2021 at 17:34):

Chris Fallin (Jul 02 2021 at 17:35):

Anton Kirilov (Jul 02 2021 at 17:37):

Chris Fallin (Jul 02 2021 at 17:39):

Chris Fallin (Jul 02 2021 at 17:39):

Anton Kirilov (Jul 02 2021 at 17:40):

Anton Kirilov (Jul 02 2021 at 17:43):

Chris Fallin (Jul 02 2021 at 17:44):

Chris Fallin (Jul 02 2021 at 17:44):

Chris Fallin (Jul 02 2021 at 17:45):

bjorn3 (Jul 02 2021 at 17:46):

Anton Kirilov (Jul 02 2021 at 17:51):

Anton Kirilov (Jul 02 2021 at 17:53):

Anton Kirilov (Jul 02 2021 at 17:54):

Anton Kirilov (Jul 02 2021 at 17:55):

Chris Fallin (Jul 02 2021 at 17:57):

Chris Fallin (Jul 02 2021 at 17:58):

bjorn3 (Jul 02 2021 at 17:59):

Anton Kirilov (Jul 02 2021 at 18:00):

Chris Fallin (Jul 02 2021 at 18:01):

Anton Kirilov (Jul 02 2021 at 18:02):

Anton Kirilov (Jul 02 2021 at 18:07):

Chris Fallin (Jul 02 2021 at 18:08):

Chris Fallin (Jul 02 2021 at 18:08):

Anton Kirilov (Jul 02 2021 at 18:09):

Anton Kirilov (Jul 02 2021 at 18:12):

Anton Kirilov (Jul 02 2021 at 18:13):

Anton Kirilov (Jul 02 2021 at 18:14):

Chris Fallin (Jul 02 2021 at 18:16):

Anton Kirilov (Jul 02 2021 at 18:20):

Anton Kirilov (Jul 02 2021 at 18:22):

Anton Kirilov (Jul 02 2021 at 18:23):

Anton Kirilov (Jul 02 2021 at 18:25):

Chris Fallin (Jul 02 2021 at 18:27):

Chris Fallin (Jul 02 2021 at 18:27):

Chris Fallin (Jul 02 2021 at 18:29):

Chris Fallin (Jul 02 2021 at 18:29):

Anton Kirilov (Jul 02 2021 at 18:30):

Anton Kirilov (Jul 02 2021 at 18:31):

Chris Fallin (Jul 02 2021 at 18:31):

Anton Kirilov (Jul 02 2021 at 18:31):

Chris Fallin (Jul 02 2021 at 18:32):

Anton Kirilov (Jul 02 2021 at 18:33):

Anton Kirilov (Jul 02 2021 at 18:34):

Anton Kirilov (Jul 02 2021 at 18:36):

Chris Fallin (Jul 02 2021 at 18:39):

Chris Fallin (Jul 02 2021 at 18:40):

Chris Fallin (Jul 02 2021 at 18:40):

Anton Kirilov (Jul 02 2021 at 18:54):

Benjamin Bouvier (Jul 05 2021 at 15:14):

Benjamin Bouvier (Jul 05 2021 at 15:15):

Anton Kirilov (Jul 05 2021 at 18:50):

Anton Kirilov (Sep 07 2021 at 18:15):

Anton Kirilov (Sep 07 2021 at 18:40):

Chris Fallin (Sep 07 2021 at 18:43):

Anton Kirilov (Sep 07 2021 at 18:57):

Chris Fallin (Sep 07 2021 at 19:15):

Chris Fallin (Sep 07 2021 at 19:15):