Stream: wasmtime

Topic: Code generation and CPU cache maintenance


view this post on Zulip Anton Kirilov (Jul 02 2021 at 17:06):

I am trying to determine if any CPU cache maintenance is performed after code generation finishes and before execution starts, and currently it looks like only the memory region is changed from readable and writable to readable and executable. Am I missing something?

Standalone JIT-style runtime for WebAssembly, using Cranelift - bytecodealliance/wasmtime

view this post on Zulip Chris Fallin (Jul 02 2021 at 17:10):

I think that is indeed all we do. On x86 at least the coherence model handles self-modifying code in the most programmer-friendly way (i.e. a write to code memory is immediately seen by fetch, architecturally, causing great pain and suffering in the icache logic) but I imagine this isn't the case on other architectures; do we need to do something special here for e.g. aarch64?

view this post on Zulip Anton Kirilov (Jul 02 2021 at 17:23):

Yes, we do. The 64-bit Arm architecture requires two actions - cache maintenance and context synchronization (usually an ISB instruction). The full sequence can be seen in GCC's implementation of __builtin___clear_cache, for example. Alternatively, section B2.4.4 of the Arm Architecture Reference Manual provides the appropriate instruction sequence.

view this post on Zulip Anton Kirilov (Jul 02 2021 at 17:26):

The Arm Neoverse cores actually have coherent data and instruction caches, so cache maintenance is not necessary in that case (and the GCC code does the appropriate thing), but that's an optional feature of the architecture.

view this post on Zulip Anton Kirilov (Jul 02 2021 at 17:32):

Context synchronization is still necessary and is indeed the trickier bit because it must be performed by all processors (in practice that would mean threads) that may execute the modified code. I remember that @Benjamin Bouvier had some fun implementing that in Spidermonkey :smile:.

view this post on Zulip Chris Fallin (Jul 02 2021 at 17:34):

Happy to take a PR for this, and thanks for raising the issue!

And yes, I was just about to ask about the multicore case, especially as we compile on a thread pool. Is it enough to flush dcache out (write back dirty lines, which broadly speaking looks like the first half of that code sequence) at the end of compile, and flush icache (second half of sequence) on all cores?

view this post on Zulip Chris Fallin (Jul 02 2021 at 17:34):

I can imagine a lazy design for the latter where we track "code generation number" and latest generation seen on each core, and flush only when needed

view this post on Zulip Chris Fallin (Jul 02 2021 at 17:35):

I guess I should go read SpiderMonkey's code to see how @Benjamin Bouvier solved this :-)

view this post on Zulip Anton Kirilov (Jul 02 2021 at 17:37):

The instruction cache invalidation is not an issue because it broadcasts in the whole coherency domain.

view this post on Zulip Chris Fallin (Jul 02 2021 at 17:39):

ah, cool, so we don't need the equivalent of the linux kernel's "run this code on all cores" primitive

view this post on Zulip Chris Fallin (Jul 02 2021 at 17:39):

this looks like the right bit?: https://searchfox.org/mozilla-central/rev/60c185194386e8e44bbeb09f5547ad5994d4c774/js/src/jit/ProcessExecutableMemory.cpp#785-791

view this post on Zulip Anton Kirilov (Jul 02 2021 at 17:40):

Well, we don't need it for cache maintenance, but we need it for context synchronization.

view this post on Zulip Anton Kirilov (Jul 02 2021 at 17:43):

The code snippet that you linked to doesn't look like what I had in mind.

view this post on Zulip Chris Fallin (Jul 02 2021 at 17:44):

So there is some interesting discussion in https://bugzilla.mozilla.org/show_bug.cgi?id=1529933 about how Firefox's scheme is safe due to ordering around an atomic code-pointer update

view this post on Zulip Chris Fallin (Jul 02 2021 at 17:44):

but I haven't fully internalized this and in any case you would know best what is actually needed -- so please do send a PR if you've got an idea of a reasonable fix :-)

view this post on Zulip Chris Fallin (Jul 02 2021 at 17:45):

(fwiw we don't update code pointers with atomic stores because we don't do the racy swap-out-new-code thing that FF does between tiers, but we could easily add a fence in the right place)

view this post on Zulip bjorn3 (Jul 02 2021 at 17:46):

cranelift-jit also doesn't do cache maintenance or context synchronization.

view this post on Zulip Anton Kirilov (Jul 02 2021 at 17:51):

Right, I've read through the same discussion before when discussing with @Benjamin Bouvier, so I'll just repeat my points - I don't necessarily disagree with what my colleague Jacob Bramley has said, but my feeling is that this is relying on implementation details, which is IMHO fragile.

view this post on Zulip Anton Kirilov (Jul 02 2021 at 17:53):

Basically the idea is that if the only way for a core to see the updated code is via an update to a function pointer, then context synchronization might not be necessary.

view this post on Zulip Anton Kirilov (Jul 02 2021 at 17:54):

The key idea being that an update to a function pointer is a data update, not a code update.

view this post on Zulip Anton Kirilov (Jul 02 2021 at 17:55):

This assumes that the buffer holding the updated code is "fresh", i.e. no one has ever executed code from it before.

view this post on Zulip Chris Fallin (Jul 02 2021 at 17:57):

Point of clarification: what do you mean by "context synchronization" exactly? (I understand cache coherence and weak memory models but am not familiar with that exact term)

view this post on Zulip Chris Fallin (Jul 02 2021 at 17:58):

and, yes, I agree, relying on the incidental synchronization edges that occur because of the data-structure updates is more fragile than I would prefer as well

view this post on Zulip bjorn3 (Jul 02 2021 at 17:59):

Anton Kirilov said:

This assumes that the buffer holding the updated code is "fresh", i.e. no one has ever executed code from it before.

Is mmap(PROT_EXEC) enough to make it fresh?

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:00):

For application programmers "context synchronization" almost always means "execution of an ISB instruction". That's a good point, I'll just say that from now on to make things clearer - I am obviously used to the Arm architecture terminology :smile:.

view this post on Zulip Chris Fallin (Jul 02 2021 at 18:01):

Ah, OK, so this just means "happens-before edge exists according to the memory model"

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:02):

@bjorn3 Do you mean mprotect()? Either way, the answer is no, and part of the reason is that TLB invalidations also broadcast (so there is no need for inter-processor interrupts).

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:07):

@Chris Fallin ISB is not so much about the memory model (which is covered by cache maintenance), but about speculative execution and the associated buffers/caches.

view this post on Zulip Chris Fallin (Jul 02 2021 at 18:08):

right, specifically in-flight instructions in the pipeline that may have already been fetched from stale cache lines are an issue, and data in store buffers, and ... (I worked on hardware details related to this in a past life :-) )

view this post on Zulip Chris Fallin (Jul 02 2021 at 18:08):

ISB is just a full pipeline synchronization/fence?

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:09):

Yes, that's right

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:12):

Basically, what people need is a broadcast ISB, which doesn't exist in the architecture.

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:13):

(and I don't think that's an accidental omission - pretty sure it has been discussed before)

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:14):

On Linux at least there is a straightforward equivalent system call - membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, 0).

view this post on Zulip Chris Fallin (Jul 02 2021 at 18:16):

OK, great. Happy to take a PR to add that. We should look into whether there is an equivalent on macOS/aarch64 as well (sadly I don't have hardware to test this; friendly ping to @Benjamin Bouvier if interested)

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:20):

BTW bug 1661016 covers the work by @Benjamin Bouvier I mentioned.

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:22):

I think I can look into adding the cache maintenance, but the ISB part is more complicated because membarrier() doesn't just interrupt all threads in the process.

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:23):

Actually, threads that are interested in the notification have to register first and they are the ones that are interrupted.

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:25):

So, in Wasmtime, for example, threads that just compile functions don't need to execute ISB - I am not sure, do we distinguish between them and the threads that might run generated code?

view this post on Zulip Chris Fallin (Jul 02 2021 at 18:27):

I think that the thread pool management happens under the hood in Rayon (library)

view this post on Zulip Chris Fallin (Jul 02 2021 at 18:27):

so potentially any compilation task or async wasm function invocation could happen on any thread

view this post on Zulip Chris Fallin (Jul 02 2021 at 18:29):

I'm kind of thinking again about the lazy-ISB / "code generation number" approach. Could we just have a conditional and ISB in the wasm-entry thunk that triggers the first time we execute wasm after someone else generates some code?

view this post on Zulip Chris Fallin (Jul 02 2021 at 18:29):

that's a cost for sure but if code almost never changes then the branch should be predictable

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:30):

First thought - you'd need to switch between RX and RW permissions in the middle of execution, though.

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:31):

Oh, no, sorry, you have a conditional branch.

view this post on Zulip Chris Fallin (Jul 02 2021 at 18:31):

right, code itself is static, no patching in an ISB or anything

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:31):

I was thinking of optimizing it :smile:.

view this post on Zulip Chris Fallin (Jul 02 2021 at 18:32):

two different thunks? or some sort of dynamic patching?

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:33):

Well, the architecture also allows a very limited form of instruction patching that doesn't require any synchronization.

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:34):

One particular case is turning an ISB into a NOP.

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:36):

Still, I think the membarrier() solution is cleaner and simpler, so I'll just have to read up on Rayon a bit.

view this post on Zulip Chris Fallin (Jul 02 2021 at 18:39):

Ah, I see, then the nop-ification of the ISB lazily propagates into reality, and it's ok if we get a spurious ISB before it is fully synchronized. makes sense!

view this post on Zulip Chris Fallin (Jul 02 2021 at 18:40):

I would actually lean a bit toward the generation-and-ISB approach because it avoids a syscall, and is more portable (any aarch64 OS)

view this post on Zulip Chris Fallin (Jul 02 2021 at 18:40):

but if you want to do the membarrier thing first that's fine, I think, as it is simpler

view this post on Zulip Anton Kirilov (Jul 02 2021 at 18:54):

Hm, I am reading the documentation again and I might be misremembering - the opt-in done by the membarrier() call might be process-wide, so there would be no need for individual threads to opt in, which makes things even easier.

view this post on Zulip Benjamin Bouvier (Jul 05 2021 at 15:14):

Oh yes, I do remember this stuff :sweat_smile: I do remember that the membarrier was process-wide, and past-me seemed to confirm this in the changeset message:

On real hardware, when a background thread finishes compilation, it must signal
to the other executing threads that they need to reload a new version of the
code. Ideally, each executing thread would run an ISB instruction to do so. We
hereby use a system call membarrier that interrupts every other running thread,
and will cause the same effect as a local ISB would. It is heavyweight, so we
make sure to only run it in the case where we're on a background thread.

view this post on Zulip Benjamin Bouvier (Jul 05 2021 at 15:15):

That being said, it is not clear what needs to be done on MacOS aarch64.

view this post on Zulip Anton Kirilov (Jul 05 2021 at 18:50):

Here is Apple's developer documentation on porting JIT compilers.

view this post on Zulip Anton Kirilov (Sep 07 2021 at 18:15):

I opened issue #3310 to track this. We can also continue the discussion there.

Currently (as of commit 164835e), after code generation is finished and before execution from the newly generated code starts on AArch64, both cranelift-jit and wasmtime-jit do not perform any acti...

view this post on Zulip Anton Kirilov (Sep 07 2021 at 18:40):

I believe that in order to have a clean solution some additions are necessary to the compiler-builtins and the libc crates (I really don't think that hardcoding the assembly sequence for cache maintenance in Wasmtime is the best approach), which is going to take a while, so it is useful to have a tracking issue.

view this post on Zulip Chris Fallin (Sep 07 2021 at 18:43):

Having builtins would be ideal, but fwiw, I don't think it would be too much of a problem to just write the assembly in the meantime. The wasmtime-fiber crate for example has an assembly file per architecture

view this post on Zulip Anton Kirilov (Sep 07 2021 at 18:57):

I considered adding helpers to the crates/runtime/src/helpers.c file (one helper for the cache maintenance operations and another for membarrier(); in fact, in the former case we can just call the C function, I suppose), but it seemed wrong to make cranelift-jit use that crate, so I decided to pursue what is in any case the proper solution (which would benefit other JIT runtimes implemented in Rust).

view this post on Zulip Chris Fallin (Sep 07 2021 at 19:15):

Yeah, that's true, it should be factored out somewhere. If getting things into libc or compiler-builtins becomes a bottleneck, I could see the case for a separate jit-icache-coherence crate (either as part of wasmtime or maybe even as an independent thing on crates.io)

view this post on Zulip Chris Fallin (Sep 07 2021 at 19:15):

or possibly it could belong in region-rs


Last updated: Oct 23 2024 at 20:03 UTC