I am trying to determine if any CPU cache maintenance is performed after code generation finishes and before execution starts, and currently it looks like only the memory region is changed from readable and writable to readable and executable. Am I missing something?
I think that is indeed all we do. On x86 at least the coherence model handles self-modifying code in the most programmer-friendly way (i.e. a write to code memory is immediately seen by fetch, architecturally, causing great pain and suffering in the icache logic) but I imagine this isn't the case on other architectures; do we need to do something special here for e.g. aarch64?
Yes, we do. The 64-bit Arm architecture requires two actions - cache maintenance and context synchronization (usually an ISB
instruction). The full sequence can be seen in GCC's implementation of __builtin___clear_cache
, for example. Alternatively, section B2.4.4 of the Arm Architecture Reference Manual provides the appropriate instruction sequence.
The Arm Neoverse cores actually have coherent data and instruction caches, so cache maintenance is not necessary in that case (and the GCC code does the appropriate thing), but that's an optional feature of the architecture.
Context synchronization is still necessary and is indeed the trickier bit because it must be performed by all processors (in practice that would mean threads) that may execute the modified code. I remember that @Benjamin Bouvier had some fun implementing that in Spidermonkey :smile:.
Happy to take a PR for this, and thanks for raising the issue!
And yes, I was just about to ask about the multicore case, especially as we compile on a thread pool. Is it enough to flush dcache out (write back dirty lines, which broadly speaking looks like the first half of that code sequence) at the end of compile, and flush icache (second half of sequence) on all cores?
I can imagine a lazy design for the latter where we track "code generation number" and latest generation seen on each core, and flush only when needed
I guess I should go read SpiderMonkey's code to see how @Benjamin Bouvier solved this :-)
The instruction cache invalidation is not an issue because it broadcasts in the whole coherency domain.
ah, cool, so we don't need the equivalent of the linux kernel's "run this code on all cores" primitive
this looks like the right bit?: https://searchfox.org/mozilla-central/rev/60c185194386e8e44bbeb09f5547ad5994d4c774/js/src/jit/ProcessExecutableMemory.cpp#785-791
Well, we don't need it for cache maintenance, but we need it for context synchronization.
The code snippet that you linked to doesn't look like what I had in mind.
So there is some interesting discussion in https://bugzilla.mozilla.org/show_bug.cgi?id=1529933 about how Firefox's scheme is safe due to ordering around an atomic code-pointer update
but I haven't fully internalized this and in any case you would know best what is actually needed -- so please do send a PR if you've got an idea of a reasonable fix :-)
(fwiw we don't update code pointers with atomic stores because we don't do the racy swap-out-new-code thing that FF does between tiers, but we could easily add a fence in the right place)
cranelift-jit also doesn't do cache maintenance or context synchronization.
Right, I've read through the same discussion before when discussing with @Benjamin Bouvier, so I'll just repeat my points - I don't necessarily disagree with what my colleague Jacob Bramley has said, but my feeling is that this is relying on implementation details, which is IMHO fragile.
Basically the idea is that if the only way for a core to see the updated code is via an update to a function pointer, then context synchronization might not be necessary.
The key idea being that an update to a function pointer is a data update, not a code update.
This assumes that the buffer holding the updated code is "fresh", i.e. no one has ever executed code from it before.
Point of clarification: what do you mean by "context synchronization" exactly? (I understand cache coherence and weak memory models but am not familiar with that exact term)
and, yes, I agree, relying on the incidental synchronization edges that occur because of the data-structure updates is more fragile than I would prefer as well
Anton Kirilov said:
This assumes that the buffer holding the updated code is "fresh", i.e. no one has ever executed code from it before.
Is mmap(PROT_EXEC)
enough to make it fresh?
For application programmers "context synchronization" almost always means "execution of an ISB
instruction". That's a good point, I'll just say that from now on to make things clearer - I am obviously used to the Arm architecture terminology :smile:.
Ah, OK, so this just means "happens-before edge exists according to the memory model"
@bjorn3 Do you mean mprotect()
? Either way, the answer is no, and part of the reason is that TLB invalidations also broadcast (so there is no need for inter-processor interrupts).
@Chris Fallin ISB
is not so much about the memory model (which is covered by cache maintenance), but about speculative execution and the associated buffers/caches.
right, specifically in-flight instructions in the pipeline that may have already been fetched from stale cache lines are an issue, and data in store buffers, and ... (I worked on hardware details related to this in a past life :-) )
ISB
is just a full pipeline synchronization/fence?
Yes, that's right
Basically, what people need is a broadcast ISB
, which doesn't exist in the architecture.
(and I don't think that's an accidental omission - pretty sure it has been discussed before)
On Linux at least there is a straightforward equivalent system call - membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, 0)
.
OK, great. Happy to take a PR to add that. We should look into whether there is an equivalent on macOS/aarch64 as well (sadly I don't have hardware to test this; friendly ping to @Benjamin Bouvier if interested)
BTW bug 1661016 covers the work by @Benjamin Bouvier I mentioned.
I think I can look into adding the cache maintenance, but the ISB
part is more complicated because membarrier()
doesn't just interrupt all threads in the process.
Actually, threads that are interested in the notification have to register first and they are the ones that are interrupted.
So, in Wasmtime, for example, threads that just compile functions don't need to execute ISB
- I am not sure, do we distinguish between them and the threads that might run generated code?
I think that the thread pool management happens under the hood in Rayon (library)
so potentially any compilation task or async wasm function invocation could happen on any thread
I'm kind of thinking again about the lazy-ISB / "code generation number" approach. Could we just have a conditional and ISB in the wasm-entry thunk that triggers the first time we execute wasm after someone else generates some code?
that's a cost for sure but if code almost never changes then the branch should be predictable
First thought - you'd need to switch between RX and RW permissions in the middle of execution, though.
Oh, no, sorry, you have a conditional branch.
right, code itself is static, no patching in an ISB or anything
I was thinking of optimizing it :smile:.
two different thunks? or some sort of dynamic patching?
Well, the architecture also allows a very limited form of instruction patching that doesn't require any synchronization.
One particular case is turning an ISB
into a NOP
.
Still, I think the membarrier()
solution is cleaner and simpler, so I'll just have to read up on Rayon a bit.
Ah, I see, then the nop-ification of the ISB lazily propagates into reality, and it's ok if we get a spurious ISB before it is fully synchronized. makes sense!
I would actually lean a bit toward the generation-and-ISB approach because it avoids a syscall, and is more portable (any aarch64 OS)
but if you want to do the membarrier thing first that's fine, I think, as it is simpler
Hm, I am reading the documentation again and I might be misremembering - the opt-in done by the membarrier()
call might be process-wide, so there would be no need for individual threads to opt in, which makes things even easier.
Oh yes, I do remember this stuff :sweat_smile: I do remember that the membarrier
was process-wide, and past-me seemed to confirm this in the changeset message:
On real hardware, when a background thread finishes compilation, it must signal
to the other executing threads that they need to reload a new version of the
code. Ideally, each executing thread would run an ISB instruction to do so. We
hereby use a system call membarrier that interrupts every other running thread,
and will cause the same effect as a local ISB would. It is heavyweight, so we
make sure to only run it in the case where we're on a background thread.
That being said, it is not clear what needs to be done on MacOS aarch64.
Here is Apple's developer documentation on porting JIT compilers.
I opened issue #3310 to track this. We can also continue the discussion there.
I believe that in order to have a clean solution some additions are necessary to the compiler-builtins
and the libc
crates (I really don't think that hardcoding the assembly sequence for cache maintenance in Wasmtime is the best approach), which is going to take a while, so it is useful to have a tracking issue.
Having builtins would be ideal, but fwiw, I don't think it would be too much of a problem to just write the assembly in the meantime. The wasmtime-fiber
crate for example has an assembly file per architecture
I considered adding helpers to the crates/runtime/src/helpers.c
file (one helper for the cache maintenance operations and another for membarrier()
; in fact, in the former case we can just call the C function, I suppose), but it seemed wrong to make cranelift-jit
use that crate, so I decided to pursue what is in any case the proper solution (which would benefit other JIT runtimes implemented in Rust).
Yeah, that's true, it should be factored out somewhere. If getting things into libc or compiler-builtins becomes a bottleneck, I could see the case for a separate jit-icache-coherence
crate (either as part of wasmtime or maybe even as an independent thing on crates.io)
or possibly it could belong in region-rs
Last updated: Dec 23 2024 at 14:03 UTC