alexcrichton commented on issue #3426:
I'm trying to read up a bit more on the AArch64 requirements here to understand better why these are required. I'm currently reaching the conclusion though of wondering if we need this due to the safety of Rust, but you know much more than me so I'm hoping that you can educate me!
In the ARM manual 2.2.5 that talks about self-modifying code and 2.4.4 talks about cache coherency, but at least for us JIT code isn't self-modifying and for cache coherency I figured that Rust safety guarantees would save us from having to execute these barriers. For example the
Module
, when created, is only visible to the calling thread. For any other thread to actually get access to theModule
there'd have to be some form of external synchronization (e.g. a mutex, a channel, etc). Do the typical memory-barrier instructions, though, not cover the instruction cache?One other question I'd have is that when I was benchmarking loading precompiled modules I found that the
mprotect
call on AArch64 took a very long time in the kernel performing icache synchronization. Does this system call add more overhead on top of that? Or is w/e the kernel doing in there also required on top of this synchronization?I suppose my main question is, in the context of safe Rust programs where we're guaranteed that making
Module
visible to other threads guarantees a correct memory barrier, is this still required? This sort of reminds me ofArc::clone
in the standard library where increasing the reference count actually uses aRelaxed
memory barrier because actually sending the new clone of theArc
to another thread requires some form of external synchronization, so there's no need to double it up.
cfallin commented on issue #3426:
@alexcrichton The issue here I think is that the implicit happens-before edge that we get from synchronization in data-race-free programs is not sufficient, because the dcache and icache are not automatically coherent, if I'm understanding the manual right.
For example, the sequence of instructions in the manul B2.2.4 (page B2-131 in my PDF) does a
DC CVAU
to "clean the data cache", then aDSB ISH
to "ensure visibility of the data cleaned from cache", then aIC IVAU
to "invalidate instruction cache". That to me indicates that the usual MESI stuff that ensures a modifed line in dcache will be seen by icache does not occur automatically. (In contrast on x86 there is (i) full MESI coherence between icache and dcache, and (ii) a special set of bits that track when lines have been fetched, causing self-modifying-code pipeline flushes when modified, so everything is automatic; but that's expensive hardware and we don't have that luxury on other platforms...)In theory we could do something better than a syscall that runs a little "sync the caches" handler on every core, though (presumably this is doing some sort of IPI?). We could run the "sync the caches" bit in two halves -- flush the dcache line when we write new code, then flush the icache line before jumping to code if we know we need to. The latter bit is tricky: we don't want to do it on every function call, obviously, or else we'd tank performance (all icache misses all the time). So we want to somehow track if this core has done an icache flush recently enough to pick up whatever changes.
The part I haven't quite worked out about the above, though, is that there's a TOCTOU issue: we could do the check, decide we're on a core with a fresh icache, then the kernel could migrate our thread to another core just before we do the call. Short of playing tricks with affinity, maybe we can't get around that and do need the
membarrier()
that hits every core.Thoughts?
cfallin commented on issue #3426:
(Clarification on above: the
IC IVAU
is broadcast across the coherence domain apparently, but theISB
("instruction barrier" which I assume is a pipeline serialization) is necessary; that's the bit that we need to track recency of last flush on cores for, unless we serialize the pipeline every time we invoke a Wasm function pointer from the host (which IMHO we shouldn't do!).
cfallin edited a comment on issue #3426:
(Clarification on above: the
IC IVAU
is broadcast across the coherence domain apparently, but theISB
("instruction barrier" which I assume is a pipeline serialization) is necessary; that's the bit that we need to track recency of last flush on cores for, unless we serialize the pipeline every time we invoke a Wasm function pointer from the host (which IMHO we shouldn't do!).)
cfallin edited a comment on issue #3426:
@alexcrichton The issue here I think is that the implicit happens-before edge that we get from synchronization in data-race-free programs is not sufficient, because the dcache and icache are not automatically coherent, if I'm understanding the manual right.
For example, the sequence of instructions in the manual B2.2.4 (page B2-131 in my PDF) does a
DC CVAU
to "clean the data cache", then aDSB ISH
to "ensure visibility of the data cleaned from cache", then aIC IVAU
to "invalidate instruction cache". That to me indicates that the usual MESI stuff that ensures a modifed line in dcache will be seen by icache does not occur automatically. (In contrast on x86 there is (i) full MESI coherence between icache and dcache, and (ii) a special set of bits that track when lines have been fetched, causing self-modifying-code pipeline flushes when modified, so everything is automatic; but that's expensive hardware and we don't have that luxury on other platforms...)In theory we could do something better than a syscall that runs a little "sync the caches" handler on every core, though (presumably this is doing some sort of IPI?). We could run the "sync the caches" bit in two halves -- flush the dcache line when we write new code, then flush the icache line before jumping to code if we know we need to. The latter bit is tricky: we don't want to do it on every function call, obviously, or else we'd tank performance (all icache misses all the time). So we want to somehow track if this core has done an icache flush recently enough to pick up whatever changes.
The part I haven't quite worked out about the above, though, is that there's a TOCTOU issue: we could do the check, decide we're on a core with a fresh icache, then the kernel could migrate our thread to another core just before we do the call. Short of playing tricks with affinity, maybe we can't get around that and do need the
membarrier()
that hits every core.Thoughts?
cfallin commented on issue #3426:
Ah, and one more clarification I wanted to mention re: other comments above and the manual is that "self-modifying code" in this context is actually applicable, because although our JIT code doesn't literally modify itself, to the core (and to a microarchitect) it's all the same -- we are executing data that we've written to memory so we have to worry about when that data becomes visible to instruction-fetch.
akirilov-arm commented on issue #3426:
@cfallin you are essentially correct about everything, so thanks a lot for these replies - they save me a lot of writing!
I just want to add a couple of details - in essence this pull request is about adding a "broadcast"
ISB
, which is not part of the Arm architecture and which themembarrier()
system call (with these particular parameters) provides an equivalent to (and yes, the system call name is a bit deceptive because here we are not dealing with the familiar barriers that prevent data races). Indeed,ISB
is a pipeline serialization or "context synchronization event" in the architecture's parlance.My changes are incomplete (hence I said that they were the first part of a fix) because the code that is dealing with the actual cache flushing is still missing - I will add that bit after the corresponding gap in the
compiler-builtins
crate is filled in. Cache flushing is in fact the easy part of the problem - precisely becauseIC IVAU
is broadcast in the whole coherence domain, as Chris mentioned, so no system calls would be necessary.One other question I'd have is that when I was benchmarking loading precompiled modules I found that the mprotect call on AArch64 took a very long time in the kernel...
@alexcrichton That sounds a bit like the problem #2890 is trying to solve - or is it something different?
akirilov-arm edited a comment on issue #3426:
@cfallin you are essentially correct about everything, so thanks a lot for these replies - they save me a lot of writing!
I just want to add a couple of details - in essence this pull request is about adding a "broadcast"
ISB
, which is not part of the Arm architecture and which themembarrier()
system call (with these particular parameters) provides an equivalent to (and yes, the system call name is a bit deceptive because here we are not dealing with the familiar barriers that prevent data races). Indeed,ISB
is a pipeline serialization or "context synchronization event" in the architecture's parlance and a recency tracking mechanism could be beneficial because we do not need the serialization after generating every single bit of code - e.g. if several functions are compiled in parallel by different worker threads, we could wait till all compilations complete before executing anISB
instruction on the current processor, as long as it does not try to run any of the functions in the meantime.My changes are incomplete (hence I said that they were the first part of a fix) because the code that is dealing with the actual cache flushing is still missing - I will add that bit after the corresponding gap in the
compiler-builtins
crate is filled in. Cache flushing is in fact the easy part of the problem - precisely becauseIC IVAU
is broadcast in the whole coherence domain, as Chris mentioned, so no system calls would be necessary.One other question I'd have is that when I was benchmarking loading precompiled modules I found that the mprotect call on AArch64 took a very long time in the kernel...
@alexcrichton That sounds a bit like the problem #2890 is trying to solve - or is it something different?
alexcrichton commented on issue #3426:
Oh wow this is tricky, which I suppose should be expected of anything caching related.
So to make sure I understand, the first step recommended by the manual is that the data cache is flushed out (since we just wrote functions into memory) and then all instruction caches are brought up to date with the
IC IVAU
instruction. Then there's also the requirement of runningISB
on all cores to flush out the instruction pipelines because otherwise some instruction in the pipeline may no longer be correct. Is that right?In any case it sounds like it definitely doesn't suffice to have data-race-free code here since that basically only guarantees data-cache coherency, not instruction-cache coherency. Otherwise if the mapping of JIT code is reused from some previous mapping then an icache may have the old mapping still cached.
Also to confirm, this PR is just the run-
isb
-on-all-cores piece, and thecompiler-builtins
bits will do theIC IVAU
stuff?@alexcrichton That sounds a bit like the problem #2890 is trying to solve - or is it something different?
When I run this program on an aarch64 machine the example output I get is:
mprotect with no touched pages 960ns mprotect with touched pages 43.838609ms
According to
perf
most of the time is spent in__flush_icache_range
in the kernel. Given that this takes so long it seems likemprotect
is doing some sort of synchronization in the kernel, I just don't know what kind. Is that entirely unrelated to the synchronization happening here though withISB
and such?
alexcrichton commented on issue #3426:
I suppose another question I could ask is whether kernel-level guarantees factor in here at all. We're always creating a new
mmap
for allocated code so that means that the memory was previously unmapped. Does the kernel guarantee that when a mapping is unmapped that all cores are updated automatically to flush all related caches for that mapping? If that's the case then we may be able to skip synchronization since we know caches are empty for this range and no one will otherwise try to use it while we're writing to it.
cfallin commented on issue #3426:
@alexcrichton I bet the
membarrier()
taking 43ms is due to IPIs and latency of entering the kernel on every other core; especially on the big 128-core machine we have to play with, that will take a while! (This is viasmp_call_function_many()
invoked in kernel/sched/membarrier.c, and that helper seems to broadcast the IPI and let them all go at once, but there is still the long tail if any core has disabled preemption temporarily.)Re: fresh addresses ameliorating this, iirc at least some cores have PIPT (physically-indexed, physically-tagged) caches, so it's possible a physical address (a particular pageframe) could be reused even if the virtual address is new, and there could be a stale cacheline, I think; but @akirilov-arm can say if that's too paranoid :-)
alexcrichton commented on issue #3426:
@cfallin but in the example program I'm not calling
membarrier
, justmprotect
, so does that mean thatmprotect
is doing the same thing?
cfallin commented on issue #3426:
Oh, I misread that, sorry; that's a good question! Maybe that is "just" TLB shootdown overhead then, hard to say without more detailed profiling...
cfallin commented on issue #3426:
(Actually, not TLB shootdown if all the time is spent in
__flush_icache_range
as you say. I need more caffeine in this tea, sorry.)This does make me think: if the
mprotect()
is doing an IPI to every core anyway, that certainly implies a full pipeline synchronization when the core handles the interrupt, if nothing else; maybe in practice this already covers us? Although it feels a little brittle to depend on a side-effect of the kernel's implementation, especially if it's doing a smart on-demand thing and does different things if pages are touched vs. not.
bjorn3 commented on issue #3426:
Although it feels a little brittle to depend on a side-effect of the kernel's implementation, especially if it's doing a smart on-demand thing and does different things if pages are touched vs. not.
I guess you could ask a maintainer if this is guaranteed? And if not maybe Wasmtime depending on it will make it guaranteed due to "don't break the userspace"? https://linuxreviews.org/WE_DO_NOT_BREAK_USERSPACE
akirilov-arm commented on issue #3426:
@alexcrichton
... then all instruction caches are brought up to date with the
IC IVAU
instruction.Technically the cache contents are discarded (IVAU = Invalidate by Virtual Address to the point of Unification), but that has an equivalent effect.
Then there's also the requirement of running
ISB
on all cores to flush out the instruction pipelines because otherwise some instruction in the pipeline may no longer be correct. Is that right?Yes, that's right. Note that while the number of instructions in the pipeline may not sound like a lot (especially if we ignore the instruction window associated with out-of-order execution), Cortex-A77, for example, has a 1500 entry macro-op cache, whose contents jump straight from the fetch to the rename stage of the pipeline. In other words, the pipeline may contain thousands of instructions, not all of them speculative!
Also to confirm, this PR is just the run-
isb
-on-all-cores piece, and thecompiler-builtins
bits will do theIC IVAU
stuff?Correct.
I suppose another question I could ask is whether kernel-level guarantees factor in here at all. We're always creating a new
mmap
for allocated code so that means that the memory was previously unmapped. Does the kernel guarantee that when a mapping is unmapped that all cores are updated automatically to flush all related caches for that mapping?When I asked our kernel team (which includes several maintainers, one half of the general AArch64 maintainers in particular) about that, the answer was that the only guarantee that the kernel could provide is that a TLB flush would occur (and even that is not 100% guaranteed); no data and instruction cache effects should be assumed. I'd take a more nuanced view towards the "don't break the userspace rule", especially since we are not talking about a clear-cut API and/or ABI issue here.
Another quirk of the architecture is that TLB invalidations are broadcast in the coherence domain as well, which means that in principle an IPI is not necessary. Speaking of which, architecturally exception entries (e.g. in response to interrupts) are context synchronization events, that is equivalent to
ISB
.@cfallin
... at least some cores have PIPT (physically-indexed, physically-tagged) caches...
It is actually an architectural requirement that data and unified caches behave as PIPT (section D5.11 in the manual), while there are several options for instruction caches, one of which is PIPT, indeed.
... it's possible a physical address (a particular pageframe) could be reused even if the virtual address is new, and there could be a stale cacheline, I think; but @akirilov-arm can say if that's too paranoid :-)
Maybe just a bit :smile:, but IMHO it is a possibility especially under memory pressure because clean (i.e. unmodified) file-backed memory pages are prime candidates to be reclaimed for allocation requests (since they could always be restored from persistent storage on demand). E.g. one of the executable pages of the Wasmtime binary that has been used in the past (so parts of it could be resident in an instruction cache) could be reclaimed to satisfy the allocation request for a code buffer used by the JIT compiler.
As for the behaviour of Alex' example program, I'll have a look too because, again, in general I don't expect an IPI in response to
mprotect()
(which, if guaranteed to occur on all processors, would indeed makemembarrier()
redundant).
akirilov-arm edited a comment on issue #3426:
@alexcrichton
... then all instruction caches are brought up to date with the
IC IVAU
instruction.Technically the cache contents are discarded (IVAU = Invalidate by Virtual Address to the point of Unification), but that has an equivalent effect.
Then there's also the requirement of running
ISB
on all cores to flush out the instruction pipelines because otherwise some instruction in the pipeline may no longer be correct. Is that right?Yes, that's right. Note that while the number of instructions in the pipeline may not sound like a lot (especially if we ignore the instruction window associated with out-of-order execution), Cortex-A77, for example, has a 1500 entry macro-op cache, whose contents jump straight from the fetch to the rename stage of the pipeline. In other words, the pipeline may contain thousands of instructions, not all of them speculative!
Also to confirm, this PR is just the run-
isb
-on-all-cores piece, and thecompiler-builtins
bits will do theIC IVAU
stuff?Correct.
I suppose another question I could ask is whether kernel-level guarantees factor in here at all. We're always creating a new
mmap
for allocated code so that means that the memory was previously unmapped. Does the kernel guarantee that when a mapping is unmapped that all cores are updated automatically to flush all related caches for that mapping?When I asked our kernel team (which includes several maintainers, one half of the general AArch64 maintainers in particular) about that, the answer was that the only guarantee that the kernel could provide is that a TLB flush would occur (and even that is not 100% certain); no data and instruction cache effects should be assumed. I'd take a more nuanced view towards the "don't break the userspace" rule, especially since we are not talking about a clear-cut API and/or ABI issue here.
Another quirk of the architecture is that TLB invalidations are broadcast in the coherence domain as well, which means that in principle an IPI is not necessary. Speaking of which, exception entries (e.g. in response to interrupts, system calls, etc.) are context synchronization events, that is architecturally equivalent to
ISB
. In particular, this means that the thread generating the code does not need to executeISB
becausemprotect()
has the same effect (it's a bit more complicated than that, but no need for those details).@cfallin
... at least some cores have PIPT (physically-indexed, physically-tagged) caches...
It is actually an architectural requirement that data and unified caches behave as PIPT (section D5.11 in the manual), while there are several options for instruction caches, one of which is PIPT, indeed.
... it's possible a physical address (a particular pageframe) could be reused even if the virtual address is new, and there could be a stale cacheline, I think; but @akirilov-arm can say if that's too paranoid :-)
Maybe just a bit :smile:, but IMHO it is a possibility especially under memory pressure because clean (i.e. unmodified) file-backed memory pages are prime candidates to be reclaimed for allocation requests (since they could always be restored from persistent storage on demand). E.g. one of the executable pages of the Wasmtime binary that has been used in the past (so parts of it could be resident in an instruction cache) could be reclaimed to satisfy the allocation request for a code buffer used by the JIT compiler.
As for the behaviour of Alex' example program, I'll have a look too because, again, in general I don't expect an IPI in response to
mprotect()
(which, if guaranteed to occur on all processors, would indeed makemembarrier()
redundant).
akirilov-arm edited a comment on issue #3426:
@alexcrichton
... then all instruction caches are brought up to date with the
IC IVAU
instruction.Technically the cache contents are discarded (IVAU = Invalidate by Virtual Address to the point of Unification), but that has an equivalent effect.
Then there's also the requirement of running
ISB
on all cores to flush out the instruction pipelines because otherwise some instruction in the pipeline may no longer be correct. Is that right?Yes, that's right. Note that while the number of instructions in the pipeline may not sound like a lot (especially if we ignore the instruction window associated with out-of-order execution), Cortex-A77, for example, has a 1500 entry macro-op cache, whose contents jump straight from the fetch to the rename stage of the pipeline. In other words, the pipeline may contain thousands of instructions, not all of them speculative!
Also to confirm, this PR is just the run-
isb
-on-all-cores piece, and thecompiler-builtins
bits will do theIC IVAU
stuff?Correct.
I suppose another question I could ask is whether kernel-level guarantees factor in here at all. We're always creating a new
mmap
for allocated code so that means that the memory was previously unmapped. Does the kernel guarantee that when a mapping is unmapped that all cores are updated automatically to flush all related caches for that mapping?When I asked our kernel team (which includes several maintainers, one half of the general AArch64 maintainers in particular) about that, the answer was that the only guarantee that the kernel could provide is that a TLB flush would occur (and even that is not 100% certain); no data and instruction cache effects should be assumed. I'd take a more nuanced view towards the "don't break the userspace" rule, especially since we are not talking about a clear-cut API and/or ABI issue here.
Another quirk of the architecture is that TLB invalidations are broadcast in the coherence domain as well, which means that in principle an IPI is not necessary. Speaking of which, exception entries (e.g. in response to interrupts, system calls, etc.) are context synchronization events, that is architecturally equivalent to
ISB
. In particular, this means that the thread generating the code does not need to executeISB
becausemprotect()
has the same effect by virtue of being a system call (it's a bit more complicated than that, but no need for those details).@cfallin
... at least some cores have PIPT (physically-indexed, physically-tagged) caches...
It is actually an architectural requirement that data and unified caches behave as PIPT (section D5.11 in the manual), while there are several options for instruction caches, one of which is PIPT, indeed.
... it's possible a physical address (a particular pageframe) could be reused even if the virtual address is new, and there could be a stale cacheline, I think; but @akirilov-arm can say if that's too paranoid :-)
Maybe just a bit :smile:, but IMHO it is a possibility, especially under memory pressure, because clean (i.e. unmodified) file-backed memory pages are prime candidates to be reclaimed for allocation requests (since they could always be restored from persistent storage on demand). E.g. one of the executable pages of the Wasmtime binary that has been used in the past (so parts of it could be resident in an instruction cache) could be reclaimed to satisfy the allocation request for a code buffer used by the JIT compiler.
As for the behaviour of Alex' example program, I'll have a look too because, again, in general I don't expect an IPI in response to
mprotect()
(which, if guaranteed to occur on all processors, would indeed make themembarrier()
call redundant).
akirilov-arm edited a comment on issue #3426:
@alexcrichton
... then all instruction caches are brought up to date with the
IC IVAU
instruction.Technically the cache contents are discarded (IVAU = Invalidate by Virtual Address to the point of Unification), but that has an equivalent effect.
Then there's also the requirement of running
ISB
on all cores to flush out the instruction pipelines because otherwise some instruction in the pipeline may no longer be correct. Is that right?Yes, that's right. Note that while the number of instructions in the pipeline may not sound like a lot (especially if we ignore the instruction window associated with out-of-order execution), Cortex-A77, for example, has a 1500 entry macro-op cache, whose contents jump straight from the fetch to the rename stage of the pipeline. In other words, the pipeline may contain thousands of instructions, not all of them speculative!
Also to confirm, this PR is just the run-
isb
-on-all-cores piece, and thecompiler-builtins
bits will do theIC IVAU
stuff?Correct.
I suppose another question I could ask is whether kernel-level guarantees factor in here at all. We're always creating a new
mmap
for allocated code so that means that the memory was previously unmapped. Does the kernel guarantee that when a mapping is unmapped that all cores are updated automatically to flush all related caches for that mapping?When I asked our kernel team (which includes several maintainers, one half of the general AArch64 maintainers in particular) about that, the answer was that the only guarantee that the kernel could provide is that a TLB flush would occur (and even that is not 100% certain); no data and instruction cache effects should be assumed. I'd take a more nuanced view towards the "don't break the userspace" rule, especially since we are not talking about a clear-cut API and/or ABI issue here.
Another quirk of the architecture is that TLB invalidations are broadcast in the coherence domain as well, which means that in principle an IPI is not necessary. Speaking of which, exception entries (e.g. in response to interrupts, system calls, etc.) are context synchronization events, that is architecturally equivalent to
ISB
. In particular, this means that the thread generating the code does not need to executeISB
becausemprotect()
has the same effect by virtue of being a system call (it's a bit more complicated than that, but no need for those details).@cfallin
... at least some cores have PIPT (physically-indexed, physically-tagged) caches...
It is actually an architectural requirement that data and unified caches behave as PIPT (section D5.11 in the manual), while there are several options for instruction caches, one of which is indeed PIPT. For instance, the Technical Reference Manual (TRM) for Neoverse N1 states in section A6.1 that L1 caches behave as PIPT.
... it's possible a physical address (a particular pageframe) could be reused even if the virtual address is new, and there could be a stale cacheline, I think; but @akirilov-arm can say if that's too paranoid :-)
Maybe just a bit :smile:, but IMHO it is a possibility, especially under memory pressure, because clean (i.e. unmodified) file-backed memory pages are prime candidates to be reclaimed for allocation requests (since they could always be restored from persistent storage on demand). E.g. one of the executable pages of the Wasmtime binary that has been used in the past (so parts of it could be resident in an instruction cache) could be reclaimed to satisfy the allocation request for a code buffer used by the JIT compiler.
As for the behaviour of Alex' example program, I'll have a look too because, again, in general I don't expect an IPI in response to
mprotect()
(which, if guaranteed to occur on all processors, would indeed make themembarrier()
call redundant).
akirilov-arm commented on issue #3426:
Yes, I am taking a look.
To go on a bit of a tangent, some recent research has uncovered a vulnerability, Speculative Code Store Bypass (SCSB, tracked by CVE-2021-0089 and CVE-2021-26313 on Intel and AMD processors respectively), that demonstrates the limitations of the approach on x86. The suggested mitigation seems to be pretty much equivalent to part of what the Arm architecture requires.
akirilov-arm commented on issue #3426:
It turns out that when the Linux kernel makes a memory mapping executable, it also performs the necessary cache flushing on AArch64. Setting aside the discussion of whether we should rely on an implementation detail like that or not, the real point is that this behaviour still does not solve the issue that this PR is taking care of, namely the requirement to run
ISB
on all processors that might execute the new code. In fact, the implementation guarantees that there will be no inter-processor interrupts.
Last updated: Jan 24 2025 at 00:11 UTC