wasmtime / issue #3426 Call membarrier() after making JIT... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #3426 Call membarrier() after making JIT...

Wasmtime GitHub notifications bot (Oct 11 2021 at 14:23):

I'm trying to read up a bit more on the AArch64 requirements here to understand better why these are required. I'm currently reaching the conclusion though of wondering if we need this due to the safety of Rust, but you know much more than me so I'm hoping that you can educate me!

In the ARM manual 2.2.5 that talks about self-modifying code and 2.4.4 talks about cache coherency, but at least for us JIT code isn't self-modifying and for cache coherency I figured that Rust safety guarantees would save us from having to execute these barriers. For example the Module, when created, is only visible to the calling thread. For any other thread to actually get access to the Module there'd have to be some form of external synchronization (e.g. a mutex, a channel, etc). Do the typical memory-barrier instructions, though, not cover the instruction cache?

One other question I'd have is that when I was benchmarking loading precompiled modules I found that the mprotect call on AArch64 took a very long time in the kernel performing icache synchronization. Does this system call add more overhead on top of that? Or is w/e the kernel doing in there also required on top of this synchronization?

I suppose my main question is, in the context of safe Rust programs where we're guaranteed that making Module visible to other threads guarantees a correct memory barrier, is this still required? This sort of reminds me of Arc::clone in the standard library where increasing the reference count actually uses a Relaxed memory barrier because actually sending the new clone of the Arc to another thread requires some form of external synchronization, so there's no need to double it up.

Wasmtime GitHub notifications bot (Oct 14 2021 at 19:39):

cfallin commented on issue #3426:

@alexcrichton The issue here I think is that the implicit happens-before edge that we get from synchronization in data-race-free programs is not sufficient, because the dcache and icache are not automatically coherent, if I'm understanding the manual right.

For example, the sequence of instructions in the manul B2.2.4 (page B2-131 in my PDF) does a DC CVAU to "clean the data cache", then a DSB ISH to "ensure visibility of the data cleaned from cache", then a IC IVAU to "invalidate instruction cache". That to me indicates that the usual MESI stuff that ensures a modifed line in dcache will be seen by icache does not occur automatically. (In contrast on x86 there is (i) full MESI coherence between icache and dcache, and (ii) a special set of bits that track when lines have been fetched, causing self-modifying-code pipeline flushes when modified, so everything is automatic; but that's expensive hardware and we don't have that luxury on other platforms...)

In theory we could do something better than a syscall that runs a little "sync the caches" handler on every core, though (presumably this is doing some sort of IPI?). We could run the "sync the caches" bit in two halves -- flush the dcache line when we write new code, then flush the icache line before jumping to code if we know we need to. The latter bit is tricky: we don't want to do it on every function call, obviously, or else we'd tank performance (all icache misses all the time). So we want to somehow track if this core has done an icache flush recently enough to pick up whatever changes.

The part I haven't quite worked out about the above, though, is that there's a TOCTOU issue: we could do the check, decide we're on a core with a fresh icache, then the kernel could migrate our thread to another core just before we do the call. Short of playing tricks with affinity, maybe we can't get around that and do need the membarrier() that hits every core.

Thoughts?

Wasmtime GitHub notifications bot (Oct 14 2021 at 19:47):

cfallin commented on issue #3426:

(Clarification on above: the IC IVAU is broadcast across the coherence domain apparently, but the ISB ("instruction barrier" which I assume is a pipeline serialization) is necessary; that's the bit that we need to track recency of last flush on cores for, unless we serialize the pipeline every time we invoke a Wasm function pointer from the host (which IMHO we shouldn't do!).

Wasmtime GitHub notifications bot (Oct 14 2021 at 19:47):

cfallin edited a comment on issue #3426:

(Clarification on above: the IC IVAU is broadcast across the coherence domain apparently, but the ISB ("instruction barrier" which I assume is a pipeline serialization) is necessary; that's the bit that we need to track recency of last flush on cores for, unless we serialize the pipeline every time we invoke a Wasm function pointer from the host (which IMHO we shouldn't do!).)

Wasmtime GitHub notifications bot (Oct 14 2021 at 19:48):

cfallin edited a comment on issue #3426:

@alexcrichton The issue here I think is that the implicit happens-before edge that we get from synchronization in data-race-free programs is not sufficient, because the dcache and icache are not automatically coherent, if I'm understanding the manual right.

For example, the sequence of instructions in the manual B2.2.4 (page B2-131 in my PDF) does a DC CVAU to "clean the data cache", then a DSB ISH to "ensure visibility of the data cleaned from cache", then a IC IVAU to "invalidate instruction cache". That to me indicates that the usual MESI stuff that ensures a modifed line in dcache will be seen by icache does not occur automatically. (In contrast on x86 there is (i) full MESI coherence between icache and dcache, and (ii) a special set of bits that track when lines have been fetched, causing self-modifying-code pipeline flushes when modified, so everything is automatic; but that's expensive hardware and we don't have that luxury on other platforms...)

In theory we could do something better than a syscall that runs a little "sync the caches" handler on every core, though (presumably this is doing some sort of IPI?). We could run the "sync the caches" bit in two halves -- flush the dcache line when we write new code, then flush the icache line before jumping to code if we know we need to. The latter bit is tricky: we don't want to do it on every function call, obviously, or else we'd tank performance (all icache misses all the time). So we want to somehow track if this core has done an icache flush recently enough to pick up whatever changes.

The part I haven't quite worked out about the above, though, is that there's a TOCTOU issue: we could do the check, decide we're on a core with a fresh icache, then the kernel could migrate our thread to another core just before we do the call. Short of playing tricks with affinity, maybe we can't get around that and do need the membarrier() that hits every core.

Thoughts?

Wasmtime GitHub notifications bot (Oct 14 2021 at 19:51):

cfallin commented on issue #3426:

Ah, and one more clarification I wanted to mention re: other comments above and the manual is that "self-modifying code" in this context is actually applicable, because although our JIT code doesn't literally modify itself, to the core (and to a microarchitect) it's all the same -- we are executing data that we've written to memory so we have to worry about when that data becomes visible to instruction-fetch.

Wasmtime GitHub notifications bot (Oct 14 2021 at 20:59):

akirilov-arm commented on issue #3426:

@cfallin you are essentially correct about everything, so thanks a lot for these replies - they save me a lot of writing!

I just want to add a couple of details - in essence this pull request is about adding a "broadcast" ISB, which is not part of the Arm architecture and which the membarrier() system call (with these particular parameters) provides an equivalent to (and yes, the system call name is a bit deceptive because here we are not dealing with the familiar barriers that prevent data races). Indeed, ISB is a pipeline serialization or "context synchronization event" in the architecture's parlance.

My changes are incomplete (hence I said that they were the first part of a fix) because the code that is dealing with the actual cache flushing is still missing - I will add that bit after the corresponding gap in the compiler-builtins crate is filled in. Cache flushing is in fact the easy part of the problem - precisely because IC IVAU is broadcast in the whole coherence domain, as Chris mentioned, so no system calls would be necessary.

One other question I'd have is that when I was benchmarking loading precompiled modules I found that the mprotect call on AArch64 took a very long time in the kernel...
@alexcrichton That sounds a bit like the problem #2890 is trying to solve - or is it something different?

Wasmtime GitHub notifications bot (Oct 14 2021 at 21:05):

akirilov-arm edited a comment on issue #3426:

@cfallin you are essentially correct about everything, so thanks a lot for these replies - they save me a lot of writing!

I just want to add a couple of details - in essence this pull request is about adding a "broadcast" ISB, which is not part of the Arm architecture and which the membarrier() system call (with these particular parameters) provides an equivalent to (and yes, the system call name is a bit deceptive because here we are not dealing with the familiar barriers that prevent data races). Indeed, ISB is a pipeline serialization or "context synchronization event" in the architecture's parlance and a recency tracking mechanism could be beneficial because we do not need the serialization after generating every single bit of code - e.g. if several functions are compiled in parallel by different worker threads, we could wait till all compilations complete before executing an ISB instruction on the current processor, as long as it does not try to run any of the functions in the meantime.

My changes are incomplete (hence I said that they were the first part of a fix) because the code that is dealing with the actual cache flushing is still missing - I will add that bit after the corresponding gap in the compiler-builtins crate is filled in. Cache flushing is in fact the easy part of the problem - precisely because IC IVAU is broadcast in the whole coherence domain, as Chris mentioned, so no system calls would be necessary.

One other question I'd have is that when I was benchmarking loading precompiled modules I found that the mprotect call on AArch64 took a very long time in the kernel...

@alexcrichton That sounds a bit like the problem #2890 is trying to solve - or is it something different?

Wasmtime GitHub notifications bot (Oct 14 2021 at 21:36):

alexcrichton commented on issue #3426:

Oh wow this is tricky, which I suppose should be expected of anything caching related.

So to make sure I understand, the first step recommended by the manual is that the data cache is flushed out (since we just wrote functions into memory) and then all instruction caches are brought up to date with the IC IVAU instruction. Then there's also the requirement of running ISB on all cores to flush out the instruction pipelines because otherwise some instruction in the pipeline may no longer be correct. Is that right?

In any case it sounds like it definitely doesn't suffice to have data-race-free code here since that basically only guarantees data-cache coherency, not instruction-cache coherency. Otherwise if the mapping of JIT code is reused from some previous mapping then an icache may have the old mapping still cached.

Also to confirm, this PR is just the run-isb-on-all-cores piece, and the compiler-builtins bits will do the IC IVAU stuff?

@alexcrichton That sounds a bit like the problem #2890 is trying to solve - or is it something different?

When I run this program on an aarch64 machine the example output I get is:
mprotect with no touched pages 960ns
mprotect with touched pages 43.838609ms
According to perf most of the time is spent in __flush_icache_range in the kernel. Given that this takes so long it seems like mprotect is doing some sort of synchronization in the kernel, I just don't know what kind. Is that entirely unrelated to the synchronization happening here though with ISB and such?

Wasmtime GitHub notifications bot (Oct 14 2021 at 21:39):

alexcrichton commented on issue #3426:

I suppose another question I could ask is whether kernel-level guarantees factor in here at all. We're always creating a new mmap for allocated code so that means that the memory was previously unmapped. Does the kernel guarantee that when a mapping is unmapped that all cores are updated automatically to flush all related caches for that mapping? If that's the case then we may be able to skip synchronization since we know caches are empty for this range and no one will otherwise try to use it while we're writing to it.

Wasmtime GitHub notifications bot (Oct 14 2021 at 21:49):

cfallin commented on issue #3426:

@alexcrichton I bet the membarrier() taking 43ms is due to IPIs and latency of entering the kernel on every other core; especially on the big 128-core machine we have to play with, that will take a while! (This is via smp_call_function_many() invoked in kernel/sched/membarrier.c, and that helper seems to broadcast the IPI and let them all go at once, but there is still the long tail if any core has disabled preemption temporarily.)

Re: fresh addresses ameliorating this, iirc at least some cores have PIPT (physically-indexed, physically-tagged) caches, so it's possible a physical address (a particular pageframe) could be reused even if the virtual address is new, and there could be a stale cacheline, I think; but @akirilov-arm can say if that's too paranoid :-)

Wasmtime GitHub notifications bot (Oct 14 2021 at 21:54):

alexcrichton commented on issue #3426:

@cfallin but in the example program I'm not calling membarrier, just mprotect, so does that mean that mprotect is doing the same thing?

Wasmtime GitHub notifications bot (Oct 14 2021 at 21:56):

cfallin commented on issue #3426:

Oh, I misread that, sorry; that's a good question! Maybe that is "just" TLB shootdown overhead then, hard to say without more detailed profiling...

Wasmtime GitHub notifications bot (Oct 14 2021 at 22:00):

cfallin commented on issue #3426:

(Actually, not TLB shootdown if all the time is spent in __flush_icache_range as you say. I need more caffeine in this tea, sorry.)

This does make me think: if the mprotect() is doing an IPI to every core anyway, that certainly implies a full pipeline synchronization when the core handles the interrupt, if nothing else; maybe in practice this already covers us? Although it feels a little brittle to depend on a side-effect of the kernel's implementation, especially if it's doing a smart on-demand thing and does different things if pages are touched vs. not.

Wasmtime GitHub notifications bot (Oct 15 2021 at 08:18):

bjorn3 commented on issue #3426:

Although it feels a little brittle to depend on a side-effect of the kernel's implementation, especially if it's doing a smart on-demand thing and does different things if pages are touched vs. not.

I guess you could ask a maintainer if this is guaranteed? And if not maybe Wasmtime depending on it will make it guaranteed due to "don't break the userspace"? https://linuxreviews.org/WE_DO_NOT_BREAK_USERSPACE

Wasmtime GitHub notifications bot (Oct 15 2021 at 22:01):

akirilov-arm commented on issue #3426:

@alexcrichton

... then all instruction caches are brought up to date with the IC IVAU instruction.

Technically the cache contents are discarded (IVAU = Invalidate by Virtual Address to the point of Unification), but that has an equivalent effect.

Then there's also the requirement of running ISB on all cores to flush out the instruction pipelines because otherwise some instruction in the pipeline may no longer be correct. Is that right?

Yes, that's right. Note that while the number of instructions in the pipeline may not sound like a lot (especially if we ignore the instruction window associated with out-of-order execution), Cortex-A77, for example, has a 1500 entry macro-op cache, whose contents jump straight from the fetch to the rename stage of the pipeline. In other words, the pipeline may contain thousands of instructions, not all of them speculative!

Also to confirm, this PR is just the run-isb-on-all-cores piece, and the compiler-builtins bits will do the IC IVAU stuff?

Correct.

I suppose another question I could ask is whether kernel-level guarantees factor in here at all. We're always creating a new mmap for allocated code so that means that the memory was previously unmapped. Does the kernel guarantee that when a mapping is unmapped that all cores are updated automatically to flush all related caches for that mapping?

When I asked our kernel team (which includes several maintainers, one half of the general AArch64 maintainers in particular) about that, the answer was that the only guarantee that the kernel could provide is that a TLB flush would occur (and even that is not 100% guaranteed); no data and instruction cache effects should be assumed. I'd take a more nuanced view towards the "don't break the userspace rule", especially since we are not talking about a clear-cut API and/or ABI issue here.

Another quirk of the architecture is that TLB invalidations are broadcast in the coherence domain as well, which means that in principle an IPI is not necessary. Speaking of which, architecturally exception entries (e.g. in response to interrupts) are context synchronization events, that is equivalent to ISB.

@cfallin

... at least some cores have PIPT (physically-indexed, physically-tagged) caches...

It is actually an architectural requirement that data and unified caches behave as PIPT (section D5.11 in the manual), while there are several options for instruction caches, one of which is PIPT, indeed.

... it's possible a physical address (a particular pageframe) could be reused even if the virtual address is new, and there could be a stale cacheline, I think; but @akirilov-arm can say if that's too paranoid :-)

Maybe just a bit :smile:, but IMHO it is a possibility especially under memory pressure because clean (i.e. unmodified) file-backed memory pages are prime candidates to be reclaimed for allocation requests (since they could always be restored from persistent storage on demand). E.g. one of the executable pages of the Wasmtime binary that has been used in the past (so parts of it could be resident in an instruction cache) could be reclaimed to satisfy the allocation request for a code buffer used by the JIT compiler.

As for the behaviour of Alex' example program, I'll have a look too because, again, in general I don't expect an IPI in response to mprotect() (which, if guaranteed to occur on all processors, would indeed make membarrier() redundant).

Wasmtime GitHub notifications bot (Oct 15 2021 at 22:14):

akirilov-arm edited a comment on issue #3426:

@alexcrichton

... then all instruction caches are brought up to date with the IC IVAU instruction.

Technically the cache contents are discarded (IVAU = Invalidate by Virtual Address to the point of Unification), but that has an equivalent effect.

Then there's also the requirement of running ISB on all cores to flush out the instruction pipelines because otherwise some instruction in the pipeline may no longer be correct. Is that right?

Yes, that's right. Note that while the number of instructions in the pipeline may not sound like a lot (especially if we ignore the instruction window associated with out-of-order execution), Cortex-A77, for example, has a 1500 entry macro-op cache, whose contents jump straight from the fetch to the rename stage of the pipeline. In other words, the pipeline may contain thousands of instructions, not all of them speculative!

Also to confirm, this PR is just the run-isb-on-all-cores piece, and the compiler-builtins bits will do the IC IVAU stuff?

Correct.

I suppose another question I could ask is whether kernel-level guarantees factor in here at all. We're always creating a new mmap for allocated code so that means that the memory was previously unmapped. Does the kernel guarantee that when a mapping is unmapped that all cores are updated automatically to flush all related caches for that mapping?

When I asked our kernel team (which includes several maintainers, one half of the general AArch64 maintainers in particular) about that, the answer was that the only guarantee that the kernel could provide is that a TLB flush would occur (and even that is not 100% certain); no data and instruction cache effects should be assumed. I'd take a more nuanced view towards the "don't break the userspace" rule, especially since we are not talking about a clear-cut API and/or ABI issue here.

Another quirk of the architecture is that TLB invalidations are broadcast in the coherence domain as well, which means that in principle an IPI is not necessary. Speaking of which, exception entries (e.g. in response to interrupts, system calls, etc.) are context synchronization events, that is architecturally equivalent to ISB. In particular, this means that the thread generating the code does not need to execute ISB because mprotect() has the same effect (it's a bit more complicated than that, but no need for those details).

@cfallin

... at least some cores have PIPT (physically-indexed, physically-tagged) caches...

It is actually an architectural requirement that data and unified caches behave as PIPT (section D5.11 in the manual), while there are several options for instruction caches, one of which is PIPT, indeed.

... it's possible a physical address (a particular pageframe) could be reused even if the virtual address is new, and there could be a stale cacheline, I think; but @akirilov-arm can say if that's too paranoid :-)

Maybe just a bit :smile:, but IMHO it is a possibility especially under memory pressure because clean (i.e. unmodified) file-backed memory pages are prime candidates to be reclaimed for allocation requests (since they could always be restored from persistent storage on demand). E.g. one of the executable pages of the Wasmtime binary that has been used in the past (so parts of it could be resident in an instruction cache) could be reclaimed to satisfy the allocation request for a code buffer used by the JIT compiler.

As for the behaviour of Alex' example program, I'll have a look too because, again, in general I don't expect an IPI in response to mprotect() (which, if guaranteed to occur on all processors, would indeed make membarrier() redundant).

Wasmtime GitHub notifications bot (Oct 15 2021 at 22:18):

akirilov-arm edited a comment on issue #3426:

@alexcrichton

... then all instruction caches are brought up to date with the IC IVAU instruction.

Technically the cache contents are discarded (IVAU = Invalidate by Virtual Address to the point of Unification), but that has an equivalent effect.

Then there's also the requirement of running ISB on all cores to flush out the instruction pipelines because otherwise some instruction in the pipeline may no longer be correct. Is that right?

Yes, that's right. Note that while the number of instructions in the pipeline may not sound like a lot (especially if we ignore the instruction window associated with out-of-order execution), Cortex-A77, for example, has a 1500 entry macro-op cache, whose contents jump straight from the fetch to the rename stage of the pipeline. In other words, the pipeline may contain thousands of instructions, not all of them speculative!

Also to confirm, this PR is just the run-isb-on-all-cores piece, and the compiler-builtins bits will do the IC IVAU stuff?

Correct.

I suppose another question I could ask is whether kernel-level guarantees factor in here at all. We're always creating a new mmap for allocated code so that means that the memory was previously unmapped. Does the kernel guarantee that when a mapping is unmapped that all cores are updated automatically to flush all related caches for that mapping?

When I asked our kernel team (which includes several maintainers, one half of the general AArch64 maintainers in particular) about that, the answer was that the only guarantee that the kernel could provide is that a TLB flush would occur (and even that is not 100% certain); no data and instruction cache effects should be assumed. I'd take a more nuanced view towards the "don't break the userspace" rule, especially since we are not talking about a clear-cut API and/or ABI issue here.

Another quirk of the architecture is that TLB invalidations are broadcast in the coherence domain as well, which means that in principle an IPI is not necessary. Speaking of which, exception entries (e.g. in response to interrupts, system calls, etc.) are context synchronization events, that is architecturally equivalent to ISB. In particular, this means that the thread generating the code does not need to execute ISB because mprotect() has the same effect by virtue of being a system call (it's a bit more complicated than that, but no need for those details).

@cfallin

... at least some cores have PIPT (physically-indexed, physically-tagged) caches...

It is actually an architectural requirement that data and unified caches behave as PIPT (section D5.11 in the manual), while there are several options for instruction caches, one of which is PIPT, indeed.

... it's possible a physical address (a particular pageframe) could be reused even if the virtual address is new, and there could be a stale cacheline, I think; but @akirilov-arm can say if that's too paranoid :-)

Maybe just a bit :smile:, but IMHO it is a possibility, especially under memory pressure, because clean (i.e. unmodified) file-backed memory pages are prime candidates to be reclaimed for allocation requests (since they could always be restored from persistent storage on demand). E.g. one of the executable pages of the Wasmtime binary that has been used in the past (so parts of it could be resident in an instruction cache) could be reclaimed to satisfy the allocation request for a code buffer used by the JIT compiler.

As for the behaviour of Alex' example program, I'll have a look too because, again, in general I don't expect an IPI in response to mprotect() (which, if guaranteed to occur on all processors, would indeed make the membarrier() call redundant).

Wasmtime GitHub notifications bot (Oct 15 2021 at 22:45):

akirilov-arm edited a comment on issue #3426:

@alexcrichton

... then all instruction caches are brought up to date with the IC IVAU instruction.

Technically the cache contents are discarded (IVAU = Invalidate by Virtual Address to the point of Unification), but that has an equivalent effect.

Then there's also the requirement of running ISB on all cores to flush out the instruction pipelines because otherwise some instruction in the pipeline may no longer be correct. Is that right?

Yes, that's right. Note that while the number of instructions in the pipeline may not sound like a lot (especially if we ignore the instruction window associated with out-of-order execution), Cortex-A77, for example, has a 1500 entry macro-op cache, whose contents jump straight from the fetch to the rename stage of the pipeline. In other words, the pipeline may contain thousands of instructions, not all of them speculative!

Also to confirm, this PR is just the run-isb-on-all-cores piece, and the compiler-builtins bits will do the IC IVAU stuff?

Correct.

I suppose another question I could ask is whether kernel-level guarantees factor in here at all. We're always creating a new mmap for allocated code so that means that the memory was previously unmapped. Does the kernel guarantee that when a mapping is unmapped that all cores are updated automatically to flush all related caches for that mapping?

When I asked our kernel team (which includes several maintainers, one half of the general AArch64 maintainers in particular) about that, the answer was that the only guarantee that the kernel could provide is that a TLB flush would occur (and even that is not 100% certain); no data and instruction cache effects should be assumed. I'd take a more nuanced view towards the "don't break the userspace" rule, especially since we are not talking about a clear-cut API and/or ABI issue here.

Another quirk of the architecture is that TLB invalidations are broadcast in the coherence domain as well, which means that in principle an IPI is not necessary. Speaking of which, exception entries (e.g. in response to interrupts, system calls, etc.) are context synchronization events, that is architecturally equivalent to ISB. In particular, this means that the thread generating the code does not need to execute ISB because mprotect() has the same effect by virtue of being a system call (it's a bit more complicated than that, but no need for those details).

@cfallin

... at least some cores have PIPT (physically-indexed, physically-tagged) caches...

It is actually an architectural requirement that data and unified caches behave as PIPT (section D5.11 in the manual), while there are several options for instruction caches, one of which is indeed PIPT. For instance, the Technical Reference Manual (TRM) for Neoverse N1 states in section A6.1 that L1 caches behave as PIPT.

... it's possible a physical address (a particular pageframe) could be reused even if the virtual address is new, and there could be a stale cacheline, I think; but @akirilov-arm can say if that's too paranoid :-)

Maybe just a bit :smile:, but IMHO it is a possibility, especially under memory pressure, because clean (i.e. unmodified) file-backed memory pages are prime candidates to be reclaimed for allocation requests (since they could always be restored from persistent storage on demand). E.g. one of the executable pages of the Wasmtime binary that has been used in the past (so parts of it could be resident in an instruction cache) could be reclaimed to satisfy the allocation request for a code buffer used by the JIT compiler.

As for the behaviour of Alex' example program, I'll have a look too because, again, in general I don't expect an IPI in response to mprotect() (which, if guaranteed to occur on all processors, would indeed make the membarrier() call redundant).

Wasmtime GitHub notifications bot (Oct 18 2021 at 20:40):

akirilov-arm commented on issue #3426:

Yes, I am taking a look.

To go on a bit of a tangent, some recent research has uncovered a vulnerability, Speculative Code Store Bypass (SCSB, tracked by CVE-2021-0089 and CVE-2021-26313 on Intel and AMD processors respectively), that demonstrates the limitations of the approach on x86. The suggested mitigation seems to be pretty much equivalent to part of what the Arm architecture requires.

Wasmtime GitHub notifications bot (Oct 20 2021 at 14:33):

akirilov-arm commented on issue #3426:

It turns out that when the Linux kernel makes a memory mapping executable, it also performs the necessary cache flushing on AArch64. Setting aside the discussion of whether we should rely on an implementation detail like that or not, the real point is that this behaviour still does not solve the issue that this PR is taking care of, namely the requirement to run ISB on all processors that might execute the new code. In fact, the implementation guarantees that there will be no inter-processor interrupts.

Last updated: Apr 18 2025 at 15:03 UTC