bjorn3 commented on issue #4997:
BTW the purpose of the membarrier() call is not cache maintenance, but rather flushing the processor pipeline (refer to the discussion in https://github.com/bytecodealliance/wasmtime/pull/3426 for more details), so this means that there is still a gap in the implementation on Windows. Looking at Microsoft's documentation, FlushProcessWriteBuffers() seems to have the necessary semantics.
The documentation for FlushInstructionCache says:
Applications should call FlushInstructionCache if they generate or modify code in memory. The CPU cannot detect the change, and may execute the old code it cached.
So it seems that FlushInstructionCache must always be called no matter if FlushProcessWriteBuffers is called or not.
akirilov-arm commented on issue #4997:
Yes, I am not claiming that
FlushProcessWriteBuffers()
replacesFlushInstructionCache()
- my point is that it (or something else with the necessary semantics) should be used after the latter, unlessFlushInstructionCache()
also performs an implicit broadcastISB
operation in all running threads of the process (which the documentation doesn't suggest, to me at least). Similarly, as I mentioned on Linuxmembarrier()
does not flush instruction caches.
akirilov-arm edited a comment on issue #4997:
Yes, I am not claiming that
FlushProcessWriteBuffers()
replacesFlushInstructionCache()
- my point is that it (or something else with the necessary semantics) should be used after the latter, unlessFlushInstructionCache()
also performs an implicit broadcastISB
operation in all running threads of the process (which the documentation doesn't suggest, to me at least). Similarly, as I mentioned on Linuxmembarrier()
does not flush instruction caches.P.S. The same considerations about single-threaded applications apply to
FlushProcessWriteBuffers()
as well, of course.
akirilov-arm edited a comment on issue #4997:
Yes, I am not claiming that
FlushProcessWriteBuffers()
replacesFlushInstructionCache()
- my point is that it (or something else with the necessary semantics) should be used after the latter, unlessFlushInstructionCache()
also performs an implicit broadcastISB
operation to all running threads of the process (which the documentation doesn't suggest, to me at least). Similarly, as I mentioned on Linuxmembarrier()
does not flush instruction caches.P.S. The same considerations about single-threaded applications apply to
FlushProcessWriteBuffers()
as well, of course.
afonso360 commented on issue #4997:
Thanks for reviewing this!
BTW the purpose of the membarrier() call is not cache maintenance, but rather flushing the processor pipeline (refer to the discussion in https://github.com/bytecodealliance/wasmtime/pull/3426 for more details), so this means that there is still a gap in the implementation on Windows. Looking at Microsoft's documentation, FlushProcessWriteBuffers() seems to have the necessary semantics.
So, if I understand https://github.com/bytecodealliance/wasmtime/issues/3310 correctly we need to both flush the icache (
clear_cache
/FlushInstructionCache
) and then subsequently also flush the pipeline (membarrier
/FlushProcessWriteBuffers
). The flush only needs to happen in multi threaded environments but we cant easily detect that.Is that right? If that is the case I might as well just clean up the terminology and add a
clear_cache
call for linux and that also fixes #3310 right?
afonso360 edited a comment on issue #4997:
Thanks for reviewing this! :heart:️
BTW the purpose of the membarrier() call is not cache maintenance, but rather flushing the processor pipeline (refer to the discussion in https://github.com/bytecodealliance/wasmtime/pull/3426 for more details), so this means that there is still a gap in the implementation on Windows. Looking at Microsoft's documentation, FlushProcessWriteBuffers() seems to have the necessary semantics.
So, if I understand https://github.com/bytecodealliance/wasmtime/issues/3310 correctly we need to both flush the icache (
clear_cache
/FlushInstructionCache
) and then subsequently also flush the pipeline (membarrier
/FlushProcessWriteBuffers
). The flush only needs to happen in multi threaded environments but we cant easily detect that.Is that right? If that is the case I might as well just clean up the terminology and add a
clear_cache
call for linux and that also fixes #3310 right?
afonso360 edited a comment on issue #4997:
Thanks for reviewing this! :heart:️
BTW the purpose of the membarrier() call is not cache maintenance, but rather flushing the processor pipeline (refer to the discussion in https://github.com/bytecodealliance/wasmtime/pull/3426 for more details), so this means that there is still a gap in the implementation on Windows. Looking at Microsoft's documentation, FlushProcessWriteBuffers() seems to have the necessary semantics.
So, if I understand https://github.com/bytecodealliance/wasmtime/issues/3310 correctly we need to both flush the icache (
clear_cache
/FlushInstructionCache
) and then subsequently also flush the pipeline (membarrier
/FlushProcessWriteBuffers
). The pipeline flush only needs to happen in multi threaded environments but we cant easily detect that.Is that right? If that is the case I might as well just clean up the terminology and add a
clear_cache
call for linux and that also fixes #3310 right?
akirilov-arm commented on issue #4997:
Is that right?
Yes, your understanding is correct. Technically the pipeline flush is always mandatory, but all implementations of
__builtin___clear_cache()
that I am aware of end with anISB
instruction (and I'd guess thatFlushInstructionCache()
does too) and furthermore a system call (e.g. the subsequent change of the memory page permissions to executable, since we don't use RWX mappings) is a context synchronization operation as well.The ideal solution, which I had in mind when I opened #3310, would be to have a generic
clear_cache()
function (probably in the compiler-builtins crate to mirror what GCC and LLVM are doing; I am not aware of anything like that existing, but I would be happy to be proven wrong). Then we would call that function, followed bymembarrier()
/FlushProcessWriteBuffers()
, finishing with making the memory executable. The order of the last 2 operations does not matter (as long as no one tries to run the newly executable code without a pipeline flush before that), but I think that this sequence looks nicer.Note that it might appear that on Linux we don't flush instruction caches, so there might be a correctness issue. Actually we are relying on an implementation detail (as discussed in #3426), but I'd rather see an explicit operation, hence I have kept #3310 open. However, I would advise against implementing the cache flushing instruction sequence inside Cranelift and/or Wasmtime because it is a bit involved and a solution that is easily accessible to all Rust users on AArch64 is a much better option. In the meantime we can continue relying on an implementation detail; any other cleanup would be appreciated, of course.
bjorn3 commented on issue #4997:
probably in the compiler-builtins crate to mirror what GCC and LLVM are doing
It should rather be in libcore as compiler-builtins is an implementation detail of rustc to be used by the compiler backend and never directly called by the user.
afonso360 commented on issue #4997:
Alright I think I understand this a little bit more! Thank you for your patience dealing with this.
However, I would advise against implementing the cache flushing instruction sequence inside Cranelift and/or Wasmtime because it is a bit involved and a solution that is easily accessible to all Rust users on AArch64 is a much better option. In the meantime we can continue relying on an implementation detail; any other cleanup would be appreciated, of course.
Yeah I also doubt I could write anything like that correctly :laughing: So I'm not very inclined to go that route.
What I think we could do is something along the lines of what @cfallin suggested on zulip create a
jit-icache-coherence
crate with the interface that we want, use that on bothwasmtime-jit
andcranelift-jit
. This at least shares the windows implementation between both crates, we still need to duplicate the membarriers betweenrsix
andlibc
since we want to keep those separate. But at least it is a little bit more organized.For
__builtin___clear_cache()
can we for now link it in from a C file as you suggested? And later on use a version fromlibcore
if one gets upstreamed. Or if we don't want to do that on wasmtime (due to not usingrsix
, not sure if that is a requirement), we can ignore it for now and have essentially the same implementation but shared with cranelift.What do you guys think about this?
Note that it might appear that on Linux we don't flush instruction caches, so there might be a correctness issue. Actually we are relying on an implementation detail (as discussed in https://github.com/bytecodealliance/wasmtime/pull/3426), but I'd rather see an explicit operation, hence I have kept https://github.com/bytecodealliance/wasmtime/issues/3310 open.
Yeah that confused me for a while! This whole thing is not easy to understand. I'd really like to get this in a central place and write all of these details in comments around this code.
akirilov-arm commented on issue #4997:
For
__builtin___clear_cache()
can we for now link it in from a C file as you suggested?That is certainly an option, but my impression is that we are trying to move away from relying on C files as much as possible (refer to the refactoring that has been done in the
wasmtime-fiber
crate, for instance), and given that on Linux we are kind of OK at the moment, IMHO it might not be the best course of action.As for creating a crate to be used by both
cranelift-jit
andwasmtime-jit
, since the issue was thatcranelift-jit
should not depend onrustix
, I am not sure how much sharing would be possible, though we do have the option of just implementing thelibc
variant (and avoidingrustix
completely). That would also help with consolidating the locations that @cfallin needs to change in #4987.
afonso360 commented on issue #4997:
I've implemented a new crate
jit-icache-coherency
, that merges all of this code together so that we don't have to update stuff in multiple places, it also provides a common interface that bothwasmtime-jit
andcranelift-jit
can use.It does change our call's to
SYNC_CORE
membarriers, using the new method suggested by @akirilov-arm. We now executeSYNC_CORE
and if it fails we retry after registering it. This should only happen once per process, so I don't think its too much overhead.It also includes the changes made by @cfallin in #4987 since I thought it was just easier to do it here instead of having to deal with the ensuing merge conflicts.
I've tried to explain the whole cache coherency thing as best as I could in the documentation, hopefully I didn't miss anything!
akirilov-arm commented on issue #4997:
Forgot to post a general comment with my review - the overall code structure looks great and is definitely cleaner than what we had before; my remarks are mostly about improving the code comments. Also, as I have said before I am fine with just implementing the
libc
code path (and omitting therustix
one).
afonso360 commented on issue #4997:
I think I got all of them, let me know if I missed any of the changes that you requested. And thanks for you patience in dealing with this! I'm still very much learning about all of this stuff.
Also, as I have said before I am fine with just implementing the libc code path (and omitting the rustix one).
I'm okay with libc too, but would like an ack from someone on the wasmtime side if we want to do that.
afonso360 commented on issue #4997:
I had to add a flags arg to membarrier on libc since on my aarch64 machine it wasn't being properly added, somehow cranelift never triggered this. I've confirmed with
strace
and the membarrier is now being properly executed, and the register on fail trick is working.[pid 615515] membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, 0) = -1 EPERM (Operation not permitted) [pid 615515] membarrier(MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE, 0) = 0 [pid 615515] membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, 0) = 0
afonso360 edited a comment on issue #4997:
I had to add a flags arg to membarrier on libc since on my aarch64 machine it wasn't being properly added, somehow cranelift never triggered this. I've confirmed with
strace
and the membarrier is now being properly executed, and the register on fail trick is working.[pid 615515] membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, 0) = -1 EPERM (Operation not permitted) [pid 615515] membarrier(MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE, 0) = 0 [pid 615515] membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, 0) = 0
Is there a way to test the GLOBAL path on modern kernels?
afonso360 edited a comment on issue #4997:
I had to add a flags arg to membarrier on libc since on my aarch64 machine it wasn't being properly added, somehow cranelift never triggered this. I've confirmed with
strace
and the membarrier is now being properly executed, and the register on fail trick is working.[pid 615515] membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, 0) = -1 EPERM (Operation not permitted) [pid 615515] membarrier(MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE, 0) = 0 [pid 615515] membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, 0) = 0
Is there a way to test the GLOBAL path on modern kernels? i.e. fail MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE with EINVAL?
afonso360 edited a comment on issue #4997:
I had to add a flags arg to membarrier on libc since on my aarch64 machine it wasn't being properly added, somehow cranelift never triggered this. I've confirmed with
strace
and the membarrier is now being properly executed, and the register on fail trick is working.[pid 615515] membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, 0) = -1 EPERM (Operation not permitted) [pid 615515] membarrier(MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE, 0) = 0 [pid 615515] membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, 0) = 0
Is there a way to test the GLOBAL path on modern kernels? i.e. fail
MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
withEINVAL
?
afonso360 commented on issue #4997:
Right, I just wanted to check if there was an easy way to add a test that forces us to use GLOBAL so that we at least have some CI coverage on that branch. But I tested that manually and it worked, so I guess that's ok.
afonso360 commented on issue #4997:
Hey, @alexcrichton @sunfishcode we discussed this in the cranelift meeting today, and @cfallin mentioned that you might be interested in this PR as (in its current form) changes wasmtime to use libc on this particular
membarrier
call.As a summary: We got to this situation when I merged this particular piece of code from
wasmtime-jit
andcranelift-jit
, both those crates have to do the exact same thing when it comes to icache mainetenance, after merging we had two backends which do exactly the same thing, and to avoid having duplicated code we need to decide on one.wasmtime already has libc on its dependency tree, but cranelift does not have rustix yet, so to avoid adding more stuff to the compilation path, I decided to keep libc.
I'd like to know if this is okay with you guys.
afonso360 edited a comment on issue #4997:
Hey, @alexcrichton @sunfishcode we discussed this in the cranelift meeting today, and @cfallin mentioned that you might be interested in this PR as (in its current form) it changes wasmtime to use libc on this particular
membarrier
call.As a summary: We got to this situation when I merged this particular piece of code from
wasmtime-jit
andcranelift-jit
, both those crates have to do the exact same thing when it comes to icache mainetenance, after merging we had two backends which do exactly the same thing, and to avoid having duplicated code we need to decide on one.wasmtime already has libc on its dependency tree, but cranelift does not have rustix yet, so to avoid adding more stuff to the compilation path, I decided to keep libc.
I'd like to know if this is okay with you guys.
afonso360 edited a comment on issue #4997:
Hey, @alexcrichton @sunfishcode we discussed this in the cranelift meeting today, and @cfallin mentioned that you might be interested in this PR as (in its current form) it changes wasmtime to use libc on this particular
membarrier
call.As a summary: We got to this situation when I merged this particular piece of code from
wasmtime-jit
andcranelift-jit
, both of those crates have to do the exact same thing when it comes to icache mainetenance, after merging we had two backends which do exactly the same thing, and to avoid having duplicated code we need to decide on one.wasmtime already has libc on its dependency tree, but cranelift does not have rustix yet, so to avoid adding more stuff to the compilation path, I decided to keep libc.
I'd like to know if this is okay with you guys.
sunfishcode commented on issue #4997:
Yes, that makes sense!
afonso360 edited a comment on issue #4997:
Hey, @alexcrichton @sunfishcode we discussed this in the cranelift meeting today, and @cfallin mentioned that you might be interested in this PR as (in its current form) it changes wasmtime to use libc on this particular
membarrier
call.As a summary: We got to this situation when I merged this particular piece of code from
wasmtime-jit
andcranelift-jit
, both of those crates have to do the exact same thing when it comes to icache maintenance. After merging we had two backends which do exactly the same thing, and to avoid having duplicated code we need to decide on one.wasmtime already has libc on its dependency tree, but cranelift does not have rustix yet, so to avoid adding more stuff to the compilation path, I decided to keep libc.
I'd like to know if this is okay with you guys.
alexcrichton commented on issue #4997:
Indeed seems reasonable to me!
Last updated: Jan 24 2025 at 00:11 UTC