@Julian Seward I couldn't review your AArch64 atomics changes when you posted them, but I could recently and I have a couple of comments.
Most of them are based on the Wasm memory model presented in the OOSPLA 2019 paper Weakening WebAssembly that has been included in the threading proposal repository.
The main point is that even though the proposal overview states that atomic memory accesses are sequentially consistent, what's actually meant according to my understanding is that they are sequentially consistent for data race-free programs (similar to the situation in C/C++), which is a weaker statement. As a result, most of the explicit memory barriers in the Cranelift implementation are unnecessary and should be covered by load-acquires and store-releases instead. In fact the paper suggests precisely that in the table of mappings in section 7.
Ignoring that issue for a moment, the code generated for atomic loads and particularly atomic stores seems problematic - either because the barriers are not in the right places or there isn't enough of them (cf. the table in the paper). For example, in the case of the atomic store the constituent store could be observed before another one preceding it in program order.
Of course, using load-acquires and store-releases will reduce the size of the generated code as well.
There are also a couple of minor issues - one of them is that the move in the middle of the loop generated for the Xchg
variant of Inst::AtomicRMW
is unnecessary.
Another, admittedly nitpick, concerns the printing of the new Inst
variants - most of the time we try to follow the disassembly specified by the architecture (though there are things that are out of its scope, e.g. literals). I understand that for Inst::AtomicRMW
and Inst::AtomicCAS
that's quite inconvenient, but at least in the case of Inst::AtomicLoad
and Inst::AtomicStore
it shouldn't be a problem, especially if we switch to the proper instructions.
I also have a question about the Inst::AtomicCAS
variant - based on the comments, it was a deliberate decision to do the masking inside the loop. Why? Furthermore, I can't see the reason for the move in the 64-bit case.
@Anton Kirilov thanks for the info. Sorry for delayed response; I was on vacation last week.
@Anton Kirilov ignoring for now the question of efficiency, you mention that "the code generated for atomic loads and particularly atomic stores seems problematic - either because the barriers are not in the right places or there isn't enough of them (cf. the table in the paper)."
I looked at the table (I assume you mean the table at the start of section 7) but fail to understand what you mean.
Can you say more precisely what the problem is, and how you think the barrier placement should be adjusted so as to fix it?
Hello, since the code is using explicit memory barriers right now, it is essentially equivalent to the ARMv7 entries in the table. So, the requirement for atomic stores is to have a double barrier, i.e. a barrier both before and after the store. Right now Cranelift generates only a barrier after.
@Julian Seward Just checking if you've seen my reply.
@Anton Kirilov yes, thanks. Doing at least a minimal fix is on my todo list.
FYI, I've now picked this up.
Last updated: Jan 24 2025 at 00:11 UTC