julian-seward1 opened PR #2077 from atomics-CL
to main
:
The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups
- atomic read-modify-write
- atomic compare-and-swap
- atomic loads
- atomic stores
- fences
and the implementation mirrors that structure, at both the CLIF and AArch64
levels.At the CLIF level, there are five new instructions, one for each group. Some
comments about these:
for those that take addresses (all except fences), the address is contained
entirely in a singleValue
; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:
per above comment, makes alignment checking simpler
reuse of existing loads and stores would require extension of
MemFlags
to indicate atomicity, which sounds semantically unclean. For example,
then any instruction carryingMemFlags
could be marked as atomic, even
in cases where it is meaningless or ambiguous.I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.The translation from Wasm to CLIF, in
code_translator.rs
is unremarkable.At the AArch64 level, there are also five new instructions, one for each
group. All of them except::Fence
contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow any extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, inget_regs
that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (::LLSC
and::CAS
) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method<LowerCtx for Lower>::ensure_in_vreg
.
Again, we rely on coalescing to turn them into no-ops in the common case.There is also a ridealong fix for the AArch64 lowering case for
Opcode::Trapif | Opcode::Trapff
, which removes a bug in which two trap insns
in a row were generated.In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
julian-seward1 requested bnjbvr for a review on PR #2077.
julian-seward1 requested cfallin and bnjbvr for a review on PR #2077.
bjorn3 submitted PR Review.
bjorn3 submitted PR Review.
bjorn3 created PR Review Comment:
Mem
is defined at https://github.com/julian-seward1/wasmtime/blob/23b8640f2a016b7077f44f41501750a5d018a658/cranelift/codegen/meta/src/shared/instructions.rs#L791. You could create a new typevar to only allow i8..i64.
bjorn3 created PR Review Comment:
cg_clif also requires nand, max, min, umax, umin.
cfallin submitted PR Review.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Slightly pedantic (sorry!) but let's write out "read-modify-write" here for clarity.
cfallin created PR Review Comment:
+1, let's limit to just the types that make sense.
cfallin created PR Review Comment:
nit: single backticks here for consistency with other descriptions.
cfallin created PR Review Comment:
For clarity: the old value is returned whether or not the CAS succeeds, I assume? (There's no bool success flag so this must be the case?)
cfallin created PR Review Comment:
Let's embed a
Type
instead.
cfallin created PR Review Comment:
Yes, I think that's reasonable; we try not to rely on the actual CLIF in the vcode, but there's no reason not to reuse its enum here.
cfallin created PR Review Comment:
Actually (after looking further down) I think it might be better to simply reuse
ALUOp
: in principle, anyAluRRR
instruction could be emitted as the op inside the r-m-w sequence, no?
cfallin created PR Review Comment:
We might want to say something about the memory ordering model here. In particular, do atomics follow sequential consistency, and do they create happens-before edges that order normal (non-atomic) loads/stores? I think this is the case based on the fences that are emitted (but we should double-check!).
cfallin created PR Review Comment:
No need for braces here
cfallin created PR Review Comment:
For consistency, I think it might be better to write an
enc_...() -> u32
helper that takes the size as a parameter.
cfallin created PR Review Comment:
I think the no-extend is safe here for the ops this patch supports, but if we add e.g. min/max later, we need to reconsider...
cfallin created PR Review Comment:
writable_xreg()
here as well
cfallin created PR Review Comment:
Use a
Type
here; or if not that, let's use or define an enum for this.
cfallin created PR Review Comment:
emit_safepoint
if this is a resumable trap; if it kills the Wasm program, then no need (everything becomes dead on termination). I think a misalignment trap is the latter case?
cfallin created PR Review Comment:
I like this API -- an elegant solution to the problem!
We might want to add a note in the doc comment about when it might be necessary: e.g., a warning about parallel moves and overlapping registers when insts that emit sequences with fixed real regs are given real-reg arguments.
cfallin created PR Review Comment:
I'd prefer to encapsulate
ldxr
(and the store-conditional below too) inenc_...
helpers as with most of the other cases; we haven't been perfect with this (and I should clean it up later) but I want to try to maintain the separation.
cfallin created PR Review Comment:
Use
writable_xreg()
?
cfallin created PR Review Comment:
Maybe add a comment here that the CAS sequence does its own masking, so we don't need to worry about zero-extending narrow (
I8
/I16
) values.
cfallin created PR Review Comment:
Do we want to
return Err(wasm_unsupported!("Unsupported atomic access type {:?}", access_ty))
instead? It seems we only call this with hardcoded types in the big match above (so any failure is our error, not a bad input) but it at least hooks into the existing error-reporting path.
cfallin created PR Review Comment:
I'd prefer not to have the redundant information here regarding ALU-op encodings; I'd prefer to construct an
AluRRR
(with the hardcoded registers and anALUOp
from this inst's fields) and emit it here. Or, if not that, at least invoke theenc_...
helper directly with the appropriate regs and op.Aside from avoiding the redundancy, this change also makes it much easier to modify later; the hex constants are difficult to maintain if, e.g., we ever want to change which registers are used.
cfallin created PR Review Comment:
Similar comments here as above regarding hardcoded constants vs. encoding functions and labels.
cfallin created PR Review Comment:
Potential optimization (not needed for correctness): there's no need to mask to 32 bits if we use a
cmp w27, w24
instead (andcbnz w24, ...
).
cfallin created PR Review Comment:
Let's use a label and fixup for the branch target; not doing so is potentially very error-prone in hard-to-debug ways if we ever change the length of the sequence above.
cfallin created PR Review Comment:
As hinted earlier -- I think the best way to go about this is to actually just reuse the
ALUOp
enum. I'm not sure of any reason to limit the allowed set of r-m-w opcodes; in principle, any two-source, one-dest operation might be desirable to perform in an atomic way.
cfallin created PR Review Comment:
close-paren after
i32
(and likewise below)?
julian-seward1 created PR Review Comment:
The Wasm atomics spec requires sequential consistency, so for each of the 5 new CLIF insns, I have added
This operation is sequentially consistent.
. For the happens-before relation to normal loads/stores, this exceeds my knowledge of such things. I'll make enquiries.
julian-seward1 submitted PR Review.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
On investigation, I am inclined to leave this as-is (although rename it to
AtomicRMWOp
) for two reasons:
ALUOp
has 32- and 64-bit variants of each operation. ButInst::AtomicRMW
also has atype
field, and so that gives the possibility of creating nonsensical combinations (it violates the no-junk no-confusion rule). Eg,type
indicating 64-bit, but withop
beingALUOp::Orr32
.- per @bjorn3's comment above ("cg_clif also requires nand, max, min, umax, umin."),
ALUOp
doesn't have max/min/umax/umin.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
I can write this as an
enc_*
function, no problem, but I didn't understand the comment about the size. This is a fence; there is no size field. Did you perhaps mean to parameterise it on the sharability domain (theISH
== "inner sharable") part?
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Per comment above, it might in hindsight be cleaner to retain
LLSCOp
but rename it toAtomicRMWOp
.
julian-seward1 created PR Review Comment:
When you say "doc comment", do you mean this comment here (on
ensure_in_vreg
), or somewhere else? (unclear)
julian-seward1 submitted PR Review.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Yes, it returns the old value regardless of whether the CAS succeeds or fails. I updated the comment accordingly.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Regarding the happens-before edges, Lars writes: "Yes [they create h-b edges] ... normal loads and stores may be reordered among themselves but not across atomics, and atomics may not be reordered wrt each other".
So I think we're good (right?) The implementation fences everything with
dmb ish
, which IIUC prevents movement of both loads and stores either forwards or backwards across the fence. And FWIW, this is the same fencyness as SM/wasm/baseline; I just cloned the scheme that uses.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
They are non-resumable; no stackmap needed. Choosing plain
emit
therefore.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
.. but then, why does
Opcode::Trap
useemit_safepoint
?
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
One thing that concerns me here is that it seems like we're conflating the semantics of wasm with the semantics of CLIF. Whether or not
Opcode::Trapif | Opcode::Trapff
produces a safepoint should surely be specified as part of CLIF's semantics, and not be dependent of the semantics of whatever that CLIF was translated from.
julian-seward1 created PR Review Comment:
Done.
julian-seward1 submitted PR Review.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Cool, that's what I suspected (and that's what the fences enforce, as far as I can tell), so as long as we document that in the descriptions, we're good!
cfallin submitted PR Review.
cfallin created PR Review Comment:
Yes, exactly, this comment here on the function declaration. Thanks!
cfallin submitted PR Review.
cfallin created PR Review Comment:
Hmm, I think I missed some context previously -- I thought for some reason that your additions here only had to do with atomics-related misalignments, but of course if I scroll up just a bit I see this is for any
Opcode::Trap{if,ff}
. D'oh. Yes, it should always have a safepoint if it is a trap at the CLIF level; the only non-safepoint-adorned traps are the ones that are internal to e.g. divides. I suppose we should specify in the CLIF descriptions that any traps generated by instruction error conditions (rather than explicit trap instructions) will not have safepoints.Just to be sure -- @bnjbvr, could you confirm that we don't expect (or can get away with specifying) that there is no safepoint data for e.g. divide traps?
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Changing to
emit_safepoint
.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
The emit code is complicated enough as it is; would you be greatly offended if I punted on this, at least for now?
julian-seward1 updated PR #2077 from atomics-CL
to main
:
The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups
- atomic read-modify-write
- atomic compare-and-swap
- atomic loads
- atomic stores
- fences
and the implementation mirrors that structure, at both the CLIF and AArch64
levels.At the CLIF level, there are five new instructions, one for each group. Some
comments about these:
for those that take addresses (all except fences), the address is contained
entirely in a singleValue
; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:
per above comment, makes alignment checking simpler
reuse of existing loads and stores would require extension of
MemFlags
to indicate atomicity, which sounds semantically unclean. For example,
then any instruction carryingMemFlags
could be marked as atomic, even
in cases where it is meaningless or ambiguous.I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.The translation from Wasm to CLIF, in
code_translator.rs
is unremarkable.At the AArch64 level, there are also five new instructions, one for each
group. All of them except::Fence
contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow any extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, inget_regs
that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (::LLSC
and::CAS
) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method<LowerCtx for Lower>::ensure_in_vreg
.
Again, we rely on coalescing to turn them into no-ops in the common case.There is also a ridealong fix for the AArch64 lowering case for
Opcode::Trapif | Opcode::Trapff
, which removes a bug in which two trap insns
in a row were generated.In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
julian-seward1 updated PR #2077 from atomics-CL
to main
:
The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups
- atomic read-modify-write
- atomic compare-and-swap
- atomic loads
- atomic stores
- fences
and the implementation mirrors that structure, at both the CLIF and AArch64
levels.At the CLIF level, there are five new instructions, one for each group. Some
comments about these:
for those that take addresses (all except fences), the address is contained
entirely in a singleValue
; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:
per above comment, makes alignment checking simpler
reuse of existing loads and stores would require extension of
MemFlags
to indicate atomicity, which sounds semantically unclean. For example,
then any instruction carryingMemFlags
could be marked as atomic, even
in cases where it is meaningless or ambiguous.I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.The translation from Wasm to CLIF, in
code_translator.rs
is unremarkable.At the AArch64 level, there are also five new instructions, one for each
group. All of them except::Fence
contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow any extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, inget_regs
that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (::LLSC
and::CAS
) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method<LowerCtx for Lower>::ensure_in_vreg
.
Again, we rely on coalescing to turn them into no-ops in the common case.There is also a ridealong fix for the AArch64 lowering case for
Opcode::Trapif | Opcode::Trapff
, which removes a bug in which two trap insns
in a row were generated.In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
Just to be sure -- @bnjbvr, could you confirm that we don't expect (or can get away with specifying) that there is no safepoint data for e.g. divide traps?
Agreed, and i think that indeed it should be specified that no GC must happen during stack unwinding (that sounds very scary and unsafe anyways)(now I wonder if wasm exception handling + wasm GC types could make this case actually possible).
bnjbvr submitted PR Review.
bnjbvr submitted PR Review.
bnjbvr created PR Review Comment:
If this is implemented as a callout in general: yes. All the callouts in the "dummy" wasm to clif environment are implemented this way; this is only used when testing internally wasm compilation with clif-util, that is, when no proper environment can be used.
bnjbvr created PR Review Comment:
nit: here and below, please remove the code commented out.
bnjbvr created PR Review Comment:
In terms of API cleanliness, would it make sense to have a single function
translate_atomic_wait
that takes an extra Type parameter, that can be either I32 or I64? Going one step further, would it make sense to have a singletranslate_atomic
method that would be passed an enum with variantsWaitI32/I64/Notify
?
bnjbvr submitted PR Review.
julian-seward1 updated PR #2077 from atomics-CL
to main
:
The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups
- atomic read-modify-write
- atomic compare-and-swap
- atomic loads
- atomic stores
- fences
and the implementation mirrors that structure, at both the CLIF and AArch64
levels.At the CLIF level, there are five new instructions, one for each group. Some
comments about these:
for those that take addresses (all except fences), the address is contained
entirely in a singleValue
; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:
per above comment, makes alignment checking simpler
reuse of existing loads and stores would require extension of
MemFlags
to indicate atomicity, which sounds semantically unclean. For example,
then any instruction carryingMemFlags
could be marked as atomic, even
in cases where it is meaningless or ambiguous.I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.The translation from Wasm to CLIF, in
code_translator.rs
is unremarkable.At the AArch64 level, there are also five new instructions, one for each
group. All of them except::Fence
contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow any extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, inget_regs
that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (::LLSC
and::CAS
) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method<LowerCtx for Lower>::ensure_in_vreg
.
Again, we rely on coalescing to turn them into no-ops in the common case.There is also a ridealong fix for the AArch64 lowering case for
Opcode::Trapif | Opcode::Trapff
, which removes a bug in which two trap insns
in a row were generated.In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Merging the
wait
andnotify
cases is messy because they have different params. However, the twowait
cases can be merged without even needing an extraType
parameter, since the 32-vs-64 bit difference can be determined by inspecting the type of theexpected
parameter. So that's what I did.
julian-seward1 updated PR #2077 from atomics-CL
to main
:
The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups
- atomic read-modify-write
- atomic compare-and-swap
- atomic loads
- atomic stores
- fences
and the implementation mirrors that structure, at both the CLIF and AArch64
levels.At the CLIF level, there are five new instructions, one for each group. Some
comments about these:
for those that take addresses (all except fences), the address is contained
entirely in a singleValue
; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:
per above comment, makes alignment checking simpler
reuse of existing loads and stores would require extension of
MemFlags
to indicate atomicity, which sounds semantically unclean. For example,
then any instruction carryingMemFlags
could be marked as atomic, even
in cases where it is meaningless or ambiguous.I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.The translation from Wasm to CLIF, in
code_translator.rs
is unremarkable.At the AArch64 level, there are also five new instructions, one for each
group. All of them except::Fence
contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow any extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, inget_regs
that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (::LLSC
and::CAS
) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method<LowerCtx for Lower>::ensure_in_vreg
.
Again, we rely on coalescing to turn them into no-ops in the common case.There is also a ridealong fix for the AArch64 lowering case for
Opcode::Trapif | Opcode::Trapff
, which removes a bug in which two trap insns
in a row were generated.In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
julian-seward1 requested cfallin for a review on PR #2077.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Tiny nit, but this was very confusing to me for a moment; the difference between the CLIF's
AtomicRmwOp
and the aarch64Inst
'sAtomicRMWOp
is just capitalization. Perhapsuse crate::ir
and then writeir::AtomicRmwOp
here, for clarity?
cfallin submitted PR Review.
julian-seward1 updated PR #2077 from atomics-CL
to main
:
The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups
- atomic read-modify-write
- atomic compare-and-swap
- atomic loads
- atomic stores
- fences
and the implementation mirrors that structure, at both the CLIF and AArch64
levels.At the CLIF level, there are five new instructions, one for each group. Some
comments about these:
for those that take addresses (all except fences), the address is contained
entirely in a singleValue
; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:
per above comment, makes alignment checking simpler
reuse of existing loads and stores would require extension of
MemFlags
to indicate atomicity, which sounds semantically unclean. For example,
then any instruction carryingMemFlags
could be marked as atomic, even
in cases where it is meaningless or ambiguous.I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.The translation from Wasm to CLIF, in
code_translator.rs
is unremarkable.At the AArch64 level, there are also five new instructions, one for each
group. All of them except::Fence
contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow any extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, inget_regs
that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (::LLSC
and::CAS
) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method<LowerCtx for Lower>::ensure_in_vreg
.
Again, we rely on coalescing to turn them into no-ops in the common case.There is also a ridealong fix for the AArch64 lowering case for
Opcode::Trapif | Opcode::Trapff
, which removes a bug in which two trap insns
in a row were generated.In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Er, well spotted. I didn't even notice. Must have looked as if I'd simply written the identity function for atomic RMW ops :-)
julian-seward1 merged PR #2077.
Last updated: Jan 24 2025 at 00:11 UTC