wasmtime / PR #2077 Implement Wasm Atomics for Cranelift/... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #2077 Implement Wasm Atomics for Cranelift/...

Wasmtime GitHub notifications bot (Jul 29 2020 at 09:17):

julian-seward1 opened PR #2077 from atomics-CL to main:

The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups

atomic read-modify-write

atomic compare-and-swap

atomic loads

atomic stores

fences

and the implementation mirrors that structure, at both the CLIF and AArch64
levels.

At the CLIF level, there are five new instructions, one for each group. Some
comments about these:

for those that take addresses (all except fences), the address is contained
entirely in a single Value; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.

atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:

per above comment, makes alignment checking simpler

reuse of existing loads and stores would require extension of MemFlags
to indicate atomicity, which sounds semantically unclean. For example,
then any instruction carrying MemFlags could be marked as atomic, even
in cases where it is meaningless or ambiguous.

I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.

The translation from Wasm to CLIF, in code_translator.rs is unremarkable.

At the AArch64 level, there are also five new instructions, one for each
group. All of them except ::Fence contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.

One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow any extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.

The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, in get_regs that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (::LLSC and ::CAS) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.

One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method <LowerCtx for Lower>::ensure_in_vreg.
Again, we rely on coalescing to turn them into no-ops in the common case.

There is also a ridealong fix for the AArch64 lowering case for
Opcode::Trapif | Opcode::Trapff, which removes a bug in which two trap insns
in a row were generated.

In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.

Wasmtime GitHub notifications bot (Jul 29 2020 at 09:27):

julian-seward1 requested bnjbvr for a review on PR #2077.

Wasmtime GitHub notifications bot (Jul 29 2020 at 09:27):

julian-seward1 requested cfallin and bnjbvr for a review on PR #2077.

Wasmtime GitHub notifications bot (Jul 29 2020 at 09:27):

bjorn3 submitted PR Review.

Wasmtime GitHub notifications bot (Jul 29 2020 at 09:27):

bjorn3 submitted PR Review.

Wasmtime GitHub notifications bot (Jul 29 2020 at 09:27):

bjorn3 created PR Review Comment:

Mem is defined at https://github.com/julian-seward1/wasmtime/blob/23b8640f2a016b7077f44f41501750a5d018a658/cranelift/codegen/meta/src/shared/instructions.rs#L791. You could create a new typevar to only allow i8..i64.

Wasmtime GitHub notifications bot (Jul 29 2020 at 09:27):

bjorn3 created PR Review Comment:

cg_clif also requires nand, max, min, umax, umin.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

Slightly pedantic (sorry!) but let's write out "read-modify-write" here for clarity.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

+1, let's limit to just the types that make sense.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

nit: single backticks here for consistency with other descriptions.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

For clarity: the old value is returned whether or not the CAS succeeds, I assume? (There's no bool success flag so this must be the case?)

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

Let's embed a Type instead.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

Yes, I think that's reasonable; we try not to rely on the actual CLIF in the vcode, but there's no reason not to reuse its enum here.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

Actually (after looking further down) I think it might be better to simply reuse ALUOp: in principle, any AluRRR instruction could be emitted as the op inside the r-m-w sequence, no?

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

We might want to say something about the memory ordering model here. In particular, do atomics follow sequential consistency, and do they create happens-before edges that order normal (non-atomic) loads/stores? I think this is the case based on the fences that are emitted (but we should double-check!).

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

No need for braces here

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

For consistency, I think it might be better to write an enc_...() -> u32 helper that takes the size as a parameter.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

I think the no-extend is safe here for the ops this patch supports, but if we add e.g. min/max later, we need to reconsider...

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

writable_xreg() here as well

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

Use a Type here; or if not that, let's use or define an enum for this.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

emit_safepoint if this is a resumable trap; if it kills the Wasm program, then no need (everything becomes dead on termination). I think a misalignment trap is the latter case?

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

I like this API -- an elegant solution to the problem!

We might want to add a note in the doc comment about when it might be necessary: e.g., a warning about parallel moves and overlapping registers when insts that emit sequences with fixed real regs are given real-reg arguments.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

I'd prefer to encapsulate ldxr (and the store-conditional below too) in enc_... helpers as with most of the other cases; we haven't been perfect with this (and I should clean it up later) but I want to try to maintain the separation.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

Use writable_xreg()?

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

Maybe add a comment here that the CAS sequence does its own masking, so we don't need to worry about zero-extending narrow (I8/I16) values.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

Do we want to return Err(wasm_unsupported!("Unsupported atomic access type {:?}", access_ty)) instead? It seems we only call this with hardcoded types in the big match above (so any failure is our error, not a bad input) but it at least hooks into the existing error-reporting path.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

I'd prefer not to have the redundant information here regarding ALU-op encodings; I'd prefer to construct an AluRRR (with the hardcoded registers and an ALUOp from this inst's fields) and emit it here. Or, if not that, at least invoke the enc_... helper directly with the appropriate regs and op.

Aside from avoiding the redundancy, this change also makes it much easier to modify later; the hex constants are difficult to maintain if, e.g., we ever want to change which registers are used.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

Similar comments here as above regarding hardcoded constants vs. encoding functions and labels.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

Potential optimization (not needed for correctness): there's no need to mask to 32 bits if we use a cmp w27, w24 instead (and cbnz w24, ...).

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

Let's use a label and fixup for the branch target; not doing so is potentially very error-prone in hard-to-debug ways if we ever change the length of the sequence above.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

As hinted earlier -- I think the best way to go about this is to actually just reuse the ALUOp enum. I'm not sure of any reason to limit the allowed set of r-m-w opcodes; in principle, any two-source, one-dest operation might be desirable to perform in an atomic way.

Wasmtime GitHub notifications bot (Jul 29 2020 at 20:54):

cfallin created PR Review Comment:

close-paren after i32 (and likewise below)?

Wasmtime GitHub notifications bot (Jul 30 2020 at 05:56):

julian-seward1 created PR Review Comment:

The Wasm atomics spec requires sequential consistency, so for each of the 5 new CLIF insns, I have added This operation is sequentially consistent.. For the happens-before relation to normal loads/stores, this exceeds my knowledge of such things. I'll make enquiries.

Wasmtime GitHub notifications bot (Jul 30 2020 at 05:56):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Jul 30 2020 at 06:06):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Jul 30 2020 at 06:06):

julian-seward1 created PR Review Comment:

On investigation, I am inclined to leave this as-is (although rename it to AtomicRMWOp) for two reasons:

ALUOp has 32- and 64-bit variants of each operation. But Inst::AtomicRMW also has a type field, and so that gives the possibility of creating nonsensical combinations (it violates the no-junk no-confusion rule). Eg, type indicating 64-bit, but with op being ALUOp::Orr32.

per @bjorn3's comment above ("cg_clif also requires nand, max, min, umax, umin."), ALUOp doesn't have max/min/umax/umin.

Wasmtime GitHub notifications bot (Jul 30 2020 at 06:20):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Jul 30 2020 at 06:20):

julian-seward1 created PR Review Comment:

I can write this as an enc_* function, no problem, but I didn't understand the comment about the size. This is a fence; there is no size field. Did you perhaps mean to parameterise it on the sharability domain (the ISH == "inner sharable") part?

Wasmtime GitHub notifications bot (Jul 30 2020 at 06:55):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Jul 30 2020 at 06:55):

julian-seward1 created PR Review Comment:

Per comment above, it might in hindsight be cleaner to retain LLSCOp but rename it to AtomicRMWOp.

Wasmtime GitHub notifications bot (Jul 30 2020 at 07:26):

julian-seward1 created PR Review Comment:

When you say "doc comment", do you mean this comment here (on ensure_in_vreg), or somewhere else? (unclear)

Wasmtime GitHub notifications bot (Jul 30 2020 at 07:26):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Jul 30 2020 at 08:13):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Jul 30 2020 at 08:13):

julian-seward1 created PR Review Comment:

Yes, it returns the old value regardless of whether the CAS succeeds or fails. I updated the comment accordingly.

Wasmtime GitHub notifications bot (Jul 30 2020 at 08:23):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Jul 30 2020 at 08:23):

julian-seward1 created PR Review Comment:

Regarding the happens-before edges, Lars writes: "Yes [they create h-b edges] ... normal loads and stores may be reordered among themselves but not across atomics, and atomics may not be reordered wrt each other".

So I think we're good (right?) The implementation fences everything with dmb ish, which IIUC prevents movement of both loads and stores either forwards or backwards across the fence. And FWIW, this is the same fencyness as SM/wasm/baseline; I just cloned the scheme that uses.

Wasmtime GitHub notifications bot (Jul 30 2020 at 08:30):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Jul 30 2020 at 08:30):

julian-seward1 created PR Review Comment:

They are non-resumable; no stackmap needed. Choosing plain emit therefore.

Wasmtime GitHub notifications bot (Jul 30 2020 at 08:31):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Jul 30 2020 at 08:31):

julian-seward1 created PR Review Comment:

.. but then, why does Opcode::Trap use emit_safepoint ?

Wasmtime GitHub notifications bot (Jul 30 2020 at 08:35):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Jul 30 2020 at 08:35):

julian-seward1 created PR Review Comment:

One thing that concerns me here is that it seems like we're conflating the semantics of wasm with the semantics of CLIF. Whether or not Opcode::Trapif | Opcode::Trapff produces a safepoint should surely be specified as part of CLIF's semantics, and not be dependent of the semantics of whatever that CLIF was translated from.

Wasmtime GitHub notifications bot (Jul 30 2020 at 08:59):

julian-seward1 created PR Review Comment:

Done.

Wasmtime GitHub notifications bot (Jul 30 2020 at 08:59):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Jul 30 2020 at 15:39):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (Jul 30 2020 at 15:39):

cfallin created PR Review Comment:

Cool, that's what I suspected (and that's what the fences enforce, as far as I can tell), so as long as we document that in the descriptions, we're good!

Wasmtime GitHub notifications bot (Jul 30 2020 at 15:40):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (Jul 30 2020 at 15:40):

cfallin created PR Review Comment:

Yes, exactly, this comment here on the function declaration. Thanks!

Wasmtime GitHub notifications bot (Jul 30 2020 at 15:49):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (Jul 30 2020 at 15:49):

cfallin created PR Review Comment:

Hmm, I think I missed some context previously -- I thought for some reason that your additions here only had to do with atomics-related misalignments, but of course if I scroll up just a bit I see this is for any Opcode::Trap{if,ff}. D'oh. Yes, it should always have a safepoint if it is a trap at the CLIF level; the only non-safepoint-adorned traps are the ones that are internal to e.g. divides. I suppose we should specify in the CLIF descriptions that any traps generated by instruction error conditions (rather than explicit trap instructions) will not have safepoints.

Just to be sure -- @bnjbvr, could you confirm that we don't expect (or can get away with specifying) that there is no safepoint data for e.g. divide traps?

Wasmtime GitHub notifications bot (Jul 31 2020 at 06:03):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Jul 31 2020 at 06:03):

julian-seward1 created PR Review Comment:

Changing to emit_safepoint.

Wasmtime GitHub notifications bot (Jul 31 2020 at 06:05):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Jul 31 2020 at 06:05):

julian-seward1 created PR Review Comment:

The emit code is complicated enough as it is; would you be greatly offended if I punted on this, at least for now?

Wasmtime GitHub notifications bot (Jul 31 2020 at 06:10):

julian-seward1 updated PR #2077 from atomics-CL to main:

The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups

atomic read-modify-write

atomic compare-and-swap

atomic loads

atomic stores

fences

and the implementation mirrors that structure, at both the CLIF and AArch64
levels.

At the CLIF level, there are five new instructions, one for each group. Some
comments about these:

for those that take addresses (all except fences), the address is contained
entirely in a single Value; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.

atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:

per above comment, makes alignment checking simpler

reuse of existing loads and stores would require extension of MemFlags
to indicate atomicity, which sounds semantically unclean. For example,
then any instruction carrying MemFlags could be marked as atomic, even
in cases where it is meaningless or ambiguous.

I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.

The translation from Wasm to CLIF, in code_translator.rs is unremarkable.

At the AArch64 level, there are also five new instructions, one for each
group. All of them except ::Fence contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.

One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow any extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.

The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, in get_regs that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (::LLSC and ::CAS) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.

One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method <LowerCtx for Lower>::ensure_in_vreg.
Again, we rely on coalescing to turn them into no-ops in the common case.

There is also a ridealong fix for the AArch64 lowering case for
Opcode::Trapif | Opcode::Trapff, which removes a bug in which two trap insns
in a row were generated.

In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.

Wasmtime GitHub notifications bot (Jul 31 2020 at 07:08):

julian-seward1 updated PR #2077 from atomics-CL to main:

The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups

atomic read-modify-write

atomic compare-and-swap

atomic loads

atomic stores

fences

and the implementation mirrors that structure, at both the CLIF and AArch64
levels.

At the CLIF level, there are five new instructions, one for each group. Some
comments about these:

for those that take addresses (all except fences), the address is contained
entirely in a single Value; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.

atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:

per above comment, makes alignment checking simpler

reuse of existing loads and stores would require extension of MemFlags
to indicate atomicity, which sounds semantically unclean. For example,
then any instruction carrying MemFlags could be marked as atomic, even
in cases where it is meaningless or ambiguous.

I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.

The translation from Wasm to CLIF, in code_translator.rs is unremarkable.

At the AArch64 level, there are also five new instructions, one for each
group. All of them except ::Fence contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.

One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow any extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.

The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, in get_regs that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (::LLSC and ::CAS) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.

One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method <LowerCtx for Lower>::ensure_in_vreg.
Again, we rely on coalescing to turn them into no-ops in the common case.

There is also a ridealong fix for the AArch64 lowering case for
Opcode::Trapif | Opcode::Trapff, which removes a bug in which two trap insns
in a row were generated.

In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.

Wasmtime GitHub notifications bot (Jul 31 2020 at 12:48):

bnjbvr submitted PR Review.

Wasmtime GitHub notifications bot (Jul 31 2020 at 12:48):

bnjbvr created PR Review Comment:

Just to be sure -- @bnjbvr, could you confirm that we don't expect (or can get away with specifying) that there is no safepoint data for e.g. divide traps?

Agreed, and i think that indeed it should be specified that no GC must happen during stack unwinding (that sounds very scary and unsafe anyways)(now I wonder if wasm exception handling + wasm GC types could make this case actually possible).

Wasmtime GitHub notifications bot (Jul 31 2020 at 13:20):

bnjbvr submitted PR Review.

Wasmtime GitHub notifications bot (Jul 31 2020 at 13:20):

bnjbvr submitted PR Review.

Wasmtime GitHub notifications bot (Jul 31 2020 at 13:20):

bnjbvr created PR Review Comment:

If this is implemented as a callout in general: yes. All the callouts in the "dummy" wasm to clif environment are implemented this way; this is only used when testing internally wasm compilation with clif-util, that is, when no proper environment can be used.

Wasmtime GitHub notifications bot (Jul 31 2020 at 13:20):

bnjbvr created PR Review Comment:

nit: here and below, please remove the code commented out.

Wasmtime GitHub notifications bot (Jul 31 2020 at 13:20):

bnjbvr created PR Review Comment:

In terms of API cleanliness, would it make sense to have a single function translate_atomic_wait that takes an extra Type parameter, that can be either I32 or I64? Going one step further, would it make sense to have a single translate_atomic method that would be passed an enum with variants WaitI32/I64/Notify?

Wasmtime GitHub notifications bot (Jul 31 2020 at 13:20):

bnjbvr submitted PR Review.

Wasmtime GitHub notifications bot (Jul 31 2020 at 14:40):

julian-seward1 updated PR #2077 from atomics-CL to main:

The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups

atomic read-modify-write

atomic compare-and-swap

atomic loads

atomic stores

fences

and the implementation mirrors that structure, at both the CLIF and AArch64
levels.

At the CLIF level, there are five new instructions, one for each group. Some
comments about these:

for those that take addresses (all except fences), the address is contained
entirely in a single Value; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.

atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:

per above comment, makes alignment checking simpler

reuse of existing loads and stores would require extension of MemFlags
to indicate atomicity, which sounds semantically unclean. For example,
then any instruction carrying MemFlags could be marked as atomic, even
in cases where it is meaningless or ambiguous.

I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.

The translation from Wasm to CLIF, in code_translator.rs is unremarkable.

At the AArch64 level, there are also five new instructions, one for each
group. All of them except ::Fence contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.

One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow any extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.

The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, in get_regs that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (::LLSC and ::CAS) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.

One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method <LowerCtx for Lower>::ensure_in_vreg.
Again, we rely on coalescing to turn them into no-ops in the common case.

There is also a ridealong fix for the AArch64 lowering case for
Opcode::Trapif | Opcode::Trapff, which removes a bug in which two trap insns
in a row were generated.

In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.

Wasmtime GitHub notifications bot (Aug 03 2020 at 12:18):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Aug 03 2020 at 12:18):

julian-seward1 created PR Review Comment:

Merging the wait and notify cases is messy because they have different params. However, the two wait cases can be merged without even needing an extra Type parameter, since the 32-vs-64 bit difference can be determined by inspecting the type of the expected parameter. So that's what I did.

Wasmtime GitHub notifications bot (Aug 03 2020 at 12:58):

julian-seward1 updated PR #2077 from atomics-CL to main:

The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups

atomic read-modify-write

atomic compare-and-swap

atomic loads

atomic stores

fences

and the implementation mirrors that structure, at both the CLIF and AArch64
levels.

At the CLIF level, there are five new instructions, one for each group. Some
comments about these:

for those that take addresses (all except fences), the address is contained
entirely in a single Value; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.

atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:

per above comment, makes alignment checking simpler

reuse of existing loads and stores would require extension of MemFlags
to indicate atomicity, which sounds semantically unclean. For example,
then any instruction carrying MemFlags could be marked as atomic, even
in cases where it is meaningless or ambiguous.

I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.

The translation from Wasm to CLIF, in code_translator.rs is unremarkable.

At the AArch64 level, there are also five new instructions, one for each
group. All of them except ::Fence contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.

One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow any extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.

The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, in get_regs that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (::LLSC and ::CAS) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.

One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method <LowerCtx for Lower>::ensure_in_vreg.
Again, we rely on coalescing to turn them into no-ops in the common case.

There is also a ridealong fix for the AArch64 lowering case for
Opcode::Trapif | Opcode::Trapff, which removes a bug in which two trap insns
in a row were generated.

In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.

Wasmtime GitHub notifications bot (Aug 03 2020 at 13:53):

julian-seward1 requested cfallin for a review on PR #2077.

Wasmtime GitHub notifications bot (Aug 03 2020 at 17:05):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (Aug 03 2020 at 17:05):

cfallin created PR Review Comment:

Tiny nit, but this was very confusing to me for a moment; the difference between the CLIF's AtomicRmwOp and the aarch64 Inst's AtomicRMWOp is just capitalization. Perhaps use crate::ir and then write ir::AtomicRmwOp here, for clarity?

Wasmtime GitHub notifications bot (Aug 03 2020 at 17:05):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (Aug 04 2020 at 06:42):

julian-seward1 updated PR #2077 from atomics-CL to main:

The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups

atomic read-modify-write

atomic compare-and-swap

atomic loads

atomic stores

fences

and the implementation mirrors that structure, at both the CLIF and AArch64
levels.

At the CLIF level, there are five new instructions, one for each group. Some
comments about these:

for those that take addresses (all except fences), the address is contained
entirely in a single Value; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.

atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:

per above comment, makes alignment checking simpler

reuse of existing loads and stores would require extension of MemFlags
to indicate atomicity, which sounds semantically unclean. For example,
then any instruction carrying MemFlags could be marked as atomic, even
in cases where it is meaningless or ambiguous.

I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.

The translation from Wasm to CLIF, in code_translator.rs is unremarkable.

At the AArch64 level, there are also five new instructions, one for each
group. All of them except ::Fence contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.

One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow any extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.

The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, in get_regs that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (::LLSC and ::CAS) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.

One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method <LowerCtx for Lower>::ensure_in_vreg.
Again, we rely on coalescing to turn them into no-ops in the common case.

There is also a ridealong fix for the AArch64 lowering case for
Opcode::Trapif | Opcode::Trapff, which removes a bug in which two trap insns
in a row were generated.

In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.

Wasmtime GitHub notifications bot (Aug 04 2020 at 06:43):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (Aug 04 2020 at 06:43):

julian-seward1 created PR Review Comment:

Er, well spotted. I didn't even notice. Must have looked as if I'd simply written the identity function for atomic RMW ops :-)

Wasmtime GitHub notifications bot (Aug 04 2020 at 07:35):

julian-seward1 merged PR #2077.

Last updated: Apr 17 2025 at 18:04 UTC