uweigand opened PR #3746 from s390x-atomicops
to main
:
This adds support for all atomic operations that were unimplemented
so far in the s390x back end:
- atomic_rmw operations xchg, nand, smin, smax, umin, umax
- $I8 and $I16 versions of atomic_rmw and atomic_cas
- little endian versions of atomic_rmw and atomic_cas
All of these have to be implemented by a compare-and-swap loop;
and for the $I8 and $I16 versions the actual atomic instruction
needs to operate on the surrounding aligned 32-bit word.Since we cannot emit new control flow during ISLE instruction
selection, these compare-and-swap loops are emitted as a single
meta-instruction to be expanded at emit time.However, since there is a large number of different versions of
the loop required to implement all the above operations, I've
implemented a facility to allow specifying the loop bodies
from within ISLE after all, by creating a vector of MInst
structures that will be emitted as part of the meta-instruction.There are still restrictions, in particular instructions that
are part of the loop body may not modify any virtual register.
But even so, this approach looks preferable to doing everything
in emit.rs.A few instructions needed in those compare-and-swap loop bodies
were added as well, in particular the RxSBG family of instructions
as well as the LOAD REVERSED in-register byte-swap instructions.This patch also adds filetest runtests to verify the semantics
of all operations, in particular the subword and little-endian
variants (those are currently only executed on s390x).CC @cfallin @fitzgen
<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
fitzgen created PR review comment:
I think this kind of thing that is explicitly about side effects should be quarantined to
inst.isle
.
fitzgen created PR review comment:
Could we move
casloop_emit
andcasloop_result
toinst.isle
and then provide a combined version that passes the result ofcasloop_emit
tocaseloop_result
for you? This combined helper would present a little more of an expression-oriented facade, like we prefer to use inside oflower.isle
.Similar for
casloop_emit
followed bycasloop_rotate_result
.
fitzgen created PR review comment:
Is it possible to expand the loop bodies inside
isle.rs
, rather than creating a new meta instruction variant? If that's possible, I think this would allow regalloc to see the loop, and loosen the restrictions on loop bodies.
fitzgen submitted PR review.
fitzgen created PR review comment:
Does "iv" stand for "instruction vec"? For readability, what do you think of calling this
inst_builder_new
and similar for the otheriv_*
decls?
fitzgen submitted PR review.
uweigand updated PR #3746 from s390x-atomicops
to main
.
uweigand submitted PR review.
uweigand created PR review comment:
I was wondering myself where to put those helpers. I've now moved them to
inst.isle
.
uweigand submitted PR review.
uweigand created PR review comment:
Done.
uweigand submitted PR review.
uweigand created PR review comment:
Done.
uweigand submitted PR review.
uweigand created PR review comment:
I don't see how this would be possible. Emitting separate instructions instead of a single meta-instruction gets us back to the very problem the meta-instruction is intended to avoid: we'd need to create branches (and their target labels), which we cannot do since we cannot modify the CFG at this point. And even if could do that somehow, we'd also have to model the loop dataflow correctly: it's not just that we're writing to a register, it is that this register is a loop-carried dependency - so if we split out separate instructions, we'd have to create a new PHI node to correctly represent that ...
In any case, the restriction to hard registers works out fine for this use case on s390x (and I don't really foresee any other use cases): we need to have %r0 and %r1 as reserved registers anyway for various reasons, and using those two actually reduces register pressure in the CAS loop, which already needs quite a few of them.
cfallin submitted PR review.
cfallin created PR review comment:
@fitzgen indeed we have a restriction here that we cannot create new control-flow at the VCode level (the succs and preds and associated bits are already computed, and contiguous block IDs are already assigned). This practice of meta-instructions that contain mini-CFGs (always with a single-in single-out property, modulo traps) has been our solution so far, and we have some other examples in aarch64 atomics (CAS loops), divides (special-case checks and traps), br_tables (bounds checks), and maybe a few others. Maybe we could loosen this in the future if we think we might want opts after lowering that could make use of visibility into the instructions.
FWIW elsewhere we've mostly done this in a way that is specific to each use-case rather than a generic combinator (loop with "body" field) but if that's what works here, I'm not really opposed to it.
fitzgen submitted PR review.
fitzgen created PR review comment:
Gotcha. Makes sense, but seems like a little bit of an unfortunate restriction.
fitzgen submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
I agree, and we could potentially relax it if it turns out we need more general control flow expansion later; there is a bit of a performance tradeoff here though that a more general approach (reconstructing succs/preds and numbering blocks after lowering) would impose. As a ballpark number, when I reworked lowering to work in a single-pass, emit-blocks-in-their-final-order way, with no edits to the CFG afterward, it was IIRC about a 10% compile-time improvement.
uweigand updated PR #3746 from s390x-atomicops
to main
.
cfallin merged PR #3746.
Last updated: Nov 22 2024 at 17:03 UTC