cfallin opened PR #10502 from cfallin:new-retvals-same-great-flavor-now-with-less-instructions
to bytecodealliance:main
:
This PR addresses the issues described in #10488 in a more head-on way: it removes the use of separate "return-value instructions" that load return values from the stack, instead folding these loads into the semantics of the call VCode instruction.
This is a prerequisite for exception-handling: we need calls to be workable as terminators, meaning that we cannot require any other (VCode) instructions after the call to define the return values.
In principle, this PR starts simply enough: the return-locations list on the
CallInfo
that each backend uses to provide regalloc metadata is updated to support a notion of "register or stack address" as the source of each return value, and this list is now used for both kinds of returns, not just returns in registers. Shared code is defined inmachinst::abi
used by all backends to perform the requisite loads.In order to make this work with more defined values than fit in registers, however, this PR also had to add support for "any"-constrained registers to Cranelift, and handling allocations that may be spillslots. This has always been supported by RA2, but this is the first time that Cranelift uses them directly (previously they were used only internally in RA2 as lowerings from other kinds of constraints like safepoints). This requires encoding a spillslot index in our
Reg
type.There is a little bit of complexity around handling the loads/defs as well: if we have a return value on-stack, and we need to put it in a spillslot, we cannot do a memory-to-memory move directly, so we need a temporary register. Earlier versions of this PR allocated another temp as a vreg on the call, but this doesn't work with all calling conventions (too many clobbers). For simplicity I picked a particular register that is (i) clobbered by calls and (ii) not used for return values for each architecture (x86-64's tailcall needed to lose one return-in-register slot to make this work).
This removes retval insts from the shared ABI infra completely. s390x is different, still, because it handles callsite lowering from ISLE; we will need to address that separately for exception support there.
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
cfallin requested abrown for a review on PR #10502.
cfallin requested wasmtime-compiler-reviewers for a review on PR #10502.
cfallin requested alexcrichton for a review on PR #10502.
cfallin requested wasmtime-core-reviewers for a review on PR #10502.
cfallin commented on PR #10502:
I'll note that this appears to pessimize codegen somewhat for many-return-values (more than fit in registers, so more than 8? on aarch64/x86-64), since this does a manual load-from-stack/store-to-spillslot rather than letting regalloc handle it; that seems acceptable given the complexity reduction and unblocking exception support, IMHO.
alexcrichton requested fitzgen for a review on PR #10502.
alexcrichton submitted PR review:
Overall this looks reasonable to me, and I have a passing thought, but I'm going to defer to @fitzgen for review rather than myself as I think he's been chatting with you more about this
alexcrichton created PR review comment:
Could this case assert that
vreg
was not allocated totemp
?
alexcrichton created PR review comment:
Also, is it possible to somewhat easily assert that
temp
is callee-saved? (or caller-save, I forget which would be appropriate here)
github-actions[bot] commented on PR #10502:
Subscribe to Label Action
cc @cfallin, @fitzgen
<details>
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"Thus the following users have been cc'd because of the following labels:
- cfallin: isle
- fitzgen: isle
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
cfallin updated PR #10502.
cfallin updated PR #10502.
cfallin submitted PR review.
cfallin created PR review comment:
Done (to assert), good idea. It's a little trickier to assert anything about whether this is a caller-saved (volatile) register we can use: this helper is called after we've thrown away all the ABI info except for what's in the
CallInfo
, so we'd have to somehow preserve that.
cfallin commented on PR #10502:
Just pushed a fix for aarch64 -- it turns out our optimization to avoid unnecessary clobber-saves for the lower half of float registers (
is_included_in_clobbers
) can no longer apply if we have other defs (the stack-carried values) that allocate into other registers; those other registers are indeed clobbered by our function body! The pessimization there is, again, only on very-many-returned-values cases; the happy path of returns-that-fit-in-regs is unaffected.
alexcrichton commented on PR #10502:
this helper is called after we've thrown away all the ABI info
Could the
temp
parameter toCallInfo::emit_retval_loads
turn into an associated const and/or function onABIMachineSpec
? That way it wouldn't have to be passed as a parameter and it'd also be available for assertions if necessary elsewhere? No need really to bend over backwards for this, but I do think it'd be good to have a debug assertion that it's in the right set of caller or callee
cfallin updated PR #10502.
cfallin updated PR #10502.
cfallin commented on PR #10502:
Ah, yes, that's a much better idea, thanks! Done.
uweigand commented on PR #10502:
Hi @cfallin I've had a quick look at this, and it should be relatively straightforward to implement this for s390x as well; I'd be happy to do that (either after this has landed, or else I could provide a patch ahead of time?).
However, I do have one comment about the new
has_non_abi_defs
mechanism. Not only is this further complicating theis_included_in_clobbers
mechanism, it also introduces performance regressions.The original reason for
is_included_in_clobbers
was that for partially callee-saved registers (e.g. on s390x, only the top half of some vector registers, which overlap the old floating-point registers, is callee-saved), in a sequence of A calling B calling C, we have to conservatively assume that the call B->C clobbers them, while the call A->B must preserve them, and therefore B will have to save/restore them all. Of course, as long as the A->B calling convention is the same as the B->C calling convention, we can avoid this, which is implemented viais_included_in_clobbers
.Now, assuming the call B->C needs another def as it has to load a return value into a callee-saved register. We have to consider this, which is what the new
has_non_abi_defs
mechanism achieves. However, that will then also have the side effect of once again forcing B to save/restore all floating-point registers - there's really no good reason for this.Quite a while ago, I made the suggestion that
is_included_in_clobbers
should really only apply to the clobber list, not the def list. If we were to make that change, then the extra defs for this case would just happen to work transparently without any newhas_non_abi_defs
mechanism, and it would not introduce the above regression.
cfallin commented on PR #10502:
@uweigand thanks, that is a great suggestion -- much better than what I did, and I will update as suggested!
cfallin updated PR #10502.
fitzgen submitted PR review:
LGTM!
fitzgen merged PR #10502.
alexcrichton commented on PR #10502:
A number of fuzz bugs have this in the regression range and I suspect might be related, would you be up for taking a look at these when you're back on Monday?
cfallin commented on PR #10502:
Yes, just bisected and commented on one of them -- will take a look!
cfallin commented on PR #10502:
Fuzzbug fix in RA2 (https://github.com/bytecodealliance/regalloc2/pull/214) -- this PR creates more challenging constraints and the bundle-splitting logic had to get a little bit more refined.
Last updated: Apr 17 2025 at 10:03 UTC