cfallin commented on issue #4521:
Happy to merge once merge conflicts are resolved.
akirilov-arm commented on issue #4521:
@cfallin I have also noticed that
x16
andx17
are reserved as spill temporaries, hence not allocatable - is that still relevant for regalloc2?
cfallin commented on issue #4521:
@akirilov-arm yes, unfortunately... regalloc2 actually doesn't need any temporaries anymore, but aarch64 itself does. The reason is that a spillslot may be at a greater offset from
sp
orfp
than we can reach with animm12
, so we need a sequence of instructions to synthesize the address of a spillslot before spilling or reloading. That sequence itself can't require spilling another register if all registers are full (as they are likely to be if we're spilling in the first place), so we need to set asidex16
for that.If I recall correctly,
x17
is used in stack-limit check sequences, but my memory is less clear; I'd have to go look at the use-cases again.
cfallin commented on issue #4521:
To follow up on that, I guess an alternative approach would be to reserve a small-offset slot to spill another victim to if we need to compute a spillslot address at a large distance away -- so we can bootstrap our way there with no registers initially free. That's a little more complexity but I wouldn't be averse to reviewing a PR if someone wants to attempt that.
akirilov-arm commented on issue #4521:
Another alternative is to reserve a vector register, which would give us the same space as 2 GPRs.
One further idea - if the
preserve_frame_pointers
flag introduced by PR #4469 is true, then we should be able to uselr
/x30
as a temporary register instead of making it unallocatable as it is right now.
cfallin commented on issue #4521:
One further idea - if the preserve_frame_pointers flag introduced by PR https://github.com/bytecodealliance/wasmtime/pull/4469 is true, then we should be able to use lr/x30 as a temporary register instead of making it unallocatable as it is right now.
Ah, that's a really interesting idea actually. I guess it's not needed for unwind (that starts from
fp
), and in the middle of this sequence we won't be doing any calls, and it's otherwise usable as a regular GPR... I like it.
akirilov-arm commented on issue #4521:
The optimal solution would be to adjust the set of allocatable registers on a per-function basis, so that we could do the same for non-leaf functions (or leaf functions that use the stack) when
preserve_frame_pointers
is false, but I don't think the backend plumbing is set to enable this.
akirilov-arm edited a comment on issue #4521:
The optimal solution would be to adjust the set of allocatable registers on a per-function basis, so that we could do the same for non-leaf functions (or leaf functions that use the stack) when
preserve_frame_pointers
is false, but I don't think the backend plumbing is set to enable this.P.S. Actually I didn't mean to use
x30
as a spill temporary, but as a generic temporary register, i.e. put it in the set of preferred registers for allocation; calls would be set up to clobber it. Restricting it to a spill temporary is also an interesting idea because we would be able to reclaim eitherx16
orx17
.
akirilov-arm edited a comment on issue #4521:
The optimal solution would be to adjust the set of allocatable registers on a per-function basis, so that we could do the same for non-leaf functions (or leaf functions that use the stack) when
preserve_frame_pointers
is false, but I don't think the backend plumbing is set to enable this.P.S. Actually I didn't mean to use
x30
as a spill temporary, but as a generic temporary register, i.e. put it in the set of preferred registers for allocation; calls would be set up to clobber it, so that regalloc would do the right thing. Restricting it to a spill temporary is also a viable idea because we would be able to reclaim eitherx16
orx17
; I suppose I am just ambituous and want all of them :smile:.
jameysharp commented on issue #4521:
I'm curious about the small-offset spillslot idea. I haven't looked at aarch64 details and don't know enough about Cranelift internals yet, but my assumptions are:
- The imm12 is a signed 12-bit immediate, so this challenge only happens if the stack frame is bigger than 2kB.
sp
andfp
point to opposite ends of the stack frame so if you pick whichever is closer as the base then you could avoid spilling for any frame smaller than 4kB.- So I imagine there's an opportunity for a lot of functions to gain an extra register.
- But we don't know how big the stack frame is until after register allocation because it depends on how many spills are inserted.
- Still, we could decide after regalloc whether to add a small-offset spillslot; if we need it, making the stack frame bigger doesn't change the fact that we need it.
Something along those lines?
cfallin commented on issue #4521:
@jameysharp yep, that's more or less it.
@akirilov-arm re:
I don't think the backend plumbing is set to enable this
we could definitely change that! The only thing I want to hold as a hard requirement is that we don't build it dynamically per-function (because there are lots of tiny functions and that would be a nontrivial cost); right now we build it once when the compiler backend is constructed. We could perhaps build a few versions of it though, and return the right one in the
regalloc2::Function
trait -- one for leaf functions and one without; and variations based on compiler flags.Anyway we're getting on quite a tangent here but if you're interested, please feel free to file an issue to "reclaim spilltmp registers on aarch64" as a future enhancement to track this!
Last updated: Dec 23 2024 at 12:05 UTC