akirilov-arm opened issue #4552:
Currently (as of commit 8137432e67c920b73a3bbcc4eb72ae5095d31f41) Cranelift's AArch64 backend excludes the following general-purpose registers (GPRs) unconditionally from the set of allocatable registers:
X16
X17
X18
X29
AKAFP
X30
AKALR
X31
AKAXZR
/SP
This list might be too conservative, except for the following cases:
X29
- the Procedure Call Standard for the Arm® 64-bit Architecture (AAPCS64) specifies that it is the frame pointer and that it must have a valid value at all timesX31
- in most contexts (e.g. non-memory operations) it is decoded asXZR
, which has an architecturally fixed value (0), and in other cases asSP
, so it is not usable in generalAs for the rest:
X16
andX17
Currently reserved as spill temporaries, as explained by @cfallin:
regalloc2 actually doesn't need any temporaries anymore, but aarch64 itself does. The reason is that a spillslot may be at a greater offset from
sp
orfp
than we can reach with animm12
, so we need a sequence of instructions to synthesize the address of a spillslot before spilling or reloading. That sequence itself can't require spilling another register if all registers are full (as they are likely to be if we're spilling in the first place), so we need to set asidex16
for that.If I recall correctly,
x17
is used in stack-limit check sequences...From the same discussion, @cfallin's suggestion for an alternative approach:
... an alternative approach would be to reserve a small-offset slot to spill another victim to if we need to compute a spillslot address at a large distance away -- so we can bootstrap our way there with no registers initially free.
In particular, we might expand the stack area next to the frame record - instead of decrementing the stack pointer by 16 bytes to save
FP
andLR
in a function prologue, we could decrement by 32 bytes, so that we would have a scratch area for 2 GPRs as well.Another option is to reserve a vector register, which would give the same amount of space.
Note that the code generating branch veneers assumes that both registers are available, so it would need adjustments.
X18
The AAPCS64 states that the platform can use it to carry inter-procedural state, which is assumed by Cranelift, but
X18
could be used as a regular temporary register otherwise; perhaps we could revisit that assumption?
X30
PR #4469 introduced the
preserve_frame_pointers
flag, which when true guarantees that theLR
register is saved in the function prologue and restored in the epilogue, thus making it usable as a temporary register in between; we just have to ensure that calls are set up to clobber it, so that regalloc does the right thing. A restricted version of this idea that is potentially easier to implement is to makeX30
a spill temporary instead, and to turn eitherX16
orX17
into a regular temporary register.In fact the same optimization is also applicable when the
preserve_frame_pointers
flag is false as long as we are compiling a function that creates a frame record on the stack, e.g. a non-leaf one, but currently the backend plumbing is not set up to make that decision on a per-function basis. As @cfallin stated, any backend changes to remedy that limitation are subject to the following constraint:The only thing I want to hold as a hard requirement is that we don't build it dynamically per-function (because there are lots of tiny functions and that would be a nontrivial cost); right now we build it once when the compiler backend is constructed. We could perhaps build a few versions of it though, and return the right one in the
regalloc2::Function
trait -- one for leaf functions and one without; and variations based on compiler flags.
akirilov-arm labeled issue #4552:
Currently (as of commit 8137432e67c920b73a3bbcc4eb72ae5095d31f41) Cranelift's AArch64 backend excludes the following general-purpose registers (GPRs) unconditionally from the set of allocatable registers:
X16
X17
X18
X29
AKAFP
X30
AKALR
X31
AKAXZR
/SP
This list might be too conservative, except for the following cases:
X29
- the Procedure Call Standard for the Arm® 64-bit Architecture (AAPCS64) specifies that it is the frame pointer and that it must have a valid value at all timesX31
- in most contexts (e.g. non-memory operations) it is decoded asXZR
, which has an architecturally fixed value (0), and in other cases asSP
, so it is not usable in generalAs for the rest:
X16
andX17
Currently reserved as spill temporaries, as explained by @cfallin:
regalloc2 actually doesn't need any temporaries anymore, but aarch64 itself does. The reason is that a spillslot may be at a greater offset from
sp
orfp
than we can reach with animm12
, so we need a sequence of instructions to synthesize the address of a spillslot before spilling or reloading. That sequence itself can't require spilling another register if all registers are full (as they are likely to be if we're spilling in the first place), so we need to set asidex16
for that.If I recall correctly,
x17
is used in stack-limit check sequences...From the same discussion, @cfallin's suggestion for an alternative approach:
... an alternative approach would be to reserve a small-offset slot to spill another victim to if we need to compute a spillslot address at a large distance away -- so we can bootstrap our way there with no registers initially free.
In particular, we might expand the stack area next to the frame record - instead of decrementing the stack pointer by 16 bytes to save
FP
andLR
in a function prologue, we could decrement by 32 bytes, so that we would have a scratch area for 2 GPRs as well.Another option is to reserve a vector register, which would give the same amount of space.
Note that the code generating branch veneers assumes that both registers are available, so it would need adjustments.
X18
The AAPCS64 states that the platform can use it to carry inter-procedural state, which is assumed by Cranelift, but
X18
could be used as a regular temporary register otherwise; perhaps we could revisit that assumption?
X30
PR #4469 introduced the
preserve_frame_pointers
flag, which when true guarantees that theLR
register is saved in the function prologue and restored in the epilogue, thus making it usable as a temporary register in between; we just have to ensure that calls are set up to clobber it, so that regalloc does the right thing. A restricted version of this idea that is potentially easier to implement is to makeX30
a spill temporary instead, and to turn eitherX16
orX17
into a regular temporary register.In fact the same optimization is also applicable when the
preserve_frame_pointers
flag is false as long as we are compiling a function that creates a frame record on the stack, e.g. a non-leaf one, but currently the backend plumbing is not set up to make that decision on a per-function basis. As @cfallin stated, any backend changes to remedy that limitation are subject to the following constraint:The only thing I want to hold as a hard requirement is that we don't build it dynamically per-function (because there are lots of tiny functions and that would be a nontrivial cost); right now we build it once when the compiler backend is constructed. We could perhaps build a few versions of it though, and return the right one in the
regalloc2::Function
trait -- one for leaf functions and one without; and variations based on compiler flags.
akirilov-arm labeled issue #4552:
Currently (as of commit 8137432e67c920b73a3bbcc4eb72ae5095d31f41) Cranelift's AArch64 backend excludes the following general-purpose registers (GPRs) unconditionally from the set of allocatable registers:
X16
X17
X18
X29
AKAFP
X30
AKALR
X31
AKAXZR
/SP
This list might be too conservative, except for the following cases:
X29
- the Procedure Call Standard for the Arm® 64-bit Architecture (AAPCS64) specifies that it is the frame pointer and that it must have a valid value at all timesX31
- in most contexts (e.g. non-memory operations) it is decoded asXZR
, which has an architecturally fixed value (0), and in other cases asSP
, so it is not usable in generalAs for the rest:
X16
andX17
Currently reserved as spill temporaries, as explained by @cfallin:
regalloc2 actually doesn't need any temporaries anymore, but aarch64 itself does. The reason is that a spillslot may be at a greater offset from
sp
orfp
than we can reach with animm12
, so we need a sequence of instructions to synthesize the address of a spillslot before spilling or reloading. That sequence itself can't require spilling another register if all registers are full (as they are likely to be if we're spilling in the first place), so we need to set asidex16
for that.If I recall correctly,
x17
is used in stack-limit check sequences...From the same discussion, @cfallin's suggestion for an alternative approach:
... an alternative approach would be to reserve a small-offset slot to spill another victim to if we need to compute a spillslot address at a large distance away -- so we can bootstrap our way there with no registers initially free.
In particular, we might expand the stack area next to the frame record - instead of decrementing the stack pointer by 16 bytes to save
FP
andLR
in a function prologue, we could decrement by 32 bytes, so that we would have a scratch area for 2 GPRs as well.Another option is to reserve a vector register, which would give the same amount of space.
Note that the code generating branch veneers assumes that both registers are available, so it would need adjustments.
X18
The AAPCS64 states that the platform can use it to carry inter-procedural state, which is assumed by Cranelift, but
X18
could be used as a regular temporary register otherwise; perhaps we could revisit that assumption?
X30
PR #4469 introduced the
preserve_frame_pointers
flag, which when true guarantees that theLR
register is saved in the function prologue and restored in the epilogue, thus making it usable as a temporary register in between; we just have to ensure that calls are set up to clobber it, so that regalloc does the right thing. A restricted version of this idea that is potentially easier to implement is to makeX30
a spill temporary instead, and to turn eitherX16
orX17
into a regular temporary register.In fact the same optimization is also applicable when the
preserve_frame_pointers
flag is false as long as we are compiling a function that creates a frame record on the stack, e.g. a non-leaf one, but currently the backend plumbing is not set up to make that decision on a per-function basis. As @cfallin stated, any backend changes to remedy that limitation are subject to the following constraint:The only thing I want to hold as a hard requirement is that we don't build it dynamically per-function (because there are lots of tiny functions and that would be a nontrivial cost); right now we build it once when the compiler backend is constructed. We could perhaps build a few versions of it though, and return the right one in the
regalloc2::Function
trait -- one for leaf functions and one without; and variations based on compiler flags.
akirilov-arm labeled issue #4552:
Currently (as of commit 8137432e67c920b73a3bbcc4eb72ae5095d31f41) Cranelift's AArch64 backend excludes the following general-purpose registers (GPRs) unconditionally from the set of allocatable registers:
X16
X17
X18
X29
AKAFP
X30
AKALR
X31
AKAXZR
/SP
This list might be too conservative, except for the following cases:
X29
- the Procedure Call Standard for the Arm® 64-bit Architecture (AAPCS64) specifies that it is the frame pointer and that it must have a valid value at all timesX31
- in most contexts (e.g. non-memory operations) it is decoded asXZR
, which has an architecturally fixed value (0), and in other cases asSP
, so it is not usable in generalAs for the rest:
X16
andX17
Currently reserved as spill temporaries, as explained by @cfallin:
regalloc2 actually doesn't need any temporaries anymore, but aarch64 itself does. The reason is that a spillslot may be at a greater offset from
sp
orfp
than we can reach with animm12
, so we need a sequence of instructions to synthesize the address of a spillslot before spilling or reloading. That sequence itself can't require spilling another register if all registers are full (as they are likely to be if we're spilling in the first place), so we need to set asidex16
for that.If I recall correctly,
x17
is used in stack-limit check sequences...From the same discussion, @cfallin's suggestion for an alternative approach:
... an alternative approach would be to reserve a small-offset slot to spill another victim to if we need to compute a spillslot address at a large distance away -- so we can bootstrap our way there with no registers initially free.
In particular, we might expand the stack area next to the frame record - instead of decrementing the stack pointer by 16 bytes to save
FP
andLR
in a function prologue, we could decrement by 32 bytes, so that we would have a scratch area for 2 GPRs as well.Another option is to reserve a vector register, which would give the same amount of space.
Note that the code generating branch veneers assumes that both registers are available, so it would need adjustments.
X18
The AAPCS64 states that the platform can use it to carry inter-procedural state, which is assumed by Cranelift, but
X18
could be used as a regular temporary register otherwise; perhaps we could revisit that assumption?
X30
PR #4469 introduced the
preserve_frame_pointers
flag, which when true guarantees that theLR
register is saved in the function prologue and restored in the epilogue, thus making it usable as a temporary register in between; we just have to ensure that calls are set up to clobber it, so that regalloc does the right thing. A restricted version of this idea that is potentially easier to implement is to makeX30
a spill temporary instead, and to turn eitherX16
orX17
into a regular temporary register.In fact the same optimization is also applicable when the
preserve_frame_pointers
flag is false as long as we are compiling a function that creates a frame record on the stack, e.g. a non-leaf one, but currently the backend plumbing is not set up to make that decision on a per-function basis. As @cfallin stated, any backend changes to remedy that limitation are subject to the following constraint:The only thing I want to hold as a hard requirement is that we don't build it dynamically per-function (because there are lots of tiny functions and that would be a nontrivial cost); right now we build it once when the compiler backend is constructed. We could perhaps build a few versions of it though, and return the right one in the
regalloc2::Function
trait -- one for leaf functions and one without; and variations based on compiler flags.
akirilov-arm labeled issue #4552:
Currently (as of commit 8137432e67c920b73a3bbcc4eb72ae5095d31f41) Cranelift's AArch64 backend excludes the following general-purpose registers (GPRs) unconditionally from the set of allocatable registers:
X16
X17
X18
X29
AKAFP
X30
AKALR
X31
AKAXZR
/SP
This list might be too conservative, except for the following cases:
X29
- the Procedure Call Standard for the Arm® 64-bit Architecture (AAPCS64) specifies that it is the frame pointer and that it must have a valid value at all timesX31
- in most contexts (e.g. non-memory operations) it is decoded asXZR
, which has an architecturally fixed value (0), and in other cases asSP
, so it is not usable in generalAs for the rest:
X16
andX17
Currently reserved as spill temporaries, as explained by @cfallin:
regalloc2 actually doesn't need any temporaries anymore, but aarch64 itself does. The reason is that a spillslot may be at a greater offset from
sp
orfp
than we can reach with animm12
, so we need a sequence of instructions to synthesize the address of a spillslot before spilling or reloading. That sequence itself can't require spilling another register if all registers are full (as they are likely to be if we're spilling in the first place), so we need to set asidex16
for that.If I recall correctly,
x17
is used in stack-limit check sequences...From the same discussion, @cfallin's suggestion for an alternative approach:
... an alternative approach would be to reserve a small-offset slot to spill another victim to if we need to compute a spillslot address at a large distance away -- so we can bootstrap our way there with no registers initially free.
In particular, we might expand the stack area next to the frame record - instead of decrementing the stack pointer by 16 bytes to save
FP
andLR
in a function prologue, we could decrement by 32 bytes, so that we would have a scratch area for 2 GPRs as well.Another option is to reserve a vector register, which would give the same amount of space.
Note that the code generating branch veneers assumes that both registers are available, so it would need adjustments.
X18
The AAPCS64 states that the platform can use it to carry inter-procedural state, which is assumed by Cranelift, but
X18
could be used as a regular temporary register otherwise; perhaps we could revisit that assumption?
X30
PR #4469 introduced the
preserve_frame_pointers
flag, which when true guarantees that theLR
register is saved in the function prologue and restored in the epilogue, thus making it usable as a temporary register in between; we just have to ensure that calls are set up to clobber it, so that regalloc does the right thing. A restricted version of this idea that is potentially easier to implement is to makeX30
a spill temporary instead, and to turn eitherX16
orX17
into a regular temporary register.In fact the same optimization is also applicable when the
preserve_frame_pointers
flag is false as long as we are compiling a function that creates a frame record on the stack, e.g. a non-leaf one, but currently the backend plumbing is not set up to make that decision on a per-function basis. As @cfallin stated, any backend changes to remedy that limitation are subject to the following constraint:The only thing I want to hold as a hard requirement is that we don't build it dynamically per-function (because there are lots of tiny functions and that would be a nontrivial cost); right now we build it once when the compiler backend is constructed. We could perhaps build a few versions of it though, and return the right one in the
regalloc2::Function
trait -- one for leaf functions and one without; and variations based on compiler flags.
akirilov-arm labeled issue #4552:
Currently (as of commit 8137432e67c920b73a3bbcc4eb72ae5095d31f41) Cranelift's AArch64 backend excludes the following general-purpose registers (GPRs) unconditionally from the set of allocatable registers:
X16
X17
X18
X29
AKAFP
X30
AKALR
X31
AKAXZR
/SP
This list might be too conservative, except for the following cases:
X29
- the Procedure Call Standard for the Arm® 64-bit Architecture (AAPCS64) specifies that it is the frame pointer and that it must have a valid value at all timesX31
- in most contexts (e.g. non-memory operations) it is decoded asXZR
, which has an architecturally fixed value (0), and in other cases asSP
, so it is not usable in generalAs for the rest:
X16
andX17
Currently reserved as spill temporaries, as explained by @cfallin:
regalloc2 actually doesn't need any temporaries anymore, but aarch64 itself does. The reason is that a spillslot may be at a greater offset from
sp
orfp
than we can reach with animm12
, so we need a sequence of instructions to synthesize the address of a spillslot before spilling or reloading. That sequence itself can't require spilling another register if all registers are full (as they are likely to be if we're spilling in the first place), so we need to set asidex16
for that.If I recall correctly,
x17
is used in stack-limit check sequences...From the same discussion, @cfallin's suggestion for an alternative approach:
... an alternative approach would be to reserve a small-offset slot to spill another victim to if we need to compute a spillslot address at a large distance away -- so we can bootstrap our way there with no registers initially free.
In particular, we might expand the stack area next to the frame record - instead of decrementing the stack pointer by 16 bytes to save
FP
andLR
in a function prologue, we could decrement by 32 bytes, so that we would have a scratch area for 2 GPRs as well.Another option is to reserve a vector register, which would give the same amount of space.
Note that the code generating branch veneers assumes that both registers are available, so it would need adjustments.
X18
The AAPCS64 states that the platform can use it to carry inter-procedural state, which is assumed by Cranelift, but
X18
could be used as a regular temporary register otherwise; perhaps we could revisit that assumption?
X30
PR #4469 introduced the
preserve_frame_pointers
flag, which when true guarantees that theLR
register is saved in the function prologue and restored in the epilogue, thus making it usable as a temporary register in between; we just have to ensure that calls are set up to clobber it, so that regalloc does the right thing. A restricted version of this idea that is potentially easier to implement is to makeX30
a spill temporary instead, and to turn eitherX16
orX17
into a regular temporary register.In fact the same optimization is also applicable when the
preserve_frame_pointers
flag is false as long as we are compiling a function that creates a frame record on the stack, e.g. a non-leaf one, but currently the backend plumbing is not set up to make that decision on a per-function basis. As @cfallin stated, any backend changes to remedy that limitation are subject to the following constraint:The only thing I want to hold as a hard requirement is that we don't build it dynamically per-function (because there are lots of tiny functions and that would be a nontrivial cost); right now we build it once when the compiler backend is constructed. We could perhaps build a few versions of it though, and return the right one in the
regalloc2::Function
trait -- one for leaf functions and one without; and variations based on compiler flags.
Last updated: Nov 22 2024 at 17:03 UTC