Stream: git-wasmtime

Topic: wasmtime / issue #4552 Cranelift AArch64: Expand the set ...


view this post on Zulip Wasmtime GitHub notifications bot (Jul 28 2022 at 18:54):

akirilov-arm opened issue #4552:

Currently (as of commit 8137432e67c920b73a3bbcc4eb72ae5095d31f41) Cranelift's AArch64 backend excludes the following general-purpose registers (GPRs) unconditionally from the set of allocatable registers:

This list might be too conservative, except for the following cases:

As for the rest:

Currently reserved as spill temporaries, as explained by @cfallin:

regalloc2 actually doesn't need any temporaries anymore, but aarch64 itself does. The reason is that a spillslot may be at a greater offset from sp or fp than we can reach with an imm12, so we need a sequence of instructions to synthesize the address of a spillslot before spilling or reloading. That sequence itself can't require spilling another register if all registers are full (as they are likely to be if we're spilling in the first place), so we need to set aside x16 for that.

If I recall correctly, x17 is used in stack-limit check sequences...

From the same discussion, @cfallin's suggestion for an alternative approach:

... an alternative approach would be to reserve a small-offset slot to spill another victim to if we need to compute a spillslot address at a large distance away -- so we can bootstrap our way there with no registers initially free.

In particular, we might expand the stack area next to the frame record - instead of decrementing the stack pointer by 16 bytes to save FP and LR in a function prologue, we could decrement by 32 bytes, so that we would have a scratch area for 2 GPRs as well.

Another option is to reserve a vector register, which would give the same amount of space.

Note that the code generating branch veneers assumes that both registers are available, so it would need adjustments.

The AAPCS64 states that the platform can use it to carry inter-procedural state, which is assumed by Cranelift, but X18 could be used as a regular temporary register otherwise; perhaps we could revisit that assumption?

PR #4469 introduced the preserve_frame_pointers flag, which when true guarantees that the LR register is saved in the function prologue and restored in the epilogue, thus making it usable as a temporary register in between; we just have to ensure that calls are set up to clobber it, so that regalloc does the right thing. A restricted version of this idea that is potentially easier to implement is to make X30 a spill temporary instead, and to turn either X16 or X17 into a regular temporary register.

In fact the same optimization is also applicable when the preserve_frame_pointers flag is false as long as we are compiling a function that creates a frame record on the stack, e.g. a non-leaf one, but currently the backend plumbing is not set up to make that decision on a per-function basis. As @cfallin stated, any backend changes to remedy that limitation are subject to the following constraint:

The only thing I want to hold as a hard requirement is that we don't build it dynamically per-function (because there are lots of tiny functions and that would be a nontrivial cost); right now we build it once when the compiler backend is constructed. We could perhaps build a few versions of it though, and return the right one in the regalloc2::Function trait -- one for leaf functions and one without; and variations based on compiler flags.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 28 2022 at 18:54):

akirilov-arm labeled issue #4552:

Currently (as of commit 8137432e67c920b73a3bbcc4eb72ae5095d31f41) Cranelift's AArch64 backend excludes the following general-purpose registers (GPRs) unconditionally from the set of allocatable registers:

This list might be too conservative, except for the following cases:

As for the rest:

Currently reserved as spill temporaries, as explained by @cfallin:

regalloc2 actually doesn't need any temporaries anymore, but aarch64 itself does. The reason is that a spillslot may be at a greater offset from sp or fp than we can reach with an imm12, so we need a sequence of instructions to synthesize the address of a spillslot before spilling or reloading. That sequence itself can't require spilling another register if all registers are full (as they are likely to be if we're spilling in the first place), so we need to set aside x16 for that.

If I recall correctly, x17 is used in stack-limit check sequences...

From the same discussion, @cfallin's suggestion for an alternative approach:

... an alternative approach would be to reserve a small-offset slot to spill another victim to if we need to compute a spillslot address at a large distance away -- so we can bootstrap our way there with no registers initially free.

In particular, we might expand the stack area next to the frame record - instead of decrementing the stack pointer by 16 bytes to save FP and LR in a function prologue, we could decrement by 32 bytes, so that we would have a scratch area for 2 GPRs as well.

Another option is to reserve a vector register, which would give the same amount of space.

Note that the code generating branch veneers assumes that both registers are available, so it would need adjustments.

The AAPCS64 states that the platform can use it to carry inter-procedural state, which is assumed by Cranelift, but X18 could be used as a regular temporary register otherwise; perhaps we could revisit that assumption?

PR #4469 introduced the preserve_frame_pointers flag, which when true guarantees that the LR register is saved in the function prologue and restored in the epilogue, thus making it usable as a temporary register in between; we just have to ensure that calls are set up to clobber it, so that regalloc does the right thing. A restricted version of this idea that is potentially easier to implement is to make X30 a spill temporary instead, and to turn either X16 or X17 into a regular temporary register.

In fact the same optimization is also applicable when the preserve_frame_pointers flag is false as long as we are compiling a function that creates a frame record on the stack, e.g. a non-leaf one, but currently the backend plumbing is not set up to make that decision on a per-function basis. As @cfallin stated, any backend changes to remedy that limitation are subject to the following constraint:

The only thing I want to hold as a hard requirement is that we don't build it dynamically per-function (because there are lots of tiny functions and that would be a nontrivial cost); right now we build it once when the compiler backend is constructed. We could perhaps build a few versions of it though, and return the right one in the regalloc2::Function trait -- one for leaf functions and one without; and variations based on compiler flags.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 28 2022 at 18:54):

akirilov-arm labeled issue #4552:

Currently (as of commit 8137432e67c920b73a3bbcc4eb72ae5095d31f41) Cranelift's AArch64 backend excludes the following general-purpose registers (GPRs) unconditionally from the set of allocatable registers:

This list might be too conservative, except for the following cases:

As for the rest:

Currently reserved as spill temporaries, as explained by @cfallin:

regalloc2 actually doesn't need any temporaries anymore, but aarch64 itself does. The reason is that a spillslot may be at a greater offset from sp or fp than we can reach with an imm12, so we need a sequence of instructions to synthesize the address of a spillslot before spilling or reloading. That sequence itself can't require spilling another register if all registers are full (as they are likely to be if we're spilling in the first place), so we need to set aside x16 for that.

If I recall correctly, x17 is used in stack-limit check sequences...

From the same discussion, @cfallin's suggestion for an alternative approach:

... an alternative approach would be to reserve a small-offset slot to spill another victim to if we need to compute a spillslot address at a large distance away -- so we can bootstrap our way there with no registers initially free.

In particular, we might expand the stack area next to the frame record - instead of decrementing the stack pointer by 16 bytes to save FP and LR in a function prologue, we could decrement by 32 bytes, so that we would have a scratch area for 2 GPRs as well.

Another option is to reserve a vector register, which would give the same amount of space.

Note that the code generating branch veneers assumes that both registers are available, so it would need adjustments.

The AAPCS64 states that the platform can use it to carry inter-procedural state, which is assumed by Cranelift, but X18 could be used as a regular temporary register otherwise; perhaps we could revisit that assumption?

PR #4469 introduced the preserve_frame_pointers flag, which when true guarantees that the LR register is saved in the function prologue and restored in the epilogue, thus making it usable as a temporary register in between; we just have to ensure that calls are set up to clobber it, so that regalloc does the right thing. A restricted version of this idea that is potentially easier to implement is to make X30 a spill temporary instead, and to turn either X16 or X17 into a regular temporary register.

In fact the same optimization is also applicable when the preserve_frame_pointers flag is false as long as we are compiling a function that creates a frame record on the stack, e.g. a non-leaf one, but currently the backend plumbing is not set up to make that decision on a per-function basis. As @cfallin stated, any backend changes to remedy that limitation are subject to the following constraint:

The only thing I want to hold as a hard requirement is that we don't build it dynamically per-function (because there are lots of tiny functions and that would be a nontrivial cost); right now we build it once when the compiler backend is constructed. We could perhaps build a few versions of it though, and return the right one in the regalloc2::Function trait -- one for leaf functions and one without; and variations based on compiler flags.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 28 2022 at 18:54):

akirilov-arm labeled issue #4552:

Currently (as of commit 8137432e67c920b73a3bbcc4eb72ae5095d31f41) Cranelift's AArch64 backend excludes the following general-purpose registers (GPRs) unconditionally from the set of allocatable registers:

This list might be too conservative, except for the following cases:

As for the rest:

Currently reserved as spill temporaries, as explained by @cfallin:

regalloc2 actually doesn't need any temporaries anymore, but aarch64 itself does. The reason is that a spillslot may be at a greater offset from sp or fp than we can reach with an imm12, so we need a sequence of instructions to synthesize the address of a spillslot before spilling or reloading. That sequence itself can't require spilling another register if all registers are full (as they are likely to be if we're spilling in the first place), so we need to set aside x16 for that.

If I recall correctly, x17 is used in stack-limit check sequences...

From the same discussion, @cfallin's suggestion for an alternative approach:

... an alternative approach would be to reserve a small-offset slot to spill another victim to if we need to compute a spillslot address at a large distance away -- so we can bootstrap our way there with no registers initially free.

In particular, we might expand the stack area next to the frame record - instead of decrementing the stack pointer by 16 bytes to save FP and LR in a function prologue, we could decrement by 32 bytes, so that we would have a scratch area for 2 GPRs as well.

Another option is to reserve a vector register, which would give the same amount of space.

Note that the code generating branch veneers assumes that both registers are available, so it would need adjustments.

The AAPCS64 states that the platform can use it to carry inter-procedural state, which is assumed by Cranelift, but X18 could be used as a regular temporary register otherwise; perhaps we could revisit that assumption?

PR #4469 introduced the preserve_frame_pointers flag, which when true guarantees that the LR register is saved in the function prologue and restored in the epilogue, thus making it usable as a temporary register in between; we just have to ensure that calls are set up to clobber it, so that regalloc does the right thing. A restricted version of this idea that is potentially easier to implement is to make X30 a spill temporary instead, and to turn either X16 or X17 into a regular temporary register.

In fact the same optimization is also applicable when the preserve_frame_pointers flag is false as long as we are compiling a function that creates a frame record on the stack, e.g. a non-leaf one, but currently the backend plumbing is not set up to make that decision on a per-function basis. As @cfallin stated, any backend changes to remedy that limitation are subject to the following constraint:

The only thing I want to hold as a hard requirement is that we don't build it dynamically per-function (because there are lots of tiny functions and that would be a nontrivial cost); right now we build it once when the compiler backend is constructed. We could perhaps build a few versions of it though, and return the right one in the regalloc2::Function trait -- one for leaf functions and one without; and variations based on compiler flags.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 28 2022 at 18:54):

akirilov-arm labeled issue #4552:

Currently (as of commit 8137432e67c920b73a3bbcc4eb72ae5095d31f41) Cranelift's AArch64 backend excludes the following general-purpose registers (GPRs) unconditionally from the set of allocatable registers:

This list might be too conservative, except for the following cases:

As for the rest:

Currently reserved as spill temporaries, as explained by @cfallin:

regalloc2 actually doesn't need any temporaries anymore, but aarch64 itself does. The reason is that a spillslot may be at a greater offset from sp or fp than we can reach with an imm12, so we need a sequence of instructions to synthesize the address of a spillslot before spilling or reloading. That sequence itself can't require spilling another register if all registers are full (as they are likely to be if we're spilling in the first place), so we need to set aside x16 for that.

If I recall correctly, x17 is used in stack-limit check sequences...

From the same discussion, @cfallin's suggestion for an alternative approach:

... an alternative approach would be to reserve a small-offset slot to spill another victim to if we need to compute a spillslot address at a large distance away -- so we can bootstrap our way there with no registers initially free.

In particular, we might expand the stack area next to the frame record - instead of decrementing the stack pointer by 16 bytes to save FP and LR in a function prologue, we could decrement by 32 bytes, so that we would have a scratch area for 2 GPRs as well.

Another option is to reserve a vector register, which would give the same amount of space.

Note that the code generating branch veneers assumes that both registers are available, so it would need adjustments.

The AAPCS64 states that the platform can use it to carry inter-procedural state, which is assumed by Cranelift, but X18 could be used as a regular temporary register otherwise; perhaps we could revisit that assumption?

PR #4469 introduced the preserve_frame_pointers flag, which when true guarantees that the LR register is saved in the function prologue and restored in the epilogue, thus making it usable as a temporary register in between; we just have to ensure that calls are set up to clobber it, so that regalloc does the right thing. A restricted version of this idea that is potentially easier to implement is to make X30 a spill temporary instead, and to turn either X16 or X17 into a regular temporary register.

In fact the same optimization is also applicable when the preserve_frame_pointers flag is false as long as we are compiling a function that creates a frame record on the stack, e.g. a non-leaf one, but currently the backend plumbing is not set up to make that decision on a per-function basis. As @cfallin stated, any backend changes to remedy that limitation are subject to the following constraint:

The only thing I want to hold as a hard requirement is that we don't build it dynamically per-function (because there are lots of tiny functions and that would be a nontrivial cost); right now we build it once when the compiler backend is constructed. We could perhaps build a few versions of it though, and return the right one in the regalloc2::Function trait -- one for leaf functions and one without; and variations based on compiler flags.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 08 2022 at 10:15):

akirilov-arm labeled issue #4552:

Currently (as of commit 8137432e67c920b73a3bbcc4eb72ae5095d31f41) Cranelift's AArch64 backend excludes the following general-purpose registers (GPRs) unconditionally from the set of allocatable registers:

This list might be too conservative, except for the following cases:

As for the rest:

Currently reserved as spill temporaries, as explained by @cfallin:

regalloc2 actually doesn't need any temporaries anymore, but aarch64 itself does. The reason is that a spillslot may be at a greater offset from sp or fp than we can reach with an imm12, so we need a sequence of instructions to synthesize the address of a spillslot before spilling or reloading. That sequence itself can't require spilling another register if all registers are full (as they are likely to be if we're spilling in the first place), so we need to set aside x16 for that.

If I recall correctly, x17 is used in stack-limit check sequences...

From the same discussion, @cfallin's suggestion for an alternative approach:

... an alternative approach would be to reserve a small-offset slot to spill another victim to if we need to compute a spillslot address at a large distance away -- so we can bootstrap our way there with no registers initially free.

In particular, we might expand the stack area next to the frame record - instead of decrementing the stack pointer by 16 bytes to save FP and LR in a function prologue, we could decrement by 32 bytes, so that we would have a scratch area for 2 GPRs as well.

Another option is to reserve a vector register, which would give the same amount of space.

Note that the code generating branch veneers assumes that both registers are available, so it would need adjustments.

The AAPCS64 states that the platform can use it to carry inter-procedural state, which is assumed by Cranelift, but X18 could be used as a regular temporary register otherwise; perhaps we could revisit that assumption?

PR #4469 introduced the preserve_frame_pointers flag, which when true guarantees that the LR register is saved in the function prologue and restored in the epilogue, thus making it usable as a temporary register in between; we just have to ensure that calls are set up to clobber it, so that regalloc does the right thing. A restricted version of this idea that is potentially easier to implement is to make X30 a spill temporary instead, and to turn either X16 or X17 into a regular temporary register.

In fact the same optimization is also applicable when the preserve_frame_pointers flag is false as long as we are compiling a function that creates a frame record on the stack, e.g. a non-leaf one, but currently the backend plumbing is not set up to make that decision on a per-function basis. As @cfallin stated, any backend changes to remedy that limitation are subject to the following constraint:

The only thing I want to hold as a hard requirement is that we don't build it dynamically per-function (because there are lots of tiny functions and that would be a nontrivial cost); right now we build it once when the compiler backend is constructed. We could perhaps build a few versions of it though, and return the right one in the regalloc2::Function trait -- one for leaf functions and one without; and variations based on compiler flags.


Last updated: Nov 22 2024 at 17:03 UTC