maekawatoshiki edited issue #6742:
Feature
Currently, on aarch64 backend, the following piece of CLIF instructions...
; Equivalent to: int64_t *v9; int64_t v10; v4 = v9[v10]; v1 = iconst.i64 3 v2 = ishl.i64 v10, v1 ; v1 = 3 v3 = iadd v9, v2 v4 = load.i64 v3
... will generate the assembly like below:
adrp x4, 0x780000 ldr x4, [x4] lsl x5, x3, #3 ldr x4, [x4, x5]
However, the assembly can be converted into more efficient one like this:
adrp x4, 0x780000 ldr x4, [x4] ldr x4, [x4, x3, lsl #3]
Benefit
The shorter instruction sequence will help improve the performance.
In fact, this problem was found when I was diffing the assembly generated by cranelift and llvm, where llvm was around 20% faster than cranelift in my case.Implementation
I've walked through the cranelift codebase and figured out that such addressing mode seems to be represented as
AMode::RegScaled
, but not sure how I can teach the code generator to useRegScaled
forldr
.
Editing isle rules or something like that?
cfallin commented on issue #6742:
This would be great to have!
In fact, we even have a TODO in the code already (but unfortunately it looks like we didn't file an issue at the time, sorry!)
As can be seen at that link, we actually never translated the
lower_address
implementation to ISLE, so it's still done with manual pattern-matching in Rust. The ideal longer-term solution would be to rework it into ISLE, as we have on x64, which would make additional pattern-matching like this easier. I think we'd prefer that, but it's also a bit more work.It might be possible to shoehorn it into the manual Rust code above, but it's a little tricky: the code works by collecting
addends32
andaddends64
, with the semantics that the address is the sum of all of those (with 32-bit values zero-extended). We'd have to have a separateshifted
(with a type something likeOption<(Reg, u8)>
to keep a value-in-register and a scale), collect it with the addends, and incorporate it if possible. But that's rapidly getting more complex than the equivalent in ISLE, so I think it's probably not the best approach.
maekawatoshiki commented on issue #6742:
Thank you for your quick reply.
I agree that it gets more complicated if we change the current
lower_address
implementation, but translatinglower_address
entirely to ISLE is also a hard work.I'll try to translate a part of
lower_address
into ISLE, and hopefully make a PR.
maekawatoshiki edited a comment on issue #6742:
Thank you for your quick reply.
I agree that it gets more complicated if we change the current
lower_address
implementation, but translatinglower_address
entirely to ISLE is also a hard work.
I'll try to translate a part oflower_address
into ISLE, and hopefully make a PR.
I've just noticed that it seems to be impossible to partially translatelower_address
into ISLE, due to(extern constructor amode amode)
...
maekawatoshiki edited a comment on issue #6742:
Thank you for your quick reply.
I agree that it gets more complicated if we change the current
lower_address
implementation, but translatinglower_address
entirely to ISLE is also a hard work.
I'll try to translate a part oflower_address
into ISLE, and hopefully make a PR.
I've just noticed that it seems to be impossible to partially translatelower_address
into ISLE, due to(extern constructor amode amode)
prohibiting defining overlapping rules...
maekawatoshiki edited a comment on issue #6742:
Thank you for your quick reply.
I agree that it gets more complicated if we change the current
lower_address
implementation, but translatinglower_address
entirely to ISLE is also a hard work.
I'll try to translate a part oflower_address
into ISLE, and hopefully make a PR.
I've just noticed that it seems to be impossible to partially translatelower_address
into ISLE, due to(extern constructor amode amode)
...
maekawatoshiki edited issue #6742:
Feature
Currently, on aarch64 backend, the following piece of CLIF instructions...
; Equivalent to: int64_t *v9; int64_t v10; v4 = v9[v10]; v1 = iconst.i64 3 v2 = ishl.i64 v10, v1 ; v1 = 3 v3 = iadd v9, v2 v4 = load.i64 v3
... will generate the assembly like below:
adrp x4, 0x780000 ldr x4, [x4] lsl x5, x3, #3 ldr x4, [x4, x5]
However, the assembly can be converted into more efficient one like this:
adrp x4, 0x780000 ldr x4, [x4] ldr x4, [x4, x3, lsl #3]
Benefit
The shorter instruction sequence will help improve the performance.
In fact, this problem was found when I was diffing the assembly generated by cranelift and llvm, where llvm was around 10% faster than cranelift in my case.Implementation
I've walked through the cranelift codebase and figured out that such addressing mode seems to be represented as
AMode::RegScaled
, but not sure how I can teach the code generator to useRegScaled
forldr
.
Editing isle rules or something like that?
alexcrichton closed issue #6742:
Feature
Currently, on aarch64 backend, the following piece of CLIF instructions...
; Equivalent to: int64_t *v9; int64_t v10; v4 = v9[v10]; v1 = iconst.i64 3 v2 = ishl.i64 v10, v1 ; v1 = 3 v3 = iadd v9, v2 v4 = load.i64 v3
... will generate the assembly like below:
adrp x4, 0x780000 ldr x4, [x4] lsl x5, x3, #3 ldr x4, [x4, x5]
However, the assembly can be converted into more efficient one like this:
adrp x4, 0x780000 ldr x4, [x4] ldr x4, [x4, x3, lsl #3]
Benefit
The shorter instruction sequence will help improve the performance.
In fact, this problem was found when I was diffing the assembly generated by cranelift and llvm, where llvm was around 10% faster than cranelift in my case.Implementation
I've walked through the cranelift codebase and figured out that such addressing mode seems to be represented as
AMode::RegScaled
, but not sure how I can teach the code generator to useRegScaled
forldr
.
Editing isle rules or something like that?
Last updated: Dec 23 2024 at 12:05 UTC