Stream: git-wasmtime

Topic: wasmtime / issue #6742 Cranelift: Generate load/store usi...


view this post on Zulip Wasmtime GitHub notifications bot (Jul 18 2023 at 04:39):

maekawatoshiki edited issue #6742:

Feature

Currently, on aarch64 backend, the following piece of CLIF instructions...

; Equivalent to: int64_t *v9; int64_t v10; v4 = v9[v10];
v1 = iconst.i64 3
v2 = ishl.i64 v10, v1  ; v1 = 3
v3 = iadd v9, v2
v4 = load.i64 v3

... will generate the assembly like below:

adrp    x4, 0x780000
ldr     x4, [x4]
lsl     x5, x3, #3
ldr     x4, [x4, x5]

However, the assembly can be converted into more efficient one like this:

adrp    x4, 0x780000
ldr     x4, [x4]
ldr     x4, [x4, x3, lsl #3]

Benefit

The shorter instruction sequence will help improve the performance.
In fact, this problem was found when I was diffing the assembly generated by cranelift and llvm, where llvm was around 20% faster than cranelift in my case.

Implementation

I've walked through the cranelift codebase and figured out that such addressing mode seems to be represented as AMode::RegScaled, but not sure how I can teach the code generator to use RegScaled for ldr.
Editing isle rules or something like that?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 18 2023 at 05:04):

cfallin commented on issue #6742:

This would be great to have!

In fact, we even have a TODO in the code already (but unfortunately it looks like we didn't file an issue at the time, sorry!)

As can be seen at that link, we actually never translated the lower_address implementation to ISLE, so it's still done with manual pattern-matching in Rust. The ideal longer-term solution would be to rework it into ISLE, as we have on x64, which would make additional pattern-matching like this easier. I think we'd prefer that, but it's also a bit more work.

It might be possible to shoehorn it into the manual Rust code above, but it's a little tricky: the code works by collecting addends32 and addends64, with the semantics that the address is the sum of all of those (with 32-bit values zero-extended). We'd have to have a separate shifted (with a type something like Option<(Reg, u8)> to keep a value-in-register and a scale), collect it with the addends, and incorporate it if possible. But that's rapidly getting more complex than the equivalent in ISLE, so I think it's probably not the best approach.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 18 2023 at 06:05):

maekawatoshiki commented on issue #6742:

Thank you for your quick reply.

I agree that it gets more complicated if we change the current lower_address implementation, but translating lower_address entirely to ISLE is also a hard work.

I'll try to translate a part of lower_address into ISLE, and hopefully make a PR.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 18 2023 at 07:22):

maekawatoshiki edited a comment on issue #6742:

Thank you for your quick reply.

I agree that it gets more complicated if we change the current lower_address implementation, but translating lower_address entirely to ISLE is also a hard work.

I'll try to translate a part of lower_address into ISLE, and hopefully make a PR.
I've just noticed that it seems to be impossible to partially translate lower_address into ISLE, due to (extern constructor amode amode) ...

view this post on Zulip Wasmtime GitHub notifications bot (Jul 18 2023 at 07:25):

maekawatoshiki edited a comment on issue #6742:

Thank you for your quick reply.

I agree that it gets more complicated if we change the current lower_address implementation, but translating lower_address entirely to ISLE is also a hard work.

I'll try to translate a part of lower_address into ISLE, and hopefully make a PR.
I've just noticed that it seems to be impossible to partially translate lower_address into ISLE, due to (extern constructor amode amode) prohibiting defining overlapping rules...

view this post on Zulip Wasmtime GitHub notifications bot (Jul 18 2023 at 07:30):

maekawatoshiki edited a comment on issue #6742:

Thank you for your quick reply.

I agree that it gets more complicated if we change the current lower_address implementation, but translating lower_address entirely to ISLE is also a hard work.

I'll try to translate a part of lower_address into ISLE, and hopefully make a PR.
I've just noticed that it seems to be impossible to partially translate lower_address into ISLE, due to (extern constructor amode amode) ...

view this post on Zulip Wasmtime GitHub notifications bot (Jul 18 2023 at 15:52):

maekawatoshiki edited issue #6742:

Feature

Currently, on aarch64 backend, the following piece of CLIF instructions...

; Equivalent to: int64_t *v9; int64_t v10; v4 = v9[v10];
v1 = iconst.i64 3
v2 = ishl.i64 v10, v1  ; v1 = 3
v3 = iadd v9, v2
v4 = load.i64 v3

... will generate the assembly like below:

adrp    x4, 0x780000
ldr     x4, [x4]
lsl     x5, x3, #3
ldr     x4, [x4, x5]

However, the assembly can be converted into more efficient one like this:

adrp    x4, 0x780000
ldr     x4, [x4]
ldr     x4, [x4, x3, lsl #3]

Benefit

The shorter instruction sequence will help improve the performance.
In fact, this problem was found when I was diffing the assembly generated by cranelift and llvm, where llvm was around 10% faster than cranelift in my case.

Implementation

I've walked through the cranelift codebase and figured out that such addressing mode seems to be represented as AMode::RegScaled, but not sure how I can teach the code generator to use RegScaled for ldr.
Editing isle rules or something like that?

view this post on Zulip Wasmtime GitHub notifications bot (Aug 31 2023 at 21:04):

alexcrichton closed issue #6742:

Feature

Currently, on aarch64 backend, the following piece of CLIF instructions...

; Equivalent to: int64_t *v9; int64_t v10; v4 = v9[v10];
v1 = iconst.i64 3
v2 = ishl.i64 v10, v1  ; v1 = 3
v3 = iadd v9, v2
v4 = load.i64 v3

... will generate the assembly like below:

adrp    x4, 0x780000
ldr     x4, [x4]
lsl     x5, x3, #3
ldr     x4, [x4, x5]

However, the assembly can be converted into more efficient one like this:

adrp    x4, 0x780000
ldr     x4, [x4]
ldr     x4, [x4, x3, lsl #3]

Benefit

The shorter instruction sequence will help improve the performance.
In fact, this problem was found when I was diffing the assembly generated by cranelift and llvm, where llvm was around 10% faster than cranelift in my case.

Implementation

I've walked through the cranelift codebase and figured out that such addressing mode seems to be represented as AMode::RegScaled, but not sure how I can teach the code generator to use RegScaled for ldr.
Editing isle rules or something like that?


Last updated: Oct 23 2024 at 20:03 UTC