afonso360 opened issue #5547:
:wave: Hey,
Feature
I had this idea when looking into #5508.
We can optimize
br_table
with 0 targets in the JumpTable into a single jump into the default block.
We can also optimize JumpTables with 1 target into abrz
+jump
which should still provide code size benefits.Benefit
Generated code size improvements.
Here's an example for the x86 backend.
<details>
<summary>br_table
with 1 target</summary>clif:
function %br_table_with_1_targets(i32) { jt0 = jump_table [block2] block0(v0: i32): br_table v0, block1, jt0 block1: return block2: return }
Assembly:
0: 55 push rbp 1: 48 89 e5 mov rbp, rsp 4: 83 ff 01 cmp edi, 1 7: 0f 83 20 00 00 00 jae 0x2d d: 8b d7 mov edx, edi f: b9 00 00 00 00 mov ecx, 0 14: 48 0f 43 d1 cmovae rdx, rcx 18: 48 8d 0d 0a 00 00 00 lea rcx, [rip + 0xa] 1f: 48 63 54 91 00 movsxd rdx, dword ptr [rcx + rdx*4] 24: 48 01 d1 add rcx, rdx 27: ff e1 jmp rcx 29: 09 00 or dword ptr [rax], eax 2b: 00 00 add byte ptr [rax], al 2d: 48 89 ec mov rsp, rbp 30: 5d pop rbp 31: c3 ret 32: 48 89 ec mov rsp, rbp 35: 5d pop rbp 36: c3 ret
</details>
<details>
<summary>Equivalentbrz
+jump
</summary>clif:
function %brz(i32) { block0(v0: i32): brz v0, block1 jump block2 block1: return block2: return }
Assembly:
Disassembly of 22 bytes: 0: 55 push rbp 1: 48 89 e5 mov rbp, rsp 4: 85 ff test edi, edi 6: 0f 85 05 00 00 00 jne 0x11 c: 48 89 ec mov rsp, rbp f: 5d pop rbp 10: c3 ret 11: 48 89 ec mov rsp, rbp 14: 5d pop rbp 15: c3 ret
</details>
Here are the savings (in bytes) for the various backends:
- x86: 33 bytes
- aarch64: 28 bytes
- s390x: 30 bytes
- riscv: 32 bytes
Implementation
Can we write this as an egraph rule? Otherwise we probably need to use a custom pass.
Alternatives
We can always not do this optimization, I'm not sure how common these two kinds of
br_tables
are. We can also perform these optimizations in each backend, but that seems to be duplicating some effort.
afonso360 edited issue #5547:
:wave: Hey,
Feature
I had this idea when looking into #5508.
We can optimize
br_table
with 0 targets in the JumpTable into a single jump into the default block.
We can also optimize JumpTables with 1 target into abrz
+jump
which should still provide code size benefits.Benefit
Generated code size improvements. Even a small
br_table
is huge in terms of code size.Here's an example for the x86 backend.
<details>
<summary>br_table
with 1 target</summary>clif:
function %br_table_with_1_targets(i32) { jt0 = jump_table [block2] block0(v0: i32): br_table v0, block1, jt0 block1: return block2: return }
Assembly:
0: 55 push rbp 1: 48 89 e5 mov rbp, rsp 4: 83 ff 01 cmp edi, 1 7: 0f 83 20 00 00 00 jae 0x2d d: 8b d7 mov edx, edi f: b9 00 00 00 00 mov ecx, 0 14: 48 0f 43 d1 cmovae rdx, rcx 18: 48 8d 0d 0a 00 00 00 lea rcx, [rip + 0xa] 1f: 48 63 54 91 00 movsxd rdx, dword ptr [rcx + rdx*4] 24: 48 01 d1 add rcx, rdx 27: ff e1 jmp rcx 29: 09 00 or dword ptr [rax], eax 2b: 00 00 add byte ptr [rax], al 2d: 48 89 ec mov rsp, rbp 30: 5d pop rbp 31: c3 ret 32: 48 89 ec mov rsp, rbp 35: 5d pop rbp 36: c3 ret
</details>
<details>
<summary>Equivalentbrz
+jump
</summary>clif:
function %brz(i32) { block0(v0: i32): brz v0, block1 jump block2 block1: return block2: return }
Assembly:
Disassembly of 22 bytes: 0: 55 push rbp 1: 48 89 e5 mov rbp, rsp 4: 85 ff test edi, edi 6: 0f 85 05 00 00 00 jne 0x11 c: 48 89 ec mov rsp, rbp f: 5d pop rbp 10: c3 ret 11: 48 89 ec mov rsp, rbp 14: 5d pop rbp 15: c3 ret
</details>
Here are the savings (in bytes) for the various backends:
- x86: 33 bytes
- aarch64: 28 bytes
- s390x: 30 bytes
- riscv: 32 bytes
Implementation
Can we write this as an egraph rule? Otherwise we probably need to use a custom pass.
Alternatives
We can always not do this optimization, I'm not sure how common these two kinds of
br_tables
are. We can also perform these optimizations in each backend, but that seems to be duplicating some effort.
afonso360 edited issue #5547:
:wave: Hey,
Feature
I had this idea when looking into #5508.
We can optimize a
br_table
with 0 targets in the JumpTable into a single jump into the default block.
We can also optimizebr_table
's with 1 target into abrz
+jump
which should still provide code size benefits.Benefit
Generated code size improvements. Even a small
br_table
is huge in terms of code size.Here's an example for the x86 backend.
<details>
<summary>br_table
with 1 target</summary>clif:
function %br_table_with_1_targets(i32) { jt0 = jump_table [block2] block0(v0: i32): br_table v0, block1, jt0 block1: return block2: return }
Assembly:
0: 55 push rbp 1: 48 89 e5 mov rbp, rsp 4: 83 ff 01 cmp edi, 1 7: 0f 83 20 00 00 00 jae 0x2d d: 8b d7 mov edx, edi f: b9 00 00 00 00 mov ecx, 0 14: 48 0f 43 d1 cmovae rdx, rcx 18: 48 8d 0d 0a 00 00 00 lea rcx, [rip + 0xa] 1f: 48 63 54 91 00 movsxd rdx, dword ptr [rcx + rdx*4] 24: 48 01 d1 add rcx, rdx 27: ff e1 jmp rcx 29: 09 00 or dword ptr [rax], eax 2b: 00 00 add byte ptr [rax], al 2d: 48 89 ec mov rsp, rbp 30: 5d pop rbp 31: c3 ret 32: 48 89 ec mov rsp, rbp 35: 5d pop rbp 36: c3 ret
</details>
<details>
<summary>Equivalentbrz
+jump
</summary>clif:
function %brz(i32) { block0(v0: i32): brz v0, block1 jump block2 block1: return block2: return }
Assembly:
Disassembly of 22 bytes: 0: 55 push rbp 1: 48 89 e5 mov rbp, rsp 4: 85 ff test edi, edi 6: 0f 85 05 00 00 00 jne 0x11 c: 48 89 ec mov rsp, rbp f: 5d pop rbp 10: c3 ret 11: 48 89 ec mov rsp, rbp 14: 5d pop rbp 15: c3 ret
</details>
Here are the savings (in bytes) for the various backends:
- x86: 33 bytes
- aarch64: 28 bytes
- s390x: 30 bytes
- riscv: 32 bytes
Implementation
Can we write this as an egraph rule? Otherwise we probably need to use a custom pass.
Alternatives
We can always not do this optimization, I'm not sure how common these two kinds of
br_tables
are. We can also perform these optimizations in each backend, but that seems to be duplicating some effort.
bjorn3 commented on issue #5547:
For cg_clif br_table with less than 3 targets doesn't happen. It uses
cranelift_frontend::Switch
, which uses brz/brnz in case of two or less targets (excluding the default target)
afonso360 closed issue #5547:
:wave: Hey,
Feature
I had this idea when looking into #5508.
We can optimize a
br_table
with 0 targets in the JumpTable into a single jump into the default block.
We can also optimizebr_table
's with 1 target into abrz
+jump
which should still provide code size benefits.Benefit
Generated code size improvements. Even a small
br_table
is huge in terms of code size.Here's an example for the x86 backend.
<details>
<summary>br_table
with 1 target</summary>clif:
function %br_table_with_1_targets(i32) { jt0 = jump_table [block2] block0(v0: i32): br_table v0, block1, jt0 block1: return block2: return }
Assembly:
0: 55 push rbp 1: 48 89 e5 mov rbp, rsp 4: 83 ff 01 cmp edi, 1 7: 0f 83 20 00 00 00 jae 0x2d d: 8b d7 mov edx, edi f: b9 00 00 00 00 mov ecx, 0 14: 48 0f 43 d1 cmovae rdx, rcx 18: 48 8d 0d 0a 00 00 00 lea rcx, [rip + 0xa] 1f: 48 63 54 91 00 movsxd rdx, dword ptr [rcx + rdx*4] 24: 48 01 d1 add rcx, rdx 27: ff e1 jmp rcx 29: 09 00 or dword ptr [rax], eax 2b: 00 00 add byte ptr [rax], al 2d: 48 89 ec mov rsp, rbp 30: 5d pop rbp 31: c3 ret 32: 48 89 ec mov rsp, rbp 35: 5d pop rbp 36: c3 ret
</details>
<details>
<summary>Equivalentbrz
+jump
</summary>clif:
function %brz(i32) { block0(v0: i32): brz v0, block1 jump block2 block1: return block2: return }
Assembly:
Disassembly of 22 bytes: 0: 55 push rbp 1: 48 89 e5 mov rbp, rsp 4: 85 ff test edi, edi 6: 0f 85 05 00 00 00 jne 0x11 c: 48 89 ec mov rsp, rbp f: 5d pop rbp 10: c3 ret 11: 48 89 ec mov rsp, rbp 14: 5d pop rbp 15: c3 ret
</details>
Here are the savings (in bytes) for the various backends:
- x86: 33 bytes
- aarch64: 28 bytes
- s390x: 30 bytes
- riscv: 32 bytes
Implementation
Can we write this as an egraph rule? Otherwise we probably need to use a custom pass.
Alternatives
We can always not do this optimization, I'm not sure how common these two kinds of
br_tables
are. We can also perform these optimizations in each backend, but that seems to be duplicating some effort.
afonso360 commented on issue #5547:
Right, we have the Switch interface that does a bunch of those optimizations. I've also disassembled a few of the sightglass benchmarks and none of them have a
br_table
with fewer than 3 targets, so its probably not worth pursuing this optimization.
afonso360 closed issue #5547:
:wave: Hey,
Feature
I had this idea when looking into #5508.
We can optimize a
br_table
with 0 targets in the JumpTable into a single jump into the default block.
We can also optimizebr_table
's with 1 target into abrz
+jump
which should still provide code size benefits.Benefit
Generated code size improvements. Even a small
br_table
is huge in terms of code size.Here's an example for the x86 backend.
<details>
<summary>br_table
with 1 target</summary>clif:
function %br_table_with_1_targets(i32) { jt0 = jump_table [block2] block0(v0: i32): br_table v0, block1, jt0 block1: return block2: return }
Assembly:
0: 55 push rbp 1: 48 89 e5 mov rbp, rsp 4: 83 ff 01 cmp edi, 1 7: 0f 83 20 00 00 00 jae 0x2d d: 8b d7 mov edx, edi f: b9 00 00 00 00 mov ecx, 0 14: 48 0f 43 d1 cmovae rdx, rcx 18: 48 8d 0d 0a 00 00 00 lea rcx, [rip + 0xa] 1f: 48 63 54 91 00 movsxd rdx, dword ptr [rcx + rdx*4] 24: 48 01 d1 add rcx, rdx 27: ff e1 jmp rcx 29: 09 00 or dword ptr [rax], eax 2b: 00 00 add byte ptr [rax], al 2d: 48 89 ec mov rsp, rbp 30: 5d pop rbp 31: c3 ret 32: 48 89 ec mov rsp, rbp 35: 5d pop rbp 36: c3 ret
</details>
<details>
<summary>Equivalentbrz
+jump
</summary>clif:
function %brz(i32) { block0(v0: i32): brz v0, block1 jump block2 block1: return block2: return }
Assembly:
Disassembly of 22 bytes: 0: 55 push rbp 1: 48 89 e5 mov rbp, rsp 4: 85 ff test edi, edi 6: 0f 85 05 00 00 00 jne 0x11 c: 48 89 ec mov rsp, rbp f: 5d pop rbp 10: c3 ret 11: 48 89 ec mov rsp, rbp 14: 5d pop rbp 15: c3 ret
</details>
Here are the savings (in bytes) for the various backends:
- x86: 33 bytes
- aarch64: 28 bytes
- s390x: 30 bytes
- riscv: 32 bytes
Implementation
Can we write this as an egraph rule? Otherwise we probably need to use a custom pass.
Alternatives
We can always not do this optimization, I'm not sure how common these two kinds of
br_tables
are. We can also perform these optimizations in each backend, but that seems to be duplicating some effort.
Last updated: Nov 22 2024 at 17:03 UTC