Stream: git-wasmtime

Topic: wasmtime / issue #5547 cranelift: Optimize `br_table`'s w...


view this post on Zulip Wasmtime GitHub notifications bot (Jan 07 2023 at 15:31):

afonso360 opened issue #5547:

:wave: Hey,

Feature

I had this idea when looking into #5508.

We can optimize br_table with 0 targets in the JumpTable into a single jump into the default block.
We can also optimize JumpTables with 1 target into a brz + jump which should still provide code size benefits.

Benefit

Generated code size improvements.

Here's an example for the x86 backend.

<details>
<summary>br_table with 1 target</summary>

clif:

function %br_table_with_1_targets(i32) {
    jt0 = jump_table [block2]

block0(v0: i32):
    br_table v0, block1, jt0

block1:
    return

block2:
    return
}

Assembly:

   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   83 ff 01                cmp     edi, 1
   7:   0f 83 20 00 00 00       jae     0x2d
   d:   8b d7                   mov     edx, edi
   f:   b9 00 00 00 00          mov     ecx, 0
  14:   48 0f 43 d1             cmovae  rdx, rcx
  18:   48 8d 0d 0a 00 00 00    lea     rcx, [rip + 0xa]
  1f:   48 63 54 91 00          movsxd  rdx, dword ptr [rcx + rdx*4]
  24:   48 01 d1                add     rcx, rdx
  27:   ff e1                   jmp     rcx
  29:   09 00                   or      dword ptr [rax], eax
  2b:   00 00                   add     byte ptr [rax], al
  2d:   48 89 ec                mov     rsp, rbp
  30:   5d                      pop     rbp
  31:   c3                      ret
  32:   48 89 ec                mov     rsp, rbp
  35:   5d                      pop     rbp
  36:   c3                      ret

</details>

<details>
<summary>Equivalent brz+jump</summary>

clif:

function %brz(i32) {
block0(v0: i32):
    brz v0, block1
    jump block2

block1:
    return

block2:
    return
}

Assembly:

Disassembly of 22 bytes:
   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   85 ff                   test    edi, edi
   6:   0f 85 05 00 00 00       jne     0x11
   c:   48 89 ec                mov     rsp, rbp
   f:   5d                      pop     rbp
  10:   c3                      ret
  11:   48 89 ec                mov     rsp, rbp
  14:   5d                      pop     rbp
  15:   c3                      ret

</details>

Here are the savings (in bytes) for the various backends:

Implementation

Can we write this as an egraph rule? Otherwise we probably need to use a custom pass.

Alternatives

We can always not do this optimization, I'm not sure how common these two kinds of br_tables are. We can also perform these optimizations in each backend, but that seems to be duplicating some effort.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 07 2023 at 15:32):

afonso360 edited issue #5547:

:wave: Hey,

Feature

I had this idea when looking into #5508.

We can optimize br_table with 0 targets in the JumpTable into a single jump into the default block.
We can also optimize JumpTables with 1 target into a brz + jump which should still provide code size benefits.

Benefit

Generated code size improvements. Even a small br_table is huge in terms of code size.

Here's an example for the x86 backend.

<details>
<summary>br_table with 1 target</summary>

clif:

function %br_table_with_1_targets(i32) {
    jt0 = jump_table [block2]

block0(v0: i32):
    br_table v0, block1, jt0

block1:
    return

block2:
    return
}

Assembly:

   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   83 ff 01                cmp     edi, 1
   7:   0f 83 20 00 00 00       jae     0x2d
   d:   8b d7                   mov     edx, edi
   f:   b9 00 00 00 00          mov     ecx, 0
  14:   48 0f 43 d1             cmovae  rdx, rcx
  18:   48 8d 0d 0a 00 00 00    lea     rcx, [rip + 0xa]
  1f:   48 63 54 91 00          movsxd  rdx, dword ptr [rcx + rdx*4]
  24:   48 01 d1                add     rcx, rdx
  27:   ff e1                   jmp     rcx
  29:   09 00                   or      dword ptr [rax], eax
  2b:   00 00                   add     byte ptr [rax], al
  2d:   48 89 ec                mov     rsp, rbp
  30:   5d                      pop     rbp
  31:   c3                      ret
  32:   48 89 ec                mov     rsp, rbp
  35:   5d                      pop     rbp
  36:   c3                      ret

</details>

<details>
<summary>Equivalent brz+jump</summary>

clif:

function %brz(i32) {
block0(v0: i32):
    brz v0, block1
    jump block2

block1:
    return

block2:
    return
}

Assembly:

Disassembly of 22 bytes:
   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   85 ff                   test    edi, edi
   6:   0f 85 05 00 00 00       jne     0x11
   c:   48 89 ec                mov     rsp, rbp
   f:   5d                      pop     rbp
  10:   c3                      ret
  11:   48 89 ec                mov     rsp, rbp
  14:   5d                      pop     rbp
  15:   c3                      ret

</details>

Here are the savings (in bytes) for the various backends:

Implementation

Can we write this as an egraph rule? Otherwise we probably need to use a custom pass.

Alternatives

We can always not do this optimization, I'm not sure how common these two kinds of br_tables are. We can also perform these optimizations in each backend, but that seems to be duplicating some effort.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 07 2023 at 15:33):

afonso360 edited issue #5547:

:wave: Hey,

Feature

I had this idea when looking into #5508.

We can optimize a br_table with 0 targets in the JumpTable into a single jump into the default block.
We can also optimize br_table's with 1 target into a brz + jump which should still provide code size benefits.

Benefit

Generated code size improvements. Even a small br_table is huge in terms of code size.

Here's an example for the x86 backend.

<details>
<summary>br_table with 1 target</summary>

clif:

function %br_table_with_1_targets(i32) {
    jt0 = jump_table [block2]

block0(v0: i32):
    br_table v0, block1, jt0

block1:
    return

block2:
    return
}

Assembly:

   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   83 ff 01                cmp     edi, 1
   7:   0f 83 20 00 00 00       jae     0x2d
   d:   8b d7                   mov     edx, edi
   f:   b9 00 00 00 00          mov     ecx, 0
  14:   48 0f 43 d1             cmovae  rdx, rcx
  18:   48 8d 0d 0a 00 00 00    lea     rcx, [rip + 0xa]
  1f:   48 63 54 91 00          movsxd  rdx, dword ptr [rcx + rdx*4]
  24:   48 01 d1                add     rcx, rdx
  27:   ff e1                   jmp     rcx
  29:   09 00                   or      dword ptr [rax], eax
  2b:   00 00                   add     byte ptr [rax], al
  2d:   48 89 ec                mov     rsp, rbp
  30:   5d                      pop     rbp
  31:   c3                      ret
  32:   48 89 ec                mov     rsp, rbp
  35:   5d                      pop     rbp
  36:   c3                      ret

</details>

<details>
<summary>Equivalent brz+jump</summary>

clif:

function %brz(i32) {
block0(v0: i32):
    brz v0, block1
    jump block2

block1:
    return

block2:
    return
}

Assembly:

Disassembly of 22 bytes:
   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   85 ff                   test    edi, edi
   6:   0f 85 05 00 00 00       jne     0x11
   c:   48 89 ec                mov     rsp, rbp
   f:   5d                      pop     rbp
  10:   c3                      ret
  11:   48 89 ec                mov     rsp, rbp
  14:   5d                      pop     rbp
  15:   c3                      ret

</details>

Here are the savings (in bytes) for the various backends:

Implementation

Can we write this as an egraph rule? Otherwise we probably need to use a custom pass.

Alternatives

We can always not do this optimization, I'm not sure how common these two kinds of br_tables are. We can also perform these optimizations in each backend, but that seems to be duplicating some effort.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 07 2023 at 15:34):

bjorn3 commented on issue #5547:

For cg_clif br_table with less than 3 targets doesn't happen. It uses cranelift_frontend::Switch, which uses brz/brnz in case of two or less targets (excluding the default target)

view this post on Zulip Wasmtime GitHub notifications bot (Jan 07 2023 at 16:07):

afonso360 closed issue #5547:

:wave: Hey,

Feature

I had this idea when looking into #5508.

We can optimize a br_table with 0 targets in the JumpTable into a single jump into the default block.
We can also optimize br_table's with 1 target into a brz + jump which should still provide code size benefits.

Benefit

Generated code size improvements. Even a small br_table is huge in terms of code size.

Here's an example for the x86 backend.

<details>
<summary>br_table with 1 target</summary>

clif:

function %br_table_with_1_targets(i32) {
    jt0 = jump_table [block2]

block0(v0: i32):
    br_table v0, block1, jt0

block1:
    return

block2:
    return
}

Assembly:

   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   83 ff 01                cmp     edi, 1
   7:   0f 83 20 00 00 00       jae     0x2d
   d:   8b d7                   mov     edx, edi
   f:   b9 00 00 00 00          mov     ecx, 0
  14:   48 0f 43 d1             cmovae  rdx, rcx
  18:   48 8d 0d 0a 00 00 00    lea     rcx, [rip + 0xa]
  1f:   48 63 54 91 00          movsxd  rdx, dword ptr [rcx + rdx*4]
  24:   48 01 d1                add     rcx, rdx
  27:   ff e1                   jmp     rcx
  29:   09 00                   or      dword ptr [rax], eax
  2b:   00 00                   add     byte ptr [rax], al
  2d:   48 89 ec                mov     rsp, rbp
  30:   5d                      pop     rbp
  31:   c3                      ret
  32:   48 89 ec                mov     rsp, rbp
  35:   5d                      pop     rbp
  36:   c3                      ret

</details>

<details>
<summary>Equivalent brz+jump</summary>

clif:

function %brz(i32) {
block0(v0: i32):
    brz v0, block1
    jump block2

block1:
    return

block2:
    return
}

Assembly:

Disassembly of 22 bytes:
   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   85 ff                   test    edi, edi
   6:   0f 85 05 00 00 00       jne     0x11
   c:   48 89 ec                mov     rsp, rbp
   f:   5d                      pop     rbp
  10:   c3                      ret
  11:   48 89 ec                mov     rsp, rbp
  14:   5d                      pop     rbp
  15:   c3                      ret

</details>

Here are the savings (in bytes) for the various backends:

Implementation

Can we write this as an egraph rule? Otherwise we probably need to use a custom pass.

Alternatives

We can always not do this optimization, I'm not sure how common these two kinds of br_tables are. We can also perform these optimizations in each backend, but that seems to be duplicating some effort.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 07 2023 at 16:07):

afonso360 commented on issue #5547:

Right, we have the Switch interface that does a bunch of those optimizations. I've also disassembled a few of the sightglass benchmarks and none of them have a br_table with fewer than 3 targets, so its probably not worth pursuing this optimization.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 07 2023 at 16:07):

afonso360 closed issue #5547:

:wave: Hey,

Feature

I had this idea when looking into #5508.

We can optimize a br_table with 0 targets in the JumpTable into a single jump into the default block.
We can also optimize br_table's with 1 target into a brz + jump which should still provide code size benefits.

Benefit

Generated code size improvements. Even a small br_table is huge in terms of code size.

Here's an example for the x86 backend.

<details>
<summary>br_table with 1 target</summary>

clif:

function %br_table_with_1_targets(i32) {
    jt0 = jump_table [block2]

block0(v0: i32):
    br_table v0, block1, jt0

block1:
    return

block2:
    return
}

Assembly:

   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   83 ff 01                cmp     edi, 1
   7:   0f 83 20 00 00 00       jae     0x2d
   d:   8b d7                   mov     edx, edi
   f:   b9 00 00 00 00          mov     ecx, 0
  14:   48 0f 43 d1             cmovae  rdx, rcx
  18:   48 8d 0d 0a 00 00 00    lea     rcx, [rip + 0xa]
  1f:   48 63 54 91 00          movsxd  rdx, dword ptr [rcx + rdx*4]
  24:   48 01 d1                add     rcx, rdx
  27:   ff e1                   jmp     rcx
  29:   09 00                   or      dword ptr [rax], eax
  2b:   00 00                   add     byte ptr [rax], al
  2d:   48 89 ec                mov     rsp, rbp
  30:   5d                      pop     rbp
  31:   c3                      ret
  32:   48 89 ec                mov     rsp, rbp
  35:   5d                      pop     rbp
  36:   c3                      ret

</details>

<details>
<summary>Equivalent brz+jump</summary>

clif:

function %brz(i32) {
block0(v0: i32):
    brz v0, block1
    jump block2

block1:
    return

block2:
    return
}

Assembly:

Disassembly of 22 bytes:
   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   85 ff                   test    edi, edi
   6:   0f 85 05 00 00 00       jne     0x11
   c:   48 89 ec                mov     rsp, rbp
   f:   5d                      pop     rbp
  10:   c3                      ret
  11:   48 89 ec                mov     rsp, rbp
  14:   5d                      pop     rbp
  15:   c3                      ret

</details>

Here are the savings (in bytes) for the various backends:

Implementation

Can we write this as an egraph rule? Otherwise we probably need to use a custom pass.

Alternatives

We can always not do this optimization, I'm not sure how common these two kinds of br_tables are. We can also perform these optimizations in each backend, but that seems to be duplicating some effort.


Last updated: Nov 22 2024 at 17:03 UTC