wasmtime / issue #5547 cranelift: Optimize `br_table`'s w... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #5547 cranelift: Optimize `br_table`'s w...

Wasmtime GitHub notifications bot (Jan 07 2023 at 15:31):

afonso360 opened issue #5547:

:wave: Hey,

Feature

I had this idea when looking into #5508.

We can optimize br_table with 0 targets in the JumpTable into a single jump into the default block.
We can also optimize JumpTables with 1 target into a brz + jump which should still provide code size benefits.

Benefit

Generated code size improvements.

Here's an example for the x86 backend.

<details>
<summary>br_table with 1 target</summary>

clif:

function %br_table_with_1_targets(i32) {
    jt0 = jump_table [block2]

block0(v0: i32):
    br_table v0, block1, jt0

block1:
    return

block2:
    return
}

Assembly:

   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   83 ff 01                cmp     edi, 1
   7:   0f 83 20 00 00 00       jae     0x2d
   d:   8b d7                   mov     edx, edi
   f:   b9 00 00 00 00          mov     ecx, 0
  14:   48 0f 43 d1             cmovae  rdx, rcx
  18:   48 8d 0d 0a 00 00 00    lea     rcx, [rip + 0xa]
  1f:   48 63 54 91 00          movsxd  rdx, dword ptr [rcx + rdx*4]
  24:   48 01 d1                add     rcx, rdx
  27:   ff e1                   jmp     rcx
  29:   09 00                   or      dword ptr [rax], eax
  2b:   00 00                   add     byte ptr [rax], al
  2d:   48 89 ec                mov     rsp, rbp
  30:   5d                      pop     rbp
  31:   c3                      ret
  32:   48 89 ec                mov     rsp, rbp
  35:   5d                      pop     rbp
  36:   c3                      ret

</details>

<details>
<summary>Equivalent brz+jump</summary>

clif:

function %brz(i32) {
block0(v0: i32):
    brz v0, block1
    jump block2

block1:
    return

block2:
    return
}

Assembly:

Disassembly of 22 bytes:
   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   85 ff                   test    edi, edi
   6:   0f 85 05 00 00 00       jne     0x11
   c:   48 89 ec                mov     rsp, rbp
   f:   5d                      pop     rbp
  10:   c3                      ret
  11:   48 89 ec                mov     rsp, rbp
  14:   5d                      pop     rbp
  15:   c3                      ret

</details>

Here are the savings (in bytes) for the various backends:

x86: 33 bytes
aarch64: 28 bytes
s390x: 30 bytes
riscv: 32 bytes

Implementation

Can we write this as an egraph rule? Otherwise we probably need to use a custom pass.

Alternatives

We can always not do this optimization, I'm not sure how common these two kinds of br_tables are. We can also perform these optimizations in each backend, but that seems to be duplicating some effort.

Wasmtime GitHub notifications bot (Jan 07 2023 at 15:32):

afonso360 edited issue #5547:

:wave: Hey,

Feature

I had this idea when looking into #5508.

Benefit

Generated code size improvements. Even a small br_table is huge in terms of code size.

Here's an example for the x86 backend.

<details>
<summary>br_table with 1 target</summary>

clif:

function %br_table_with_1_targets(i32) {
    jt0 = jump_table [block2]

block0(v0: i32):
    br_table v0, block1, jt0

block1:
    return

block2:
    return
}

Assembly:

   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   83 ff 01                cmp     edi, 1
   7:   0f 83 20 00 00 00       jae     0x2d
   d:   8b d7                   mov     edx, edi
   f:   b9 00 00 00 00          mov     ecx, 0
  14:   48 0f 43 d1             cmovae  rdx, rcx
  18:   48 8d 0d 0a 00 00 00    lea     rcx, [rip + 0xa]
  1f:   48 63 54 91 00          movsxd  rdx, dword ptr [rcx + rdx*4]
  24:   48 01 d1                add     rcx, rdx
  27:   ff e1                   jmp     rcx
  29:   09 00                   or      dword ptr [rax], eax
  2b:   00 00                   add     byte ptr [rax], al
  2d:   48 89 ec                mov     rsp, rbp
  30:   5d                      pop     rbp
  31:   c3                      ret
  32:   48 89 ec                mov     rsp, rbp
  35:   5d                      pop     rbp
  36:   c3                      ret

</details>

<details>
<summary>Equivalent brz+jump</summary>

clif:

function %brz(i32) {
block0(v0: i32):
    brz v0, block1
    jump block2

block1:
    return

block2:
    return
}

Assembly:

Disassembly of 22 bytes:
   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   85 ff                   test    edi, edi
   6:   0f 85 05 00 00 00       jne     0x11
   c:   48 89 ec                mov     rsp, rbp
   f:   5d                      pop     rbp
  10:   c3                      ret
  11:   48 89 ec                mov     rsp, rbp
  14:   5d                      pop     rbp
  15:   c3                      ret

</details>

Here are the savings (in bytes) for the various backends:

x86: 33 bytes
aarch64: 28 bytes
s390x: 30 bytes
riscv: 32 bytes

Implementation

Can we write this as an egraph rule? Otherwise we probably need to use a custom pass.

Alternatives

Wasmtime GitHub notifications bot (Jan 07 2023 at 15:33):

afonso360 edited issue #5547:

:wave: Hey,

Feature

I had this idea when looking into #5508.

We can optimize a br_table with 0 targets in the JumpTable into a single jump into the default block.
We can also optimize br_table's with 1 target into a brz + jump which should still provide code size benefits.

Benefit

Generated code size improvements. Even a small br_table is huge in terms of code size.

Here's an example for the x86 backend.

<details>
<summary>br_table with 1 target</summary>

clif:

function %br_table_with_1_targets(i32) {
    jt0 = jump_table [block2]

block0(v0: i32):
    br_table v0, block1, jt0

block1:
    return

block2:
    return
}

Assembly:

   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   83 ff 01                cmp     edi, 1
   7:   0f 83 20 00 00 00       jae     0x2d
   d:   8b d7                   mov     edx, edi
   f:   b9 00 00 00 00          mov     ecx, 0
  14:   48 0f 43 d1             cmovae  rdx, rcx
  18:   48 8d 0d 0a 00 00 00    lea     rcx, [rip + 0xa]
  1f:   48 63 54 91 00          movsxd  rdx, dword ptr [rcx + rdx*4]
  24:   48 01 d1                add     rcx, rdx
  27:   ff e1                   jmp     rcx
  29:   09 00                   or      dword ptr [rax], eax
  2b:   00 00                   add     byte ptr [rax], al
  2d:   48 89 ec                mov     rsp, rbp
  30:   5d                      pop     rbp
  31:   c3                      ret
  32:   48 89 ec                mov     rsp, rbp
  35:   5d                      pop     rbp
  36:   c3                      ret

</details>

<details>
<summary>Equivalent brz+jump</summary>

clif:

function %brz(i32) {
block0(v0: i32):
    brz v0, block1
    jump block2

block1:
    return

block2:
    return
}

Assembly:

Disassembly of 22 bytes:
   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   85 ff                   test    edi, edi
   6:   0f 85 05 00 00 00       jne     0x11
   c:   48 89 ec                mov     rsp, rbp
   f:   5d                      pop     rbp
  10:   c3                      ret
  11:   48 89 ec                mov     rsp, rbp
  14:   5d                      pop     rbp
  15:   c3                      ret

</details>

Here are the savings (in bytes) for the various backends:

x86: 33 bytes
aarch64: 28 bytes
s390x: 30 bytes
riscv: 32 bytes

Implementation

Can we write this as an egraph rule? Otherwise we probably need to use a custom pass.

Alternatives

Wasmtime GitHub notifications bot (Jan 07 2023 at 15:34):

bjorn3 commented on issue #5547:

For cg_clif br_table with less than 3 targets doesn't happen. It uses cranelift_frontend::Switch, which uses brz/brnz in case of two or less targets (excluding the default target)

Wasmtime GitHub notifications bot (Jan 07 2023 at 16:07):

afonso360 closed issue #5547:

:wave: Hey,

Feature

I had this idea when looking into #5508.

Benefit

Generated code size improvements. Even a small br_table is huge in terms of code size.

Here's an example for the x86 backend.

<details>
<summary>br_table with 1 target</summary>

clif:

function %br_table_with_1_targets(i32) {
    jt0 = jump_table [block2]

block0(v0: i32):
    br_table v0, block1, jt0

block1:
    return

block2:
    return
}

Assembly:

   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   83 ff 01                cmp     edi, 1
   7:   0f 83 20 00 00 00       jae     0x2d
   d:   8b d7                   mov     edx, edi
   f:   b9 00 00 00 00          mov     ecx, 0
  14:   48 0f 43 d1             cmovae  rdx, rcx
  18:   48 8d 0d 0a 00 00 00    lea     rcx, [rip + 0xa]
  1f:   48 63 54 91 00          movsxd  rdx, dword ptr [rcx + rdx*4]
  24:   48 01 d1                add     rcx, rdx
  27:   ff e1                   jmp     rcx
  29:   09 00                   or      dword ptr [rax], eax
  2b:   00 00                   add     byte ptr [rax], al
  2d:   48 89 ec                mov     rsp, rbp
  30:   5d                      pop     rbp
  31:   c3                      ret
  32:   48 89 ec                mov     rsp, rbp
  35:   5d                      pop     rbp
  36:   c3                      ret

</details>

<details>
<summary>Equivalent brz+jump</summary>

clif:

function %brz(i32) {
block0(v0: i32):
    brz v0, block1
    jump block2

block1:
    return

block2:
    return
}

Assembly:

Disassembly of 22 bytes:
   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   85 ff                   test    edi, edi
   6:   0f 85 05 00 00 00       jne     0x11
   c:   48 89 ec                mov     rsp, rbp
   f:   5d                      pop     rbp
  10:   c3                      ret
  11:   48 89 ec                mov     rsp, rbp
  14:   5d                      pop     rbp
  15:   c3                      ret

</details>

Here are the savings (in bytes) for the various backends:

x86: 33 bytes
aarch64: 28 bytes
s390x: 30 bytes
riscv: 32 bytes

Implementation

Can we write this as an egraph rule? Otherwise we probably need to use a custom pass.

Alternatives

Wasmtime GitHub notifications bot (Jan 07 2023 at 16:07):

afonso360 commented on issue #5547:

Right, we have the Switch interface that does a bunch of those optimizations. I've also disassembled a few of the sightglass benchmarks and none of them have a br_table with fewer than 3 targets, so its probably not worth pursuing this optimization.

Wasmtime GitHub notifications bot (Jan 07 2023 at 16:07):

afonso360 closed issue #5547:

:wave: Hey,

Feature

I had this idea when looking into #5508.

Benefit

Generated code size improvements. Even a small br_table is huge in terms of code size.

Here's an example for the x86 backend.

<details>
<summary>br_table with 1 target</summary>

clif:

function %br_table_with_1_targets(i32) {
    jt0 = jump_table [block2]

block0(v0: i32):
    br_table v0, block1, jt0

block1:
    return

block2:
    return
}

Assembly:

   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   83 ff 01                cmp     edi, 1
   7:   0f 83 20 00 00 00       jae     0x2d
   d:   8b d7                   mov     edx, edi
   f:   b9 00 00 00 00          mov     ecx, 0
  14:   48 0f 43 d1             cmovae  rdx, rcx
  18:   48 8d 0d 0a 00 00 00    lea     rcx, [rip + 0xa]
  1f:   48 63 54 91 00          movsxd  rdx, dword ptr [rcx + rdx*4]
  24:   48 01 d1                add     rcx, rdx
  27:   ff e1                   jmp     rcx
  29:   09 00                   or      dword ptr [rax], eax
  2b:   00 00                   add     byte ptr [rax], al
  2d:   48 89 ec                mov     rsp, rbp
  30:   5d                      pop     rbp
  31:   c3                      ret
  32:   48 89 ec                mov     rsp, rbp
  35:   5d                      pop     rbp
  36:   c3                      ret

</details>

<details>
<summary>Equivalent brz+jump</summary>

clif:

function %brz(i32) {
block0(v0: i32):
    brz v0, block1
    jump block2

block1:
    return

block2:
    return
}

Assembly:

Disassembly of 22 bytes:
   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   85 ff                   test    edi, edi
   6:   0f 85 05 00 00 00       jne     0x11
   c:   48 89 ec                mov     rsp, rbp
   f:   5d                      pop     rbp
  10:   c3                      ret
  11:   48 89 ec                mov     rsp, rbp
  14:   5d                      pop     rbp
  15:   c3                      ret

</details>

Here are the savings (in bytes) for the various backends:

x86: 33 bytes
aarch64: 28 bytes
s390x: 30 bytes
riscv: 32 bytes

Implementation

Can we write this as an egraph rule? Otherwise we probably need to use a custom pass.

Alternatives

Last updated: Apr 18 2025 at 19:03 UTC