mchesser opened issue #4635:
During lowering Cranelift may generate additional blocks to handle moves that conceptually occur on edges. The block ordering code treats these blocks the same as ordinary blocks, irrespective of whether they are associated with a cold block. This means that edges from a cold block can cause additional code to be inserted in the hot path.
For example, consider the following IL:
function %test_cold(i64, i64, i64, i64) -> i64 { block0(v0: i64, v1: i64, v2: i64, v3: i64): brz v0, block2 jump block1 block1: v10 = iadd.i64 v0, v2 jump block3(v10) block2 cold: v20 = iadd.i64 v1, v3 brnz v20, block3(v20) jump block4(v3) block3(v30: i64): v34 = iadd.i64 v30, v1 jump block4(v34) block4(v40: i64): return v40 }
Currently Cranelift emits the following (annotated x86-64 code):
.block0: push rbp mov rbp, rsp test rdi, rdi je 0x28 ; .block2 jmp 0x1a ; .block1 .block2_to_block4_edge: mov rdi, rcx jmp 0x20 ; .block4 .block1: add rdi, rdx .block3: add rdi, rsi .block4: mov rax, rdi mov rsp, rbp pop rbp ret mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x1d ; .block3 jmp 0x12 ; .block2_to_block4_edge
Modifying the lowering stage to mark edge blocks as cold if either the predecessor or successor block is cold, improves code generation:
```asm .block0: push rbp mov rbp, rsp test rdi, rdi je 0x1b ; .block2 .block1: add rdi, rdx .block3: add rdi, rsi .block4 mov rax, rdi mov rsp, rbp pop rbp ret .block2: mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x10 ; .block3 .block2_to_block4_edge: mov rdi, rcx jmp 0x13 ; .block4
This provided significant runtime speedups (~11%) for my code (which is severely frontend bound and uses heavy cold annotations).
mchesser labeled issue #4635:
During lowering Cranelift may generate additional blocks to handle moves that conceptually occur on edges. The block ordering code treats these blocks the same as ordinary blocks, irrespective of whether they are associated with a cold block. This means that edges from a cold block can cause additional code to be inserted in the hot path.
For example, consider the following IL:
function %test_cold(i64, i64, i64, i64) -> i64 { block0(v0: i64, v1: i64, v2: i64, v3: i64): brz v0, block2 jump block1 block1: v10 = iadd.i64 v0, v2 jump block3(v10) block2 cold: v20 = iadd.i64 v1, v3 brnz v20, block3(v20) jump block4(v3) block3(v30: i64): v34 = iadd.i64 v30, v1 jump block4(v34) block4(v40: i64): return v40 }
Currently Cranelift emits the following (annotated x86-64 code):
.block0: push rbp mov rbp, rsp test rdi, rdi je 0x28 ; .block2 jmp 0x1a ; .block1 .block2_to_block4_edge: mov rdi, rcx jmp 0x20 ; .block4 .block1: add rdi, rdx .block3: add rdi, rsi .block4: mov rax, rdi mov rsp, rbp pop rbp ret mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x1d ; .block3 jmp 0x12 ; .block2_to_block4_edge
Modifying the lowering stage to mark edge blocks as cold if either the predecessor or successor block is cold, improves code generation:
```asm .block0: push rbp mov rbp, rsp test rdi, rdi je 0x1b ; .block2 .block1: add rdi, rdx .block3: add rdi, rsi .block4 mov rax, rdi mov rsp, rbp pop rbp ret .block2: mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x10 ; .block3 .block2_to_block4_edge: mov rdi, rcx jmp 0x13 ; .block4
This provided significant runtime speedups (~11%) for my code (which is severely frontend bound and uses heavy cold annotations).
mchesser labeled issue #4635:
During lowering Cranelift may generate additional blocks to handle moves that conceptually occur on edges. The block ordering code treats these blocks the same as ordinary blocks, irrespective of whether they are associated with a cold block. This means that edges from a cold block can cause additional code to be inserted in the hot path.
For example, consider the following IL:
function %test_cold(i64, i64, i64, i64) -> i64 { block0(v0: i64, v1: i64, v2: i64, v3: i64): brz v0, block2 jump block1 block1: v10 = iadd.i64 v0, v2 jump block3(v10) block2 cold: v20 = iadd.i64 v1, v3 brnz v20, block3(v20) jump block4(v3) block3(v30: i64): v34 = iadd.i64 v30, v1 jump block4(v34) block4(v40: i64): return v40 }
Currently Cranelift emits the following (annotated x86-64 code):
.block0: push rbp mov rbp, rsp test rdi, rdi je 0x28 ; .block2 jmp 0x1a ; .block1 .block2_to_block4_edge: mov rdi, rcx jmp 0x20 ; .block4 .block1: add rdi, rdx .block3: add rdi, rsi .block4: mov rax, rdi mov rsp, rbp pop rbp ret mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x1d ; .block3 jmp 0x12 ; .block2_to_block4_edge
Modifying the lowering stage to mark edge blocks as cold if either the predecessor or successor block is cold, improves code generation:
```asm .block0: push rbp mov rbp, rsp test rdi, rdi je 0x1b ; .block2 .block1: add rdi, rdx .block3: add rdi, rsi .block4 mov rax, rdi mov rsp, rbp pop rbp ret .block2: mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x10 ; .block3 .block2_to_block4_edge: mov rdi, rcx jmp 0x13 ; .block4
This provided significant runtime speedups (~11%) for my code (which is severely frontend bound and uses heavy cold annotations).
mchesser edited issue #4635:
During lowering Cranelift may generate additional blocks to handle moves that conceptually occur on edges. The block ordering code treats these blocks the same as ordinary blocks, irrespective of whether they are associated with a cold block. This means that edges from a cold block can cause additional code to be inserted in the hot path.
For example, consider the following IL:
function %test_cold(i64, i64, i64, i64) -> i64 { block0(v0: i64, v1: i64, v2: i64, v3: i64): brz v0, block2 jump block1 block1: v10 = iadd.i64 v0, v2 jump block3(v10) block2 cold: v20 = iadd.i64 v1, v3 brnz v20, block3(v20) jump block4(v3) block3(v30: i64): v34 = iadd.i64 v30, v1 jump block4(v34) block4(v40: i64): return v40 }
Currently Cranelift emits the following (annotated x86-64 code):
.block0: push rbp mov rbp, rsp test rdi, rdi je 0x28 ; .block2 jmp 0x1a ; .block1 .block2_to_block4_edge: mov rdi, rcx jmp 0x20 ; .block4 .block1: add rdi, rdx .block3: add rdi, rsi .block4: mov rax, rdi mov rsp, rbp pop rbp ret mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x1d ; .block3 jmp 0x12 ; .block2_to_block4_edge
Modifying the lowering stage to mark edge blocks as cold if either the predecessor or successor block is cold, improves code generation:
```asm .block0: push rbp mov rbp, rsp test rdi, rdi je 0x1b ; .block2 .block1: add rdi, rdx .block3: add rdi, rsi .block4 mov rax, rdi mov rsp, rbp pop rbp ret .block2: mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x10 ; .block3 .block2_to_block4_edge: mov rdi, rcx jmp 0x13 ; .block4
This provided significant runtime speedups (~11%) for my code (which is severely frontend bound and heavily uses cold annotations).
mchesser edited issue #4635:
During lowering Cranelift may generate additional blocks to handle moves that conceptually occur on edges. The block ordering code treats these blocks the same as ordinary blocks, irrespective of whether they are associated with a cold block. This means that edges from a cold block can cause additional code to be inserted in the hot path.
For example, consider the following IL:
function %test_cold(i64, i64, i64, i64) -> i64 { block0(v0: i64, v1: i64, v2: i64, v3: i64): brz v0, block2 jump block1 block1: v10 = iadd.i64 v0, v2 jump block3(v10) block2 cold: v20 = iadd.i64 v1, v3 brnz v20, block3(v20) jump block4(v3) block3(v30: i64): v34 = iadd.i64 v30, v1 jump block4(v34) block4(v40: i64): return v40 }
Currently Cranelift emits the following (annotated x86-64 code):
.block0: push rbp mov rbp, rsp test rdi, rdi je 0x28 ; .block2 jmp 0x1a ; .block1 .block2_to_block4_edge: mov rdi, rcx jmp 0x20 ; .block4 .block1: add rdi, rdx .block3: add rdi, rsi .block4: mov rax, rdi mov rsp, rbp pop rbp ret mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x1d ; .block3 jmp 0x12 ; .block2_to_block4_edge
Modifying the lowering stage to mark edge blocks as cold if either the predecessor or successor block is cold (#4636), improves code generation:
```asm .block0: push rbp mov rbp, rsp test rdi, rdi je 0x1b ; .block2 .block1: add rdi, rdx .block3: add rdi, rsi .block4 mov rax, rdi mov rsp, rbp pop rbp ret .block2: mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x10 ; .block3 .block2_to_block4_edge: mov rdi, rcx jmp 0x13 ; .block4
This provided significant runtime speedups (~11%) for my code (which is severely frontend bound and heavily uses cold annotations).
mchesser edited issue #4635:
During lowering Cranelift may generate additional blocks to handle moves that conceptually occur on edges. The block ordering code treats these blocks the same as ordinary blocks, irrespective of whether they are associated with a cold block. This means that edges from a cold block can cause additional code to be inserted in the hot path.
For example, consider the following IL:
function %test_cold(i64, i64, i64, i64) -> i64 { block0(v0: i64, v1: i64, v2: i64, v3: i64): brz v0, block2 jump block1 block1: v10 = iadd.i64 v0, v2 jump block3(v10) block2 cold: v20 = iadd.i64 v1, v3 brnz v20, block3(v20) jump block4(v3) block3(v30: i64): v34 = iadd.i64 v30, v1 jump block4(v34) block4(v40: i64): return v40 }
Currently Cranelift emits the following (annotated x86-64 code):
.block0: push rbp mov rbp, rsp test rdi, rdi je 0x28 ; .block2 jmp 0x1a ; .block1 .block2_to_block4_edge: mov rdi, rcx jmp 0x20 ; .block4 .block1: add rdi, rdx .block3: add rdi, rsi .block4: mov rax, rdi mov rsp, rbp pop rbp ret mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x1d ; .block3 jmp 0x12 ; .block2_to_block4_edge
Modifying the lowering stage to mark edge blocks as cold if either the predecessor or successor block is cold (#4636), improves code generation:
.block0: push rbp mov rbp, rsp test rdi, rdi je 0x1b ; .block2 .block1: add rdi, rdx .block3: add rdi, rsi .block4 mov rax, rdi mov rsp, rbp pop rbp ret .block2: mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x10 ; .block3 .block2_to_block4_edge: mov rdi, rcx jmp 0x13 ; .block4
This provided significant runtime speedups (~11%) for my code (which is severely frontend bound and heavily uses cold annotations).
mchesser edited issue #4635:
During lowering Cranelift may generate additional blocks to handle moves that conceptually occur on edges. The block ordering code treats these blocks the same as ordinary blocks, irrespective of whether they are associated with a cold block. This means that edges from a cold block can cause additional code to be inserted in the hot path.
For example, consider the following IL:
function %test_cold(i64, i64, i64, i64) -> i64 { block0(v0: i64, v1: i64, v2: i64, v3: i64): brz v0, block2 jump block1 block1: v10 = iadd.i64 v0, v2 jump block3(v10) block2 cold: v20 = iadd.i64 v1, v3 brnz v20, block3(v20) jump block4(v3) block3(v30: i64): v34 = iadd.i64 v30, v1 jump block4(v34) block4(v40: i64): return v40 }
Currently Cranelift emits the following (annotated x86-64 code):
.block0: push rbp mov rbp, rsp test rdi, rdi je 0x28 ; .block2 jmp 0x1a ; .block1 .block2_to_block4_edge: mov rdi, rcx jmp 0x20 ; .block4 .block1: add rdi, rdx .block3: add rdi, rsi .block4: mov rax, rdi mov rsp, rbp pop rbp ret .block2: mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x1d ; .block3 jmp 0x12 ; .block2_to_block4_edge
Modifying the lowering stage to mark edge blocks as cold if either the predecessor or successor block is cold (#4636), improves code generation:
.block0: push rbp mov rbp, rsp test rdi, rdi je 0x1b ; .block2 .block1: add rdi, rdx .block3: add rdi, rsi .block4 mov rax, rdi mov rsp, rbp pop rbp ret .block2: mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x10 ; .block3 .block2_to_block4_edge: mov rdi, rcx jmp 0x13 ; .block4
This provided significant runtime speedups (~11%) for my code (which is severely frontend bound and heavily uses cold annotations).
cfallin closed issue #4635:
During lowering Cranelift may generate additional blocks to handle moves that conceptually occur on edges. The block ordering code treats these blocks the same as ordinary blocks, irrespective of whether they are associated with a cold block. This means that edges from a cold block can cause additional code to be inserted in the hot path.
For example, consider the following IL:
function %test_cold(i64, i64, i64, i64) -> i64 { block0(v0: i64, v1: i64, v2: i64, v3: i64): brz v0, block2 jump block1 block1: v10 = iadd.i64 v0, v2 jump block3(v10) block2 cold: v20 = iadd.i64 v1, v3 brnz v20, block3(v20) jump block4(v3) block3(v30: i64): v34 = iadd.i64 v30, v1 jump block4(v34) block4(v40: i64): return v40 }
Currently Cranelift emits the following (annotated x86-64 code):
.block0: push rbp mov rbp, rsp test rdi, rdi je 0x28 ; .block2 jmp 0x1a ; .block1 .block2_to_block4_edge: mov rdi, rcx jmp 0x20 ; .block4 .block1: add rdi, rdx .block3: add rdi, rsi .block4: mov rax, rdi mov rsp, rbp pop rbp ret .block2: mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x1d ; .block3 jmp 0x12 ; .block2_to_block4_edge
Modifying the lowering stage to mark edge blocks as cold if either the predecessor or successor block is cold (#4636), improves code generation:
.block0: push rbp mov rbp, rsp test rdi, rdi je 0x1b ; .block2 .block1: add rdi, rdx .block3: add rdi, rsi .block4 mov rax, rdi mov rsp, rbp pop rbp ret .block2: mov rdi, rsi add rdi, rcx test rdi, rdi jne 0x10 ; .block3 .block2_to_block4_edge: mov rdi, rcx jmp 0x13 ; .block4
This provided significant runtime speedups (~11%) for my code (which is severely frontend bound and heavily uses cold annotations).
Last updated: Dec 23 2024 at 12:05 UTC