Stream: git-wasmtime

Topic: wasmtime / issue #4636 Treat edge blocks as cold if they ...


view this post on Zulip Wasmtime GitHub notifications bot (Aug 09 2022 at 00:42):

mchesser commented on issue #4636:

I had intended to write a follow up comment to discuss this issue - here it is now:

There are two main uses of cold blocks I can think of:

  1. Cases where there is code that needs to be executed in rare circumstances (e.g., error handlers).

  2. Situations where a value is obtained via two (or more) different paths of code, one significantly rarer than the other (in my case, this corresponds to a fast path from a cache hit, and a slow path that needs to recompute the value).

These have different behaviour with respect to the edge blocks they generate (Note: overall Cranelift seems to do a decent job at avoiding extra edge blocks in many cases):

In 1., the cold block might be shared (e.g., a common return for all error cases), and so might require an edge block along the incoming edge. However, simple cases may never need to jump back to a shared block, so don’t require an edge-block on outgoing edges.

In 2., the cold block only as a single incoming edge so doesn’t require any moves to occur along the incoming edge (at least as well as I can understand). However, since the cold path rejoins the hot path, it frequently requires an edge block on the outgoing edge to ensure that the values.

Admittedly, I was only thinking of 2. when I changed the heuristic (to attempt to solve the issue below), so my intuition was that there would never be an edge-block that contained for the incoming edge.

An issue I observed (as a result of the existing cold-blocks test: isa::x64::test::test_cold_blocks), is that there is a subtle interaction of this change with trivial edges (i.e., edges without any moves) that occur when the cold path is reached as part of an untaken branch.

e.g., if we have:

block0:
    brnz v1, block2
    jump block1

block1 cold:
    ; ....

block2:
    ; ...

If block1 is shared, then Cranelift seems to generate something equivalent to the following:

block0:
    brnz v1, block2
    jump block0_to_block1_edge

block0_to_block1_edge:
    jump block1

block1 cold:
    ; ....

block2:
    ; ...

Normally the redundant block0_to_block1_edge jump is eliminated because it acts as a "fallthrough" to the next block, however if it gets marked as cold then it is no longer a fallthrough block and we end up with a redundant jump, e.g., see the branch at 0x23 below:

.00:  push    rbp
.01:  mov     rbp, rsp
.04:  add     edi, 0x1234
.0a:  test    edi, edi
.0c:  je      0x2e

.12:  mov     r8, rdi
.15:  mov     rax, rdi
.18:  sub     eax, 0x1234
.1e:  sub     eax, r8d
.21:  test    edi, edi
.23:  jne     0x46

.29:  mov     rsp, rbp
.2c:  pop     rbp
.2d:  ret

.2e:  mov     r8, rdi
.31:  add     r8d, 0x1234
.38:  test    r8d, r8d
.3b:  jne     0x2e
.41:  jmp     0x15

.46:  jmp     0x2e

I originally slightly misunderstood what was going on and believe that removing the heuristic that propagated the cold annotation to incoming edges would fix the redundant branch (in combination with my assumption that all edges to cold blocks would not contain any moves). However, this is not really the case, and the same issue can crop up in the cold->hot edge case as well.

In summary: I think it makes sense to re-enable the heuristic for edges with cold blocks as successors, however in both cases we can end up with a redundant edge in some cases.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 09 2022 at 00:42):

mchesser edited a comment on issue #4636:

I had intended to write a follow up comment to discuss this issue (but forgot) - here it is now:

There are two main uses of cold blocks I can think of:

  1. Cases where there is code that needs to be executed in rare circumstances (e.g., error handlers).

  2. Situations where a value is obtained via two (or more) different paths of code, one significantly rarer than the other (in my case, this corresponds to a fast path from a cache hit, and a slow path that needs to recompute the value).

These have different behaviour with respect to the edge blocks they generate (Note: overall Cranelift seems to do a decent job at avoiding extra edge blocks in many cases):

In 1., the cold block might be shared (e.g., a common return for all error cases), and so might require an edge block along the incoming edge. However, simple cases may never need to jump back to a shared block, so don’t require an edge-block on outgoing edges.

In 2., the cold block only as a single incoming edge so doesn’t require any moves to occur along the incoming edge (at least as well as I can understand). However, since the cold path rejoins the hot path, it frequently requires an edge block on the outgoing edge to ensure that the values.

Admittedly, I was only thinking of 2. when I changed the heuristic (to attempt to solve the issue below), so my intuition was that there would never be an edge-block that contained for the incoming edge.

An issue I observed (as a result of the existing cold-blocks test: isa::x64::test::test_cold_blocks), is that there is a subtle interaction of this change with trivial edges (i.e., edges without any moves) that occur when the cold path is reached as part of an untaken branch.

e.g., if we have:

block0:
    brnz v1, block2
    jump block1

block1 cold:
    ; ....

block2:
    ; ...

If block1 is shared, then Cranelift seems to generate something equivalent to the following:

block0:
    brnz v1, block2
    jump block0_to_block1_edge

block0_to_block1_edge:
    jump block1

block1 cold:
    ; ....

block2:
    ; ...

Normally the redundant block0_to_block1_edge jump is eliminated because it acts as a "fallthrough" to the next block, however if it gets marked as cold then it is no longer a fallthrough block and we end up with a redundant jump, e.g., see the branch at 0x23 below:

.00:  push    rbp
.01:  mov     rbp, rsp
.04:  add     edi, 0x1234
.0a:  test    edi, edi
.0c:  je      0x2e

.12:  mov     r8, rdi
.15:  mov     rax, rdi
.18:  sub     eax, 0x1234
.1e:  sub     eax, r8d
.21:  test    edi, edi
.23:  jne     0x46

.29:  mov     rsp, rbp
.2c:  pop     rbp
.2d:  ret

.2e:  mov     r8, rdi
.31:  add     r8d, 0x1234
.38:  test    r8d, r8d
.3b:  jne     0x2e
.41:  jmp     0x15

.46:  jmp     0x2e

I originally slightly misunderstood what was going on and believe that removing the heuristic that propagated the cold annotation to incoming edges would fix the redundant branch (in combination with my assumption that all edges to cold blocks would not contain any moves). However, this is not really the case, and the same issue can crop up in the cold->hot edge case as well.

In summary: I think it makes sense to re-enable the heuristic for edges with cold blocks as successors, however in both cases we can end up with a redundant edge in some cases.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 09 2022 at 00:47):

mchesser edited a comment on issue #4636:

I had intended to write a follow up comment to discuss this issue (but forgot) - here it is now:

There are two main uses of cold blocks I can think of:

  1. Cases where there is code that needs to be executed in rare circumstances (e.g., error handlers), but the results of which are effectively unused by the hot path.

  2. Situations where a value is obtained via two (or more) different paths of code, one significantly rarer than the other (in my case, this corresponds to a fast path from a cache hit, and a slow path that needs to recompute the value).

These have different behaviour with respect to the edge blocks they generate (Note: overall Cranelift seems to do a decent job at avoiding extra edge blocks in many cases):

In 1., the cold block might be shared (e.g., a common return for all error cases), and so might require an edge block along the incoming edge. However, simple cases may never need to jump back to a shared block, so don’t require an edge-block on outgoing edges.

In 2., the cold block only as a single incoming edge so doesn’t require any moves to occur along the incoming edge (at least as well as I can understand). However, since the cold path rejoins the hot path, it frequently requires an edge block on the outgoing edge to ensure that the values.

Admittedly, I was only thinking of 2. when I changed the heuristic (to attempt to solve the issue below), so my intuition was that there would never be an edge-block that contained for the incoming edge.

An issue I observed (as a result of the existing cold-blocks test: isa::x64::test::test_cold_blocks), is that there is a subtle interaction of this change with trivial edges (i.e., edges without any moves) that occur when the cold path is reached as part of an untaken branch.

e.g., if we have:

block0:
    brnz v1, block2
    jump block1

block1 cold:
    ; ....

block2:
    ; ...

If block1 is shared, then Cranelift seems to generate something equivalent to the following:

block0:
    brnz v1, block2
    jump block0_to_block1_edge

block0_to_block1_edge:
    jump block1

block1 cold:
    ; ....

block2:
    ; ...

Normally the redundant block0_to_block1_edge jump is eliminated because it acts as a "fallthrough" to the next block, however if it gets marked as cold then it is no longer a fallthrough block and we end up with a redundant jump, e.g., see the branch at 0x23 below:

.00:  push    rbp
.01:  mov     rbp, rsp
.04:  add     edi, 0x1234
.0a:  test    edi, edi
.0c:  je      0x2e

.12:  mov     r8, rdi
.15:  mov     rax, rdi
.18:  sub     eax, 0x1234
.1e:  sub     eax, r8d
.21:  test    edi, edi
.23:  jne     0x46

.29:  mov     rsp, rbp
.2c:  pop     rbp
.2d:  ret

.2e:  mov     r8, rdi
.31:  add     r8d, 0x1234
.38:  test    r8d, r8d
.3b:  jne     0x2e
.41:  jmp     0x15

.46:  jmp     0x2e

I originally slightly misunderstood what was going on and believe that removing the heuristic that propagated the cold annotation to incoming edges would fix the redundant branch (in combination with my assumption that all edges to cold blocks would not contain any moves). However, this is not really the case, and the same issue can crop up in the cold->hot edge case as well.

In summary: I think it makes sense to re-enable the heuristic for edges with cold blocks as successors, however in both cases we can end up with a redundant edge in some cases.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 09 2022 at 00:48):

mchesser edited a comment on issue #4636:

I had intended to write a follow up comment to discuss this issue (but forgot) - here it is now:

There are two main uses of cold blocks I can think of:

  1. Cases where there is code that needs to be executed in rare circumstances (e.g., error handlers), but the results of which are effectively unused by the hot path.

  2. Situations where a value is obtained via two (or more) different paths of code, one significantly rarer than the other (in my case, this corresponds to a fast path from a cache hit, and a slow path that needs to recompute the value).

These have different behaviour with respect to the edge blocks they generate (Note: overall Cranelift seems to do a decent job at avoiding extra edge blocks in many cases):

In 1., the cold block might be shared (e.g., a common return for all error cases), and so might require an edge block along the incoming edge. However, simple cases may never need to jump back to a shared block, so don’t require an edge-block on outgoing edges.

In 2., the cold block only as a single incoming edge so doesn’t require any moves to occur along the incoming edge (at least as well as I can understand). However, since the cold path rejoins the hot path, it frequently requires an edge block on the outgoing edge to ensure that the values.

Admittedly, I was only thinking of 2. when I changed the heuristic (to attempt to solve the issue below), so my intuition was that there would never be an edge-block that contained any move operations for incoming edges to cold blocks.

An issue I observed (as a result of the existing cold-blocks test: isa::x64::test::test_cold_blocks), is that there is a subtle interaction of this change with trivial edges (i.e., edges without any moves) that occur when the cold path is reached as part of an untaken branch.

e.g., if we have:

block0:
    brnz v1, block2
    jump block1

block1 cold:
    ; ....

block2:
    ; ...

If block1 is shared, then Cranelift seems to generate something equivalent to the following:

block0:
    brnz v1, block2
    jump block0_to_block1_edge

block0_to_block1_edge:
    jump block1

block1 cold:
    ; ....

block2:
    ; ...

Normally the redundant block0_to_block1_edge jump is eliminated because it acts as a "fallthrough" to the next block, however if it gets marked as cold then it is no longer a fallthrough block and we end up with a redundant jump, e.g., see the branch at 0x23 below:

.00:  push    rbp
.01:  mov     rbp, rsp
.04:  add     edi, 0x1234
.0a:  test    edi, edi
.0c:  je      0x2e

.12:  mov     r8, rdi
.15:  mov     rax, rdi
.18:  sub     eax, 0x1234
.1e:  sub     eax, r8d
.21:  test    edi, edi
.23:  jne     0x46

.29:  mov     rsp, rbp
.2c:  pop     rbp
.2d:  ret

.2e:  mov     r8, rdi
.31:  add     r8d, 0x1234
.38:  test    r8d, r8d
.3b:  jne     0x2e
.41:  jmp     0x15

.46:  jmp     0x2e

I originally slightly misunderstood what was going on and believe that removing the heuristic that propagated the cold annotation to incoming edges would fix the redundant branch (in combination with my assumption that all edges to cold blocks would not contain any moves). However, this is not really the case, and the same issue can crop up in the cold->hot edge case as well.

In summary: I think it makes sense to re-enable the heuristic for edges with cold blocks as successors, however in both cases we can end up with a redundant edge in some cases.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 09 2022 at 00:49):

mchesser edited a comment on issue #4636:

I had intended to write a follow up comment to discuss this issue (but forgot) - here it is now:

There are two main uses of cold blocks I can think of:

  1. Cases where there is code that needs to be executed in rare circumstances (e.g., error handlers), but the results of which are effectively unused by the hot path.

  2. Situations where a value is obtained via two (or more) different paths of code, one significantly rarer than the other (in my case, this corresponds to a fast path from a cache hit, and a slow path that needs to recompute the value).

These have different behaviour with respect to the edge blocks they generate (Note: overall Cranelift seems to do a decent job at avoiding extra edge blocks in many cases):

In 1., the cold block might be shared (e.g., a common return for all error cases), and so might require an edge block along the incoming edge. However, simple cases may never need to jump back to a shared block, so don’t require an edge-block on outgoing edges.

In 2., the cold block only as a single incoming edge so doesn’t require any moves to occur along the incoming edge (at least as well as I can understand). However, since the cold path rejoins the hot path, it frequently requires an edge block on the outgoing edge to ensure that the values.

Admittedly, I was only thinking of 2. when I changed the heuristic (to attempt to solve the issue below), so my intuition was that there would never be an edge-block that contained any move operations for incoming edges to cold blocks.

An issue I observed (as a result of the existing cold-blocks test: isa::x64::test::test_cold_blocks), is that there is a subtle interaction of this change with trivial edges (i.e., edges without any moves) that occur when the cold path is reached as part of an untaken branch.

e.g., if we have:

block0:
    brnz v1, block2
    jump block1

block1 cold:
    ; ....

block2:
    ; ...

If block1 is shared, then Cranelift seems to generate something equivalent to the following:

block0:
    brnz v1, block2
    jump block0_to_block1_edge

block0_to_block1_edge:
    jump block1

block1 cold:
    ; ....

block2:
    ; ...

Normally the redundant block0_to_block1_edge jump is eliminated because it acts as a "fallthrough" to the next block, however if it gets marked as cold then it is no longer a fallthrough block and we end up with a redundant jump, e.g., see the branch at 0x23 below:

.00:  push    rbp
.01:  mov     rbp, rsp
.04:  add     edi, 0x1234
.0a:  test    edi, edi
.0c:  je      0x2e

.12:  mov     r8, rdi
.15:  mov     rax, rdi
.18:  sub     eax, 0x1234
.1e:  sub     eax, r8d
.21:  test    edi, edi
.23:  jne     0x46

.29:  mov     rsp, rbp
.2c:  pop     rbp
.2d:  ret

.2e:  mov     r8, rdi
.31:  add     r8d, 0x1234
.38:  test    r8d, r8d
.3b:  jne     0x2e
.41:  jmp     0x15

.46:  jmp     0x2e

I originally slightly misunderstood what was going on and believe that removing the heuristic that propagated the cold annotation to incoming edges would fix the redundant branch (in combination with my assumption that all edges to cold blocks would not contain any moves). However, this is not really the case, and the same issue can crop up in the cold->hot edge case as well.

In summary: I think it makes sense to re-enable the heuristic for edges with cold blocks as successors, however in both cases we can end up with a redundant edge in some cases.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 09 2022 at 01:01):

mchesser edited a comment on issue #4636:

I had intended to write a follow up comment to discuss this issue (but forgot) - here it is now:

There are two main uses of cold blocks I can think of:

  1. Cases where there is code that needs to be executed in rare circumstances (e.g., error handlers), but the results of which are effectively unused by the hot path.

  2. Situations where a value is obtained via two (or more) different paths of code, one significantly rarer than the other (in my case, this corresponds to a fast path from a cache hit, and a slow path that needs to recompute the value).

These have different behaviour with respect to the edge blocks they generate (Note: overall Cranelift seems to do a decent job at avoiding extra edge blocks in many cases):

In 1., the cold block might be shared (e.g., a common return for all error cases), and so might require an edge block along the incoming edge. However, simple cases may never need to jump back to a shared block, so don’t require an edge-block on outgoing edges.

In 2., the cold block only as a single incoming edge so doesn’t require any moves to occur along the incoming edge (at least as well as I can understand). However, since the cold path rejoins the hot path, it frequently requires an edge block on the outgoing edge to ensure the values are in the same place as the hot path.

Admittedly, I was only thinking of 2. when I changed the heuristic (to attempt to solve the issue below), so my intuition was that there would never be an edge-block that contained any move operations for incoming edges to cold blocks.

An issue I observed (as a result of the existing cold-blocks test: isa::x64::test::test_cold_blocks), is that there is a subtle interaction of this change with trivial edges (i.e., edges without any moves) that occur when the cold path is reached as part of an untaken branch.

e.g., if we have:

block0:
    brnz v1, block2
    jump block1

block1 cold:
    ; ....

block2:
    ; ...

If block1 is shared, then Cranelift seems to generate something equivalent to the following:

block0:
    brnz v1, block2
    jump block0_to_block1_edge

block0_to_block1_edge:
    jump block1

block1 cold:
    ; ....

block2:
    ; ...

Normally the redundant block0_to_block1_edge jump is eliminated because it acts as a "fallthrough" to the next block, however if it gets marked as cold then it is no longer a fallthrough block and we end up with a redundant jump, e.g., see the branch at 0x23 below:

.00:  push    rbp
.01:  mov     rbp, rsp
.04:  add     edi, 0x1234
.0a:  test    edi, edi
.0c:  je      0x2e

.12:  mov     r8, rdi
.15:  mov     rax, rdi
.18:  sub     eax, 0x1234
.1e:  sub     eax, r8d
.21:  test    edi, edi
.23:  jne     0x46

.29:  mov     rsp, rbp
.2c:  pop     rbp
.2d:  ret

.2e:  mov     r8, rdi
.31:  add     r8d, 0x1234
.38:  test    r8d, r8d
.3b:  jne     0x2e
.41:  jmp     0x15

.46:  jmp     0x2e

I originally slightly misunderstood what was going on and believe that removing the heuristic that propagated the cold annotation to incoming edges would fix the redundant branch (in combination with my assumption that all edges to cold blocks would not contain any moves). However, this is not really the case, and the same issue can crop up in the cold->hot edge case as well.

In summary: I think it makes sense to re-enable the heuristic for edges with cold blocks as successors, however in both cases we can end up with a redundant edge in some cases.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 09 2022 at 04:18):

cfallin commented on issue #4636:

OK, I see; it would be helpful to see the actual CLIF that generates the disassembly (the edge block 0x46 being placed after the block it leads into at 0x2e is very surprising; RPO sort should place preds to the block before the block itself, absent any cycles among the cold blocks, and assuming the pred edge block and main cold block are both marked cold and sunk accordingly). If you want to dig into it further, in addition to the RPO sort you see in blockorder.rs, this code does the actual sinking at the last second, during emission (we do it this way because we need to generate vcode in postorder to see uses before defs).

In the meantime I'd be happy to merge this with the heuristic returned to "edge is cold if pred or succ is cold"; that seems pretty reasonable and robust to me.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 09 2022 at 04:38):

mchesser commented on issue #4636:

I've reverted the commit back to the "edge is cold if pred or succ is cold" heuristic.


The assembly code came from the test here: https://github.com/bytecodealliance/wasmtime/blob/0b1f51f80427a79fbd7d77849f7e2301a1d128f5/cranelift/codegen/src/isa/x64/mod.rs#L219

I translated it to CLIF:

function %test_cold(i32) -> i32 {
block0(v10: i32):
    v0 = iconst.i32 0x1234
    v1 = iadd.i32 v10, v0
    brnz v1, block1(v1)
    jump block2

block1(v20: i32):
    v2 = isub.i32 v1, v0
    v3 = iadd.i32 v2, v20
    brnz v1, block2
    jump block3(v3)

block2 cold:
    v4 = iadd.i32 v1, v0
    brnz v4, block2
    jump block1(v4)

block3(v30: i32):
    return v30
}

Looks like there is a cycle.

view this post on Zulip Wasmtime GitHub notifications bot (Aug 09 2022 at 05:09):

cfallin commented on issue #4636:

@mchesser I fixed the redundant-jump issue over in #4652 just now -- thanks for noticing this!


Last updated: Nov 22 2024 at 16:03 UTC