mchesser commented on issue #4636:
I had intended to write a follow up comment to discuss this issue - here it is now:
There are two main uses of cold blocks I can think of:
Cases where there is code that needs to be executed in rare circumstances (e.g., error handlers).
Situations where a value is obtained via two (or more) different paths of code, one significantly rarer than the other (in my case, this corresponds to a fast path from a cache hit, and a slow path that needs to recompute the value).
These have different behaviour with respect to the edge blocks they generate (Note: overall Cranelift seems to do a decent job at avoiding extra edge blocks in many cases):
In 1., the cold block might be shared (e.g., a common return for all error cases), and so might require an edge block along the incoming edge. However, simple cases may never need to jump back to a shared block, so don’t require an edge-block on outgoing edges.
In 2., the cold block only as a single incoming edge so doesn’t require any moves to occur along the incoming edge (at least as well as I can understand). However, since the cold path rejoins the hot path, it frequently requires an edge block on the outgoing edge to ensure that the values.
Admittedly, I was only thinking of 2. when I changed the heuristic (to attempt to solve the issue below), so my intuition was that there would never be an edge-block that contained for the incoming edge.
An issue I observed (as a result of the existing cold-blocks test:
isa::x64::test::test_cold_blocks
), is that there is a subtle interaction of this change with trivial edges (i.e., edges without any moves) that occur when the cold path is reached as part of an untaken branch.e.g., if we have:
block0: brnz v1, block2 jump block1 block1 cold: ; .... block2: ; ...
If block1 is shared, then Cranelift seems to generate something equivalent to the following:
block0: brnz v1, block2 jump block0_to_block1_edge block0_to_block1_edge: jump block1 block1 cold: ; .... block2: ; ...
Normally the redundant
block0_to_block1_edge
jump is eliminated because it acts as a "fallthrough" to the next block, however if it gets marked as cold then it is no longer a fallthrough block and we end up with a redundant jump, e.g., see the branch at 0x23 below:.00: push rbp .01: mov rbp, rsp .04: add edi, 0x1234 .0a: test edi, edi .0c: je 0x2e .12: mov r8, rdi .15: mov rax, rdi .18: sub eax, 0x1234 .1e: sub eax, r8d .21: test edi, edi .23: jne 0x46 .29: mov rsp, rbp .2c: pop rbp .2d: ret .2e: mov r8, rdi .31: add r8d, 0x1234 .38: test r8d, r8d .3b: jne 0x2e .41: jmp 0x15 .46: jmp 0x2e
I originally slightly misunderstood what was going on and believe that removing the heuristic that propagated the cold annotation to incoming edges would fix the redundant branch (in combination with my assumption that all edges to cold blocks would not contain any moves). However, this is not really the case, and the same issue can crop up in the cold->hot edge case as well.
In summary: I think it makes sense to re-enable the heuristic for edges with cold blocks as successors, however in both cases we can end up with a redundant edge in some cases.
mchesser edited a comment on issue #4636:
I had intended to write a follow up comment to discuss this issue (but forgot) - here it is now:
There are two main uses of cold blocks I can think of:
Cases where there is code that needs to be executed in rare circumstances (e.g., error handlers).
Situations where a value is obtained via two (or more) different paths of code, one significantly rarer than the other (in my case, this corresponds to a fast path from a cache hit, and a slow path that needs to recompute the value).
These have different behaviour with respect to the edge blocks they generate (Note: overall Cranelift seems to do a decent job at avoiding extra edge blocks in many cases):
In 1., the cold block might be shared (e.g., a common return for all error cases), and so might require an edge block along the incoming edge. However, simple cases may never need to jump back to a shared block, so don’t require an edge-block on outgoing edges.
In 2., the cold block only as a single incoming edge so doesn’t require any moves to occur along the incoming edge (at least as well as I can understand). However, since the cold path rejoins the hot path, it frequently requires an edge block on the outgoing edge to ensure that the values.
Admittedly, I was only thinking of 2. when I changed the heuristic (to attempt to solve the issue below), so my intuition was that there would never be an edge-block that contained for the incoming edge.
An issue I observed (as a result of the existing cold-blocks test:
isa::x64::test::test_cold_blocks
), is that there is a subtle interaction of this change with trivial edges (i.e., edges without any moves) that occur when the cold path is reached as part of an untaken branch.e.g., if we have:
block0: brnz v1, block2 jump block1 block1 cold: ; .... block2: ; ...
If block1 is shared, then Cranelift seems to generate something equivalent to the following:
block0: brnz v1, block2 jump block0_to_block1_edge block0_to_block1_edge: jump block1 block1 cold: ; .... block2: ; ...
Normally the redundant
block0_to_block1_edge
jump is eliminated because it acts as a "fallthrough" to the next block, however if it gets marked as cold then it is no longer a fallthrough block and we end up with a redundant jump, e.g., see the branch at 0x23 below:.00: push rbp .01: mov rbp, rsp .04: add edi, 0x1234 .0a: test edi, edi .0c: je 0x2e .12: mov r8, rdi .15: mov rax, rdi .18: sub eax, 0x1234 .1e: sub eax, r8d .21: test edi, edi .23: jne 0x46 .29: mov rsp, rbp .2c: pop rbp .2d: ret .2e: mov r8, rdi .31: add r8d, 0x1234 .38: test r8d, r8d .3b: jne 0x2e .41: jmp 0x15 .46: jmp 0x2e
I originally slightly misunderstood what was going on and believe that removing the heuristic that propagated the cold annotation to incoming edges would fix the redundant branch (in combination with my assumption that all edges to cold blocks would not contain any moves). However, this is not really the case, and the same issue can crop up in the cold->hot edge case as well.
In summary: I think it makes sense to re-enable the heuristic for edges with cold blocks as successors, however in both cases we can end up with a redundant edge in some cases.
mchesser edited a comment on issue #4636:
I had intended to write a follow up comment to discuss this issue (but forgot) - here it is now:
There are two main uses of cold blocks I can think of:
Cases where there is code that needs to be executed in rare circumstances (e.g., error handlers), but the results of which are effectively unused by the hot path.
Situations where a value is obtained via two (or more) different paths of code, one significantly rarer than the other (in my case, this corresponds to a fast path from a cache hit, and a slow path that needs to recompute the value).
These have different behaviour with respect to the edge blocks they generate (Note: overall Cranelift seems to do a decent job at avoiding extra edge blocks in many cases):
In 1., the cold block might be shared (e.g., a common return for all error cases), and so might require an edge block along the incoming edge. However, simple cases may never need to jump back to a shared block, so don’t require an edge-block on outgoing edges.
In 2., the cold block only as a single incoming edge so doesn’t require any moves to occur along the incoming edge (at least as well as I can understand). However, since the cold path rejoins the hot path, it frequently requires an edge block on the outgoing edge to ensure that the values.
Admittedly, I was only thinking of 2. when I changed the heuristic (to attempt to solve the issue below), so my intuition was that there would never be an edge-block that contained for the incoming edge.
An issue I observed (as a result of the existing cold-blocks test:
isa::x64::test::test_cold_blocks
), is that there is a subtle interaction of this change with trivial edges (i.e., edges without any moves) that occur when the cold path is reached as part of an untaken branch.e.g., if we have:
block0: brnz v1, block2 jump block1 block1 cold: ; .... block2: ; ...
If block1 is shared, then Cranelift seems to generate something equivalent to the following:
block0: brnz v1, block2 jump block0_to_block1_edge block0_to_block1_edge: jump block1 block1 cold: ; .... block2: ; ...
Normally the redundant
block0_to_block1_edge
jump is eliminated because it acts as a "fallthrough" to the next block, however if it gets marked as cold then it is no longer a fallthrough block and we end up with a redundant jump, e.g., see the branch at 0x23 below:.00: push rbp .01: mov rbp, rsp .04: add edi, 0x1234 .0a: test edi, edi .0c: je 0x2e .12: mov r8, rdi .15: mov rax, rdi .18: sub eax, 0x1234 .1e: sub eax, r8d .21: test edi, edi .23: jne 0x46 .29: mov rsp, rbp .2c: pop rbp .2d: ret .2e: mov r8, rdi .31: add r8d, 0x1234 .38: test r8d, r8d .3b: jne 0x2e .41: jmp 0x15 .46: jmp 0x2e
I originally slightly misunderstood what was going on and believe that removing the heuristic that propagated the cold annotation to incoming edges would fix the redundant branch (in combination with my assumption that all edges to cold blocks would not contain any moves). However, this is not really the case, and the same issue can crop up in the cold->hot edge case as well.
In summary: I think it makes sense to re-enable the heuristic for edges with cold blocks as successors, however in both cases we can end up with a redundant edge in some cases.
mchesser edited a comment on issue #4636:
I had intended to write a follow up comment to discuss this issue (but forgot) - here it is now:
There are two main uses of cold blocks I can think of:
Cases where there is code that needs to be executed in rare circumstances (e.g., error handlers), but the results of which are effectively unused by the hot path.
Situations where a value is obtained via two (or more) different paths of code, one significantly rarer than the other (in my case, this corresponds to a fast path from a cache hit, and a slow path that needs to recompute the value).
These have different behaviour with respect to the edge blocks they generate (Note: overall Cranelift seems to do a decent job at avoiding extra edge blocks in many cases):
In 1., the cold block might be shared (e.g., a common return for all error cases), and so might require an edge block along the incoming edge. However, simple cases may never need to jump back to a shared block, so don’t require an edge-block on outgoing edges.
In 2., the cold block only as a single incoming edge so doesn’t require any moves to occur along the incoming edge (at least as well as I can understand). However, since the cold path rejoins the hot path, it frequently requires an edge block on the outgoing edge to ensure that the values.
Admittedly, I was only thinking of 2. when I changed the heuristic (to attempt to solve the issue below), so my intuition was that there would never be an edge-block that contained any move operations for incoming edges to cold blocks.
An issue I observed (as a result of the existing cold-blocks test:
isa::x64::test::test_cold_blocks
), is that there is a subtle interaction of this change with trivial edges (i.e., edges without any moves) that occur when the cold path is reached as part of an untaken branch.e.g., if we have:
block0: brnz v1, block2 jump block1 block1 cold: ; .... block2: ; ...
If block1 is shared, then Cranelift seems to generate something equivalent to the following:
block0: brnz v1, block2 jump block0_to_block1_edge block0_to_block1_edge: jump block1 block1 cold: ; .... block2: ; ...
Normally the redundant
block0_to_block1_edge
jump is eliminated because it acts as a "fallthrough" to the next block, however if it gets marked as cold then it is no longer a fallthrough block and we end up with a redundant jump, e.g., see the branch at 0x23 below:.00: push rbp .01: mov rbp, rsp .04: add edi, 0x1234 .0a: test edi, edi .0c: je 0x2e .12: mov r8, rdi .15: mov rax, rdi .18: sub eax, 0x1234 .1e: sub eax, r8d .21: test edi, edi .23: jne 0x46 .29: mov rsp, rbp .2c: pop rbp .2d: ret .2e: mov r8, rdi .31: add r8d, 0x1234 .38: test r8d, r8d .3b: jne 0x2e .41: jmp 0x15 .46: jmp 0x2e
I originally slightly misunderstood what was going on and believe that removing the heuristic that propagated the cold annotation to incoming edges would fix the redundant branch (in combination with my assumption that all edges to cold blocks would not contain any moves). However, this is not really the case, and the same issue can crop up in the cold->hot edge case as well.
In summary: I think it makes sense to re-enable the heuristic for edges with cold blocks as successors, however in both cases we can end up with a redundant edge in some cases.
mchesser edited a comment on issue #4636:
I had intended to write a follow up comment to discuss this issue (but forgot) - here it is now:
There are two main uses of cold blocks I can think of:
Cases where there is code that needs to be executed in rare circumstances (e.g., error handlers), but the results of which are effectively unused by the hot path.
Situations where a value is obtained via two (or more) different paths of code, one significantly rarer than the other (in my case, this corresponds to a fast path from a cache hit, and a slow path that needs to recompute the value).
These have different behaviour with respect to the edge blocks they generate (Note: overall Cranelift seems to do a decent job at avoiding extra edge blocks in many cases):
In 1., the cold block might be shared (e.g., a common return for all error cases), and so might require an edge block along the incoming edge. However, simple cases may never need to jump back to a shared block, so don’t require an edge-block on outgoing edges.
In 2., the cold block only as a single incoming edge so doesn’t require any moves to occur along the incoming edge (at least as well as I can understand). However, since the cold path rejoins the hot path, it frequently requires an edge block on the outgoing edge to ensure that the values.
Admittedly, I was only thinking of 2. when I changed the heuristic (to attempt to solve the issue below), so my intuition was that there would never be an edge-block that contained any move operations for incoming edges to cold blocks.
An issue I observed (as a result of the existing cold-blocks test:
isa::x64::test::test_cold_blocks
), is that there is a subtle interaction of this change with trivial edges (i.e., edges without any moves) that occur when the cold path is reached as part of an untaken branch.e.g., if we have:
block0: brnz v1, block2 jump block1 block1 cold: ; .... block2: ; ...
If block1 is shared, then Cranelift seems to generate something equivalent to the following:
block0: brnz v1, block2 jump block0_to_block1_edge block0_to_block1_edge: jump block1 block1 cold: ; .... block2: ; ...
Normally the redundant
block0_to_block1_edge
jump is eliminated because it acts as a "fallthrough" to the next block, however if it gets marked as cold then it is no longer a fallthrough block and we end up with a redundant jump, e.g., see the branch at 0x23 below:.00: push rbp .01: mov rbp, rsp .04: add edi, 0x1234 .0a: test edi, edi .0c: je 0x2e .12: mov r8, rdi .15: mov rax, rdi .18: sub eax, 0x1234 .1e: sub eax, r8d .21: test edi, edi .23: jne 0x46 .29: mov rsp, rbp .2c: pop rbp .2d: ret .2e: mov r8, rdi .31: add r8d, 0x1234 .38: test r8d, r8d .3b: jne 0x2e .41: jmp 0x15 .46: jmp 0x2e
I originally slightly misunderstood what was going on and believe that removing the heuristic that propagated the cold annotation to incoming edges would fix the redundant branch (in combination with my assumption that all edges to cold blocks would not contain any moves). However, this is not really the case, and the same issue can crop up in the cold->hot edge case as well.
In summary: I think it makes sense to re-enable the heuristic for edges with cold blocks as successors, however in both cases we can end up with a redundant edge in some cases.
mchesser edited a comment on issue #4636:
I had intended to write a follow up comment to discuss this issue (but forgot) - here it is now:
There are two main uses of cold blocks I can think of:
Cases where there is code that needs to be executed in rare circumstances (e.g., error handlers), but the results of which are effectively unused by the hot path.
Situations where a value is obtained via two (or more) different paths of code, one significantly rarer than the other (in my case, this corresponds to a fast path from a cache hit, and a slow path that needs to recompute the value).
These have different behaviour with respect to the edge blocks they generate (Note: overall Cranelift seems to do a decent job at avoiding extra edge blocks in many cases):
In 1., the cold block might be shared (e.g., a common return for all error cases), and so might require an edge block along the incoming edge. However, simple cases may never need to jump back to a shared block, so don’t require an edge-block on outgoing edges.
In 2., the cold block only as a single incoming edge so doesn’t require any moves to occur along the incoming edge (at least as well as I can understand). However, since the cold path rejoins the hot path, it frequently requires an edge block on the outgoing edge to ensure the values are in the same place as the hot path.
Admittedly, I was only thinking of 2. when I changed the heuristic (to attempt to solve the issue below), so my intuition was that there would never be an edge-block that contained any move operations for incoming edges to cold blocks.
An issue I observed (as a result of the existing cold-blocks test:
isa::x64::test::test_cold_blocks
), is that there is a subtle interaction of this change with trivial edges (i.e., edges without any moves) that occur when the cold path is reached as part of an untaken branch.e.g., if we have:
block0: brnz v1, block2 jump block1 block1 cold: ; .... block2: ; ...
If block1 is shared, then Cranelift seems to generate something equivalent to the following:
block0: brnz v1, block2 jump block0_to_block1_edge block0_to_block1_edge: jump block1 block1 cold: ; .... block2: ; ...
Normally the redundant
block0_to_block1_edge
jump is eliminated because it acts as a "fallthrough" to the next block, however if it gets marked as cold then it is no longer a fallthrough block and we end up with a redundant jump, e.g., see the branch at 0x23 below:.00: push rbp .01: mov rbp, rsp .04: add edi, 0x1234 .0a: test edi, edi .0c: je 0x2e .12: mov r8, rdi .15: mov rax, rdi .18: sub eax, 0x1234 .1e: sub eax, r8d .21: test edi, edi .23: jne 0x46 .29: mov rsp, rbp .2c: pop rbp .2d: ret .2e: mov r8, rdi .31: add r8d, 0x1234 .38: test r8d, r8d .3b: jne 0x2e .41: jmp 0x15 .46: jmp 0x2e
I originally slightly misunderstood what was going on and believe that removing the heuristic that propagated the cold annotation to incoming edges would fix the redundant branch (in combination with my assumption that all edges to cold blocks would not contain any moves). However, this is not really the case, and the same issue can crop up in the cold->hot edge case as well.
In summary: I think it makes sense to re-enable the heuristic for edges with cold blocks as successors, however in both cases we can end up with a redundant edge in some cases.
cfallin commented on issue #4636:
OK, I see; it would be helpful to see the actual CLIF that generates the disassembly (the edge block 0x46 being placed after the block it leads into at 0x2e is very surprising; RPO sort should place preds to the block before the block itself, absent any cycles among the cold blocks, and assuming the pred edge block and main cold block are both marked cold and sunk accordingly). If you want to dig into it further, in addition to the RPO sort you see in blockorder.rs, this code does the actual sinking at the last second, during emission (we do it this way because we need to generate vcode in postorder to see uses before defs).
In the meantime I'd be happy to merge this with the heuristic returned to "edge is cold if pred or succ is cold"; that seems pretty reasonable and robust to me.
mchesser commented on issue #4636:
I've reverted the commit back to the "edge is cold if pred or succ is cold" heuristic.
The assembly code came from the test here: https://github.com/bytecodealliance/wasmtime/blob/0b1f51f80427a79fbd7d77849f7e2301a1d128f5/cranelift/codegen/src/isa/x64/mod.rs#L219
I translated it to CLIF:
function %test_cold(i32) -> i32 { block0(v10: i32): v0 = iconst.i32 0x1234 v1 = iadd.i32 v10, v0 brnz v1, block1(v1) jump block2 block1(v20: i32): v2 = isub.i32 v1, v0 v3 = iadd.i32 v2, v20 brnz v1, block2 jump block3(v3) block2 cold: v4 = iadd.i32 v1, v0 brnz v4, block2 jump block1(v4) block3(v30: i32): return v30 }
Looks like there is a cycle.
cfallin commented on issue #4636:
@mchesser I fixed the redundant-jump issue over in #4652 just now -- thanks for noticing this!
Last updated: Dec 23 2024 at 12:05 UTC