afonso360 commented on issue #7047:
Looks like on the test
filetests/filetests/runtests/issue5569.clif
we have a single block that produces about 8KiB of code, and the branch that we use for these traps only has a range of +/-4KiB, so we run into this assert.I thought we would emit islands in the middle of a block if we needed to, but that appears not to be the case. I'm also not sure what the best solution here is, checking if we need an island at every instruction seems quite heavy handed.
This is also going to get worse when I introduced compressed conditional branches that only have +/- 256 bytes of range.
afonso360 edited a comment on issue #7047:
Looks like on the test
filetests/filetests/runtests/issue5569.clif
we have a single block that produces about 8KiB of code, and the branch that we use for these traps only has a range of +/-4KiB, so we run into this assert.I thought we would emit islands in the middle of a block if we needed to, but that appears not to be the case. I'm also not sure what the best solution here is, checking if we need an island at every instruction seems quite heavy handed.
This is also going to get worse when I introduced compressed conditional branches that only have +/- 256 bytes of range.
Is there an easy way around this, or should I keep using inline traps for now and fix deferred traps later?
afonso360 edited a comment on issue #7047:
Looks like on the test
filetests/filetests/runtests/issue5569.clif
we have a single block that produces about 8KiB of code, and the branch that we use for these traps only has a range of +/-4KiB, so we run into this assert.I thought we would emit islands in the middle of a block if we needed to, but that appears not to be the case. I'm also not sure what the best solution here is, checking if we need an island at every instruction seems quite heavy handed.
This is also going to get worse when I introduced compressed conditional branches that only have +/- 256 bytes of range.
Is there an easy way around this, or should I keep inline traps for now and fix deferred traps later?
alexcrichton commented on issue #7047:
Ah yes you're right that islands are emitted once per block which is based on
I::worst_case_size()
. Is this perhaps a case where theworse_case_size
for RISC-V isn't accurate? Or perhaps the block is so large that a jump in the middle of the block can't jump out of the block?This is also going to get worse when I introduced compressed conditional branches that only have +/- 256 bytes of range.
This may not actually work well with Wasmtime today unfortunately. As I'm sure you've seen we aren't guaranteed to know the branch target at emission time meaning that we can't use the current scheme you have for compressed instructions. It's technically possible to do it sort of but it means that it could make things worse since a 256 byte jump would run the risk of often requiring a veneer to a larger jump which means that what could have been a single jump would be veneer'd to two jumps.
When I was reading over the RISC-V documentation and intro and such I got the impression that the intention of the compressed instructions was to do what you've design today which is to emit everything and then compress it all after-the fact. The linker, however, seemed like it would be the one responsible for compressing RISC-V jumps which I'm not sure
MachBuffer
is positioned to do today. (e.g.MachBuffer
isn't able to implement linker relaxation where all relocations are compressed to their smallest size). Not to say we couldn't ever implement it though, it's probably a chunk of work though.
cfallin commented on issue #7047:
Chiming in as well on branch ranges to say: two basic design tenets of MachBuffer's approach were:
- We only emit islands between blocks, where we don't need to munge control flow to jump around them (on some backends then, e.g. aarch64, we do it for inline
br_table
tables too, since they're unbounded-ish, but that's morally "between blocks" as well even though it's within a pseudoinst)- We never backpatch code in a way that can expand size, because this requires a fixpoint loop in the limit and that could degrade to be very slow (imagine expanding a compressed jump to jump pushes a bunch of other branches out of range)
The former should be possible to relax, since the do-we-need-an-island check is cheap (just a compare to a deadline). We'd need to generate a jump beforehand that jumps around the island (and that jump should be longer-range!) If it actually happens it could be pretty undesirable from a performance standpoint -- the extra jump and break in fetch bandwidth could hurt in an inner loop -- so we'd want to do it only rarely (i.e., not optimistically pick the 256-byte-range jumps).
That should handle the long-basic-blocks problem re: 4K-range branches. It seems like the right way to handle the compressed-jump question is indeed a relaxation/shrinkage approach (which can't push other branches of out range, so you can do it without a fixpoint, but you might want to because other branches might become shrinkable too), but probably as a post-pass on the machine code. We'd need to keep all label fixups around to enable that as well. Maybe best to defer to a linker in e.g. Cranelift-as-AOT-rustc-backend cases, and omit this where compile times are more important (Cranelift-in-Wasmtime).
afonso360 commented on issue #7047:
Ah yes you're right that islands are emitted once per block which is based on I::worst_case_size(). Is this perhaps a case where the worse_case_size for RISC-V isn't accurate? Or perhaps the block is so large that a jump in the middle of the block can't jump out of the block?
As far as I can tell
worst_case_size
is fairly accurate, It's currently at 124 bytes. The test for it points to the maximum being 116, so it's not too far off.The block being too large seems to me the likely cause. The logs show that the full block is 8KiB, and we have a TrapIf about 1KiB into the block. That means that the closest island is 7KiB away, and the CondBr Instruction can't reach that.
I think I'm going to back out of the
deferred_trap
portion of this PR and land that separately, since it looks like we might need to make some changes to island checking for that to work properly with our limited branch range.
The linker, however, seemed like it would be the one responsible for compressing RISC-V jumps which I'm not sure MachBuffer is positioned to do today.
Maybe best to defer to a linker in e.g. Cranelift-as-AOT-rustc-backend cases, and omit this where compile times are more important (Cranelift-in-Wasmtime).
Yeah, I agree with that. I'm not sure I want to go full relaxation on MachBuffer yet, but a good middle ground would probably be expose all label usages as unresolved relocations, and let the external linker deal with it.
afonso360 edited a comment on issue #7047:
Ah yes you're right that islands are emitted once per block which is based on I::worst_case_size(). Is this perhaps a case where the worse_case_size for RISC-V isn't accurate? Or perhaps the block is so large that a jump in the middle of the block can't jump out of the block?
As far as I can tell
worst_case_size
is fairly accurate, It's currently at 124 bytes. The test for it points to the maximum being 116, so it's not too far off.The block being too large seems to me the likely cause. The logs show that the full block is 8KiB, and we have a TrapIf about 1KiB into the block. That means that the closest island is 7KiB away, and the CondBr Instruction can't reach that.
I think I'm going to back out of the
deferred_trap
portion of this PR and land that separately, since it looks like we might need to make some changes to island checking for that to work properly with our limited branch range.
The linker, however, seemed like it would be the one responsible for compressing RISC-V jumps which I'm not sure MachBuffer is positioned to do today.
Maybe best to defer to a linker in e.g. Cranelift-as-AOT-rustc-backend cases, and omit this where compile times are more important (Cranelift-in-Wasmtime).
Yeah, I agree with that. I'm not sure I want to go full relaxation on MachBuffer yet, but a good middle ground would probably be expose all label usages as unresolved relocations, and let the external linker deal with it. We could probably get some good relaxation gains that way.
afonso360 edited a comment on issue #7047:
Ah yes you're right that islands are emitted once per block which is based on I::worst_case_size(). Is this perhaps a case where the worse_case_size for RISC-V isn't accurate? Or perhaps the block is so large that a jump in the middle of the block can't jump out of the block?
As far as I can tell
worst_case_size
is fairly accurate, It's currently at 124 bytes. The test for it points to the maximum being 116, so it's not too far off.The block being too large seems to me the likely cause. The logs show that the full block is 8KiB, and we have a TrapIf about 1KiB into the block. That means that the closest island is 7KiB away, and the CondBr Instruction can't reach that.
I think I'm going to back out of the
defer_trap
portion of this PR and land that separately, since it looks like we might need to make some changes to island checking for that to work properly with our limited branch range.
The linker, however, seemed like it would be the one responsible for compressing RISC-V jumps which I'm not sure MachBuffer is positioned to do today.
Maybe best to defer to a linker in e.g. Cranelift-as-AOT-rustc-backend cases, and omit this where compile times are more important (Cranelift-in-Wasmtime).
Yeah, I agree with that. I'm not sure I want to go full relaxation on MachBuffer yet, but a good middle ground would probably be expose all label usages as unresolved relocations, and let the external linker deal with it. We could probably get some good relaxation gains that way.
Last updated: Nov 22 2024 at 17:03 UTC