wasmtime / issue #7047 riscv64: Cleanup trap handling · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #7047 riscv64: Cleanup trap handling

Wasmtime GitHub notifications bot (Sep 16 2023 at 09:36):

Looks like on the test filetests/filetests/runtests/issue5569.clif we have a single block that produces about 8KiB of code, and the branch that we use for these traps only has a range of +/-4KiB, so we run into this assert.

I thought we would emit islands in the middle of a block if we needed to, but that appears not to be the case. I'm also not sure what the best solution here is, checking if we need an island at every instruction seems quite heavy handed.

This is also going to get worse when I introduced compressed conditional branches that only have +/- 256 bytes of range.

Wasmtime GitHub notifications bot (Sep 18 2023 at 10:20):

afonso360 edited a comment on issue #7047:

Looks like on the test filetests/filetests/runtests/issue5569.clif we have a single block that produces about 8KiB of code, and the branch that we use for these traps only has a range of +/-4KiB, so we run into this assert.

I thought we would emit islands in the middle of a block if we needed to, but that appears not to be the case. I'm also not sure what the best solution here is, checking if we need an island at every instruction seems quite heavy handed.

This is also going to get worse when I introduced compressed conditional branches that only have +/- 256 bytes of range.

Is there an easy way around this, or should I keep using inline traps for now and fix deferred traps later?

Wasmtime GitHub notifications bot (Sep 18 2023 at 10:20):

afonso360 edited a comment on issue #7047:

Looks like on the test filetests/filetests/runtests/issue5569.clif we have a single block that produces about 8KiB of code, and the branch that we use for these traps only has a range of +/-4KiB, so we run into this assert.

I thought we would emit islands in the middle of a block if we needed to, but that appears not to be the case. I'm also not sure what the best solution here is, checking if we need an island at every instruction seems quite heavy handed.

This is also going to get worse when I introduced compressed conditional branches that only have +/- 256 bytes of range.

Is there an easy way around this, or should I keep inline traps for now and fix deferred traps later?

Wasmtime GitHub notifications bot (Sep 18 2023 at 14:57):

alexcrichton commented on issue #7047:

Ah yes you're right that islands are emitted once per block which is based on I::worst_case_size(). Is this perhaps a case where the worse_case_size for RISC-V isn't accurate? Or perhaps the block is so large that a jump in the middle of the block can't jump out of the block?

This is also going to get worse when I introduced compressed conditional branches that only have +/- 256 bytes of range.

This may not actually work well with Wasmtime today unfortunately. As I'm sure you've seen we aren't guaranteed to know the branch target at emission time meaning that we can't use the current scheme you have for compressed instructions. It's technically possible to do it sort of but it means that it could make things worse since a 256 byte jump would run the risk of often requiring a veneer to a larger jump which means that what could have been a single jump would be veneer'd to two jumps.

When I was reading over the RISC-V documentation and intro and such I got the impression that the intention of the compressed instructions was to do what you've design today which is to emit everything and then compress it all after-the fact. The linker, however, seemed like it would be the one responsible for compressing RISC-V jumps which I'm not sure MachBuffer is positioned to do today. (e.g. MachBuffer isn't able to implement linker relaxation where all relocations are compressed to their smallest size). Not to say we couldn't ever implement it though, it's probably a chunk of work though.

Wasmtime GitHub notifications bot (Sep 18 2023 at 15:08):

cfallin commented on issue #7047:

Chiming in as well on branch ranges to say: two basic design tenets of MachBuffer's approach were:

We only emit islands between blocks, where we don't need to munge control flow to jump around them (on some backends then, e.g. aarch64, we do it for inline br_table tables too, since they're unbounded-ish, but that's morally "between blocks" as well even though it's within a pseudoinst)

We never backpatch code in a way that can expand size, because this requires a fixpoint loop in the limit and that could degrade to be very slow (imagine expanding a compressed jump to jump pushes a bunch of other branches out of range)

The former should be possible to relax, since the do-we-need-an-island check is cheap (just a compare to a deadline). We'd need to generate a jump beforehand that jumps around the island (and that jump should be longer-range!) If it actually happens it could be pretty undesirable from a performance standpoint -- the extra jump and break in fetch bandwidth could hurt in an inner loop -- so we'd want to do it only rarely (i.e., not optimistically pick the 256-byte-range jumps).

That should handle the long-basic-blocks problem re: 4K-range branches. It seems like the right way to handle the compressed-jump question is indeed a relaxation/shrinkage approach (which can't push other branches of out range, so you can do it without a fixpoint, but you might want to because other branches might become shrinkable too), but probably as a post-pass on the machine code. We'd need to keep all label fixups around to enable that as well. Maybe best to defer to a linker in e.g. Cranelift-as-AOT-rustc-backend cases, and omit this where compile times are more important (Cranelift-in-Wasmtime).

Wasmtime GitHub notifications bot (Sep 18 2023 at 18:52):

afonso360 commented on issue #7047:

Ah yes you're right that islands are emitted once per block which is based on I::worst_case_size(). Is this perhaps a case where the worse_case_size for RISC-V isn't accurate? Or perhaps the block is so large that a jump in the middle of the block can't jump out of the block?

As far as I can tell worst_case_size is fairly accurate, It's currently at 124 bytes. The test for it points to the maximum being 116, so it's not too far off.

The block being too large seems to me the likely cause. The logs show that the full block is 8KiB, and we have a TrapIf about 1KiB into the block. That means that the closest island is 7KiB away, and the CondBr Instruction can't reach that.

I think I'm going to back out of the deferred_trap portion of this PR and land that separately, since it looks like we might need to make some changes to island checking for that to work properly with our limited branch range.

The linker, however, seemed like it would be the one responsible for compressing RISC-V jumps which I'm not sure MachBuffer is positioned to do today.

Maybe best to defer to a linker in e.g. Cranelift-as-AOT-rustc-backend cases, and omit this where compile times are more important (Cranelift-in-Wasmtime).

Yeah, I agree with that. I'm not sure I want to go full relaxation on MachBuffer yet, but a good middle ground would probably be expose all label usages as unresolved relocations, and let the external linker deal with it.

Wasmtime GitHub notifications bot (Sep 18 2023 at 18:53):

afonso360 edited a comment on issue #7047:

Ah yes you're right that islands are emitted once per block which is based on I::worst_case_size(). Is this perhaps a case where the worse_case_size for RISC-V isn't accurate? Or perhaps the block is so large that a jump in the middle of the block can't jump out of the block?

As far as I can tell worst_case_size is fairly accurate, It's currently at 124 bytes. The test for it points to the maximum being 116, so it's not too far off.

The block being too large seems to me the likely cause. The logs show that the full block is 8KiB, and we have a TrapIf about 1KiB into the block. That means that the closest island is 7KiB away, and the CondBr Instruction can't reach that.

I think I'm going to back out of the deferred_trap portion of this PR and land that separately, since it looks like we might need to make some changes to island checking for that to work properly with our limited branch range.

The linker, however, seemed like it would be the one responsible for compressing RISC-V jumps which I'm not sure MachBuffer is positioned to do today.

Maybe best to defer to a linker in e.g. Cranelift-as-AOT-rustc-backend cases, and omit this where compile times are more important (Cranelift-in-Wasmtime).

Yeah, I agree with that. I'm not sure I want to go full relaxation on MachBuffer yet, but a good middle ground would probably be expose all label usages as unresolved relocations, and let the external linker deal with it. We could probably get some good relaxation gains that way.

Wasmtime GitHub notifications bot (Sep 18 2023 at 18:55):

afonso360 edited a comment on issue #7047:

Ah yes you're right that islands are emitted once per block which is based on I::worst_case_size(). Is this perhaps a case where the worse_case_size for RISC-V isn't accurate? Or perhaps the block is so large that a jump in the middle of the block can't jump out of the block?

As far as I can tell worst_case_size is fairly accurate, It's currently at 124 bytes. The test for it points to the maximum being 116, so it's not too far off.

The block being too large seems to me the likely cause. The logs show that the full block is 8KiB, and we have a TrapIf about 1KiB into the block. That means that the closest island is 7KiB away, and the CondBr Instruction can't reach that.

I think I'm going to back out of the defer_trap portion of this PR and land that separately, since it looks like we might need to make some changes to island checking for that to work properly with our limited branch range.

The linker, however, seemed like it would be the one responsible for compressing RISC-V jumps which I'm not sure MachBuffer is positioned to do today.

Maybe best to defer to a linker in e.g. Cranelift-as-AOT-rustc-backend cases, and omit this where compile times are more important (Cranelift-in-Wasmtime).

Yeah, I agree with that. I'm not sure I want to go full relaxation on MachBuffer yet, but a good middle ground would probably be expose all label usages as unresolved relocations, and let the external linker deal with it. We could probably get some good relaxation gains that way.

Last updated: Apr 17 2025 at 23:03 UTC