wasmtime / PR #3254 Use relative `call` instructions betw... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #3254 Use relative `call` instructions betw...

Wasmtime GitHub notifications bot (Aug 26 2021 at 22:12):

alexcrichton opened PR #3254 from call-relative to main:

This commit is a relatively major change to the way that Wasmtime
generates code for Wasm modules and how functions call each other.
Prior to this commit all function calls between functions, even if they
were defined in the same module, were done indirectly through a
register. To implement this the backend would emit an absolute 8-byte
relocation near all function calls, load that address into a register,
and then call it. While this technique is simple to implement and easy
to get right, it has two primary downsides associated with it:

Function calls are always indirect which means they are more difficult
to predict, resulting in worse performance.

Generating a relocation-per-function call requires expensive
relocation resolution at module-load time, which can be a large
contributing factor to how long it takes to load a precompiled module.

To fix these issues, while also compromising on the previously simple
implementation technique, this commit switches wasm calls within a
module to using the colocated flag enabled in Cranelift-speak, which
basically means that a relative call instruction is used with a
relocation that's resolved relative to the pc of the call instruction
itself.

When switching the colocated flag to true this commit is also then
able to move much of the relocation resolution from wasmtime_jit::link
into wasmtime_cranelift::obj during object-construction time. This
frontloads all relocation work which means that there's actually no
relocations related to function calls in the final image, solving both
of our points above.

The main gotcha in implementing this technique is that there are
hardware limitations to relative function calls which mean we can't
simply blindly use them. AArch64, for example, can only go +/- 64 MB
from the bl instruction to the target, which means that if the
function we're calling is a greater distance away then we would fail to
resolve that relocation. On x86_64 the limits are +/- 2GB which are much
larger, but theoretically still feasible to hit. Consequently the main
increase in implementation complexity is fixing this issue.

This issue is actually already present in Cranelift itself, and is
internally one of the invariants handled by the MachBuffer type. When
generating a function relative jumps between basic blocks have similar
restrictions. At this time, however, I decided to not try to reuse
MachBuffer for inter-function relocations. The main reason for this is
that MachBuffer doesn't actually handle any of the cases that
inter-function relocations need to handle, namely the 26-bit relocation
and 32-bit relocations of AArch64 and x86_64. If a 26-bit relocation
isn't resolvable because a function gets too large then MachBuffer
will simply panic. I also opted to not use MachBuffer for now, though,
because it doesn't quite have the infrastructure already set up where
when inserting an island for a smaller jump sequence the result should
be a relocation to fill in later. Today it simply promotes a 19-bit
branch on AArch64 to a 26-bit branch, assuming the 26 bits is enough.
All-in-all I felt that at this time MachBuffer wasn't reusable enough
and would need enough work that it was probably more worthwhile to do
this in wasmtime-cranelift first and looking to unify the strategies
in the future.

For these reasons the wasmtime_cranelift::obj module has grown in
complexity quite a bit. This now entirely handles relative relocations
between functions, automatically inserting "jump veneers" to resolve
calls between functions that are too far away. The general strategy is
similar to MachBuffer, but different in the final tracking of
relocations. The current assumption is that each TargetIsa has the
ability to generate a fixed "jump veneer" which internally has an 8-byte
immediate value that is added to its own address to reach the
destination of an indirect call. This relative jump, even in the veneer,
means that when veneers are used we still don't need absolute
relocations since the code is still all implemented as relative jumps.
The veneer jumps, however, are larger in code size because they don't
fit into the native versions.

I've added some simple testing of this for now. A synthetic compiler
option was create to simply add padded 0s between functions and test
cases implement various forms of calls that at least need veneers. A
test is also included for x86_64, but it is unfortunately pretty slow
because it requires generating 2GB of output. I'm hoping for now it's
not too bad, but we can disable the test if it's prohibitive and
otherwise just comment the necessary portions to be sure to run the
ignored test if these parts of the code have changed.

The final end-result of this commit is that for a large module I'm
working with the number of relocations dropped to zero, meaning that
nothing actually needs to be done to the text section when it's loaded
into memory (yay!). I haven't run final benchmarks yet but this is the
last remaining source of significant slowdown when loading modules,
after I land a number of other PRs both active and ones that I only have
locally for now.

Wasmtime GitHub notifications bot (Aug 26 2021 at 22:12):

alexcrichton requested cfallin for a review on PR #3254.

Wasmtime GitHub notifications bot (Aug 26 2021 at 23:03):

akirilov-arm submitted PR review.

Wasmtime GitHub notifications bot (Aug 26 2021 at 23:03):

akirilov-arm submitted PR review.

Wasmtime GitHub notifications bot (Aug 26 2021 at 23:03):

akirilov-arm created PR review comment:

This should be a load literal instruction, i.e. ldr x16, 16, and you should schedule it before the adr (whose offset would also be adjusted).

Wasmtime GitHub notifications bot (Aug 26 2021 at 23:03):

akirilov-arm created PR review comment:

Typo - generates.

Wasmtime GitHub notifications bot (Aug 26 2021 at 23:03):

akirilov-arm created PR review comment:

x16 should be x17, shouldn't it?

Wasmtime GitHub notifications bot (Aug 27 2021 at 00:00):

alexcrichton updated PR #3254 from call-relative to main.

Wasmtime GitHub notifications bot (Aug 27 2021 at 01:16):

akirilov-arm submitted PR review.

Wasmtime GitHub notifications bot (Aug 27 2021 at 01:16):

akirilov-arm submitted PR review.

Wasmtime GitHub notifications bot (Aug 27 2021 at 01:16):

akirilov-arm created PR review comment:

Ditto.

Wasmtime GitHub notifications bot (Aug 27 2021 at 01:16):

akirilov-arm created PR review comment:

adr x17, 12

Wasmtime GitHub notifications bot (Aug 27 2021 at 01:16):

akirilov-arm created PR review comment:

Has this been added by mistake?

Wasmtime GitHub notifications bot (Aug 27 2021 at 11:11):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (Aug 27 2021 at 11:11):

bjorn3 created PR review comment:

Using a markdown code block like done here makes sense to me.

Wasmtime GitHub notifications bot (Aug 27 2021 at 11:12):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (Aug 27 2021 at 11:12):

bjorn3 created PR review comment:

/// This generates:

Wasmtime GitHub notifications bot (Aug 27 2021 at 11:19):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (Aug 27 2021 at 11:19):

bjorn3 created PR review comment:

        if name == "wasmtime_linkopt_padding_between_functions" {
            self.linkopts.padding_between_functions = value.parse()?;
            return Ok(());
        }
        if name == "wasmtime_linkopt_force_jump_veneer" {

Wasmtime GitHub notifications bot (Aug 27 2021 at 11:19):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (Aug 27 2021 at 11:19):

bjorn3 created PR review comment:

            self.linkopts.force_jump_veneers = value.parse().unwrap();
to prevent silently ignoring invalid values.

Wasmtime GitHub notifications bot (Aug 27 2021 at 11:19):

bjorn3 edited PR review comment.

Wasmtime GitHub notifications bot (Aug 27 2021 at 11:21):

bjorn3 created PR review comment:

    /// Packed form of windows unwind tables which, if present, will get emited

Wasmtime GitHub notifications bot (Aug 27 2021 at 11:21):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (Aug 27 2021 at 11:27):

bjorn3 created PR review comment:

Maybe make this ud2 on x86?

Wasmtime GitHub notifications bot (Aug 27 2021 at 11:27):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (Aug 27 2021 at 15:37):

alexcrichton edited PR #3254 from call-relative to main:

This commit is a relatively major change to the way that Wasmtime
generates code for Wasm modules and how functions call each other.
Prior to this commit all function calls between functions, even if they
were defined in the same module, were done indirectly through a
register. To implement this the backend would emit an absolute 8-byte
relocation near all function calls, load that address into a register,
and then call it. While this technique is simple to implement and easy
to get right, it has two primary downsides associated with it:

Function calls are always indirect which means they are more difficult
to predict, resulting in worse performance.

Generating a relocation-per-function call requires expensive
relocation resolution at module-load time, which can be a large
contributing factor to how long it takes to load a precompiled module.

To fix these issues, while also compromising on the previously simple
implementation technique, this commit switches wasm calls within a
module to using the colocated flag enabled in Cranelift-speak, which
basically means that a relative call instruction is used with a
relocation that's resolved relative to the pc of the call instruction
itself.

When switching the colocated flag to true this commit is also then
able to move much of the relocation resolution from wasmtime_jit::link
into wasmtime_cranelift::obj during object-construction time. This
frontloads all relocation work which means that there's actually no
relocations related to function calls in the final image, solving both
of our points above.

The main gotcha in implementing this technique is that there are
hardware limitations to relative function calls which mean we can't
simply blindly use them. AArch64, for example, can only go +/- 64 MB
from the bl instruction to the target, which means that if the
function we're calling is a greater distance away then we would fail to
resolve that relocation. On x86_64 the limits are +/- 2GB which are much
larger, but theoretically still feasible to hit. Consequently the main
increase in implementation complexity is fixing this issue.

This issue is actually already present in Cranelift itself, and is
internally one of the invariants handled by the MachBuffer type. When
generating a function relative jumps between basic blocks have similar
restrictions. At this time, however, I decided to not try to reuse
MachBuffer for inter-function relocations. The main reason for this is
that MachBuffer doesn't actually handle any of the cases that
inter-function relocations need to handle, namely the 26-bit relocation
and 32-bit relocations of AArch64 and x86_64. If a 26-bit relocation
isn't resolvable because a function gets too large then MachBuffer
will simply panic. I also opted to not use MachBuffer for now, though,
because it doesn't quite have the infrastructure already set up where
when inserting an island for a smaller jump sequence the result should
be a relocation to fill in later. Today it simply promotes a 19-bit
branch on AArch64 to a 26-bit branch, assuming the 26 bits is enough.
All-in-all I felt that at this time MachBuffer wasn't reusable enough
and would need enough work that it was probably more worthwhile to do
this in wasmtime-cranelift first and looking to unify the strategies
in the future.

For these reasons the wasmtime_cranelift::obj module has grown in
complexity quite a bit. This now entirely handles relative relocations
between functions, automatically inserting "jump veneers" to resolve
calls between functions that are too far away. The general strategy is
similar to MachBuffer, but different in the final tracking of
relocations. The current assumption is that each TargetIsa has the
ability to generate a fixed "jump veneer" which internally has an 8-byte
immediate value that is added to its own address to reach the
destination of an indirect call. This relative jump, even in the veneer,
means that when veneers are used we still don't need absolute
relocations since the code is still all implemented as relative jumps.
The veneer jumps, however, are larger in code size because they don't
fit into the native versions.

I've added some simple testing of this for now. A synthetic compiler
option was create to simply add padded 0s between functions and test
cases implement various forms of calls that at least need veneers. A
test is also included for x86_64, but it is unfortunately pretty slow
because it requires generating 2GB of output. I'm hoping for now it's
not too bad, but we can disable the test if it's prohibitive and
otherwise just comment the necessary portions to be sure to run the
ignored test if these parts of the code have changed.

The final end-result of this commit is that for a large module I'm
working with the number of relocations dropped to zero, meaning that
nothing actually needs to be done to the text section when it's loaded
into memory (yay!). I haven't run final benchmarks yet but this is the
last remaining source of significant slowdown when loading modules,
after I land a number of other PRs both active and ones that I only have
locally for now.

cc #3230

Wasmtime GitHub notifications bot (Aug 27 2021 at 16:20):

alexcrichton updated PR #3254 from call-relative to main.

Wasmtime GitHub notifications bot (Aug 27 2021 at 16:20):

alexcrichton submitted PR review.

Wasmtime GitHub notifications bot (Aug 27 2021 at 16:20):

alexcrichton created PR review comment:

Nah this is because rustdoc tries to test all code blocks as Rust code so I needed to add an annotation that it's not actually Rust code.

Wasmtime GitHub notifications bot (Aug 27 2021 at 17:21):

alexcrichton updated PR #3254 from call-relative to main.

Wasmtime GitHub notifications bot (Aug 27 2021 at 18:58):

alexcrichton updated PR #3254 from call-relative to main.

Wasmtime GitHub notifications bot (Aug 27 2021 at 19:01):

alexcrichton updated PR #3254 from call-relative to main.

Wasmtime GitHub notifications bot (Aug 27 2021 at 19:12):

alexcrichton updated PR #3254 from call-relative to main.

Wasmtime GitHub notifications bot (Aug 27 2021 at 20:28):

alexcrichton updated PR #3254 from call-relative to main.

Wasmtime GitHub notifications bot (Aug 27 2021 at 21:21):

alexcrichton updated PR #3254 from call-relative to main.

Wasmtime GitHub notifications bot (Aug 27 2021 at 21:41):

alexcrichton updated PR #3254 from call-relative to main.

Wasmtime GitHub notifications bot (Aug 27 2021 at 23:12):

alexcrichton updated PR #3254 from call-relative to main.

Wasmtime GitHub notifications bot (Aug 28 2021 at 00:59):

alexcrichton updated PR #3254 from call-relative to main.

Wasmtime GitHub notifications bot (Aug 30 2021 at 18:07):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Aug 30 2021 at 18:07):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Aug 30 2021 at 18:07):

cfallin created PR review comment:

Two thoughts on this new addition to the backend trait:

I think there might be an opportunity to merge this per-backend behavior into the LabelUse trait already defined by each backend. The generated trampoline code is playing almost the same role -- extending the range of a branch and allowing a shorter-range reloc to be converted to a longer-range one -- except that, unlike intra-function veneers, it also assumes the use of various registers.

That last bit is a key difference. In the aarch64 case x16 and x17 are used internally to instruction-lowering sequences but never across basic blocks, so this is fine; but r10 and r11 on x86-64 will potentially be used by regalloc, so we wouldn't want to blindly insert this as another type of veneer in the existing trait. So we'd want to add some parameters to the supports_veneer, veneer_size and generate_veneer functions to indicate what kind of veneer ("next step up in range" or "absolute" maybe) and whether the use of regs as per ABI for inter-function transfers is allowed.

Whatever we do, it strikes me that the duplication here ("veneers" in two places, with similar APIs) is likely to confuse others so we should somehow merge them or distinguish them better. Furthermore if we're going to have low-level understanding of branches (e.g. embedded machine-code bits in a sequence to emit) we should have that in as few places as possible.

I am wondering if there is a better name than "veneer", if we don't merge; maybe "trampoline" or "linkage stub"?

Wasmtime GitHub notifications bot (Aug 30 2021 at 18:07):

cfallin created PR review comment:

Do we still need this if we have the option below (force veneers) as well?

Wasmtime GitHub notifications bot (Aug 30 2021 at 18:07):

cfallin created PR review comment:

For maps from one index space to another I like the X_to_Y naming idiom to make it clear what the index and value are; maybe text_offset_to_text_contents_index? Verbose I know but getting index spaces crossed is a potent footgun...

Wasmtime GitHub notifications bot (Aug 30 2021 at 18:07):

cfallin created PR review comment:

resolved before the...? (don't leave me hanging, I need to know how this story ends!)

Wasmtime GitHub notifications bot (Aug 30 2021 at 18:07):

cfallin created PR review comment:

Can you add a reference to the MachBuffer code that does this too? The cases are the same (known/in-bounds, known, out-of-bounds, unknown) and it'd be helpful to cross-reference where else this logic is used :-)

Wasmtime GitHub notifications bot (Aug 30 2021 at 18:07):

cfallin created PR review comment:

can you add a TODO here referencing one of the remove-old-backends issues/PRs (bytecodealliance/rfcs#12 maybe) so we can grep this when the time comes?

Wasmtime GitHub notifications bot (Aug 30 2021 at 18:07):

cfallin created PR review comment:

Maybe at least a checked_add here, if not some clamping? A hypothetical machine with a 64-bit branch that can do a relative jump anywhere would overflow otherwise.

Wasmtime GitHub notifications bot (Aug 30 2021 at 18:07):

cfallin created PR review comment:

"cache loads" is ambiguous here (my computer-architect mind thinks this means a normal memory load); maybe "to make loading compiled modules from disk more efficient"?

Wasmtime GitHub notifications bot (Aug 30 2021 at 18:07):

cfallin created PR review comment:

I see here why you had to add methods to the TargetIsa to get at the veneer-generation implementation; above comments re: making use of LabelUse still apply and then maybe we could bounce through the TargetIsa to use those.

Wasmtime GitHub notifications bot (Aug 30 2021 at 18:07):

cfallin created PR review comment:

Readability request: a little struct in place of this tuple, so we can write e.g. TextSegmentFunction { body_bytes, alignment } or something like that?

Wasmtime GitHub notifications bot (Aug 30 2021 at 18:07):

cfallin created PR review comment:

There is some code in the runtime that applies relocations as well; can we consolidate the two implementations?

Wasmtime GitHub notifications bot (Aug 30 2021 at 18:39):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (Aug 30 2021 at 18:39):

bjorn3 created PR review comment:

This function could be useful for cranelift-jit too. It also needs a version that loads the address from memory at a specific location like ELF PLT's. Currently it only has a x86_64 version for the latter function.

Wasmtime GitHub notifications bot (Aug 31 2021 at 20:56):

alexcrichton closed without merge PR #3254.

Wasmtime GitHub notifications bot (Aug 31 2021 at 21:06):

alexcrichton submitted PR review.

Wasmtime GitHub notifications bot (Aug 31 2021 at 21:06):

alexcrichton created PR review comment:

I'm going to leave this in for now because I think it's useful to exercise the veneers in real-world situations where they're actually needed. Otherwise I'd be worried that the logic for actually inserting veneers was only correct when the forcing was turned on. This way there's some exercising of the actual "ok yes we need that veneer" logic as well. (it's also relatively easy to support).

Wasmtime GitHub notifications bot (Aug 31 2021 at 21:42):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Aug 31 2021 at 21:42):

cfallin created PR review comment:

Makes sense!

Would it make sense to use it only in tests with the aarch64 backend (could be instantiated explicitly, doesn't need to run on aarch64 host), where the threshold for inserting an island is much lower, so we don't have the overhead of 2GiB object files in tests on x86?

Last updated: Apr 18 2025 at 18:04 UTC