alexcrichton opened PR #3254 from call-relative
to main
:
This commit is a relatively major change to the way that Wasmtime
generates code for Wasm modules and how functions call each other.
Prior to this commit all function calls between functions, even if they
were defined in the same module, were done indirectly through a
register. To implement this the backend would emit an absolute 8-byte
relocation near all function calls, load that address into a register,
and then call it. While this technique is simple to implement and easy
to get right, it has two primary downsides associated with it:
Function calls are always indirect which means they are more difficult
to predict, resulting in worse performance.Generating a relocation-per-function call requires expensive
relocation resolution at module-load time, which can be a large
contributing factor to how long it takes to load a precompiled module.To fix these issues, while also compromising on the previously simple
implementation technique, this commit switches wasm calls within a
module to using thecolocated
flag enabled in Cranelift-speak, which
basically means that a relative call instruction is used with a
relocation that's resolved relative to the pc of the call instruction
itself.When switching the
colocated
flag totrue
this commit is also then
able to move much of the relocation resolution fromwasmtime_jit::link
intowasmtime_cranelift::obj
during object-construction time. This
frontloads all relocation work which means that there's actually no
relocations related to function calls in the final image, solving both
of our points above.The main gotcha in implementing this technique is that there are
hardware limitations to relative function calls which mean we can't
simply blindly use them. AArch64, for example, can only go +/- 64 MB
from thebl
instruction to the target, which means that if the
function we're calling is a greater distance away then we would fail to
resolve that relocation. On x86_64 the limits are +/- 2GB which are much
larger, but theoretically still feasible to hit. Consequently the main
increase in implementation complexity is fixing this issue.This issue is actually already present in Cranelift itself, and is
internally one of the invariants handled by theMachBuffer
type. When
generating a function relative jumps between basic blocks have similar
restrictions. At this time, however, I decided to not try to reuse
MachBuffer
for inter-function relocations. The main reason for this is
thatMachBuffer
doesn't actually handle any of the cases that
inter-function relocations need to handle, namely the 26-bit relocation
and 32-bit relocations of AArch64 and x86_64. If a 26-bit relocation
isn't resolvable because a function gets too large thenMachBuffer
will simply panic. I also opted to not useMachBuffer
for now, though,
because it doesn't quite have the infrastructure already set up where
when inserting an island for a smaller jump sequence the result should
be a relocation to fill in later. Today it simply promotes a 19-bit
branch on AArch64 to a 26-bit branch, assuming the 26 bits is enough.
All-in-all I felt that at this timeMachBuffer
wasn't reusable enough
and would need enough work that it was probably more worthwhile to do
this inwasmtime-cranelift
first and looking to unify the strategies
in the future.For these reasons the
wasmtime_cranelift::obj
module has grown in
complexity quite a bit. This now entirely handles relative relocations
between functions, automatically inserting "jump veneers" to resolve
calls between functions that are too far away. The general strategy is
similar toMachBuffer
, but different in the final tracking of
relocations. The current assumption is that eachTargetIsa
has the
ability to generate a fixed "jump veneer" which internally has an 8-byte
immediate value that is added to its own address to reach the
destination of an indirect call. This relative jump, even in the veneer,
means that when veneers are used we still don't need absolute
relocations since the code is still all implemented as relative jumps.
The veneer jumps, however, are larger in code size because they don't
fit into the native versions.I've added some simple testing of this for now. A synthetic compiler
option was create to simply add padded 0s between functions and test
cases implement various forms of calls that at least need veneers. A
test is also included for x86_64, but it is unfortunately pretty slow
because it requires generating 2GB of output. I'm hoping for now it's
not too bad, but we can disable the test if it's prohibitive and
otherwise just comment the necessary portions to be sure to run the
ignored test if these parts of the code have changed.The final end-result of this commit is that for a large module I'm
working with the number of relocations dropped to zero, meaning that
nothing actually needs to be done to the text section when it's loaded
into memory (yay!). I haven't run final benchmarks yet but this is the
last remaining source of significant slowdown when loading modules,
after I land a number of other PRs both active and ones that I only have
locally for now.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
alexcrichton requested cfallin for a review on PR #3254.
akirilov-arm submitted PR review.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
This should be a load literal instruction, i.e.
ldr x16, 16
, and you should schedule it before theadr
(whose offset would also be adjusted).
akirilov-arm created PR review comment:
Typo -
generates
.
akirilov-arm created PR review comment:
x16
should bex17
, shouldn't it?
alexcrichton updated PR #3254 from call-relative
to main
.
akirilov-arm submitted PR review.
akirilov-arm submitted PR review.
akirilov-arm created PR review comment:
Ditto.
akirilov-arm created PR review comment:
adr x17, 12
akirilov-arm created PR review comment:
Has this been added by mistake?
bjorn3 submitted PR review.
bjorn3 created PR review comment:
Using a markdown code block like done here makes sense to me.
bjorn3 submitted PR review.
bjorn3 created PR review comment:
/// This generates:
bjorn3 submitted PR review.
bjorn3 created PR review comment:
if name == "wasmtime_linkopt_padding_between_functions" { self.linkopts.padding_between_functions = value.parse()?; return Ok(()); } if name == "wasmtime_linkopt_force_jump_veneer" {
bjorn3 submitted PR review.
bjorn3 created PR review comment:
self.linkopts.force_jump_veneers = value.parse().unwrap();
to prevent silently ignoring invalid values.
bjorn3 edited PR review comment.
bjorn3 created PR review comment:
/// Packed form of windows unwind tables which, if present, will get emited
bjorn3 submitted PR review.
bjorn3 created PR review comment:
Maybe make this
ud2
on x86?
bjorn3 submitted PR review.
alexcrichton edited PR #3254 from call-relative
to main
:
This commit is a relatively major change to the way that Wasmtime
generates code for Wasm modules and how functions call each other.
Prior to this commit all function calls between functions, even if they
were defined in the same module, were done indirectly through a
register. To implement this the backend would emit an absolute 8-byte
relocation near all function calls, load that address into a register,
and then call it. While this technique is simple to implement and easy
to get right, it has two primary downsides associated with it:
Function calls are always indirect which means they are more difficult
to predict, resulting in worse performance.Generating a relocation-per-function call requires expensive
relocation resolution at module-load time, which can be a large
contributing factor to how long it takes to load a precompiled module.To fix these issues, while also compromising on the previously simple
implementation technique, this commit switches wasm calls within a
module to using thecolocated
flag enabled in Cranelift-speak, which
basically means that a relative call instruction is used with a
relocation that's resolved relative to the pc of the call instruction
itself.When switching the
colocated
flag totrue
this commit is also then
able to move much of the relocation resolution fromwasmtime_jit::link
intowasmtime_cranelift::obj
during object-construction time. This
frontloads all relocation work which means that there's actually no
relocations related to function calls in the final image, solving both
of our points above.The main gotcha in implementing this technique is that there are
hardware limitations to relative function calls which mean we can't
simply blindly use them. AArch64, for example, can only go +/- 64 MB
from thebl
instruction to the target, which means that if the
function we're calling is a greater distance away then we would fail to
resolve that relocation. On x86_64 the limits are +/- 2GB which are much
larger, but theoretically still feasible to hit. Consequently the main
increase in implementation complexity is fixing this issue.This issue is actually already present in Cranelift itself, and is
internally one of the invariants handled by theMachBuffer
type. When
generating a function relative jumps between basic blocks have similar
restrictions. At this time, however, I decided to not try to reuse
MachBuffer
for inter-function relocations. The main reason for this is
thatMachBuffer
doesn't actually handle any of the cases that
inter-function relocations need to handle, namely the 26-bit relocation
and 32-bit relocations of AArch64 and x86_64. If a 26-bit relocation
isn't resolvable because a function gets too large thenMachBuffer
will simply panic. I also opted to not useMachBuffer
for now, though,
because it doesn't quite have the infrastructure already set up where
when inserting an island for a smaller jump sequence the result should
be a relocation to fill in later. Today it simply promotes a 19-bit
branch on AArch64 to a 26-bit branch, assuming the 26 bits is enough.
All-in-all I felt that at this timeMachBuffer
wasn't reusable enough
and would need enough work that it was probably more worthwhile to do
this inwasmtime-cranelift
first and looking to unify the strategies
in the future.For these reasons the
wasmtime_cranelift::obj
module has grown in
complexity quite a bit. This now entirely handles relative relocations
between functions, automatically inserting "jump veneers" to resolve
calls between functions that are too far away. The general strategy is
similar toMachBuffer
, but different in the final tracking of
relocations. The current assumption is that eachTargetIsa
has the
ability to generate a fixed "jump veneer" which internally has an 8-byte
immediate value that is added to its own address to reach the
destination of an indirect call. This relative jump, even in the veneer,
means that when veneers are used we still don't need absolute
relocations since the code is still all implemented as relative jumps.
The veneer jumps, however, are larger in code size because they don't
fit into the native versions.I've added some simple testing of this for now. A synthetic compiler
option was create to simply add padded 0s between functions and test
cases implement various forms of calls that at least need veneers. A
test is also included for x86_64, but it is unfortunately pretty slow
because it requires generating 2GB of output. I'm hoping for now it's
not too bad, but we can disable the test if it's prohibitive and
otherwise just comment the necessary portions to be sure to run the
ignored test if these parts of the code have changed.The final end-result of this commit is that for a large module I'm
working with the number of relocations dropped to zero, meaning that
nothing actually needs to be done to the text section when it's loaded
into memory (yay!). I haven't run final benchmarks yet but this is the
last remaining source of significant slowdown when loading modules,
after I land a number of other PRs both active and ones that I only have
locally for now.cc #3230
alexcrichton updated PR #3254 from call-relative
to main
.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Nah this is because rustdoc tries to test all code blocks as Rust code so I needed to add an annotation that it's not actually Rust code.
alexcrichton updated PR #3254 from call-relative
to main
.
alexcrichton updated PR #3254 from call-relative
to main
.
alexcrichton updated PR #3254 from call-relative
to main
.
alexcrichton updated PR #3254 from call-relative
to main
.
alexcrichton updated PR #3254 from call-relative
to main
.
alexcrichton updated PR #3254 from call-relative
to main
.
alexcrichton updated PR #3254 from call-relative
to main
.
alexcrichton updated PR #3254 from call-relative
to main
.
alexcrichton updated PR #3254 from call-relative
to main
.
cfallin submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
Two thoughts on this new addition to the backend trait:
I think there might be an opportunity to merge this per-backend behavior into the
LabelUse
trait already defined by each backend. The generated trampoline code is playing almost the same role -- extending the range of a branch and allowing a shorter-range reloc to be converted to a longer-range one -- except that, unlike intra-function veneers, it also assumes the use of various registers.That last bit is a key difference. In the aarch64 case
x16
andx17
are used internally to instruction-lowering sequences but never across basic blocks, so this is fine; butr10
andr11
on x86-64 will potentially be used by regalloc, so we wouldn't want to blindly insert this as another type of veneer in the existing trait. So we'd want to add some parameters to thesupports_veneer
,veneer_size
andgenerate_veneer
functions to indicate what kind of veneer ("next step up in range" or "absolute" maybe) and whether the use of regs as per ABI for inter-function transfers is allowed.Whatever we do, it strikes me that the duplication here ("veneers" in two places, with similar APIs) is likely to confuse others so we should somehow merge them or distinguish them better. Furthermore if we're going to have low-level understanding of branches (e.g. embedded machine-code bits in a sequence to emit) we should have that in as few places as possible.
I am wondering if there is a better name than "veneer", if we don't merge; maybe "trampoline" or "linkage stub"?
cfallin created PR review comment:
Do we still need this if we have the option below (force veneers) as well?
cfallin created PR review comment:
For maps from one index space to another I like the
X_to_Y
naming idiom to make it clear what the index and value are; maybetext_offset_to_text_contents_index
? Verbose I know but getting index spaces crossed is a potent footgun...
cfallin created PR review comment:
resolved before the...? (don't leave me hanging, I need to know how this story ends!)
cfallin created PR review comment:
Can you add a reference to the
MachBuffer
code that does this too? The cases are the same (known/in-bounds, known, out-of-bounds, unknown) and it'd be helpful to cross-reference where else this logic is used :-)
cfallin created PR review comment:
can you add a TODO here referencing one of the remove-old-backends issues/PRs (bytecodealliance/rfcs#12 maybe) so we can grep this when the time comes?
cfallin created PR review comment:
Maybe at least a
checked_add
here, if not some clamping? A hypothetical machine with a 64-bit branch that can do a relative jump anywhere would overflow otherwise.
cfallin created PR review comment:
"cache loads" is ambiguous here (my computer-architect mind thinks this means a normal memory load); maybe "to make loading compiled modules from disk more efficient"?
cfallin created PR review comment:
I see here why you had to add methods to the
TargetIsa
to get at the veneer-generation implementation; above comments re: making use ofLabelUse
still apply and then maybe we could bounce through theTargetIsa
to use those.
cfallin created PR review comment:
Readability request: a little struct in place of this tuple, so we can write e.g.
TextSegmentFunction { body_bytes, alignment }
or something like that?
cfallin created PR review comment:
There is some code in the runtime that applies relocations as well; can we consolidate the two implementations?
bjorn3 submitted PR review.
bjorn3 created PR review comment:
This function could be useful for cranelift-jit too. It also needs a version that loads the address from memory at a specific location like ELF PLT's. Currently it only has a x86_64 version for the latter function.
alexcrichton closed without merge PR #3254.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
I'm going to leave this in for now because I think it's useful to exercise the veneers in real-world situations where they're actually needed. Otherwise I'd be worried that the logic for actually inserting veneers was only correct when the forcing was turned on. This way there's some exercising of the actual "ok yes we need that veneer" logic as well. (it's also relatively easy to support).
cfallin submitted PR review.
cfallin created PR review comment:
Makes sense!
Would it make sense to use it only in tests with the aarch64 backend (could be instantiated explicitly, doesn't need to run on aarch64 host), where the threshold for inserting an island is much lower, so we don't have the overhead of 2GiB object files in tests on x86?
Last updated: Nov 22 2024 at 16:03 UTC