saulecabrera opened PR #28 from wasmtime-baseline-compilation
to main
:
This RFC proposes the addition of a new WebAssembly (Wasm) compiler to Wasmtime:
a single-pass or “baseline” compiler. A baseline compiler works to improve
overall compilation performance, yielding faster startup times, at the cost of
less optimized code.This RFC describes:
- The motivation for a baseline compiler
A high-level overview and description of the structure of the baseline
compilerA high-level overview of its integration with Wasmtime
cc @cfallin @fitzgen @alexcrichton @bnjbvr
lqd submitted PR review.
lqd created PR review comment:
Tiny linebreak suggestion:
let add = cranelift_asm::x64::AluRmiR { op: AluRmiR::Add, … }; let mut buf = MachBuffer::new(); add.emit(&mut buf, …);
lqd created PR review comment:
wasmtime run --compiler=<baseline|cranelift> file.wasm
lqd submitted PR review.
fitzgen submitted PR review.
fitzgen created PR review comment:
I'd like to echo @alexcrichton's suggestion about building out (at least a realistic kernel of) support for a second architecture during the initial implementation. Because Winch is, by design, skipping an architecture-agnostic mid-end and IR, it has higher risk of baking in architecture-specific assumptions into the core design. Building support for a second architecture should mitigate and nullify this risk.
fitzgen submitted PR review.
fitzgen created PR review comment:
I'm not convinced that making this a compile time feature will be overly complex, and even if it turns out that it is, I would strongly prefer that we start by attempting to do it that way, and only fall back to runtime-only if we determine it isn't practical.
fitzgen created PR review comment:
// request a general purpose register; // spill if none available. let imm = self.value_stack.pop(); let rd = self.gpr(); masm.add(rd, imm);
cfallin submitted PR review.
cfallin created PR review comment:
I see the appeal of compiled-out-by-default for minimizing impact, but I'll offer a few reasons from experience that might push in the other direction:
- Dealing with features in general is a speedbump for users. When developing the new x64 backend we had it cfg'd off behind a feature and it was a constant fight to get tools (LSP etc) to use the right flags, and meant that we couldn't have just one
cargo test
invocation to test the thing; we had to have a separate one with the feature enabled.- Differential testing across the two configurations becomes impossible. For the x64 backend, I eventually had to refactor to use a runtime setting and allow both configs to co-exist so that I could build a differential fuzz target. In this case I guess the feature would be strictly additive (include another compiler backend) but I could imagine tests at the Wasmtime level might still run into issues.
- In general, features make for more confusing code; see e.g. @alexcrichton 's push for us to use very coarse-grained features in memfd, and remove the cfg-logic from the inner implementations; and also the difficulties arising from the feature-controlled uffd backend.
On the other side, I guess the main concern is compile time? Or are you worried about e.g. users prematurely opting into using it and that opt-in being easier if it's a programmatic configuration vs. a feature flag in the dependency?
fitzgen submitted PR review.
fitzgen created PR review comment:
My concerns are (1) compile time and (2) factoring code such that it isn't entangled with the rest of Cranelift/Wasmtime.
Differential testing should definitely be possible. I'm _not_ saying that we should have a compile time config to either use Winch xor cranelift (like the
experimental_x64
cargo feature did). I'm saying we should have a compile time config which when enabled adds support for and exposes the Winch compilation strategy variant. So both Winch and Cranelift would be available for differential testing or what have you when the cargo feature was enabled.I don't think this is an onerous imposition on development, and we can point to the ongoing component model work as something that is using this compile-time-cargo-feature-to-make-available-a-runtime-option approach successfully.
cfallin created PR review comment:
Fair enough; I think much of my experience is with the "xor" variant of feature (so a bunch of
#[cfg(not(feature = ...))]
) and that may have given me a darker view of how this could go.I do like the idea of ensuring proper layering/separation by having a build with the feature off, as well.
So I guess this is fine; as you say we can always abandon the approach and go runtime-only if needed later.
cfallin submitted PR review.
saulecabrera created PR review comment:
I think I didn't do a good job expressing myself during the initial draft of the doc when I introduced the idea of/open questions around the compile time feature, which probably hinted the "xor" approach. I'm aligned with this though, my main goal is to allow the usage of both compilers, for ease of testing, but also I can definitely see cases in which an embedder might want to use Winch for fast startup and then compile with Cranelift in the background for subsequent requests; and according to the discussion here, such goal is still achievable with a compile time feature.
saulecabrera submitted PR review.
saulecabrera updated PR #28 from wasmtime-baseline-compilation
to main
.
saulecabrera submitted PR review.
saulecabrera created PR review comment:
I've bundled all the typo corrections in https://github.com/bytecodealliance/rfcs/pull/28/commits/d8d934ec8a16c6d0ae823c9d5ba499f472bfab99, thanks!
saulecabrera submitted PR review.
saulecabrera created PR review comment:
I've updated the doc to specify that changes will be introduced behind a compile-time feature.
bjorn3 submitted PR review.
bjorn3 created PR review comment:
If
MacroAssembler
isn't used by Cranelift, I think it should not be put incranelift_asm
, but either in wince or a separate crate.
bjorn3 created PR review comment:
This is not valid rust.
bjorn3 submitted PR review.
bjorn3 edited PR review comment.
bjorn3 submitted PR review.
bjorn3 created PR review comment:
Probably best left for the future, but it would be nice to support tiered compilation to compile with winch by default and with cranelift when a function turns out to be hot.
bjorn3 submitted PR review.
bjorn3 created PR review comment:
I personally prefer Winch over Baseline for consistency. Or maybe BaselineWinch.
saulecabrera updated PR #28 from wasmtime-baseline-compilation
to main
.
saulecabrera created PR review comment:
It feels natural to have the MacroAssembler at this level in my opinion, given that
cranelift_asm
's objective is to provide the functionality and building blocks for code emission from different scenarios. That said, I don't think that this is a one-way door decision, if it's proven to cause any issues, this is easily reversible.
saulecabrera submitted PR review.
bjorn3 submitted PR review.
bjorn3 created PR review comment:
My main worry is about compilation time. Adding
MacroAssembler
tocranelift_asm
would require compiling more code even though Cranelift doesn't need it.
saulecabrera updated PR #28 from wasmtime-baseline-compilation
to main
.
saulecabrera created PR review comment:
Yes, tiering and its heuristics are intentionally out of scope for this RFC.
saulecabrera submitted PR review.
saulecabrera updated PR #28 from wasmtime-baseline-compilation
to main
.
saulecabrera submitted PR review.
saulecabrera created PR review comment:
That's a fair concern; on the other hand, I'm confident that we can control exposing the MacroAssembler with a compile time feature, making it opt-in to consumers.
saulecabrera edited PR review comment.
bnjbvr submitted PR review.
bnjbvr created PR review comment:
As a source of inspiration for design (edit: now I see the design principles are very close to what's in this blog post), as well as numbers showing the speed of a baseline compiler, I recommend this great post by Andy Wingo: https://wingolog.org/archives/2020/03/25/firefoxs-low-latency-webassembly-compiler
bnjbvr created PR review comment:
buffering the `MachInst` structs in memory, which we do not wish to do.
bnjbvr submitted PR review.
bnjbvr created PR review comment:
This is really mostly a nitpick from myself, but I think we're confusing two different levels of abstraction here. Provided the examples below, namely the higher-level API call, I think this is really an
Assembler
library. From my understanding of Spidermonkey's code base, aMacroAssembler
tries to hide some extra complexity by only having "operands" (and details about memory addresses/registers are hidden by the abstraction). Or imbuing some other specific semantics of operators that don't exist at the assembler layer; think for instance about having somewasm_div
method, that would trap if the divisor is zero; there's no assembly instructions to do so, but a macro-assembler would combine different assembler method calls to generate the code that implements this behavior. As a matter of fact, the macro-assembler is one layer of abstraction above an assembler, and what's described here rather looks like a plain assembler to me, unless I'm mistaken.
bnjbvr created PR review comment:
Will it spill _all_ the registers, or only the conflicting ones / enough that it can continue doing its work? Spidermonkey's wasm baseline compiler would spill all, if I remember correctly, as it's much simpler to reason about.
bnjbvr created PR review comment:
Wasmtime, similar to what is present in Wasm engines in Web browsers. This RFC
bnjbvr created PR review comment:
abstraction, per ISA, along with their availability. It will also hold
bnjbvr created PR review comment:
Should this be
winch
instead of baseline? (alternatively, should we change thecranelift
option tooptimizing
?)
bnjbvr created PR review comment:
let imm = self.value_stack.pop(); // request a general purpose register; // spill if none available let rd = self.gpr();
bnjbvr created PR review comment:
There are many variants of linear scan register allocation, and I'm afraid that referring to a textbook implementation may be confusing: here it's really allocating registers on the fly, and spilling aggressively whenever none is available, rather than doing any kind of textbook analysis to figure out liveness intervals and all of that.
saulecabrera submitted PR review.
saulecabrera created PR review comment:
Indeed a great post! I've used it as a source of inspiration and as a guide to navigate SpiderMonkey's code.
cfallin created PR review comment:
Agree, this reference should probably be removed @saulecabrera ; what we describe here isn't true linear-scan but something more like "single-pass register allocation" (proper linear scan requires lookahead to know which register is used furthest in the future, to choose which to evict).
cfallin submitted PR review.
saulecabrera updated PR #28 from wasmtime-baseline-compilation
to main
.
saulecabrera submitted PR review.
saulecabrera created PR review comment:
Thanks for flagging this, I should've updated it in https://github.com/bytecodealliance/rfcs/pull/28/commits/141a3c6e3df27b1a9dc2d70580e021ab2ec58d00. For consistency I'd like to keep this as
winch
saulecabrera created PR review comment:
Yeah, the current intention is to spill all the registers.
saulecabrera submitted PR review.
saulecabrera submitted PR review.
saulecabrera created PR review comment:
Agreed, I've removed the reference. For my own knowledge: are there any papers that describe each of the linear scan variants? Or are variations purely a result of implementation differences?
cfallin submitted PR review.
cfallin created PR review comment:
I'm not aware of any that describe the SpiderMonkey-style single pass allocator, but I haven't looked too hard either; at the level we care about it tends to be more instructive to look at other implementations and academics tend not to study the fiddly details that matter here. But if I find one I'll let you know (and vice-versa) :-)
saulecabrera submitted PR review.
saulecabrera created PR review comment:
From my understanding of Spidermonkey's code base, a MacroAssembler tries to hide some extra complexity by only having "operands" (and details about memory addresses/registers are hidden by the abstraction).
My understanding is a bit different here. As far as I can tell, the MacroAssembler aside from operating on top of the concept of operands, also operates on top of other abstractions like Register, Imm, and Address; still encapsulating most of the architecture-specific details.
As far as encapsulating complexity, my mental model doesn't diverge from what you've outlined: emitting a set of assembly assembly instructions representing a given wasm instruction, with the difference that we're not strictly relying on/introducing the presence of an Assembler for lowering; but instead potentially relying on helper methods.
From a strict point of view, at least taking in to account SpiderMonkey's implementation, maybe MacroAssembler is not the right term, but eventually it could grow to be the right term. For the initial implementation introducing an Assembler abstraction seems a bit premature IMO. That said, I wouldn't be opposed to naming this abstraction
Assembler
instead.@cfallin do you have any further thoughts that you'd like to share according to your experience with SpiderMonkey?
saulecabrera edited PR review comment.
saulecabrera edited PR review comment.
saulecabrera edited PR review comment.
cfallin submitted PR review.
cfallin created PR review comment:
Yes, I agree generally with Ben here that what we want initially is not-so-macro; actually I think as a first approximation, one method call per
MachInst
is about right. At that level, we'll need different wasm-opcode-to-assembler-API code for each architecture. Details like addressing modes, and two-register (x86) vs three-register (aarch64) form, are best dealt with explicitly at this level, rather than being papered over; for example an alternative approach to the 2/3-register-form difference would be to emit a mov/add sequence for every add and emulate 3-reg form on x86, but that's a significant pessimization.But I also think there's a lot of room to evolve a general set of abstractions on top of this and share implementations where necessary, especially for the more complex lowerings like (for example) reference-type read/write barrier fastpaths, and other ops built from primitives. We could implement something like a
CommonAssembler
trait that provides load/store/move/add/sub/compare/branch/... and then write another layer that is generic over that.I think that starting point and evolutionary path is the important point and the names aren't so important; so
Assembler
rather thanMacroAssembler
is fine, if that's the way others prefer!
saulecabrera updated PR #28 from wasmtime-baseline-compilation
to main
.
saulecabrera edited PR review comment.
saulecabrera updated PR #28 from wasmtime-baseline-compilation
to main
.
Last updated: Nov 22 2024 at 17:03 UTC