alexcrichton commented on issue #28:
Personally this all sounds great to me, I'm excited to see this take shape! I'm think that reusing Cranelift backends is a great idea and I think the integration strategy chosen here (no tiering yet) is the right answer for now.
I suspect that various abstractions in Wasmtime which baseline compilation would want to use are not as optimal as they could be as-implemented today. We haven't focused much on compile time performance and Cranelift often dwarfs any cost in Wasmtime or other parts of the system, but I think having a baseline compiler push on these abstractions to optimize them is a great forcing function to have. Mostly just saying this as a word of caution that there may be some surprsing bottlenecks come time for performance measurements but it should all be easy enough to handle.
Lastly one thing I might recommend is if possible getting AArch64 support up and running as soon as possible. In such an architecture-specific context as a baseline compiler adding a second architecture I find can often radically change existing designs so getting that forcing function sooner rather than later may be helpful. That being said I don't think it's necessary to push on it immediately and if y'all would prefer to defer it to the second phase as you've mentioned here I also think that's ok.
bjorn3 commented on issue #28:
This paper may be useful:
https://arxiv.org/abs/2011.13127: Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode
Our WebAssembly compiler generates code 4.9X-6.5X faster than Liftoff, the WebAssembly baseline compiler in Google Chrome. The generated code also outperforms Liftoff's by 39%-63% on the Coremark and PolyBenchC WebAssembly benchmarks.
In the future if an interpreter tier is added below the baseline compiler the following is also interesting:
https://arxiv.org/abs/2205.01183: A fast in-place interpreter for WebAssembly
Our evaluation shows that in-place interpretation of Wasm code is space-efficient and fast, achieving performance on-par with interpreting a custom-designed internal format.
cfallin commented on issue #28:
This paper may be useful:
https://arxiv.org/abs/2011.13127: Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode
Our WebAssembly compiler generates code 4.9X-6.5X faster than Liftoff, the WebAssembly baseline compiler in Google Chrome. The generated code also outperforms Liftoff's by 39%-63% on the Coremark and PolyBenchC WebAssembly benchmarks.
The paper is indeed really interesting, but the approach it takes is not terribly compatible with the goals that we have; in particular we don't want to depend on a C++ compiler and template specialization, and relocation magic, and then gluing together the machine code that comes from that separate offline process, to generate the actual bits of code. An equally important goal of the baseline compiler is to be straightforward to understand and modify; as such, a very simple "emit these three instructions for this Wasm opcode" mapping is ideal for maintainability.
saulecabrera commented on issue #28:
During today's wasmtime meeting, we discussed @fitzgen's and @alexcrichton's feedback around aarch64 support; we concluded that lowering the priority of Reference Types and bumping the priority of supporting at least a subset of aarch64 is probably the best path forward for the initial implementation. I've made the necessary changes to reflect this decision.
bjorn3 commented on issue #28:
in particular we don't want to depend on a C++ compiler and template specialization, and relocation magic, and then gluing together the machine code that comes from that separate offline process, to generate the actual bits of code.
Would it be possible to replace C++ with Cranelift IR? It could then be compiled at build time and at runtime.
saulecabrera commented on issue #28:
in particular we don't want to depend on a C++ compiler and template specialization, and relocation magic, and then gluing together the machine code that comes from that separate offline process, to generate the actual bits of code.
Would it be possible to replace C++ with Cranelift IR? It could then be compiled at build time and at runtime.
Personally I think that even if possible, this approach would still not be compatible with the design principles that we've laid out, mainly around (i) simplicity and (ii) no intermediate representation. In general I think that it might be worth thinking about keeping the implementation as simple and straight forward as possible.
saulecabrera edited a comment on issue #28:
in particular we don't want to depend on a C++ compiler and template specialization, and relocation magic, and then gluing together the machine code that comes from that separate offline process, to generate the actual bits of code.
Would it be possible to replace C++ with Cranelift IR? It could then be compiled at build time and at runtime.
Personally I think that even if possible, this approach would still not be compatible with the design principles that we've laid out, mainly around (i) simplicity and (ii) no intermediate representation. In general I think that it might be worth thinking about keeping the implementation as simple and straightforward as possible.
jameysharp commented on issue #28:
I want to expand on @bjorn3's suggestions. But, to be clear, I'm not going to argue that this alternative should replace the current RFC. If this RFC is the direction that you want to pursue, @saulecabrera, then I see no reason the rest of us shouldn't support you in that. That said, here's an alternative that I think has some advantages.
Imagine that for each WebAssembly instruction, we write down a sequence of Cranelift instructions to implement it, with holes for registers and labels. (To keep this short, I'm hand-waving things like control-flow.) That's basically what
cranelift/wasm/src/code_translator.rs
is, if you squint.We currently get an optimizing WebAssembly compiler by filling those CLIF sequences with virtual registers from abstract interpretation, concatenating the snippets, and then handing the result off to Cranelift.
We could instead get a baseline compiler from those snippets of CLIF by compiling each snippet to native code in isolation, and then concatenating the compiled snippets. Compiling the snippets to native code can happen during the wasmtime build if desired. The number of variations we need to compile of each snippet is bounded by the number of physical registers used to shadow the wasm stack. This is essentially Copy-and-Patch, but using the compiler we already maintain instead of a C++ compiler.
And we could also get an interpreter by inserting those CLIF snippets into a dispatch loop, also written in CLIF, with register holes filled in by loads and stores to an explicit value stack. The "fast in-place interpreter" paper implemented the interpreter in native assembly because they couldn't control the generated code well enough from a high-level language, but I believe Cranelift IR gives us everything we'd need while being able to compile the interpreter for multiple architectures. (Chris tells me this resembles Spidermonkey's "baseline interpreter".)
I like this approach because we'd get three tiers, all supporting every target architecture, while only having to implement new instructions once. That appeals to me for giving us confidence that the tiers are all equally correct implementations.
fitzgen commented on issue #28:
Compiling the snippets to native code can happen during the wasmtime build if desired. The
I think this would be necessary to meet performance goals, since otherwise we would be materializing a bunch of IR that we wouldn't need to if we didn't do this approach.
Not clear what kind of impact this would have on the Wasmtime crate's compile times, but that is something I would be concerned about.
I like this approach because we'd get three tiers, all supporting every target architecture, while only having to implement new instructions once. That appeals to me for giving us confidence that the tiers are all equally correct implementations.
Yes, this is pretty nice. Tempting. I like using CLIF to side step using a C++ compiler and making a super hacky post-processing linker that chops off prologues and epilogues and patches control flow together. I like having control over the shape of the code that we would paste together, so we can be sure we aren't messing up edge cases.
But also, this is a bit more work to get to an initial baseline compiler implementation than what the RFC proposes.
cfallin commented on issue #28:
I do like @jameysharp 's idea here, from first principles: it would be a nice way to be able to obtain a series of compilers from one source-of-truth. We should experiment with this in the future, I think.
The tradeoff space I see right now is, on the one hand, a "handwritten straightforward baseline compiler (this RFC) with the advantages of:
- Simple to verify by looking: one sees exactly the instructions that will be emitted for any given opcode, locked down in the source and not indirectly from any other toolchain or generation process;
- Simpler to add new opcodes to than Cranelift. While on the one hand if we assume that all features will be in Cranelift so leveraging this work is "free", on the other hand in practice engines that have a separate baseline tier (like SpiderMonkey) often add features to the baseline compiler first, because it's an order of magnitude simpler. This could allow us to build out early support for some features and experiment more freely before we work out the details of a high-performance, optimizing-tier implementation.
- A separate implementation adds "verification diversity": we can differentially fuzz Cranelift against this new backend and possibly find issues.
On the other hand:
- A copy-and-patch-like approach that stitches together Cranelift-compiled fragments can much more flexibly grow its "library" into many op-combining possibilities. This would be a natural design point between an op-by-op baseline compiler / template-JIT (this RFC) and a full optimizing tier. "Macro-op fusion" is still a single-pass approach so retains compile speed while possibly getting pretty decent isel.
- If we assume that the Cranelift implementation is a sunk cost, then this does give us baseline compiler support for new features as we add them for "free". We just need to pay for the complexity and effort to build the framework once.
also, to amplify and add a bit to @fitzgen's point here
Not clear what kind of impact this would have on the Wasmtime crate's compile times, but that is something I would be concerned about.
I'd be concerned about build complexity as well: this implies a "bootstrap compiler" that is built during the compiler build process to compile another compiler (or data that is embedded into it); that's probably nontrivial cost. And it implies that Cranelift itself is at least a dev-dependency of the baseline compiler, which may or may not be desirable.
On balance I think I still favor the approach in this RFC, but these ideas are intriguing for a possible later exploration (so thanks again for raising them!).
jameysharp commented on issue #28:
Thanks, Nick and Chris! I generally agree with both of you, although I'd quibble about the value of implementation diversity: we already have plenty of wasm implementations to use for differential testing. I have more thoughts on engineering details such as build time and build complexity, but I don't want to sidetrack discussion on this RFC further. So unless consensus swings towards implementing the copy-and-patch thing, I'll leave it at that.
saulecabrera commented on issue #28:
Thanks for expanding on this @jameysharp; I find your thoughts interesting here, mostly from the perspective of maximizing reusability and correctness. I must admit that one of the biggest unknowns for me, is the build-time complexity that this approach will/might introduce and what will be the ROI of that cost; in paper the ideas seem compelling, but I find it a bit hard to quantify in the scope of Wasmtime without more investment/investigation in that direction.
Expanding on @cfallin's point above around the simplicity to add new features:
One side-effect that I'm hoping to get with the current proposal (which is not explicit in the RFC and perhaps I should add it), is to make easier to add and maintain developer-facing features, the main feature that I have in mind is debugging. I feel that with the Copy-and-Patch approach, this will be more difficult to achieve (this is purely an assumption, so I might be wrong here).
On correctness: FWIW, we discussed (and the RFC states) integrating fuzzing in phase 1 to ensure a correctness-first approach; this doesn't void the fact that having a single source of truth is potentially less error-prone.
In conclusion my feeling is that switching the approach has the potential to bring in a set of desirable benefits, but we'd need to deal with more unknowns/complexity. The approach that we have today, offers a higher level of certainty in terms of maintenance, compilation and runtime performance in the scope of Wasmtime. That said, if there's a way to leave the door open to incrementally investigate this down the road, on top of the current proposal, I think we should consider it.
jameysharp commented on issue #28:
In conclusion my feeling is that switching the approach has the potential to bring in a set of desirable benefits, but we'd need to deal with more unknowns/complexity. The approach that we have today, offers a higher level of certainty in terms of maintenance, compilation and runtime performance in the scope of Wasmtime.
I agree. My guess is that copy-and-patch would let us maintain less total code in the end, but compared to the RFC, it's definitely not as clear how we get from here to there.
That said, if there's a way to leave the door open to incrementally investigate this down the road, on top of the current proposal, I think we should consider it.
I don't think there's anything about the current RFC that makes it hard to do something like copy-and-patch later. As long as Wasmtime has a path that still uses Cranelift, we can try slicing it up any time. We absolutely don't have to choose one approach or the other.
On the other hand, I don't think the work outlined in this RFC helps much with implementing copy-and-patch. So, _if_ folks felt like that was something we want to do in the short term, then we could save some effort overall by going straight there. But since copy-and-patch has more uncertainty and engineering risk, I think it's reasonable to follow the RFC's plan first.
One side-effect that I'm hoping to get with the current proposal (which is not explicit in the RFC and perhaps I should add it), is to make easier to add and maintain developer-facing features, the main feature that I have in mind is debugging.
I think that would be a great note to add to the RFC, as well as these points that Chris made:
Simple to verify by looking: one sees exactly the instructions that will be emitted for any given opcode, locked down in the source and not indirectly from any other toolchain or generation process;
Simpler to add new opcodes to than Cranelift. [...] in practice engines that have a separate baseline tier (like SpiderMonkey) often add features to the baseline compiler _first_, because it's an order of magnitude simpler. This could allow us to build out early support for some features and experiment more freely before we work out the details of a high-performance, optimizing-tier implementation.
Last updated: Dec 23 2024 at 12:05 UTC