Birbe commented on issue #2566:
I'm working a Java Virtual Machine implementation in Rust and one of my plans is to write a WASM backend along with a general JIT implementation using Cranelift, and if everything were unified to just be Cranelift that would be great. I might give creating a WASM backend a shot, would benefit my project as well as Cranelift
teymour-aldridge commented on issue #2566:
I would be interested in implementing this!
I had a look, and I'm not sure where the correct place to implement this would be. My thought was in
cranelift_codegen::isa
, however theMachBackend
trait seems to reply upon the existence of registers in the target architecture (which makes sense, but WebAssembly is sort-of a stack machine). Would it make sense to emulate registers by using WebAssembly locals, or is that a bad idea?
teymour-aldridge edited a comment on issue #2566:
I would be interested in implementing this!
I had a look, and I'm not sure where the correct place to implement this would be. My thought was in
cranelift_codegen::isa
, however theMachBackend
trait seems to reply upon the existence of registers in the target architecture (which makes sense for other architectures, but WebAssembly is sort-of a stack machine). Would it make sense to emulate registers by using WebAssembly locals, or is that a bad idea?
bjorn3 commented on issue #2566:
Wasm is a fundamentally different architecture from all other supported architectures in that it is a stack machine rather than a register machine, it uses structured control flow rather than arbitrary jumps, allows more than one memory, ... For this reason I think you should avoid using the existing backend infrastructure. One possible implementation method would be to have an implementation of the
Module
trait fromcranelift_module
in a separate crate and then write the lowering from clif ir to wasm bytecode in this crate.
cfallin commented on issue #2566:
Indeed, this would best be structured as a separate framework that consumes Cranelift's IR, I think.
There are lots of resources you could look at for the various parts; the Stackifier algorithm to start with, to recover structured control flow, and then some scheme to map operators back to stack code using locals where necessary (a sort of combined regalloc and instruction-scheduling problem: reordering operators can let you use the stack directly).
We'll need to decide what we do about runtime-specific implementations of Wasm memories, tables, globals, and imports as well. Traditionally Cranelift has these bits filled in by the "environment", and our embedding (Wasmtime, Lucet, SpiderMonkey, something else) fills in some IR that uses e.g. machine-native pointers to implement Wasm memory/table accesses. If we compile back to Wasm, should a table just become a table, and a memory become a memory? Probably, but this will require some careful thought in the frontend/environment too. Possibly we add new IR operators (primitives) and just lower
table.get
1-to-1 to this primitive, then back totable.get
in the backend.FWIW, I have a little side-project (waffle) that is attempting to be a Wasm-to-CFG-of-SSA-to-Wasm framework, which might be useful to look at to see what it takes to compile SSA IR with blockparams back to Wasm. (It kind of works: it roundtrips a hello-world, but has some bugs beyond that. Side-project, no guarantees, etc!)
cfallin edited a comment on issue #2566:
Indeed, this would best be structured as a separate framework that consumes Cranelift's IR, I think.
There are lots of resources you could look at for the various parts; the Stackifier algorithm to start with, to recover structured control flow, and then some scheme to map operators back to stack code using locals where necessary (a sort of combined regalloc and instruction-scheduling problem: reordering operators can let you use the stack directly).
We'll need to decide what we do about runtime-specific implementations of Wasm memories, tables, globals, and imports as well (EDIT: when using the
cranelift-wasm
frontend). Traditionallycranelift-wasm
has these bits filled in by the "environment", and our embedding (Wasmtime, Lucet, SpiderMonkey, something else) fills in some IR that uses e.g. machine-native pointers to implement Wasm memory/table accesses. If we compile back to Wasm, should a table just become a table, and a memory become a memory? Probably, but this will require some careful thought in the frontend/environment too. Possibly we add new IR operators (primitives) and just lowertable.get
1-to-1 to this primitive, then back totable.get
in the backend.FWIW, I have a little side-project (waffle) that is attempting to be a Wasm-to-CFG-of-SSA-to-Wasm framework, which might be useful to look at to see what it takes to compile SSA IR with blockparams back to Wasm. (It kind of works: it roundtrips a hello-world, but has some bugs beyond that. Side-project, no guarantees, etc!)
Birbe commented on issue #2566:
There are lots of resources you could look at for the various parts; the Stackifier algorithm to start with,
https://docs.rs/relooper/latest/relooper/ might be a good way to easily implement this (I believe it actually implements the stackifier algorithm even though it's called relooper)
teymour-aldridge commented on issue #2566:
There are lots of resources you could look at for the various parts; the Stackifier algorithm to start with, to recover structured control flow, and then some scheme to map operators back to stack code using locals where necessary (a sort of combined regalloc and instruction-scheduling problem: reordering operators can let you use the stack directly).
I'm not sure how
regalloc
could be used to translate operators to stack code. Would you mind elaborating (a link to something explaining the idea would be great if you have one).My thoughts were that this should be done by recursively inserting WebAssembly instructions to compute
Value
s.So something like:
function instruction_code() for each operand instruction_code(operand) code to push result onto stack (apply operator and then push)
teymour-aldridge edited a comment on issue #2566:
Thank you for the advice!
There are lots of resources you could look at for the various parts; the Stackifier algorithm to start with, to recover structured control flow, and then some scheme to map operators back to stack code using locals where necessary (a sort of combined regalloc and instruction-scheduling problem: reordering operators can let you use the stack directly).
I'm not sure how
regalloc
could be used to translate operators to stack code. Would you mind elaborating (a link to something explaining the idea would be great if you have one).My thoughts were that this should be done by recursively inserting WebAssembly instructions to compute
Value
s.So something like:
function instruction_code() for each operand instruction_code(operand) code to push result onto stack (apply operator and then push)
cfallin commented on issue #2566:
That scheme is a good start, and works for tree-shaped code, but not DAG-shaped code: as soon as an operator (SSA value) is used more than once, you either need to recompute it or store it in a local and then reload it at each use-site.
The use of locals to store SSA values is the register allocation problem -- with a range of solutions, from a "one local (register) per SSA value" approach up to more sophisticated approaches that reuse locals. Given Wasm engines' limits on locals, and given the way that baseline Wasm compilers allocate separate stack slots for each local (leading to huge stackframes if we are not frugal with locals), we probably want to do something better than one-local-per-value.
The reason I described this as a "combined regalloc / instruction scheduling problem" is that the two are intertwined. Once we've decided to use locals to carry values, we could do the simplest thing and emit
local.get
for every operand, then the operator, thenlocal.set
. But that's also inefficient; we want to use the stack when we can. So the order of operators matters. Your scheme (recursive value generation) is actually doing implicit code-motion in order to take advantage of stack dataflow (by sinking operators to use-sites), which is great, and what we want. But we have to be careful about it for two reasons:
- Some operators have side-effects, so we have to be careful about moving them. We can slide them over pure operators but not other side-effecting operators. (Trapping is a side-effect too, so this is not just for loads/stores/calls.)
- When a value is used more than once, we need to choose a location to compute it that dominates all of its use-sites. We can't, for example, just generate it onto the stack on demand the first time it's used then save it in a local for later times -- consider when two branches of an if-else use a value computed before the control-flow split.
In the most general form of the problem, we could also consider rematerialization: some operations (such as, e.g.,
i32.iconst 0
) are trivial to compute and so it would probably not be a good idea to compute this once at the top of the function (GVN will often dedup constants like tihs) and store it in a local thereafter. Rematerialization can be considered part of the regalloc problem (because a regalloc might not want to spill and remat is then an option) and some regalloc implementations can do this; one can also do it in a more ad-hoc way in the code one generates prior to allocating registers.A scheme that might work is to "treeify" the operands before generating code. Basically each operand of an operator could then be:
SingleUse(Value)
: We are the only use of the value, and the operator that produces it is pure, so it will be generated in-place on the stack (your scheme);NormalUse(Value)
: We are one of several uses of the value; it should be generated into a local at its original location in the program, then loaded here; orRematerialize(Value)
: We replicate the operator here. Used for pure ops likeiconst
only.So one would generate ops in original program location if used by a
NormalUse
elsewhere, and otherwise only at the use-site. This would result in code that then refers to values, which we can subsequently do register allocation over to assign to locals.There's not really a good document on this that I know of; it's just applications of general concepts (regalloc, remat, instruction scheduling) to a specific and somewhat odd target. We're inventing as we go here. That said it'd probably be good to understand how e.g. LLVM's Wasm backend does this; I haven't really looked at the dataflow side (just the stackifier).
Anyway hopefully this helps!
cfallin edited a comment on issue #2566:
That scheme is a good start, and works for tree-shaped code, but not DAG-shaped code: as soon as an operator (SSA value) is used more than once, you either need to recompute it or store it in a local and then reload it at each use-site.
The use of locals to store SSA values is the register allocation problem -- with a range of solutions, from a "one local (register) per SSA value" approach up to more sophisticated approaches that reuse locals. Given Wasm engines' limits on locals, and given the way that baseline Wasm compilers allocate separate stack slots for each local (leading to huge stackframes if we are not frugal with locals), we probably want to do something better than one-local-per-value.
The reason I described this as a "combined regalloc / instruction scheduling problem" is that the two are intertwined. Once we've decided to use locals to carry values, we could do the simplest thing and emit
local.get
for every operand, then the operator, thenlocal.set
. But that's also inefficient; we want to use the stack when we can. So the order of operators matters. Your scheme (recursive value generation) is actually doing implicit code-motion in order to take advantage of stack dataflow (by sinking operators to use-sites), which is great, and what we want. But we have to be careful about it for two reasons:
- Some operators have side-effects, so we have to be careful about moving them. We can slide them over pure operators but not other side-effecting operators. (Trapping is a side-effect too, so this is not just for loads/stores/calls.)
- When a value is used more than once, we need to choose a location to compute it that dominates all of its use-sites. We can't, for example, just generate it onto the stack on demand the first time it's used then save it in a local for later times -- consider when two branches of an if-else use a value computed before the control-flow split.
In the most general form of the problem, we could also consider rematerialization: some operations (such as, e.g.,
i32.const 0
) are trivial to compute and so it would probably not be a good idea to compute this once at the top of the function (GVN will often dedup constants like this) and store it in a local thereafter. Rematerialization can be considered part of the regalloc problem (because a regalloc might not want to spill and remat is then an option) and some regalloc implementations can do this; one can also do it in a more ad-hoc way in the code one generates prior to allocating registers.A scheme that might work is to "treeify" the operands before generating code. Basically each operand of an operator could then be:
SingleUse(Value)
: We are the only use of the value, and the operator that produces it is pure, so it will be generated in-place on the stack (your scheme);NormalUse(Value)
: We are one of several uses of the value; it should be generated into a local at its original location in the program, then loaded here; orRematerialize(Value)
: We replicate the operator here. Used for pure ops likeiconst
only.So one would generate ops in original program location if used by a
NormalUse
elsewhere, and otherwise only at the use-site. This would result in code that then refers to values, which we can subsequently do register allocation over to assign to locals.There's not really a good document on this that I know of; it's just applications of general concepts (regalloc, remat, instruction scheduling) to a specific and somewhat odd target. We're inventing as we go here. That said it'd probably be good to understand how e.g. LLVM's Wasm backend does this; I haven't really looked at the dataflow side (just the stackifier).
Anyway hopefully this helps!
teymour-aldridge commented on issue #2566:
Thanks so much!
teymour-aldridge commented on issue #2566:
I have started to implement something in: https://github.com/teymour-aldridge/cranelift_codegen_wasm
Unfortunately there are quite a few bugs and many things (e.g. data objects, instructions) are not yet supported.
cfallin labeled issue #2566:
<!-- Please try to describe precisely what you would like to do in
Cranelift/Wasmtime and/or expect from it. You can answer the questions below if
they're relevant and delete this text before submitting. Thanks for opening an
issue! -->Feature
Add a Cranelift backend that compiles to Webassembly.
Benefit
This would allow a compiler to run in the browser and compile code to run in the browser.
Implementation
No plan yet
Alternatives
Not adding a Webassembly backend.
cfallin labeled issue #2566:
<!-- Please try to describe precisely what you would like to do in
Cranelift/Wasmtime and/or expect from it. You can answer the questions below if
they're relevant and delete this text before submitting. Thanks for opening an
issue! -->Feature
Add a Cranelift backend that compiles to Webassembly.
Benefit
This would allow a compiler to run in the browser and compile code to run in the browser.
Implementation
No plan yet
Alternatives
Not adding a Webassembly backend.
cfallin commented on issue #2566:
Since I chanced upon this issue again, I'll leave a note that my waffle crate now has a working Wasm compilation pipeline from an SSA-based IR that is very similar to Cranelift's IR. It could serve either as the model for, or as a codegen library for, a WebAssembly backend for Cranelift. It currently requires reducible control flow (will panic when given an irreducible CFG) but this is "just" an implementation question, as the API is based on a standard CFG. The crate has been fuzzed with roundtrip + differential execution and is in a fairly robust state (can roundtrip complex Wasms like SpiderMonkey).
The main challenge that remains is in the semantic gap between Wasm and CLIF: lowering loses information about tables, heaps, globals and the like, and we would need to "lift" CLIF's raw memory accesses back to Wasm-level concepts. That's maybe not so bad though: one could use a "null" environment when lowering the Wasm, leaving heap accesses to memory zero as-is (address computation is the identity function), and then compile accesses to raw memory in CLIF back to accesses to a single Wasm memory. Function pointers (taking the address of symbols) will require some care. Also, CLIF is not exactly 1-to-1 with Wasm opcodes, especially the SIMD ones. But, doable. This is probably a 1-2 month project for someone at this point.
cfallin commented on issue #2566:
(I'll note too that some of my thoughts above are specific to a "Wasm roundtripping" context, which probably is actually the least interesting application; serving as a backend for something that starts with a native-code model, like cg_clif, is more interesting, and for that the mappings are a little simpler. One raw address space goes to one Wasm memory, and all function pointers are indices into one table, with entries populated as we observe function-address operators. I guess there is just the question of "relocations" for external symbol references.)
bjorn3 commented on issue #2566:
I just took a look at waffle, but it seems basically everything necessary to build a module is private. Would it make sense to move the wasm frontend to a separate crate. This way everything the wasm frontend and other frontends need is forced to be public.
bjorn3 edited a comment on issue #2566:
I just took a look at waffle, but it seems basically everything necessary to build a module is private. Would it make sense to move the wasm frontend to a separate crate? This way everything the wasm frontend and other frontends need is forced to be public.
cfallin commented on issue #2566:
I just took a look at waffle, but it seems basically everything necessary to build a module is private. Would it make sense to move the wasm frontend to a separate crate? This way everything the wasm frontend and other frontends need is forced to be public.
Yes, for sure; the crate's API polish hasn't gotten much attention, it was just codeveloped for a single initial purpose (roundtripping with modifications). Happy to discuss further there and/or take PRs from folks :-)
madsmtm commented on issue #2566:
A WebAssembly backend would likely be very useful for projects like Ruffle as well
skyne98 commented on issue #2566:
Hi, thanks for the amazing work! Any plans for this? Really excited for the feature, will allow me to drop wasm-opt as a dependency in lots of my projects :eyes: (+ working with cranelift is generally very pleasant!).
@bjorn3
cfallin commented on issue #2566:
@skyne98 we don't have anyone looking at this right now, and as it's a large engineering project, it's unlikely to be prioritized by the folks who work on Cranelift full-time. That doesn't mean we wouldn't like to have it, just that there are other things that are more important, and time is limited.
As always with an open-source project, I'd be happy to review a PR from anyone who wants this and builds it!
skyne98 commented on issue #2566:
Of course, thanks for the reply, much appreciated! It seems like a project that requires quite a bit of expertise, based on your comments, so I hope someone can pick it up one day! (Or I get confident enough with it to help out :sweat_smile:).
One quick question, if you don't mind: does running WASM in embedded wasmtime (with optimizations turned on & cranelift) already fully utilizes cranelift's optimization capability? And, also, is there potentially a way I get yonk the optimized IR out of the pipeline to print it and check it?
cfallin commented on issue #2566:
Wasmtime does turn on Cranelift's optimizations by default, yes (this can be controlled with
-O opt-level=N
, 0 for off and 2 for all on. You can get unoptimized IR withwasmtime compile --emit-clif
; optimized IR could be printed with a similar option, but you'd need to add similar code around here (and handle the cached case somehow?) using logic similar to [this}https://github.com/bytecodealliance/wasmtime/blob/c1741470798dfe3bb8e8273bb35cc11e5299ba64/crates/cranelift/src/compiler.rs#L216-L225). Or, as a hack,println!()
it directly at that point.
cfallin edited a comment on issue #2566:
Wasmtime does turn on Cranelift's optimizations by default, yes (this can be controlled with
-O opt-level=N
, 0 for off and 2 for all on. You can get unoptimized IR withwasmtime compile --emit-clif
; optimized IR could be printed with a similar option, but you'd need to add similar code around here (and handle the cached case somehow?) using logic similar to this. Or, as a hack,println!()
it directly at that point.
skyne98 commented on issue #2566:
Would have been nice to be able to pass optional callbacks to hook into different parts of the pipeline :eyes:
Last updated: Dec 23 2024 at 12:05 UTC