Stream: git-wasmtime

Topic: wasmtime / issue #2566 Add Webassembly backend


view this post on Zulip Wasmtime GitHub notifications bot (Aug 05 2021 at 04:57):

Birbe commented on issue #2566:

I'm working a Java Virtual Machine implementation in Rust and one of my plans is to write a WASM backend along with a general JIT implementation using Cranelift, and if everything were unified to just be Cranelift that would be great. I might give creating a WASM backend a shot, would benefit my project as well as Cranelift

view this post on Zulip Wasmtime GitHub notifications bot (Dec 30 2021 at 16:46):

teymour-aldridge commented on issue #2566:

I would be interested in implementing this!

I had a look, and I'm not sure where the correct place to implement this would be. My thought was in cranelift_codegen::isa, however the MachBackend trait seems to reply upon the existence of registers in the target architecture (which makes sense, but WebAssembly is sort-of a stack machine). Would it make sense to emulate registers by using WebAssembly locals, or is that a bad idea?

view this post on Zulip Wasmtime GitHub notifications bot (Dec 30 2021 at 16:46):

teymour-aldridge edited a comment on issue #2566:

I would be interested in implementing this!

I had a look, and I'm not sure where the correct place to implement this would be. My thought was in cranelift_codegen::isa, however the MachBackend trait seems to reply upon the existence of registers in the target architecture (which makes sense for other architectures, but WebAssembly is sort-of a stack machine). Would it make sense to emulate registers by using WebAssembly locals, or is that a bad idea?

view this post on Zulip Wasmtime GitHub notifications bot (Dec 30 2021 at 17:25):

bjorn3 commented on issue #2566:

Wasm is a fundamentally different architecture from all other supported architectures in that it is a stack machine rather than a register machine, it uses structured control flow rather than arbitrary jumps, allows more than one memory, ... For this reason I think you should avoid using the existing backend infrastructure. One possible implementation method would be to have an implementation of the Module trait from cranelift_module in a separate crate and then write the lowering from clif ir to wasm bytecode in this crate.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 30 2021 at 20:40):

cfallin commented on issue #2566:

Indeed, this would best be structured as a separate framework that consumes Cranelift's IR, I think.

There are lots of resources you could look at for the various parts; the Stackifier algorithm to start with, to recover structured control flow, and then some scheme to map operators back to stack code using locals where necessary (a sort of combined regalloc and instruction-scheduling problem: reordering operators can let you use the stack directly).

We'll need to decide what we do about runtime-specific implementations of Wasm memories, tables, globals, and imports as well. Traditionally Cranelift has these bits filled in by the "environment", and our embedding (Wasmtime, Lucet, SpiderMonkey, something else) fills in some IR that uses e.g. machine-native pointers to implement Wasm memory/table accesses. If we compile back to Wasm, should a table just become a table, and a memory become a memory? Probably, but this will require some careful thought in the frontend/environment too. Possibly we add new IR operators (primitives) and just lower table.get 1-to-1 to this primitive, then back to table.get in the backend.

FWIW, I have a little side-project (waffle) that is attempting to be a Wasm-to-CFG-of-SSA-to-Wasm framework, which might be useful to look at to see what it takes to compile SSA IR with blockparams back to Wasm. (It kind of works: it roundtrips a hello-world, but has some bugs beyond that. Side-project, no guarantees, etc!)

view this post on Zulip Wasmtime GitHub notifications bot (Dec 30 2021 at 20:41):

cfallin edited a comment on issue #2566:

Indeed, this would best be structured as a separate framework that consumes Cranelift's IR, I think.

There are lots of resources you could look at for the various parts; the Stackifier algorithm to start with, to recover structured control flow, and then some scheme to map operators back to stack code using locals where necessary (a sort of combined regalloc and instruction-scheduling problem: reordering operators can let you use the stack directly).

We'll need to decide what we do about runtime-specific implementations of Wasm memories, tables, globals, and imports as well (EDIT: when using the cranelift-wasm frontend). Traditionally cranelift-wasm has these bits filled in by the "environment", and our embedding (Wasmtime, Lucet, SpiderMonkey, something else) fills in some IR that uses e.g. machine-native pointers to implement Wasm memory/table accesses. If we compile back to Wasm, should a table just become a table, and a memory become a memory? Probably, but this will require some careful thought in the frontend/environment too. Possibly we add new IR operators (primitives) and just lower table.get 1-to-1 to this primitive, then back to table.get in the backend.

FWIW, I have a little side-project (waffle) that is attempting to be a Wasm-to-CFG-of-SSA-to-Wasm framework, which might be useful to look at to see what it takes to compile SSA IR with blockparams back to Wasm. (It kind of works: it roundtrips a hello-world, but has some bugs beyond that. Side-project, no guarantees, etc!)

view this post on Zulip Wasmtime GitHub notifications bot (Dec 30 2021 at 22:13):

Birbe commented on issue #2566:

There are lots of resources you could look at for the various parts; the Stackifier algorithm to start with,

https://docs.rs/relooper/latest/relooper/ might be a good way to easily implement this (I believe it actually implements the stackifier algorithm even though it's called relooper)

view this post on Zulip Wasmtime GitHub notifications bot (Jan 01 2022 at 15:34):

teymour-aldridge commented on issue #2566:

There are lots of resources you could look at for the various parts; the Stackifier algorithm to start with, to recover structured control flow, and then some scheme to map operators back to stack code using locals where necessary (a sort of combined regalloc and instruction-scheduling problem: reordering operators can let you use the stack directly).

I'm not sure how regalloc could be used to translate operators to stack code. Would you mind elaborating (a link to something explaining the idea would be great if you have one).

My thoughts were that this should be done by recursively inserting WebAssembly instructions to compute Values.

So something like:

function instruction_code()
  for each operand
    instruction_code(operand)
  code to push result onto stack (apply operator and then push)

view this post on Zulip Wasmtime GitHub notifications bot (Jan 01 2022 at 15:57):

teymour-aldridge edited a comment on issue #2566:

Thank you for the advice!

There are lots of resources you could look at for the various parts; the Stackifier algorithm to start with, to recover structured control flow, and then some scheme to map operators back to stack code using locals where necessary (a sort of combined regalloc and instruction-scheduling problem: reordering operators can let you use the stack directly).

I'm not sure how regalloc could be used to translate operators to stack code. Would you mind elaborating (a link to something explaining the idea would be great if you have one).

My thoughts were that this should be done by recursively inserting WebAssembly instructions to compute Values.

So something like:

function instruction_code()
  for each operand
    instruction_code(operand)
  code to push result onto stack (apply operator and then push)

view this post on Zulip Wasmtime GitHub notifications bot (Jan 01 2022 at 19:18):

cfallin commented on issue #2566:

That scheme is a good start, and works for tree-shaped code, but not DAG-shaped code: as soon as an operator (SSA value) is used more than once, you either need to recompute it or store it in a local and then reload it at each use-site.

The use of locals to store SSA values is the register allocation problem -- with a range of solutions, from a "one local (register) per SSA value" approach up to more sophisticated approaches that reuse locals. Given Wasm engines' limits on locals, and given the way that baseline Wasm compilers allocate separate stack slots for each local (leading to huge stackframes if we are not frugal with locals), we probably want to do something better than one-local-per-value.

The reason I described this as a "combined regalloc / instruction scheduling problem" is that the two are intertwined. Once we've decided to use locals to carry values, we could do the simplest thing and emit local.get for every operand, then the operator, then local.set. But that's also inefficient; we want to use the stack when we can. So the order of operators matters. Your scheme (recursive value generation) is actually doing implicit code-motion in order to take advantage of stack dataflow (by sinking operators to use-sites), which is great, and what we want. But we have to be careful about it for two reasons:

In the most general form of the problem, we could also consider rematerialization: some operations (such as, e.g., i32.iconst 0) are trivial to compute and so it would probably not be a good idea to compute this once at the top of the function (GVN will often dedup constants like tihs) and store it in a local thereafter. Rematerialization can be considered part of the regalloc problem (because a regalloc might not want to spill and remat is then an option) and some regalloc implementations can do this; one can also do it in a more ad-hoc way in the code one generates prior to allocating registers.

A scheme that might work is to "treeify" the operands before generating code. Basically each operand of an operator could then be:

So one would generate ops in original program location if used by a NormalUse elsewhere, and otherwise only at the use-site. This would result in code that then refers to values, which we can subsequently do register allocation over to assign to locals.

There's not really a good document on this that I know of; it's just applications of general concepts (regalloc, remat, instruction scheduling) to a specific and somewhat odd target. We're inventing as we go here. That said it'd probably be good to understand how e.g. LLVM's Wasm backend does this; I haven't really looked at the dataflow side (just the stackifier).

Anyway hopefully this helps!

view this post on Zulip Wasmtime GitHub notifications bot (Jan 01 2022 at 19:19):

cfallin edited a comment on issue #2566:

That scheme is a good start, and works for tree-shaped code, but not DAG-shaped code: as soon as an operator (SSA value) is used more than once, you either need to recompute it or store it in a local and then reload it at each use-site.

The use of locals to store SSA values is the register allocation problem -- with a range of solutions, from a "one local (register) per SSA value" approach up to more sophisticated approaches that reuse locals. Given Wasm engines' limits on locals, and given the way that baseline Wasm compilers allocate separate stack slots for each local (leading to huge stackframes if we are not frugal with locals), we probably want to do something better than one-local-per-value.

The reason I described this as a "combined regalloc / instruction scheduling problem" is that the two are intertwined. Once we've decided to use locals to carry values, we could do the simplest thing and emit local.get for every operand, then the operator, then local.set. But that's also inefficient; we want to use the stack when we can. So the order of operators matters. Your scheme (recursive value generation) is actually doing implicit code-motion in order to take advantage of stack dataflow (by sinking operators to use-sites), which is great, and what we want. But we have to be careful about it for two reasons:

In the most general form of the problem, we could also consider rematerialization: some operations (such as, e.g., i32.const 0) are trivial to compute and so it would probably not be a good idea to compute this once at the top of the function (GVN will often dedup constants like this) and store it in a local thereafter. Rematerialization can be considered part of the regalloc problem (because a regalloc might not want to spill and remat is then an option) and some regalloc implementations can do this; one can also do it in a more ad-hoc way in the code one generates prior to allocating registers.

A scheme that might work is to "treeify" the operands before generating code. Basically each operand of an operator could then be:

So one would generate ops in original program location if used by a NormalUse elsewhere, and otherwise only at the use-site. This would result in code that then refers to values, which we can subsequently do register allocation over to assign to locals.

There's not really a good document on this that I know of; it's just applications of general concepts (regalloc, remat, instruction scheduling) to a specific and somewhat odd target. We're inventing as we go here. That said it'd probably be good to understand how e.g. LLVM's Wasm backend does this; I haven't really looked at the dataflow side (just the stackifier).

Anyway hopefully this helps!

view this post on Zulip Wasmtime GitHub notifications bot (Jan 02 2022 at 21:18):

teymour-aldridge commented on issue #2566:

Thanks so much!

view this post on Zulip Wasmtime GitHub notifications bot (Jan 06 2022 at 14:51):

teymour-aldridge commented on issue #2566:

I have started to implement something in: https://github.com/teymour-aldridge/cranelift_codegen_wasm

Unfortunately there are quite a few bugs and many things (e.g. data objects, instructions) are not yet supported.

view this post on Zulip Wasmtime GitHub notifications bot (May 04 2022 at 21:02):

cfallin labeled issue #2566:

<!-- Please try to describe precisely what you would like to do in
Cranelift/Wasmtime and/or expect from it. You can answer the questions below if
they're relevant and delete this text before submitting. Thanks for opening an
issue! -->

Feature

Add a Cranelift backend that compiles to Webassembly.

Benefit

This would allow a compiler to run in the browser and compile code to run in the browser.

Implementation

No plan yet

Alternatives

Not adding a Webassembly backend.

view this post on Zulip Wasmtime GitHub notifications bot (May 04 2022 at 21:02):

cfallin labeled issue #2566:

<!-- Please try to describe precisely what you would like to do in
Cranelift/Wasmtime and/or expect from it. You can answer the questions below if
they're relevant and delete this text before submitting. Thanks for opening an
issue! -->

Feature

Add a Cranelift backend that compiles to Webassembly.

Benefit

This would allow a compiler to run in the browser and compile code to run in the browser.

Implementation

No plan yet

Alternatives

Not adding a Webassembly backend.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2022 at 03:37):

cfallin commented on issue #2566:

Since I chanced upon this issue again, I'll leave a note that my waffle crate now has a working Wasm compilation pipeline from an SSA-based IR that is very similar to Cranelift's IR. It could serve either as the model for, or as a codegen library for, a WebAssembly backend for Cranelift. It currently requires reducible control flow (will panic when given an irreducible CFG) but this is "just" an implementation question, as the API is based on a standard CFG. The crate has been fuzzed with roundtrip + differential execution and is in a fairly robust state (can roundtrip complex Wasms like SpiderMonkey).

The main challenge that remains is in the semantic gap between Wasm and CLIF: lowering loses information about tables, heaps, globals and the like, and we would need to "lift" CLIF's raw memory accesses back to Wasm-level concepts. That's maybe not so bad though: one could use a "null" environment when lowering the Wasm, leaving heap accesses to memory zero as-is (address computation is the identity function), and then compile accesses to raw memory in CLIF back to accesses to a single Wasm memory. Function pointers (taking the address of symbols) will require some care. Also, CLIF is not exactly 1-to-1 with Wasm opcodes, especially the SIMD ones. But, doable. This is probably a 1-2 month project for someone at this point.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2022 at 03:42):

cfallin commented on issue #2566:

(I'll note too that some of my thoughts above are specific to a "Wasm roundtripping" context, which probably is actually the least interesting application; serving as a backend for something that starts with a native-code model, like cg_clif, is more interesting, and for that the mappings are a little simpler. One raw address space goes to one Wasm memory, and all function pointers are indices into one table, with entries populated as we observe function-address operators. I guess there is just the question of "relocations" for external symbol references.)

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2022 at 12:07):

bjorn3 commented on issue #2566:

I just took a look at waffle, but it seems basically everything necessary to build a module is private. Would it make sense to move the wasm frontend to a separate crate. This way everything the wasm frontend and other frontends need is forced to be public.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2022 at 12:07):

bjorn3 edited a comment on issue #2566:

I just took a look at waffle, but it seems basically everything necessary to build a module is private. Would it make sense to move the wasm frontend to a separate crate? This way everything the wasm frontend and other frontends need is forced to be public.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2022 at 16:08):

cfallin commented on issue #2566:

I just took a look at waffle, but it seems basically everything necessary to build a module is private. Would it make sense to move the wasm frontend to a separate crate? This way everything the wasm frontend and other frontends need is forced to be public.

Yes, for sure; the crate's API polish hasn't gotten much attention, it was just codeveloped for a single initial purpose (roundtripping with modifications). Happy to discuss further there and/or take PRs from folks :-)

view this post on Zulip Wasmtime GitHub notifications bot (Sep 05 2023 at 04:02):

madsmtm commented on issue #2566:

A WebAssembly backend would likely be very useful for projects like Ruffle as well

view this post on Zulip Wasmtime GitHub notifications bot (Apr 04 2024 at 05:15):

skyne98 commented on issue #2566:

Hi, thanks for the amazing work! Any plans for this? Really excited for the feature, will allow me to drop wasm-opt as a dependency in lots of my projects :eyes: (+ working with cranelift is generally very pleasant!).

@bjorn3

view this post on Zulip Wasmtime GitHub notifications bot (Apr 04 2024 at 14:49):

cfallin commented on issue #2566:

@skyne98 we don't have anyone looking at this right now, and as it's a large engineering project, it's unlikely to be prioritized by the folks who work on Cranelift full-time. That doesn't mean we wouldn't like to have it, just that there are other things that are more important, and time is limited.

As always with an open-source project, I'd be happy to review a PR from anyone who wants this and builds it!

view this post on Zulip Wasmtime GitHub notifications bot (Apr 04 2024 at 14:56):

skyne98 commented on issue #2566:

Of course, thanks for the reply, much appreciated! It seems like a project that requires quite a bit of expertise, based on your comments, so I hope someone can pick it up one day! (Or I get confident enough with it to help out :sweat_smile:).

One quick question, if you don't mind: does running WASM in embedded wasmtime (with optimizations turned on & cranelift) already fully utilizes cranelift's optimization capability? And, also, is there potentially a way I get yonk the optimized IR out of the pipeline to print it and check it?

view this post on Zulip Wasmtime GitHub notifications bot (Apr 04 2024 at 15:56):

cfallin commented on issue #2566:

Wasmtime does turn on Cranelift's optimizations by default, yes (this can be controlled with -O opt-level=N, 0 for off and 2 for all on. You can get unoptimized IR with wasmtime compile --emit-clif; optimized IR could be printed with a similar option, but you'd need to add similar code around here (and handle the cached case somehow?) using logic similar to [this}https://github.com/bytecodealliance/wasmtime/blob/c1741470798dfe3bb8e8273bb35cc11e5299ba64/crates/cranelift/src/compiler.rs#L216-L225). Or, as a hack, println!() it directly at that point.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 04 2024 at 15:56):

cfallin edited a comment on issue #2566:

Wasmtime does turn on Cranelift's optimizations by default, yes (this can be controlled with -O opt-level=N, 0 for off and 2 for all on. You can get unoptimized IR with wasmtime compile --emit-clif; optimized IR could be printed with a similar option, but you'd need to add similar code around here (and handle the cached case somehow?) using logic similar to this. Or, as a hack, println!() it directly at that point.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 15 2024 at 13:54):

skyne98 commented on issue #2566:

Would have been nice to be able to pass optional callbacks to hook into different parts of the pipeline :eyes:


Last updated: Nov 22 2024 at 16:03 UTC