wasmtime / issue #4712 Census of binaryen/`wasm-opt` pass... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #4712 Census of binaryen/`wasm-opt` pass...

Wasmtime GitHub notifications bot (Aug 15 2022 at 19:47):

fitzgen labeled issue #4712:

What is Cranelift's job (in the context of Wasmtime)? To take Wasm that is produced by LLVM and already optimized 99% of the time and do the architecture-specific code generation that LLVM cannot do when targeting Wasm (e.g. instruction selection, regalloc). We don't want to duplicate all of LLVM's mid-end optimizations, only the ones that are beneficial for cleaning up and improving code after we've lowered Wasm memory operations into raw base + offset memory operations, etc. This is an interesting place for a compiler to be, and it means the set of passes and trade offs we have are different from what one might assume by default.

There is another compiler that is in a similar space, at least as far as consuming already-optimized-by-LLVM Wasm binaries and attempting to further optimize them: binaryen and wasm-opt. The big difference is that wasm-opt is emitting another Wasm binary while we are emitting machine code. But maybe they have passes that are not specific to targeting Wasm and which are beneficial to run on already-optimized-by-LLVM Wasm binaries? AIUI, the LLVM IR to Wasm lowering introduces some suboptimal code patterns.

So I did an informal census of what passes are run by wasm-opt, filtering out anything that looked overly specific to targeting Wasm. Results are summarized in the table below, and might give us some food for thought as we start looking into Cranelift's code quality some more. (FWIW, I wasn't 100% sure about some things below, so if you see something that you know is incorrect, feel free to edit this issue and correct it!)

Pass Description Cranelift has equivalent? Discussion

local-cse Perform common-subexpression elimination within a block Yes Our GVN should cover all of this.

dce Perform dead code elimination to remove unreachable blocks and unused expressions Yes

optimize-instructions Peephole optimizations Partial We have some peepholes, but not as many as wasm-opt, and could definitely add more. Although, they care a lot about Wasm encoding tricks for peepholes where we do not. Probably better to look at LLVM itself here for inspiration. Finally, they also have some Souper-synthesized peepholes, and we should really add some of our own once the e-graphs work lands.

pick-load-signs Look at uses of a load to determine whether to use sign extension or zero extension for the load (e.g. if the load instruction is i32.load8_u but a majority of uses are prefixed with i32.extend8_s, then change the load to i32.load8_s, remove the now-redundant sign extends from the majority of uses, and insert zero extends for the other uses.) No Unclear how much this is worth it in practice, especially if our primary goal is code speed rather than code size.

precompute[-propagate] Constant propagation and folding Partial We have a couple peepholes in simple_preopt that do some of this, but only 2 levels deep. Should investigate doing this more completely once the WIP e-graphs work merges.

code-pushing Push defs down towards uses. Might move the def into a block on the other side of a conditional, making it so that it is not executed unless needed. No Unsure whether their pass is aware of loop boundaries, and whether this might "undo" some manual LICM the programmer/LLVM did (this comes before their LICM in their phase ordering; our LICM won't create partially dead code, fwiw.) Although maybe we start (or can start) doing this with the new e-graphs work?

code-folding Merge common tails of all of a block's predecessors into the block itself. No I don't believe we do any block-level optimizations that look at multiple predecessors or multiple successors at the same time (i.e. we can merge one successor block into its sole predecessor). FWIW, I don't see a dual pass for merging common heads of successor blocks into their predecessor block in wasm-opt, but that seems like an obvious thing to implement if you've implemented merging common tails of successor blocks. Totally possible it exists and I missed it.

merge-blocks Sort of Wasm-specific, but essentially merge a block into its sole predecessor. Yes

duplicate-function-elimination Interprocedural optimization to deduplicate identical functions. No When the new incremental caching infra is enabled, I guess we could get this for free? But also depends on implementation (which I am not personally familiar with) and order of function compilation scheduling in the face of our parallelism.

inlining Inline a callee function into its caller, removing function call overhead and, more importantly, providing opportunity for more optimization based on the actual arguments to the call. No We probably don't want this for regular core Wasm modules right now, since LLVM did all the profitable inlining already and has way better heuristics than anything we are going to come up with on the first try. If something is both small and not inlined into callers by the time we see it, then it was probably either marked no-inline or cold or something like that and we just don't have those annotations anymore. However, with the component model this is going to change: callees will remain a black box until after component linking time (so LLVM won't ever have had a chance to inline these calls) and we will have lots of oppotunities to do some nice cross-module inlining ourselves.

directize Turn call_indirects into calls. Devirtualization. No Probably not profitable for us to do this, since LLVM already does it, but could be very profitable when done optimistically in concert with PGO data and then inline the callee into the caller.

Pass	Description	Cranelift has equivalent?	Discussion
local-cse	Perform common-subexpression elimination within a block	Yes	Our GVN should cover all of this.
dce	Perform dead code elimination to remove unreachable blocks and unused expressions	Yes
optimize-instructions	Peephole optimizations	Partial	We have some peepholes, but not as many as `wasm-opt`, and could definitely add more. Although, they care a lot about Wasm encoding tricks for peepholes where we do not. Probably better to look at LLVM itself here for inspiration. Finally, they also have some Souper-synthesized peepholes, and we should really add some of our own once the e-graphs work lands.
pick-load-signs	Look at uses of a load to determine whether to use sign extension or zero extension for the load (e.g. if the load instruction is `i32.load8_u` but a majority of uses are prefixed with `i32.extend8_s`, then change the load to `i32.load8_s`, remove the now-redundant sign extends from the majority of uses, and insert zero extends for the other uses.)	No	Unclear how much this is worth it in practice, especially if our primary goal is code speed rather than code size.
precompute[-propagate]	Constant propagation and folding	Partial	We have a couple peepholes in `simple_preopt` that do some of this, but only 2 levels deep. Should investigate doing this more completely once the WIP e-graphs work merges.
code-pushing	Push defs down towards uses. Might move the def into a block on the other side of a conditional, making it so that it is not executed unless needed.	No	Unsure whether their pass is aware of loop boundaries, and whether this might "undo" some manual LICM the programmer/LLVM did (this comes before their LICM in their phase ordering; our LICM won't create partially dead code, fwiw.) Although maybe we start (or can start) doing this with the new e-graphs work?
code-folding	Merge common tails of all of a block's predecessors into the block itself.	No	I don't believe we do any block-level optimizations that look at multiple predecessors or multiple successors at the same time (i.e. we can merge one successor block into its sole predecessor). FWIW, I don't see a dual pass for merging common heads of successor blocks into their predecessor block in `wasm-opt`, but that seems like an obvious thing to implement if you've implemented merging common tails of successor blocks. Totally possible it exists and I missed it.
merge-blocks	Sort of Wasm-specific, but essentially merge a block into its sole predecessor.	Yes
duplicate-function-elimination	Interprocedural optimization to deduplicate identical functions.	No	When the new incremental caching infra is enabled, I guess we could get this for free? But also depends on implementation (which I am not personally familiar with) and order of function compilation scheduling in the face of our parallelism.
inlining	Inline a callee function into its caller, removing function call overhead and, more importantly, providing opportunity for more optimization based on the actual arguments to the call.	No	We probably don't want this for regular core Wasm modules right now, since LLVM did all the profitable inlining already and has way better heuristics than anything we are going to come up with on the first try. If something is both small and not inlined into callers by the time we see it, then it was probably either marked no-inline or cold or something like that and we just don't have those annotations anymore. However, with the component model this is going to change: callees will remain a black box until after component linking time (so LLVM won't ever have had a chance to inline these calls) and we will have lots of oppotunities to do some nice cross-module inlining ourselves.
directize	Turn `call_indirect`s into `call`s. Devirtualization.	No	Probably not profitable for us to do this, since LLVM already does it, but could be very profitable when done optimistically in concert with PGO data and then inline the callee into the caller.

Wasmtime GitHub notifications bot (Aug 15 2022 at 19:47):

fitzgen opened issue #4712:

What is Cranelift's job (in the context of Wasmtime)? To take Wasm that is produced by LLVM and already optimized 99% of the time and do the architecture-specific code generation that LLVM cannot do when targeting Wasm (e.g. instruction selection, regalloc). We don't want to duplicate all of LLVM's mid-end optimizations, only the ones that are beneficial for cleaning up and improving code after we've lowered Wasm memory operations into raw base + offset memory operations, etc. This is an interesting place for a compiler to be, and it means the set of passes and trade offs we have are different from what one might assume by default.

There is another compiler that is in a similar space, at least as far as consuming already-optimized-by-LLVM Wasm binaries and attempting to further optimize them: binaryen and wasm-opt. The big difference is that wasm-opt is emitting another Wasm binary while we are emitting machine code. But maybe they have passes that are not specific to targeting Wasm and which are beneficial to run on already-optimized-by-LLVM Wasm binaries? AIUI, the LLVM IR to Wasm lowering introduces some suboptimal code patterns.

So I did an informal census of what passes are run by wasm-opt, filtering out anything that looked overly specific to targeting Wasm. Results are summarized in the table below, and might give us some food for thought as we start looking into Cranelift's code quality some more. (FWIW, I wasn't 100% sure about some things below, so if you see something that you know is incorrect, feel free to edit this issue and correct it!)

Pass Description Cranelift has equivalent? Discussion

local-cse Perform common-subexpression elimination within a block Yes Our GVN should cover all of this.

dce Perform dead code elimination to remove unreachable blocks and unused expressions Yes

optimize-instructions Peephole optimizations Partial We have some peepholes, but not as many as wasm-opt, and could definitely add more. Although, they care a lot about Wasm encoding tricks for peepholes where we do not. Probably better to look at LLVM itself here for inspiration. Finally, they also have some Souper-synthesized peepholes, and we should really add some of our own once the e-graphs work lands.

pick-load-signs Look at uses of a load to determine whether to use sign extension or zero extension for the load (e.g. if the load instruction is i32.load8_u but a majority of uses are prefixed with i32.extend8_s, then change the load to i32.load8_s, remove the now-redundant sign extends from the majority of uses, and insert zero extends for the other uses.) No Unclear how much this is worth it in practice, especially if our primary goal is code speed rather than code size.

precompute[-propagate] Constant propagation and folding Partial We have a couple peepholes in simple_preopt that do some of this, but only 2 levels deep. Should investigate doing this more completely once the WIP e-graphs work merges.

code-pushing Push defs down towards uses. Might move the def into a block on the other side of a conditional, making it so that it is not executed unless needed. No Unsure whether their pass is aware of loop boundaries, and whether this might "undo" some manual LICM the programmer/LLVM did (this comes before their LICM in their phase ordering; our LICM won't create partially dead code, fwiw.) Although maybe we start (or can start) doing this with the new e-graphs work?

code-folding Merge common tails of all of a block's predecessors into the block itself. No I don't believe we do any block-level optimizations that look at multiple predecessors or multiple successors at the same time (i.e. we can merge one successor block into its sole predecessor). FWIW, I don't see a dual pass for merging common heads of successor blocks into their predecessor block in wasm-opt, but that seems like an obvious thing to implement if you've implemented merging common tails of successor blocks. Totally possible it exists and I missed it.

merge-blocks Sort of Wasm-specific, but essentially merge a block into its sole predecessor. Yes

duplicate-function-elimination Interprocedural optimization to deduplicate identical functions. No When the new incremental caching infra is enabled, I guess we could get this for free? But also depends on implementation (which I am not personally familiar with) and order of function compilation scheduling in the face of our parallelism.

inlining Inline a callee function into its caller, removing function call overhead and, more importantly, providing opportunity for more optimization based on the actual arguments to the call. No We probably don't want this for regular core Wasm modules right now, since LLVM did all the profitable inlining already and has way better heuristics than anything we are going to come up with on the first try. If something is both small and not inlined into callers by the time we see it, then it was probably either marked no-inline or cold or something like that and we just don't have those annotations anymore. However, with the component model this is going to change: callees will remain a black box until after component linking time (so LLVM won't ever have had a chance to inline these calls) and we will have lots of oppotunities to do some nice cross-module inlining ourselves.

directize Turn call_indirects into calls. Devirtualization. No Probably not profitable for us to do this, since LLVM already does it, but could be very profitable when done optimistically in concert with PGO data and then inline the callee into the caller.

Pass	Description	Cranelift has equivalent?	Discussion
local-cse	Perform common-subexpression elimination within a block	Yes	Our GVN should cover all of this.
dce	Perform dead code elimination to remove unreachable blocks and unused expressions	Yes
optimize-instructions	Peephole optimizations	Partial	We have some peepholes, but not as many as `wasm-opt`, and could definitely add more. Although, they care a lot about Wasm encoding tricks for peepholes where we do not. Probably better to look at LLVM itself here for inspiration. Finally, they also have some Souper-synthesized peepholes, and we should really add some of our own once the e-graphs work lands.
pick-load-signs	Look at uses of a load to determine whether to use sign extension or zero extension for the load (e.g. if the load instruction is `i32.load8_u` but a majority of uses are prefixed with `i32.extend8_s`, then change the load to `i32.load8_s`, remove the now-redundant sign extends from the majority of uses, and insert zero extends for the other uses.)	No	Unclear how much this is worth it in practice, especially if our primary goal is code speed rather than code size.
precompute[-propagate]	Constant propagation and folding	Partial	We have a couple peepholes in `simple_preopt` that do some of this, but only 2 levels deep. Should investigate doing this more completely once the WIP e-graphs work merges.
code-pushing	Push defs down towards uses. Might move the def into a block on the other side of a conditional, making it so that it is not executed unless needed.	No	Unsure whether their pass is aware of loop boundaries, and whether this might "undo" some manual LICM the programmer/LLVM did (this comes before their LICM in their phase ordering; our LICM won't create partially dead code, fwiw.) Although maybe we start (or can start) doing this with the new e-graphs work?
code-folding	Merge common tails of all of a block's predecessors into the block itself.	No	I don't believe we do any block-level optimizations that look at multiple predecessors or multiple successors at the same time (i.e. we can merge one successor block into its sole predecessor). FWIW, I don't see a dual pass for merging common heads of successor blocks into their predecessor block in `wasm-opt`, but that seems like an obvious thing to implement if you've implemented merging common tails of successor blocks. Totally possible it exists and I missed it.
merge-blocks	Sort of Wasm-specific, but essentially merge a block into its sole predecessor.	Yes
duplicate-function-elimination	Interprocedural optimization to deduplicate identical functions.	No	When the new incremental caching infra is enabled, I guess we could get this for free? But also depends on implementation (which I am not personally familiar with) and order of function compilation scheduling in the face of our parallelism.
inlining	Inline a callee function into its caller, removing function call overhead and, more importantly, providing opportunity for more optimization based on the actual arguments to the call.	No	We probably don't want this for regular core Wasm modules right now, since LLVM did all the profitable inlining already and has way better heuristics than anything we are going to come up with on the first try. If something is both small and not inlined into callers by the time we see it, then it was probably either marked no-inline or cold or something like that and we just don't have those annotations anymore. However, with the component model this is going to change: callees will remain a black box until after component linking time (so LLVM won't ever have had a chance to inline these calls) and we will have lots of oppotunities to do some nice cross-module inlining ourselves.
directize	Turn `call_indirect`s into `call`s. Devirtualization.	No	Probably not profitable for us to do this, since LLVM already does it, but could be very profitable when done optimistically in concert with PGO data and then inline the callee into the caller.

Wasmtime GitHub notifications bot (Aug 15 2022 at 20:13):

cfallin commented on issue #4712:

Thanks for this in-depth look! This is a really valuable comparison.

I agree that Cranelift's goal in context as a backend for Wasmtime, with today's Wasm ecosystem, largely results in an optimization design-space that discounts traditional heavyweight analysis and calls for more directed work. I sort of touched on this recently in this comment with a from-first-principles breakdown of how should expect Cranelift to be able to optimize further as coming from (i) Wasm-to-CLIF semantic gap, (ii) CLIF-to-machine semantic gap, and (iii) regalloc quality. The meat of the (remaining) issue is in (i) and that's what you're pointing to with "cleaning up and improving code after we've lowered Wasm memory operations"; so, fully agreed.

I do also want to emphasize your point about inlining and the component model: late-binding two modules together, that have not met before and were compiled separately, completely changes the tradeoffs in that it means all of the traditional LLVM-style heavyweight analyses will find low-hanging fruit when inlining across that boundary. For that reason, and for the reason that Cranelift is also used (and hopefully used more in the future) in non-Wasm contexts, I think it's valuable to keep thinking about the heavyweight analyses even if they are turned off in a one-Wasm-module context.

(I suspect that having different meta-settings or "opt levels" would make sense here, with a preset for "one Wasm module, likely produced by an optimizing compiler" and a preset for "Wasm components with multiple modules linking together" and a preset for "unoptimizing frontend driving Cranelift".)

On the particular opts (aside from inlining covered above):

precompute-propagate and code-pushing should both be subsumed by the mid-end optimizer with its already-existing work in the prototype: in particular, the constant folding is more complete (follows an arbitrarily-long chain) and the code-motion allowed by scoped elaboration naturally pushes defs down when they are "partially redundant" (not used on some paths following computation).

we don't have anything like code-folding; I'd be curious to see how often it occurs in already-optimized modules. It should primarily benefit code size, and only indirectly speed (by reducing icache footprint); so useful if it applies but IMHO not likely to produce huge gains. I suspect this is an artifact of wasm-opt's focus on module size as well as speed.

duplicate-function-elimination should indeed fall out of the incremental compilation cache; in a mode where we care about maximizing this we could do a MapReduce-style thing where we (i) generate IR and compute cache keys for every function, then (ii) sort and deduplicate, then (iii) compile only once for each cache key. Thoughts @bnjbvr?

directize: yes absolutely, once we inline across multiple modules, I suspect this will be a very important optimization. One difficulty is that it is generally best as an interprocedural analysis, as one can reach stronger conclusions about constant function pointers this way (e.g., not only "I directly stored this function pointer" case, but "I read this field and across the program, the only function pointer stored into this field is this value" cases). That might require a bit of scaffolding to do properly.

Not covered above because out-of-scope for Wasm-to-Wasm (wasm-opt), but IMHO still important for us to build: bounds-check elimination and the value-range analysis that can feed into it.

That's all I can think of for now, but we should dump more thoughts about building any of the above here and/or split out specific issues as needed. (Also at least inlining has #4127 but I don't know if the others do already.) Thanks again for the survey!

Wasmtime GitHub notifications bot (Aug 16 2022 at 10:55):

bnjbvr commented on issue #4712:

duplicate-function-elimination should indeed fall out of the incremental compilation cache; in a mode where we care about maximizing this we could do a MapReduce-style thing where we (i) generate IR and compute cache keys for every function, then (ii) sort and deduplicate, then (iii) compile only once for each cache key. Thoughts @bnjbvr?

Indeed we should get this for free as long as the FunctionStencils are exactly the same for two different functions (think newtypes, (proc-)macro-generated code like cranelift-entity's entity_impl! or derive(serde), etc. tend to create lots of duplicated code).

Now this is the first time I hear of this MapReduce-style idea, and IIUC this is semantically equivalent to what we have right now, just changes the order of operations: in the existing implementation, a CacheStore would do the deduplication by reusing hashed entries (and likely run into race conditions which handling is deferred to the cache store impl), while in the proposal the deduplication would happen earlier, and then only the CacheStore would be probed for existing entries.

If I'm not mistaken, the MapReduce approach makes parallel compilation of functions a bit harder, as it requires precomputing all the cache keys for all functions in a module, thus reading every function's body before compiling them. I don't imagine that we expect performing streaming compilation using Cranelift any time soon.

Right now we can compile all functions in a module, in parallel, and let the CacheStore implementation handle races via locking. To explicit what I mean here: two duplicated functions may start getting compiled at the same time, the CacheStore wouldn't see existing cache entries for both, so it could compile both and store the same compiled artifact twice. The MapReduce approach would prevent all such race conditions in the CacheStore, by providing a set of unique cache keys in the first place.

Wasmtime GitHub notifications bot (Sep 02 2022 at 15:52):

akirilov-arm labeled issue #4712:

What is Cranelift's job (in the context of Wasmtime)? To take Wasm that is produced by LLVM and already optimized 99% of the time and do the architecture-specific code generation that LLVM cannot do when targeting Wasm (e.g. instruction selection, regalloc). We don't want to duplicate all of LLVM's mid-end optimizations, only the ones that are beneficial for cleaning up and improving code after we've lowered Wasm memory operations into raw base + offset memory operations, etc. This is an interesting place for a compiler to be, and it means the set of passes and trade offs we have are different from what one might assume by default.

There is another compiler that is in a similar space, at least as far as consuming already-optimized-by-LLVM Wasm binaries and attempting to further optimize them: binaryen and wasm-opt. The big difference is that wasm-opt is emitting another Wasm binary while we are emitting machine code. But maybe they have passes that are not specific to targeting Wasm and which are beneficial to run on already-optimized-by-LLVM Wasm binaries? AIUI, the LLVM IR to Wasm lowering introduces some suboptimal code patterns.

So I did an informal census of what passes are run by wasm-opt, filtering out anything that looked overly specific to targeting Wasm. Results are summarized in the table below, and might give us some food for thought as we start looking into Cranelift's code quality some more. (FWIW, I wasn't 100% sure about some things below, so if you see something that you know is incorrect, feel free to edit this issue and correct it!)

Pass Description Cranelift has equivalent? Discussion

local-cse Perform common-subexpression elimination within a block Yes Our GVN should cover all of this.

dce Perform dead code elimination to remove unreachable blocks and unused expressions Yes

optimize-instructions Peephole optimizations Partial We have some peepholes, but not as many as wasm-opt, and could definitely add more. Although, they care a lot about Wasm encoding tricks for peepholes where we do not. Probably better to look at LLVM itself here for inspiration. Finally, they also have some Souper-synthesized peepholes, and we should really add some of our own once the e-graphs work lands.

pick-load-signs Look at uses of a load to determine whether to use sign extension or zero extension for the load (e.g. if the load instruction is i32.load8_u but a majority of uses are prefixed with i32.extend8_s, then change the load to i32.load8_s, remove the now-redundant sign extends from the majority of uses, and insert zero extends for the other uses.) No Unclear how much this is worth it in practice, especially if our primary goal is code speed rather than code size.

precompute[-propagate] Constant propagation and folding Partial We have a couple peepholes in simple_preopt that do some of this, but only 2 levels deep. Should investigate doing this more completely once the WIP e-graphs work merges.

code-pushing Push defs down towards uses. Might move the def into a block on the other side of a conditional, making it so that it is not executed unless needed. No Unsure whether their pass is aware of loop boundaries, and whether this might "undo" some manual LICM the programmer/LLVM did (this comes before their LICM in their phase ordering; our LICM won't create partially dead code, fwiw.) Although maybe we start (or can start) doing this with the new e-graphs work?

code-folding Merge common tails of all of a block's predecessors into the block itself. No I don't believe we do any block-level optimizations that look at multiple predecessors or multiple successors at the same time (i.e. we can merge one successor block into its sole predecessor). FWIW, I don't see a dual pass for merging common heads of successor blocks into their predecessor block in wasm-opt, but that seems like an obvious thing to implement if you've implemented merging common tails of successor blocks. Totally possible it exists and I missed it.

merge-blocks Sort of Wasm-specific, but essentially merge a block into its sole predecessor. Yes

duplicate-function-elimination Interprocedural optimization to deduplicate identical functions. No When the new incremental caching infra is enabled, I guess we could get this for free? But also depends on implementation (which I am not personally familiar with) and order of function compilation scheduling in the face of our parallelism.

inlining Inline a callee function into its caller, removing function call overhead and, more importantly, providing opportunity for more optimization based on the actual arguments to the call. No We probably don't want this for regular core Wasm modules right now, since LLVM did all the profitable inlining already and has way better heuristics than anything we are going to come up with on the first try. If something is both small and not inlined into callers by the time we see it, then it was probably either marked no-inline or cold or something like that and we just don't have those annotations anymore. However, with the component model this is going to change: callees will remain a black box until after component linking time (so LLVM won't ever have had a chance to inline these calls) and we will have lots of oppotunities to do some nice cross-module inlining ourselves.

directize Turn call_indirects into calls. Devirtualization. No Probably not profitable for us to do this, since LLVM already does it, but could be very profitable when done optimistically in concert with PGO data and then inline the callee into the caller.

Pass	Description	Cranelift has equivalent?	Discussion
local-cse	Perform common-subexpression elimination within a block	Yes	Our GVN should cover all of this.
dce	Perform dead code elimination to remove unreachable blocks and unused expressions	Yes
optimize-instructions	Peephole optimizations	Partial	We have some peepholes, but not as many as `wasm-opt`, and could definitely add more. Although, they care a lot about Wasm encoding tricks for peepholes where we do not. Probably better to look at LLVM itself here for inspiration. Finally, they also have some Souper-synthesized peepholes, and we should really add some of our own once the e-graphs work lands.
pick-load-signs	Look at uses of a load to determine whether to use sign extension or zero extension for the load (e.g. if the load instruction is `i32.load8_u` but a majority of uses are prefixed with `i32.extend8_s`, then change the load to `i32.load8_s`, remove the now-redundant sign extends from the majority of uses, and insert zero extends for the other uses.)	No	Unclear how much this is worth it in practice, especially if our primary goal is code speed rather than code size.
precompute[-propagate]	Constant propagation and folding	Partial	We have a couple peepholes in `simple_preopt` that do some of this, but only 2 levels deep. Should investigate doing this more completely once the WIP e-graphs work merges.
code-pushing	Push defs down towards uses. Might move the def into a block on the other side of a conditional, making it so that it is not executed unless needed.	No	Unsure whether their pass is aware of loop boundaries, and whether this might "undo" some manual LICM the programmer/LLVM did (this comes before their LICM in their phase ordering; our LICM won't create partially dead code, fwiw.) Although maybe we start (or can start) doing this with the new e-graphs work?
code-folding	Merge common tails of all of a block's predecessors into the block itself.	No	I don't believe we do any block-level optimizations that look at multiple predecessors or multiple successors at the same time (i.e. we can merge one successor block into its sole predecessor). FWIW, I don't see a dual pass for merging common heads of successor blocks into their predecessor block in `wasm-opt`, but that seems like an obvious thing to implement if you've implemented merging common tails of successor blocks. Totally possible it exists and I missed it.
merge-blocks	Sort of Wasm-specific, but essentially merge a block into its sole predecessor.	Yes
duplicate-function-elimination	Interprocedural optimization to deduplicate identical functions.	No	When the new incremental caching infra is enabled, I guess we could get this for free? But also depends on implementation (which I am not personally familiar with) and order of function compilation scheduling in the face of our parallelism.
inlining	Inline a callee function into its caller, removing function call overhead and, more importantly, providing opportunity for more optimization based on the actual arguments to the call.	No	We probably don't want this for regular core Wasm modules right now, since LLVM did all the profitable inlining already and has way better heuristics than anything we are going to come up with on the first try. If something is both small and not inlined into callers by the time we see it, then it was probably either marked no-inline or cold or something like that and we just don't have those annotations anymore. However, with the component model this is going to change: callees will remain a black box until after component linking time (so LLVM won't ever have had a chance to inline these calls) and we will have lots of oppotunities to do some nice cross-module inlining ourselves.
directize	Turn `call_indirect`s into `call`s. Devirtualization.	No	Probably not profitable for us to do this, since LLVM already does it, but could be very profitable when done optimistically in concert with PGO data and then inline the callee into the caller.

Wasmtime GitHub notifications bot (Sep 02 2022 at 15:52):

akirilov-arm labeled issue #4712:

What is Cranelift's job (in the context of Wasmtime)? To take Wasm that is produced by LLVM and already optimized 99% of the time and do the architecture-specific code generation that LLVM cannot do when targeting Wasm (e.g. instruction selection, regalloc). We don't want to duplicate all of LLVM's mid-end optimizations, only the ones that are beneficial for cleaning up and improving code after we've lowered Wasm memory operations into raw base + offset memory operations, etc. This is an interesting place for a compiler to be, and it means the set of passes and trade offs we have are different from what one might assume by default.

There is another compiler that is in a similar space, at least as far as consuming already-optimized-by-LLVM Wasm binaries and attempting to further optimize them: binaryen and wasm-opt. The big difference is that wasm-opt is emitting another Wasm binary while we are emitting machine code. But maybe they have passes that are not specific to targeting Wasm and which are beneficial to run on already-optimized-by-LLVM Wasm binaries? AIUI, the LLVM IR to Wasm lowering introduces some suboptimal code patterns.

So I did an informal census of what passes are run by wasm-opt, filtering out anything that looked overly specific to targeting Wasm. Results are summarized in the table below, and might give us some food for thought as we start looking into Cranelift's code quality some more. (FWIW, I wasn't 100% sure about some things below, so if you see something that you know is incorrect, feel free to edit this issue and correct it!)

Pass Description Cranelift has equivalent? Discussion

local-cse Perform common-subexpression elimination within a block Yes Our GVN should cover all of this.

dce Perform dead code elimination to remove unreachable blocks and unused expressions Yes

optimize-instructions Peephole optimizations Partial We have some peepholes, but not as many as wasm-opt, and could definitely add more. Although, they care a lot about Wasm encoding tricks for peepholes where we do not. Probably better to look at LLVM itself here for inspiration. Finally, they also have some Souper-synthesized peepholes, and we should really add some of our own once the e-graphs work lands.

pick-load-signs Look at uses of a load to determine whether to use sign extension or zero extension for the load (e.g. if the load instruction is i32.load8_u but a majority of uses are prefixed with i32.extend8_s, then change the load to i32.load8_s, remove the now-redundant sign extends from the majority of uses, and insert zero extends for the other uses.) No Unclear how much this is worth it in practice, especially if our primary goal is code speed rather than code size.

precompute[-propagate] Constant propagation and folding Partial We have a couple peepholes in simple_preopt that do some of this, but only 2 levels deep. Should investigate doing this more completely once the WIP e-graphs work merges.

code-pushing Push defs down towards uses. Might move the def into a block on the other side of a conditional, making it so that it is not executed unless needed. No Unsure whether their pass is aware of loop boundaries, and whether this might "undo" some manual LICM the programmer/LLVM did (this comes before their LICM in their phase ordering; our LICM won't create partially dead code, fwiw.) Although maybe we start (or can start) doing this with the new e-graphs work?

code-folding Merge common tails of all of a block's predecessors into the block itself. No I don't believe we do any block-level optimizations that look at multiple predecessors or multiple successors at the same time (i.e. we can merge one successor block into its sole predecessor). FWIW, I don't see a dual pass for merging common heads of successor blocks into their predecessor block in wasm-opt, but that seems like an obvious thing to implement if you've implemented merging common tails of successor blocks. Totally possible it exists and I missed it.

merge-blocks Sort of Wasm-specific, but essentially merge a block into its sole predecessor. Yes

duplicate-function-elimination Interprocedural optimization to deduplicate identical functions. No When the new incremental caching infra is enabled, I guess we could get this for free? But also depends on implementation (which I am not personally familiar with) and order of function compilation scheduling in the face of our parallelism.

inlining Inline a callee function into its caller, removing function call overhead and, more importantly, providing opportunity for more optimization based on the actual arguments to the call. No We probably don't want this for regular core Wasm modules right now, since LLVM did all the profitable inlining already and has way better heuristics than anything we are going to come up with on the first try. If something is both small and not inlined into callers by the time we see it, then it was probably either marked no-inline or cold or something like that and we just don't have those annotations anymore. However, with the component model this is going to change: callees will remain a black box until after component linking time (so LLVM won't ever have had a chance to inline these calls) and we will have lots of oppotunities to do some nice cross-module inlining ourselves.

directize Turn call_indirects into calls. Devirtualization. No Probably not profitable for us to do this, since LLVM already does it, but could be very profitable when done optimistically in concert with PGO data and then inline the callee into the caller.

Pass	Description	Cranelift has equivalent?	Discussion
local-cse	Perform common-subexpression elimination within a block	Yes	Our GVN should cover all of this.
dce	Perform dead code elimination to remove unreachable blocks and unused expressions	Yes
optimize-instructions	Peephole optimizations	Partial	We have some peepholes, but not as many as `wasm-opt`, and could definitely add more. Although, they care a lot about Wasm encoding tricks for peepholes where we do not. Probably better to look at LLVM itself here for inspiration. Finally, they also have some Souper-synthesized peepholes, and we should really add some of our own once the e-graphs work lands.
pick-load-signs	Look at uses of a load to determine whether to use sign extension or zero extension for the load (e.g. if the load instruction is `i32.load8_u` but a majority of uses are prefixed with `i32.extend8_s`, then change the load to `i32.load8_s`, remove the now-redundant sign extends from the majority of uses, and insert zero extends for the other uses.)	No	Unclear how much this is worth it in practice, especially if our primary goal is code speed rather than code size.
precompute[-propagate]	Constant propagation and folding	Partial	We have a couple peepholes in `simple_preopt` that do some of this, but only 2 levels deep. Should investigate doing this more completely once the WIP e-graphs work merges.
code-pushing	Push defs down towards uses. Might move the def into a block on the other side of a conditional, making it so that it is not executed unless needed.	No	Unsure whether their pass is aware of loop boundaries, and whether this might "undo" some manual LICM the programmer/LLVM did (this comes before their LICM in their phase ordering; our LICM won't create partially dead code, fwiw.) Although maybe we start (or can start) doing this with the new e-graphs work?
code-folding	Merge common tails of all of a block's predecessors into the block itself.	No	I don't believe we do any block-level optimizations that look at multiple predecessors or multiple successors at the same time (i.e. we can merge one successor block into its sole predecessor). FWIW, I don't see a dual pass for merging common heads of successor blocks into their predecessor block in `wasm-opt`, but that seems like an obvious thing to implement if you've implemented merging common tails of successor blocks. Totally possible it exists and I missed it.
merge-blocks	Sort of Wasm-specific, but essentially merge a block into its sole predecessor.	Yes
duplicate-function-elimination	Interprocedural optimization to deduplicate identical functions.	No	When the new incremental caching infra is enabled, I guess we could get this for free? But also depends on implementation (which I am not personally familiar with) and order of function compilation scheduling in the face of our parallelism.
inlining	Inline a callee function into its caller, removing function call overhead and, more importantly, providing opportunity for more optimization based on the actual arguments to the call.	No	We probably don't want this for regular core Wasm modules right now, since LLVM did all the profitable inlining already and has way better heuristics than anything we are going to come up with on the first try. If something is both small and not inlined into callers by the time we see it, then it was probably either marked no-inline or cold or something like that and we just don't have those annotations anymore. However, with the component model this is going to change: callees will remain a black box until after component linking time (so LLVM won't ever have had a chance to inline these calls) and we will have lots of oppotunities to do some nice cross-module inlining ourselves.
directize	Turn `call_indirect`s into `call`s. Devirtualization.	No	Probably not profitable for us to do this, since LLVM already does it, but could be very profitable when done optimistically in concert with PGO data and then inline the callee into the caller.

Wasmtime GitHub notifications bot (Sep 02 2022 at 15:52):

akirilov-arm labeled issue #4712:

What is Cranelift's job (in the context of Wasmtime)? To take Wasm that is produced by LLVM and already optimized 99% of the time and do the architecture-specific code generation that LLVM cannot do when targeting Wasm (e.g. instruction selection, regalloc). We don't want to duplicate all of LLVM's mid-end optimizations, only the ones that are beneficial for cleaning up and improving code after we've lowered Wasm memory operations into raw base + offset memory operations, etc. This is an interesting place for a compiler to be, and it means the set of passes and trade offs we have are different from what one might assume by default.

There is another compiler that is in a similar space, at least as far as consuming already-optimized-by-LLVM Wasm binaries and attempting to further optimize them: binaryen and wasm-opt. The big difference is that wasm-opt is emitting another Wasm binary while we are emitting machine code. But maybe they have passes that are not specific to targeting Wasm and which are beneficial to run on already-optimized-by-LLVM Wasm binaries? AIUI, the LLVM IR to Wasm lowering introduces some suboptimal code patterns.

So I did an informal census of what passes are run by wasm-opt, filtering out anything that looked overly specific to targeting Wasm. Results are summarized in the table below, and might give us some food for thought as we start looking into Cranelift's code quality some more. (FWIW, I wasn't 100% sure about some things below, so if you see something that you know is incorrect, feel free to edit this issue and correct it!)

Pass Description Cranelift has equivalent? Discussion

local-cse Perform common-subexpression elimination within a block Yes Our GVN should cover all of this.

dce Perform dead code elimination to remove unreachable blocks and unused expressions Yes

optimize-instructions Peephole optimizations Partial We have some peepholes, but not as many as wasm-opt, and could definitely add more. Although, they care a lot about Wasm encoding tricks for peepholes where we do not. Probably better to look at LLVM itself here for inspiration. Finally, they also have some Souper-synthesized peepholes, and we should really add some of our own once the e-graphs work lands.

pick-load-signs Look at uses of a load to determine whether to use sign extension or zero extension for the load (e.g. if the load instruction is i32.load8_u but a majority of uses are prefixed with i32.extend8_s, then change the load to i32.load8_s, remove the now-redundant sign extends from the majority of uses, and insert zero extends for the other uses.) No Unclear how much this is worth it in practice, especially if our primary goal is code speed rather than code size.

precompute[-propagate] Constant propagation and folding Partial We have a couple peepholes in simple_preopt that do some of this, but only 2 levels deep. Should investigate doing this more completely once the WIP e-graphs work merges.

code-pushing Push defs down towards uses. Might move the def into a block on the other side of a conditional, making it so that it is not executed unless needed. No Unsure whether their pass is aware of loop boundaries, and whether this might "undo" some manual LICM the programmer/LLVM did (this comes before their LICM in their phase ordering; our LICM won't create partially dead code, fwiw.) Although maybe we start (or can start) doing this with the new e-graphs work?

code-folding Merge common tails of all of a block's predecessors into the block itself. No I don't believe we do any block-level optimizations that look at multiple predecessors or multiple successors at the same time (i.e. we can merge one successor block into its sole predecessor). FWIW, I don't see a dual pass for merging common heads of successor blocks into their predecessor block in wasm-opt, but that seems like an obvious thing to implement if you've implemented merging common tails of successor blocks. Totally possible it exists and I missed it.

merge-blocks Sort of Wasm-specific, but essentially merge a block into its sole predecessor. Yes

duplicate-function-elimination Interprocedural optimization to deduplicate identical functions. No When the new incremental caching infra is enabled, I guess we could get this for free? But also depends on implementation (which I am not personally familiar with) and order of function compilation scheduling in the face of our parallelism.

inlining Inline a callee function into its caller, removing function call overhead and, more importantly, providing opportunity for more optimization based on the actual arguments to the call. No We probably don't want this for regular core Wasm modules right now, since LLVM did all the profitable inlining already and has way better heuristics than anything we are going to come up with on the first try. If something is both small and not inlined into callers by the time we see it, then it was probably either marked no-inline or cold or something like that and we just don't have those annotations anymore. However, with the component model this is going to change: callees will remain a black box until after component linking time (so LLVM won't ever have had a chance to inline these calls) and we will have lots of oppotunities to do some nice cross-module inlining ourselves.

directize Turn call_indirects into calls. Devirtualization. No Probably not profitable for us to do this, since LLVM already does it, but could be very profitable when done optimistically in concert with PGO data and then inline the callee into the caller.

Pass	Description	Cranelift has equivalent?	Discussion
local-cse	Perform common-subexpression elimination within a block	Yes	Our GVN should cover all of this.
dce	Perform dead code elimination to remove unreachable blocks and unused expressions	Yes
optimize-instructions	Peephole optimizations	Partial	We have some peepholes, but not as many as `wasm-opt`, and could definitely add more. Although, they care a lot about Wasm encoding tricks for peepholes where we do not. Probably better to look at LLVM itself here for inspiration. Finally, they also have some Souper-synthesized peepholes, and we should really add some of our own once the e-graphs work lands.
pick-load-signs	Look at uses of a load to determine whether to use sign extension or zero extension for the load (e.g. if the load instruction is `i32.load8_u` but a majority of uses are prefixed with `i32.extend8_s`, then change the load to `i32.load8_s`, remove the now-redundant sign extends from the majority of uses, and insert zero extends for the other uses.)	No	Unclear how much this is worth it in practice, especially if our primary goal is code speed rather than code size.
precompute[-propagate]	Constant propagation and folding	Partial	We have a couple peepholes in `simple_preopt` that do some of this, but only 2 levels deep. Should investigate doing this more completely once the WIP e-graphs work merges.
code-pushing	Push defs down towards uses. Might move the def into a block on the other side of a conditional, making it so that it is not executed unless needed.	No	Unsure whether their pass is aware of loop boundaries, and whether this might "undo" some manual LICM the programmer/LLVM did (this comes before their LICM in their phase ordering; our LICM won't create partially dead code, fwiw.) Although maybe we start (or can start) doing this with the new e-graphs work?
code-folding	Merge common tails of all of a block's predecessors into the block itself.	No	I don't believe we do any block-level optimizations that look at multiple predecessors or multiple successors at the same time (i.e. we can merge one successor block into its sole predecessor). FWIW, I don't see a dual pass for merging common heads of successor blocks into their predecessor block in `wasm-opt`, but that seems like an obvious thing to implement if you've implemented merging common tails of successor blocks. Totally possible it exists and I missed it.
merge-blocks	Sort of Wasm-specific, but essentially merge a block into its sole predecessor.	Yes
duplicate-function-elimination	Interprocedural optimization to deduplicate identical functions.	No	When the new incremental caching infra is enabled, I guess we could get this for free? But also depends on implementation (which I am not personally familiar with) and order of function compilation scheduling in the face of our parallelism.
inlining	Inline a callee function into its caller, removing function call overhead and, more importantly, providing opportunity for more optimization based on the actual arguments to the call.	No	We probably don't want this for regular core Wasm modules right now, since LLVM did all the profitable inlining already and has way better heuristics than anything we are going to come up with on the first try. If something is both small and not inlined into callers by the time we see it, then it was probably either marked no-inline or cold or something like that and we just don't have those annotations anymore. However, with the component model this is going to change: callees will remain a black box until after component linking time (so LLVM won't ever have had a chance to inline these calls) and we will have lots of oppotunities to do some nice cross-module inlining ourselves.
directize	Turn `call_indirect`s into `call`s. Devirtualization.	No	Probably not profitable for us to do this, since LLVM already does it, but could be very profitable when done optimistically in concert with PGO data and then inline the callee into the caller.

Last updated: Apr 17 2025 at 21:03 UTC