fitzgen opened Issue #1704:
Profiling suggests that one source of inefficiency in using peepmatic-generated peephole optimizers is branch misses.
![branch-misses](https://user-images.githubusercontent.com/74571/81961017-a4d97780-95c6-11ea-95cf-25253777bd91.png)
This makes sense, since we currently interpret the peepmatic-generated transducer automata, and interpreters (pretty much by definition) don't statically encode what the next opcode is for the CPU to see.
We should investigate compiling these automata into Rust source code that is then compiled into native code with the rest of cranelfit by rustc/llvm. This should make all transitions between states easily predictable for the CPU.
A first step would be to compile each state into its own function, with transitions being function calls to other states' functions. Even better would be to collapse chains where there is only a single incoming and outgoing transition into a single function (or maybe we can just rely on LLVM inlining for this).
Since the generated automata are all DAGs, but we do have implicit back edges via the implicit backtracking when a more-specific optimization fails to fully match and so we fallback to a less-specific optimization. This means that we do have loops, but I don't think we need to go full relooper-style control flow analysis. We can determine all back edges statically, and make them function calls, just like regular transitions. This does mean we technically have recursive function calls, but we control the max depth of recursion via our set of optimizations; this is not something that is affected by input clif (or wasm translated to clif). So I think we'd be okay, and can at least revisit that if the issue of too much recursion ever crops up.
fitzgen labeled Issue #1704:
Profiling suggests that one source of inefficiency in using peepmatic-generated peephole optimizers is branch misses.
![branch-misses](https://user-images.githubusercontent.com/74571/81961017-a4d97780-95c6-11ea-95cf-25253777bd91.png)
This makes sense, since we currently interpret the peepmatic-generated transducer automata, and interpreters (pretty much by definition) don't statically encode what the next opcode is for the CPU to see.
We should investigate compiling these automata into Rust source code that is then compiled into native code with the rest of cranelfit by rustc/llvm. This should make all transitions between states easily predictable for the CPU.
A first step would be to compile each state into its own function, with transitions being function calls to other states' functions. Even better would be to collapse chains where there is only a single incoming and outgoing transition into a single function (or maybe we can just rely on LLVM inlining for this).
Since the generated automata are all DAGs, but we do have implicit back edges via the implicit backtracking when a more-specific optimization fails to fully match and so we fallback to a less-specific optimization. This means that we do have loops, but I don't think we need to go full relooper-style control flow analysis. We can determine all back edges statically, and make them function calls, just like regular transitions. This does mean we technically have recursive function calls, but we control the max depth of recursion via our set of optimizations; this is not something that is affected by input clif (or wasm translated to clif). So I think we'd be okay, and can at least revisit that if the issue of too much recursion ever crops up.
Last updated: Dec 23 2024 at 12:05 UTC