cfallin commented on issue #27:
cc @fitzgen @jameysharp @elliottt @abrown @avanhatt @mlfbrown
cfallin commented on issue #27:
Current prototype is bytecodealliance/wasmtime#4249; it implements the e-graph roundtripping but not (yet) the rewriting described in this RFC.
fitzgen commented on issue #27:
For example, the https://github.com/bytecodealliance/wasmtime/pull/1647 tooling included tools to "harvest" potential patterns to rewrite from real programs, and then use the superoptimization engine Souper to find rewrite rules offline. We could revisit such tooling and use it to build a significantly more effective body of rules semi-automatically.
FWIW, the harvester still exists and is independent of Peepmatic, it just extracts Souper IR from CLIF.
The only tool we are missing would be a Souper IR to ISLE(CLIF) translator for Souper's synthesized optimizations back to ISLE/this new framework.
philzook58 commented on issue #27:
This is incredibly cool work and I am very interested in following it! I also have appreciated your very thorough blog posts in the past. I'm seeing if it useful for people to have a place for egraph discussion, so I made a nascent but fairly quiet so far zulip. https://egraphs.zulipchat.com/join/q7bjayghbdw66xsnzgwc2hjb/
fitzgen commented on issue #27:
Super happy with how simple and obvious-after-the-fact your proposed e-graph destruction algorithm is (cycle breaking + scoped elaboration).
fitzgen commented on issue #27:
We will thus need a system of "optimization fuel" that limits rewrite steps overall and possibly per-rule or per-eclass. It is not yet certain what policies will work best; this is a question that we will need to experiment with. We simply wish to acknowledge now that it will be a problem that we will need to solve.
Note that
egg
already has a few knobs we can experiment with before trying to create our own custom thing here. In addition to perform-at-most-N-rewrite-iterations, you can also give it optional limits on the number of e-classes and e-nodes, and halt equality saturation once those are reached. I think the latter will be important for limiting the e-graph's memory overhead.
mwillsey commented on issue #27:
Hey all, egg author here. Very, very exciting stuff ahead! I will dive into the RFC soon and try to grok the details. I'm happy to help out as well!
fitzgen commented on issue #27:
How will e-graph-based optimization affect compile time?
Have you run
spidermonke.wasm
through sightglass with and without your prototype that just does round tripping through an e-graph? That could give us a tiny taste of what compile time overheads might look like (also I imagine you'd disable the old GVN when it is subsumed by the e-graph round trip). I'd also expect that, in theory, execution time of the generated code would be unchanged, but that would also be a good thing to empirically check.
fitzgen edited a comment on issue #27:
How will e-graph-based optimization affect compile time?
Have you run
spidermonkye.wasm
through sightglass with and without your prototype that just does round tripping through an e-graph? That could give us a tiny taste of what compile time overheads might look like (also I imagine you'd disable the old GVN when it is subsumed by the e-graph round trip). I'd also expect that, in theory, execution time of the generated code would be unchanged, but that would also be a good thing to empirically check.
fitzgen edited a comment on issue #27:
How will e-graph-based optimization affect compile time?
Have you run
spidermonkey.wasm
through sightglass with and without your prototype that just does round tripping through an e-graph? That could give us a tiny taste of what compile time overheads might look like (also I imagine you'd disable the old GVN when it is subsumed by the e-graph round trip). I'd also expect that, in theory, execution time of the generated code would be unchanged, but that would also be a good thing to empirically check.
cfallin commented on issue #27:
How will e-graph-based optimization affect compile time?
Have you run
spidermonkey.wasm
through sightglass with and without your prototype that just does round tripping through an e-graph? That could give us a tiny taste of what compile time overheads might look like (also I imagine you'd disable the old GVN when it is subsumed by the e-graph round trip). I'd also expect that, in theory, execution time of the generated code would be unchanged, but that would also be a good thing to empirically check.Nope, not yet, but that's next on my plate! Before doing measurements I want to do some basic optimizations to the scheme that I know how to do -- mainly, reuse existing CLIF insts rather than repopulating -- which should dramatically lower overhead. If with that we are "at parity" or "not much worse" than LICM (assuming LICM tweaks here) + GVN today, then I'd be very happy, but I guess how much overhead to accept there (in exchange for having an optimization framework) is another good discussion point.
fitzgen commented on issue #27:
I feel like this RFC is a little different from many of our other RFCs. Usually
we know we can implement the thing, and the RFC process is more about API design
and determining whether to implement the thing as described or in some
alternative way. Merging most RFCs is essentially committing to "yes we will do
this". This RFC, on the other hand, has relatively large open questions around
compilation overhead and whether we will actually solve phase ordering the way
we believe we should in theory. I think merging this RFC doesn't necessarily
mean "yes we will do this" because we still aren't 100% sure we can do
this. We think we can, but we need to actually build more to prove it out. So
I think merging this RFC should mean something more like "we will start landing
(cfg
'd off) code inmain
and when we (hopefully) reach X, Y, and Z
acceptance criteria, we will stop hiding the code behindcfg
s and remove the
old, non-egraph code paths."And I would love this RFC to define and document our ahead-of-time consensus on
what exactly those acceptance criteria are so that we don't have moving goal
posts later on down the line when we are deciding whether to actually, fully,
100% commit to the e-graphs mid-end framework or not.
Just to spark discussion, I'll throw some stuff out there:
What kind of compile time overhead is acceptable for when we first turn this
on (with the expectation that we will have potential compile time wins down
the line as documented in the "Overall Cranelift Structure and Migration Path"
section)? 5% overhead? 10%?What kind of memory overhead during compilation is acceptable? Unlike compile
time, I don't think we have any obvious possibilities for improving this in
the future other than limiting e-nodes/e-classes. (Please correct me if I'm
wrong!) 50% additional max RSS? 20%? 10%?Finally, what should the acceptance criteria be for generated code quality?
How about: none of our Real World benchmarks in sightglass (SpiderMonkey,
markdown, bz2, meshoptimizer, and maybe intgemm-simd but I'm less familiar
with that one) get statistically significantly worse execution times and at
least one of them gets statistically significantly better execution
time. Additionally, we verify by hand that the alias analysis and GVN phase
ordering example you mention in the RFC actually gets resolved in practice by
our e-graphs framework.What do other Cranelift stakeholders think our acceptance criteria should be?
Does compile time/space and code quality cover everything we need to keep an eye
on as far as acceptance quality goes or am I missing something else?
cfallin commented on issue #27:
These are great questions, thanks @fitzgen -- I should've included more details around what I think the actual decisions should be (now and later)!
I actually am somewhat optimistic that we're close (a-few-weeks-to-a-month-ish?) to a point where we can evaluate the prototype plus basic rewriting rules that have functional parity to today's GVN, LICM, and simple_preopt, and make a decision with this RFC (merge or not) that means "we will actually do this". And to be clear I do want to get that far in the prototype before I personally advocate for merging; I want to ensure there are no unknown-unknowns in the rewrite step. I'd personally favor an acceptance criteria that is something like:
- Replicates GVN, LICM, and simple_preopt (and clear path for alias analysis);
- Is clearly an improvement in ease of contributing further optimizations;
- Meets some compile-time threshold.
Open discussion how we balance items 2 and 3 above, but I hope we will have minimal overhead when we have only today's functionality. That would mean that additional compile-time cost is only a factor when actually adding interesting additional rules/optimizations.
On the flipside, I don't know if it necessarily makes sense to have a "something must speed up on initial merge" criterion, though I'm open to that if a bunch of other folks disagree! The reason is that I think once we have (i) a framework and (ii) a replica of today's rewrites, the future upside is clear enough that the work to come up with those additional optimizations can happen as a second step. But, I can see how we might want to require such a speedup if the compile-time overhead is also non-negligible.
Anyway all this is open to a bunch of discussion and, on that note, I'll put an item on the agenda for our next biweekly :-)
cfallin edited a comment on issue #27:
These are great questions, thanks @fitzgen -- I should've included more details around what I think the actual decisions should be (now and later)!
I actually am somewhat optimistic that we're close (a-few-weeks-to-a-month-ish?) to a point where we can evaluate the prototype plus basic rewriting rules that have functional parity to today's GVN, LICM, and simple_preopt, and make a decision with this RFC (merge or not) that means "we will actually do this". And to be clear I do want to get that far in the prototype before I personally advocate for merging; I want to ensure there are no unknown-unknowns in the rewrite step. I'd personally favor acceptance criteria that is something like:
- Replicates GVN, LICM, and simple_preopt (and clear path for alias analysis);
- Is clearly an improvement in ease of contributing further optimizations;
- Meets some compile-time threshold.
Open discussion how we balance items 2 and 3 above, but I hope we will have minimal overhead when we have only today's functionality. That would mean that additional compile-time cost is only a factor when actually adding interesting additional rules/optimizations.
On the flipside, I don't know if it necessarily makes sense to have a "something must speed up on initial merge" criterion, though I'm open to that if a bunch of other folks disagree! The reason is that I think once we have (i) a framework and (ii) a replica of today's rewrites, the future upside is clear enough that the work to come up with those additional optimizations can happen as a second step. But, I can see how we might want to require such a speedup if the compile-time overhead is also non-negligible.
Anyway all this is open to a bunch of discussion and, on that note, I'll put an item on the agenda for our next biweekly :-)
cfallin edited a comment on issue #27:
These are great questions, thanks @fitzgen -- I should've included more details around what I think the actual decisions should be (now and later)!
I actually am somewhat optimistic that we're close (a-few-weeks-to-a-month-ish?) to a point where we can evaluate the prototype plus basic rewriting rules that have functional parity to today's GVN, LICM, and simple_preopt, and make a decision with this RFC (merge or not) that means "we will actually do this". And to be clear I do want to get that far in the prototype before I personally advocate for merging; I want to ensure there are no unknown-unknowns in the rewrite step. I'd personally favor acceptance criteria that are something like:
- Replicates GVN, LICM, and simple_preopt (and clear path for alias analysis);
- Is clearly an improvement in ease of contributing further optimizations;
- Meets some compile-time threshold.
Open discussion how we balance items 2 and 3 above, but I hope we will have minimal overhead when we have only today's functionality. That would mean that additional compile-time cost is only a factor when actually adding interesting additional rules/optimizations.
On the flipside, I don't know if it necessarily makes sense to have a "something must speed up on initial merge" criterion, though I'm open to that if a bunch of other folks disagree! The reason is that I think once we have (i) a framework and (ii) a replica of today's rewrites, the future upside is clear enough that the work to come up with those additional optimizations can happen as a second step. But, I can see how we might want to require such a speedup if the compile-time overhead is also non-negligible.
Anyway all this is open to a bunch of discussion and, on that note, I'll put an item on the agenda for our next biweekly :-)
cfallin commented on issue #27:
So I did some quick-and-dirty compile-time measurements. On SpiderMonkey.wasm, the overhead of roundtripping through the egraph adds about 23% to compile time (when opts are enabled). All current handwritten opts account for about 10% of compile time.
That's not terrible (not a doubling, for example) but it's not great either, so I dug into which parts were slow. It turns out that the extraction and elaboration take very little time (less than 10% of the egraph addition); most of the time is in the initial egraph build.
I suspect that we'll need to do some careful data-structure engineering to get things within acceptable overheads, but fortunately it looks like there's some relatively low-hanging fruit. Right now in
egg
theEClass
struct keeps twoVec
s, one for nodes and one for parents; theId
-to-EClass
map is aHashMap
while I would expect theId
space to be somewhat dense (maybe this is to clear out eclasses that have been merged?); and theNode
interface forces us to keep args internally in separate lists (I useSmallVec
currently for this). All of this could be refactored to use centralVec
s in a pseudo-arena way (like Cranelift does for CLIF args/results currently), at the cost of passing around the "backingVec
".I'll work on that for a bit, then return to hacking on the ISLE integration; once I prove out both of these, I'm happy to make a bunch more measurements!
mwillsey commented on issue #27:
@cfallin Awesome! At first glance, 23% isn't so bad, and I think there's you're right: a lot that could be done to get that down. Some quick points:
- While the
Id
space is initially dense, you actually hope that it ends up sparse because many e-classes are getting merged.- I think you are right, we could get some perf back by using more arenas in
EClass
.Also, the way the e-graph is being used in the prototype is basically a worst-case scenario for performance; for every
Func
, you build an e-graph, do e-graph stuff, extract and throw away. One big way we've improved performance in other egg applications is by batching many optimization queries at once. That would obviously require re-architecting the way egg and clif interact, which is probably something to do down the road.Overall, I think this means that there are some serious "high-level" ways to improve performance as well as the lower-level ways you mentioned.
cfallin commented on issue #27:
Thanks @mwillsey! By the way I'm planning to continue hacking on my fork of
egg
but I plan to upstream everything in due time, once plans are stable...Re: this point:
Also, the way the e-graph is being used in the prototype is basically a worst-case scenario for performance; for every Func, you build an e-graph, do e-graph stuff, extract and throw away. One big way we've improved performance in other egg applications is by batching many optimization queries at once. That would obviously require re-architecting the way egg and clif interact, which is probably something to do down the road.
I want to dig a bit deeper: it seems you're saying that one way of tolerating overheads is to amortize the construction and extraction over more use of the egraph. I guess that's true in a "fraction of time spent in construction+extraction" sense, but I'm thinking in an absolute sense here: the time spent building the egraph and extracting from it, with no work between the two, is the bare minimum in additional overhead, no? At least as long as we have a separate egraph per function, and given that the overall problem we are solving is "compile this CLIF to machine code"? Or it's possible I'm missing some other way of batching or amortization that you mean to suggest...
cfallin commented on issue #27:
Oh also a quick calibration point re:
At first glance, 23% isn't so bad
In Cranelift context this would be a huge deal actually: the whole regalloc rewrite (0.5 engineer-years) bought us about 20% compile-time improvement, and I fight to not give up 0.5% regularly. This is an open question above but I'll venture a ballpark number on this now just to calibrate (or at least give my own opinion): maybe 5% would be OK, but 20% would require a large increase in generated-code quality to justify, I think.
mwillsey commented on issue #27:
I'm planning to continue hacking on my fork of egg
Makes perfect sense. I'm happy to help hack on the fork if that's useful.
as long as we have a separate egraph per function
As long the e-graph optimization workflow is per-function, you can't really amortize. The batching I'm talking about is across functions: there's nothing stopping you from adding 1000 functions to the same e-graph (hopefully there's some sharing in there!) and optimizing them simultaneously. I suspect that even just adding 1000 functions to one e-graph is faster than adding 1000 functions to 1000 e-graphs because there is likely some sharing, but I could be wrong here.
Obviously one e-graph per function is cleaner for how the optimization workflow is right now. It just comes with the unfortunate overhead of basically copying your whole program twice: once into the e-graph and once out. Even outside of batching, another way of improving performance would be to eliminate one or both of these copies by increasing the part of the pipeline where you use e-graphs, ideally to some other "translation" boundary where you'd have to lower to another IR anyway. This is clearly a big task, though.
cfallin commented on issue #27:
That's a really interesting idea; we can certainly play with it at some point!
We currently have a hard boundary between each function compilation, in order to allow for parallel compilation, but we'll need to break that boundary (optionally) eventually for inlining as well. And, actually, now that these two ideas are next to each other: putting the inlining and inlined function in the same egraph, linking up the dataflow, and then elaborating out of that is a really nice clean way of doing inlining. So, you've spawned some thoughts here, thanks!
One potentially complicating factor toward deduplicating efficiency between function bodies is that the "roots" (terminal nodes) will differ: both
f
andg
may containcomplex_expr(x, y, z)
but wherex
,y
, andz
are different block params, or side-effecting ops (calls, loads), or whatever, in the two functions. I wonder if there is some sort of canonicalization that could help (in the extreme case this feels like a graph isomorphism problem though). In the egraph literature I mostly see expressions in terms of variable terminalsx
,y
, etc and so this doesn't seem to be as big a deal. Any thoughts?
cfallin commented on issue #27:
A brief update: I hacked up a custom egraph implementation designed to be more memory-efficient and allocation-light, and got the overhead down to 9%.
The key differences are in how it avoids ever cloning
Node
s, usingindexmap
to allow both node-to-ID dedup and ID-to-node access and using these IDs in eclasses' list of nodes (the use of IDs to name nodes is itself different fromegg
too); and using Cranelift'sEntityList
to store lists of node IDs and eclass IDs in pooled storage; and usingbumpalo
(an arena allocator) to allocate the argument lists of eclass IDs. All of this is to implement the principles (i) O(1) memory allocations, i.e. we should grow single large containers rather than have nested containers per node or eclass (entity-component-system style) and (ii) avoid copies and store nodes only once. There's a little more we could do in theory with a custom hashmap that doesn't requireHash
+Eq
(namely, allows us to provide context for those; that would allow us to provide the backing pool separately and not store a fat-pointer slice for args and types in everyNode
, but only au32
index).I haven't built the merging or rule-application bits yet but the roundtripping (without any rule applications) actually never merges eclasses so this is enough to evaluate the overhead of this part of the pipeline.
Given this, and given that GVN+LICM add around 10% to compile time, I'm actually cautiously optimistic that we can slip into the hole left by removing these stages and be almost neutral when no rewrite rules are applied. That's the ideal case in my mind. Or at least have small enough overhead that this roundtrip is always on when opts are on. Later we can remove half the roundtrip by generating egraph nodes straight from the frontend/builder API, as noted in the last section of the RFC (and as you note as well @mwillsey) for further savings, but for now anchoring both ends of the transform in CLIF (which can be printed, manipulated, used for testing, etc) has substantial pragmatic benefits I think.
(I'm not sure if the path forward is to actually use
cranelift-egraph
oregg
with a bunch of hacks; as a general principle I don't like to go all NIH and hack up my own versions of things; but the data structures and tradeoffs are different enough that this may be the right way, I'm not sure. Right now I just want to pathfind wrt "is this performance profile possible".)After dropping this tidbit I'm now promptly going to disappear for a week of PTO (...sorry!) but wanted to get that data point out first. More later!
bjorn3 commented on issue #27:
There's a little more we could do in theory with a custom hashmap that doesn't require Hash+Eq (namely, allows us to provide context for those; that would allow us to provide the backing pool separately and not store a fat-pointer slice for args and types in every Node, but only a u32 index).
hashbrown's
RawTable
might help with that.
cfallin commented on issue #27:
There's a little more we could do in theory with a custom hashmap that doesn't require Hash+Eq (namely, allows us to provide context for those; that would allow us to provide the backing pool separately and not store a fat-pointer slice for args and types in every Node, but only a u32 index).
hashbrown's
RawTable
might help with that.Indeed! I pushed a new implementation that does just that over here. (That could probably use a lot of cleanup, and is a bit lifetime-heavy, but proves out what I wanted to!) Thanks for the pointer to that.
RDambrosio016 commented on issue #27:
This RFC is very interesting to me, i plan to use e-graphs too for a large part of the optimizer for a math-heavy language im writing. I considered using ISLE as well for writing patterns, i think it would work pretty well. However ran into a couple of open questions i could not resolve:
- Egg by default uses exponential backoff for rules to avoid dominating rewrite time with rules like associativity, but if we just use a single gigantic pattern matching trie for the rules, how would we disable rules? This is probably doable by keeping some arrays for counting rule "hits", and checking those arrays before matching/falling through, but the matching and memory overhead for this needs to be explored. Using indirect jumps through a lookup table could probably also work (indirect-threaded interpreter style kinda). Would love to hear if @mwillsey has any ideas for this.
- Is ISLE capable of expressing rules common to multiple types easily? for example,
x * 1 => x
can be applied to floats, ints, and uints of any size. I presume ISLE has a way to handle this, but i haven't explored it much.- Is it potentially possible to store e-class data separate to the actual language? I.e the
Language
enum would be 64/128-bit sized and only contain data required for matching, for example,Add([Id; 2])
, To potentially match rules much faster by exploiting vectorized pattern matching.But overall this approach seems promising, and it would be awesome to share rulesets/ideas with cranelift.
RDambrosio016 edited a comment on issue #27:
This RFC is very interesting to me, i plan to use e-graphs too for a large part of the optimizer for a math-heavy language im writing. I considered using ISLE as well for writing patterns, i think it would work pretty well. I however ran into a couple of open questions i could not resolve:
- Egg by default uses exponential backoff for rules to avoid dominating rewrite time with rules like associativity, but if we just use a single gigantic pattern matching trie for the rules, how would we disable rules? This is probably doable by keeping some arrays for counting rule "hits", and checking those arrays before matching/falling through, but the matching and memory overhead for this needs to be explored. Using indirect jumps through a lookup table could probably also work (indirect-threaded interpreter style kinda). Would love to hear if @mwillsey has any ideas for this.
- Is ISLE capable of expressing rules common to multiple types easily? for example,
x * 1 => x
can be applied to floats, ints, and uints of any size. I presume ISLE has a way to handle this, but i haven't explored it much.- Is it potentially possible to store e-class data separate to the actual language? I.e the
Language
enum would be 64/128-bit sized and only contain data required for matching, for example,Add([Id; 2])
, To potentially match rules much faster by exploiting vectorized pattern matching.- Is egg/ISLE capable of tracking optimization invariants through
Analysis
in a performant manner? this would allow for much deeper optimization potential by allowing the optimizer to track invariants, for example, eliminatingnormalize
for vectors if a value is known to already be normalized.But overall this approach seems promising, and it would be awesome to share rulesets/ideas with cranelift.
mwillsey commented on issue #27:
Hi @RDambrosio016,
egg doesn't use trie-based pattern matching, so you'd essentially have to re-implement e-matching to support this. I believe Z3 does do something like this (Efficient E-matching for SMT Solvers, section 4). We prototyped it at some point, and found it was slower because the rules we were using didn't significantly "overlap". In a high-performance setting, what I would try before tries is just running rules using the
SimpleScheduler
, but with a fixed, small number of iterations.As far as using
Analysis
for tracking invariants, sure! Just be careful, an analysis is per e-class, so it depends on your notion of equivalence.
RDambrosio016 commented on issue #27:
Interesting, i did a small experiment generating rules with my own DSL, and, while decently fast, it definitely seems like egg spends a lot of its time on rules that could never match if you have a ruleset that is somewhat larger as a consequence of duplicating rules for numerous numeric types. In this particular ruleset i have ~61 unique rules (including bidirectionality), which after being duplicated for multiple numeric types, result in
240
total rules, and this is how long egg takes on each step:Number of rules: 240 Time from runner creation: 2.6857ms Runner report ============= Stop reason: Saturated Iterations: 7 Egraph size: 148 nodes, 23 classes, 177 memo Rebuilds: 4 Total time: 0.0026048 Search: (0.82) 0.0021298000000000003 Apply: (0.17) 0.0004338 Rebuild: (0.02) 0.0000397 2.0 - pow(x * 1.0 * -1.0, 2.0) + (y / z + 3.0) ----> fma(x, neg(x), fma(y, recip(z), 5)) Cost: 26.5 -> 12
This is on release with 1 CGUs and LTO on, on an i9-10850k. Search time making up the bulk of the time is not unexpected, but to me it seemed a little bit high considering how small the expression is. So i re-tried it but without duplicating rules, and i got these results:
Number of rules: 50 Time from runner creation: 1.0181ms Runner report ============= Stop reason: Saturated Iterations: 7 Egraph size: 148 nodes, 23 classes, 177 memo Rebuilds: 4 Total time: 0.0009838 Search: (0.66) 0.0006475000000000001 Apply: (0.30) 0.0002992 Rebuild: (0.04) 0.000035999999999999994 2.0 - pow(x * 1.0 * -1.0, 2.0) + (y / z + 3.0) ----> fma(x, neg(x), fma(y, recip(z), 5)) Cost: 26.5 -> 12
So it definitely seems like hierarchical matching of some sort would be required to avoid spending a ton of time searching the graph. It's odd that your experiment in trie matching ended up being slower, was it using generated rust code like ISLE creates, or was it dynamically matched at runtime?
cfallin commented on issue #27:
@RDambrosio016 I'd be happy to talk more about sharing efforts once the Cranelift work here is more mature; for now there is still exploration going on. I suspect that a lot of what we do is going to be fairly Cranelift-specific: there is a large semantic gap between "x*1 == x" at the conceptual level, and the particular matchers, node formats, and idiosyncrasies of the framework. My experience in the past is that trying to paper over such details to give a nice high-level view can sometimes be counterproductive because it either creates subtle bugs, or sweeps efficiency concerns under the rug. But once we've built more, perhaps there's something to be used for your effort too :-)
In the meantime perhaps it's best if we keep this discussion thread scoped to the Cranelift effort? It's likely to get much longer in time anyway (I have some more updates cooking at the moment) but side-discussions about other projects might pull us a bit too far off course... thanks!
RDambrosio016 commented on issue #27:
Absolutely :+1: , didnt mean to steer it into more minute details, i was just wondering if you considered/discussed this problem in the past, since i looked at the RFC but it did not seem to mention it.
RDambrosio016 edited a comment on issue #27:
Absolutely :+1: , didnt mean to steer it into more minute details, i was just wondering if you considered/discussed this problem in the past, since i looked at the RFC but it did not seem to mention it. Anyways, im excited about this RFC, and interested in maybe contributing to implementing it/making it better.
cfallin commented on issue #27:
For sure, performance is my main concern right now and I'm working to make the whole thing competitive with handwritten optimizations, so you'll here more about that in a bit!
RDambrosio016 commented on issue #27:
Beating hand-written optimizations is definitely an achievable goal, considering the e-graph approach would have a much easier time knowing when to stop optimizing vs a standard pass-based optimizer. It will also be an interesting problem trying to figure out a good cost metric that maximizes generated code quality. Having a complete e-graph to choose from is great, but trying to select the ideal version from that will definitely prove to be a challenge.
cfallin commented on issue #27:
As an update here, on the running prototype/experimental branch here, I've gotten overhead of Cranelift-with-egraphs (including elaboration that subsumes GVN and LICM, and rewrite rules using ISLE that encompass basic const propagation and algebraic simplifications) vs Cranelift-with-handwritten-opts down to ~1% slower on parallel compile time (and actually slightly faster in terms of total CPU time across cores):
cfallin@min:~/work/wasmtime % hyperfine --show-output -- "target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/spidermonkey.wasm" Benchmark 1: target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/spidermonkey.wasm Time (mean ± σ): 1.693 s ± 0.098 s [User: 10.747 s, System: 0.234 s] Range (min … max): 1.621 s … 1.951 s 10 runs cfallin@min:~/work/wasmtime % hyperfine --show-output -- "target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/spidermonkey.wasm" Benchmark 1: target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/spidermonkey.wasm Time (mean ± σ): 1.671 s ± 0.035 s [User: 10.888 s, System: 0.236 s] Range (min … max): 1.606 s … 1.707 s 10 runs
(That's measuring on an M1 laptop with SpiderMonkey.wasm, a large Wasm module testcase.)
This is using a new acyclic-egraph approach that I'm calling the "aegraph" (or ægraph for those who love digraphs). I got the conventional egraph rewriting to work last week, and measured it and found compilation time was inflating significantly once rewrites occur -- something like 5x. I then had a series of realizations that the way that we're using egraphs here is more limited than perhaps a conventional full-equality-saturation approach, and had a series of realizations about simplifications we could make.
Basically an ægraph contains eclasses as nodes where each node contains an enode, an enode and a parent-eclass pointer, or two parent-eclass pointers. The union operation joins two eclasses together only by adding nodes (or as an optimization, replacing an enode node with an enode-and-parent node). An eclass node can refer only to earlier eclasses. We maintain the acyclicity invariant when adding enodes and unioning eclasses. Rule application occurs immediately when a new node is added. In a sense, eclasses are split into "levels" and a given level of an eclass refers only to the levels of argument eclasses that existed when it was created. It's thus a persistent immutable data structure. An eclass is represented by the id of its latest version (highest id), whereas the traditional union-find approach tends to push ids downward to the lowest id.
This data structure has some really compelling properties:
- Because eclass nodes are immutable once created, we only need to apply rules to a node once. This "single pass" rule application is at the heart of its speed.
- We don't re-canonicalize and re-intern enodes, so we don't need parent pointers and we don't need a rebuild step.
- It plays really nicely with the scoped elaboration, in that we can keep a parallel scoped map with the "best node" (extraction information) and memoize; extraction doesn't need cycle removal any more.
The tradeoff is that it's not at all a full equality saturation (see above re: no recanonicalization). But I think that's OK for our use-case: we mainly (i) introduce new expressions then (ii) apply rules that introduce alternate forms of those expressions, and union them together. In this case the ægraph is pretty ideal: it's like a streaming journal of nodes and we add a new node from the build, then add a few more nodes that are equivalent to it and union them in-place, then keep going, referring to the just-built expression by the latest id. So the ægraph is sort of a "multi-version IR" where every value can have multiple representations, combined with GVN.
(The case it misses that a traditional egraph with parent pointers, recanonicalization, and the rebuild step would get is where we introduce some node A that gets a new eclass, a rewrite rule adds a node B and unions it with A, and B happens to dedup to an already-existing eclass. The new ID for A will point to a union node that joins A and B's existing eclass, but other users of B will not see A because it's "newer". This seems to be of limited use for our specific rules because our rewrite rules tend to go from more expensive/complex expressions to simpler ones, so A is likely not better than B. That's not always the case and I'm sure there are cases where a full egraph does do better. But the multiple-version-representation and ability of rules to match on all versions seems like the key thing that we want from the data structure.)
Anyway, I'll write this up more properly but that's where I'm at now: a prototype that works and is fast. I'll add a bit more to it (integration with alias analysis! and other rules I haven't added yet) and then start to look at generated code quality too.
cfallin edited a comment on issue #27:
As an update here, on the running prototype/experimental branch here, I've gotten overhead of Cranelift-with-egraphs (including elaboration that subsumes GVN and LICM, and rewrite rules using ISLE that encompass basic const propagation and algebraic simplifications) vs Cranelift-with-handwritten-opts down to ~1% slower on parallel compile time (and actually slightly faster in terms of total CPU time across cores):
cfallin@min:~/work/wasmtime % hyperfine --show-output -- "target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/spidermonkey.wasm" Benchmark 1: target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/spidermonkey.wasm Time (mean ± σ): 1.693 s ± 0.098 s [User: 10.747 s, System: 0.234 s] Range (min … max): 1.621 s … 1.951 s 10 runs cfallin@min:~/work/wasmtime % hyperfine --show-output -- "target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/spidermonkey.wasm" Benchmark 1: target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/spidermonkey.wasm Time (mean ± σ): 1.671 s ± 0.035 s [User: 10.888 s, System: 0.236 s] Range (min … max): 1.606 s … 1.707 s 10 runs
(That's measuring on an M1 laptop with SpiderMonkey.wasm, a large Wasm module testcase.)
This is using a new acyclic-egraph approach that I'm calling the "aegraph" (or ægraph for those who love digraphs). I got the conventional egraph rewriting to work last week, and measured it and found compilation time was inflating significantly once rewrites occur -- something like 5x. I then had a series of realizations that the way that we're using egraphs here is more limited than perhaps a conventional full-equality-saturation approach, and there are simplifications we could make.
Basically an ægraph contains eclasses as nodes where each node contains an enode, an enode and a parent-eclass pointer, or two parent-eclass pointers. The union operation joins two eclasses together only by adding nodes (or as an optimization, replacing an enode node with an enode-and-parent node). An eclass node can refer only to earlier eclasses. We maintain the acyclicity invariant when adding enodes and unioning eclasses. Rule application occurs immediately when a new node is added. In a sense, eclasses are split into "levels" and a given level of an eclass refers only to the levels of argument eclasses that existed when it was created. It's thus a persistent immutable data structure. An eclass is represented by the id of its latest version (highest id), whereas the traditional union-find approach tends to push ids downward to the lowest id.
This data structure has some really compelling properties:
- Because eclass nodes are immutable once created, we only need to apply rules to a node once. This "single pass" rule application is at the heart of its speed.
- We don't re-canonicalize and re-intern enodes, so we don't need parent pointers and we don't need a rebuild step.
- It plays really nicely with the scoped elaboration, in that we can keep a parallel scoped map with the "best node" (extraction information) and memoize; extraction doesn't need cycle removal any more.
The tradeoff is that it's not at all a full equality saturation (see above re: no recanonicalization). But I think that's OK for our use-case: we mainly (i) introduce new expressions then (ii) apply rules that introduce alternate forms of those expressions, and union them together. In this case the ægraph is pretty ideal: it's like a streaming journal of nodes and we add a new node from the build, then add a few more nodes that are equivalent to it and union them in-place, then keep going, referring to the just-built expression by the latest id. So the ægraph is sort of a "multi-version IR" where every value can have multiple representations, combined with GVN.
(The case it misses that a traditional egraph with parent pointers, recanonicalization, and the rebuild step would get is where we introduce some node A that gets a new eclass, a rewrite rule adds a node B and unions it with A, and B happens to dedup to an already-existing eclass. The new ID for A will point to a union node that joins A and B's existing eclass, but other users of B will not see A because it's "newer". This seems to be of limited use for our specific rules because our rewrite rules tend to go from more expensive/complex expressions to simpler ones, so A is likely not better than B. That's not always the case and I'm sure there are cases where a full egraph does do better. But the multiple-version-representation and ability of rules to match on all versions seems like the key thing that we want from the data structure.)
Anyway, I'll write this up more properly but that's where I'm at now: a prototype that works and is fast. I'll add a bit more to it (integration with alias analysis! and other rules I haven't added yet) and then start to look at generated code quality too.
RDambrosio016 commented on issue #27:
Very interesting! im curious to see if that specific limitation will end up limiting any optimization potential in the future. It may be worth it to try running generated code benchmarks both without rebuilding, and with. Although it's definitely possible that associativity rules will hide that limitation. Nevertheless, the performance on something as large as the spidermonkey wasm file is impressive! Did you happen to measure how much memory the e-graphs/program overall took?
Excited to try out this approach, i was thinking about perhaps working on algebraic identities for mul/div/rem mostly based on this paper. Would you accept a PR to your experimental branch to add such rules?
cfallin commented on issue #27:
Sure; more opts are definitely welcome in a PR, thanks!
I haven't done any more detailed measurements; I'm polishing things wrt integrations now and once I have it in a state where I'm happy, I'll do a full comparison and set of measurements (with Sightglass) on compile- and runtime performance.
cfallin commented on issue #27:
Hi all -- I have some updates. It's kind of a mixed bag -- tl;dr, code quality matches existing Cranelift (all optimizations have been reproduced), compilation speed is ~1-5% slower. Theoretical benefits of everything-in-one-fixpoint (GVN and alias analysis and LICM and algebraic simplifications all cooperating) work in filetests, but don't seem to do much to Wasm-based benchmarks in Sightglass.
So the current status of the WIP branch is:
- "acyclic egraph" (ægraph) implementation is working;
- producing and lowering out of this aegraph works fine;
- ISLE-based rewrite rules operate on this aegraph, and cost-based extraction picks the best option for each node after rewrites are done;
- rewrites in the aegraph apply immediately (because of acyclicity, there's no need to revisit a node);
- ISLE itself has been extended to support "multiplicity" in extractors and constructors, so multiple rules can match and matchers on eclass IDs see all available enodes for each eclass;
- The aegraph construction and "scoped elaboration" subsume GVN (global value numbering, a form of common-subexpression elimination) by means of the deduplication map;
- The aegraph elaboration automatically does LICM (loop-invariant code motion) implicitly, by way of loop-nest awareness when traversing the domtree;
- The alias analysis feeds into the rewrite rules and there are rules for redundant-load elimination and store-to-load forwarding;
- Rewrite rules have been written for a bunch of constant-propagation and algebraic-simplification cases, and some other slightly more ad-hoc ones (e.g. reassociating loads to try to make LICM on partially-loop-invariant sums work)
- A simple form of rematerialization has been implemented, where adds-with-one-immediate, and immediates themselves, are regenerated in the block they are used rather than maximally hoisted by GVN, in order to reduce register pressure.
All this works -- and IMHO the data structures and algorithms are pretty beautiful actually! -- but it's not any faster. It is nicer to update with more rewrite rules -- see cprop.isle and algebraic.isle -- and if it were completely at parity performance-wise, this would be an interesting benefit still IMHO. But there is still a small compile-time penalty, around 5% on average, with everything above incorporated.
I have some more thoughts in my running TODO list but most of the remaining ones are for compile speed, not code quality (runtime speed).
Given all that, I think I'm running up against a wall that's actually mostly supported by some other limits:
- There is actually not a ton of remaining "low-hanging fruit" when running Wasm modules produced by an optimizing compiler. (Any strength reduction or other clever "instcombine"-type optimization I would expect to be done by LLVM before producing the
.wasm
.)- Lowering, once it occurs, disallows any further optimizations, because we can't edit VCode currently.
The first is just a fact of our benchmarking conditions (but we're oriented this way because we care very much about Wasm performance -- we can't just choose to focus on poorly-optimized CLIF instead and say "see look opts work!"). But the second is where a lot of my thoughts are going lately. In other words, from first principles, if we start with highly-optimized Wasm bytecode, where do we expect inefficiencies and redundancies to creep in? Probably one of the three areas:
- The semantic gap between Wasm bytecode and CLIF, and how it is bridged (heap bounds checks, table accesses, and the like);
- The semantic gap between CLIF and the machine, and how it is bridged (our lowering rules in our backends);
- regalloc quality.
My forehead is already dented from banging against item 3 (regalloc quality) for a long time, and item 1 (Wasm to CLIF) is largely addressed already by designing CLIF semantics to be close to those of Wasm; but item 2 (lowering) presents unharvested opportunities IMHO.
In looking at disassemblies of "hot blocks" when profiling runtime I see a lot of not-great code, and it comes in several varieties: redundant flags computations (cmp / setnp / cmp / sete for float comparisons is a common one); redundant trap conditions that cprop could optimize away, or GVN could combine between multiple ops; address-mode junk produced by our "match up the operand tree and then regenerate an
add
tree for the parts we can't incorporate" logic in both x64 and aarch64; a bunch of "spill x+1, x+3, x+10, x+15 to separate spillslots for use later" (hence my attempt at remat above); and others. A lot of this comes from the inability to optimize code once we lower a CLIF op into multiple MachInsts, or one pseudo-MachInst.(Those who have been around a while might be screaming: legalization and post-opt! Yes, you're right. I think we threw out some of the value of those in exchange for the speed and simplicity of the single-pass lowering, and the real value that we do get from a two-level IR. But we lost something too.)
So: I'm taking a little time to think about what might make sense as a way to optimize VCode. I don't think this means throwing out the aegraph and ISLE-based mid-end framework; I think instead there may be a way to write some mid-end legalizations that do a lot of expansion that we currently do in lowering, and then let lowering be simpler; or in the extreme, even have a
Node::MachInst { ... }
in the aegraph, though that has potential efficiency issues. More thought needed.Anyway, I'd love to hear more thoughts on this, and we can talk about it more in the next Cranelift meeting, but there's my report on what I've learned in the last few weeks :-)
cfallin commented on issue #27:
... and in a typical show of my luck, I discover just after writing the above that my remat was doing something wrong (LICM was overriding its decision). I now have the result:
- on spidermonkey.wasm I see an 11.4% speedup, with 5% extra compile time (the latter I can optimize a bit more I think);
- on bz2.wasm I see a 7.3% speedup, also with 5% extra compile time.
More measurements later but IMHO if I can get the compile time overhead a bit lower this should be compelling on its own (then the post-lowering optimization ideas above become the next step).
cfallin commented on issue #27:
I did some optimization on compiler runtime tonight -- both low-level things like inline pragmas, refactoring some data structures to allow holding borrows rather than lookups, using dense rather than sparse maps in places, preallocating with correct sizes, and similar, and high-level things like algorithmic improvements to memoize more, trim the extraction search space (running upper bound and early exit/pruning), "subsume" all nodes in an eclass with one "final" version (for e.g. cprop), and the like.
In the end I got SpiderMonkey.wasm down to a ~1% compile-time regression (relative to baseline Cranelift), and bz2.wasm to a... 15% compile-time speedup (yeah, I know; it's wild; I was focusing entirely on SpiderMonkey and then just checked bz2 now, and would have fallen off my chair if gravity were slightly stronger):
[cfallin@xap]~/work/wasmtime% hyperfine -L egraphs false,true 'target/release/wasmtime compile --cranelift-set use_egraphs={egraphs} ../wasm-tests/spidermonkey.wasm' Benchmark 1: target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/spidermonkey.wasm Time (mean ± σ): 1.106 s ± 0.051 s [User: 15.034 s, System: 0.505 s] Range (min … max): 1.043 s … 1.201 s 10 runs Benchmark 2: target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/spidermonkey.wasm Time (mean ± σ): 1.115 s ± 0.033 s [User: 15.915 s, System: 0.510 s] Range (min … max): 1.077 s … 1.162 s 10 runs Summary 'target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/spidermonkey.wasm' ran 1.01 ± 0.05 times faster than 'target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/spidermonkey.wasm'
[cfallin@xap]~/work/wasmtime% hyperfine -r 200 -L egraphs false,true 'target/release/wasmtime compile --cranelift-set use_egraphs={egraphs} ../wasm-tests/bz2.wasm' Benchmark 1: target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/bz2.wasm Time (mean ± σ): 82.1 ms ± 10.8 ms [User: 203.2 ms, System: 58.4 ms] Range (min … max): 71.4 ms … 106.5 ms 200 runs Benchmark 2: target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/bz2.wasm Time (mean ± σ): 71.4 ms ± 10.2 ms [User: 191.0 ms, System: 58.7 ms] Range (min … max): 59.6 ms … 96.3 ms 200 runs Summary 'target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/bz2.wasm' ran 1.15 ± 0.22 times faster than 'target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/bz2.wasm'
And for completeness, yesterday's runtime numbers were on my M1 laptop, while today's work is on a Linux/x86-64 desktop (Ryzen 3900X); on this system, SpiderMonkey.wasm sees a 13% speedup, and bz2.wasm a 3% speedup:
[cfallin@xap]~/work/wasmtime% target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/spidermonkey.wasm -o base.cwasm [cfallin@xap]~/work/wasmtime% target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/spidermonkey.wasm -o egraphs.cwasm [cfallin@xap]~/work/wasmtime% hyperfine -L file base,egraphs 'target/release/wasmtime run --allow-precompiled --dir=. {file}.cwasm ./fib.js' Benchmark 1: target/release/wasmtime run --allow-precompiled --dir=. base.cwasm ./fib.js Time (mean ± σ): 1.345 s ± 0.023 s [User: 1.338 s, System: 0.005 s] Range (min … max): 1.317 s … 1.384 s 10 runs Benchmark 2: target/release/wasmtime run --allow-precompiled --dir=. egraphs.cwasm ./fib.js Time (mean ± σ): 1.195 s ± 0.010 s [User: 1.189 s, System: 0.003 s] Range (min … max): 1.181 s … 1.205 s 10 runs Summary 'target/release/wasmtime run --allow-precompiled --dir=. egraphs.cwasm ./fib.js' ran 1.13 ± 0.02 times faster than 'target/release/wasmtime run --allow-precompiled --dir=. base.cwasm ./fib.js'
[cfallin@xap]~/work/wasmtime% target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/bz2.wasm -o base.cwasm [cfallin@xap]~/work/wasmtime% target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/bz2.wasm -o egraphs.cwasm [cfallin@xap]~/work/wasmtime% hyperfine -L file base,egraphs 'target/release/wasmtime run --allow-precompiled {file}.cwasm' Benchmark 1: target/release/wasmtime run --allow-precompiled base.cwasm Time (mean ± σ): 428.3 ms ± 9.7 ms [User: 425.3 ms, System: 1.7 ms] Range (min … max): 416.6 ms … 441.6 ms 10 runs Benchmark 2: target/release/wasmtime run --allow-precompiled egraphs.cwasm Time (mean ± σ): 414.5 ms ± 12.8 ms [User: 411.5 ms, System: 1.9 ms] Range (min … max): 403.1 ms … 440.7 ms 10 runs Summary 'target/release/wasmtime run --allow-precompiled egraphs.cwasm' ran 1.03 ± 0.04 times faster than 'target/release/wasmtime run --allow-precompiled base.cwasm'
I think we're at the point where we can discuss this in the next Cranelift meeting, then hopefully, if consensus exists, I'll start cleaning this up and preparing it for merging (in small, more easily-digestible pieces!).
mwillsey commented on issue #27:
@cfallin Amazing work! These are killer results! I think specializing to the ægraph (great name) is definitely the right decision for this use case; the trade-offs seem to be worth it. Although, it does make me wonder what it would take to make "normal" equality saturation usable in this setting. Exciting future work!
oflatt commented on issue #27:
This is so cool!
Would be really interested to see how much better results are with different tweaks. Does full egraph-style matching do much good? I'm thinking about the case where an enode creates a new match in a parent eclass.
cfallin commented on issue #27:
Thanks @mwillsey! I'm definitely curious what other design points may be possible here. I suspect that benefits of more strongly normalizing approaches (full recanonicalize/rebuild as in a traditional egraph or egg) will become more apparent the more we add interesting rules. Right now I have rules I wrote with my traditional-compiler-author brain, so mostly constant prop and the like where the right-hand side is unambiguously better. (There are a few cases of reassociation where I let the extractor do the right thing though, and also scoped elaboration can make things interesting when another expression forces evaluation of some otherwise-expensive node, shorting its cost to 0 for subsequent uses and altering the best choice. (This is why extraction is done lazily and memoized.) In those cases I suspect having multiple versions around actually helps.)
There are definitely some experiments I'd like to run, if I were to put on my old grad-student hat (it has collected some dust!) -- among them, putting together a full rebuild step as @oflatt suggests, and maybe also a "rewrites always subsume" mode (ie no multi-version capability) with the rules. My day-job means I need to focus on getting this "makes compiler faster and better" magic into production shape first, of course, but I do want to experiment some as time allows :-)
remysucre commented on issue #27:
Hello! Yet another egg contributor here. @mwillsey and I had a great chat with Nico Bruno, who was a main engineer of SQLServer. It turns out one main difference between equality saturation and cascade-style DB query optimization is that the latter maintains the acyclicity of the e-graph, just like aegraph. Acyclicity enables many powerful optimizations like branch-and-bound, and usually runs very fast. I highly recommend checking out some of the literature there, as well as some modern implementations like cockroach and orca.
cfallin commented on issue #27:
@remysucre thanks for the pointers! (As with many other topics in systems work, the database folks got there first... why am I not surprised?!). I'm happy to hear there are similar approaches in use and I'll read up more.
cfallin commented on issue #27:
Hi all -- I've updated the RFC to refer to ægraphs and include the numbers above. A full writeup of how ægraphs work is IMHO out-of-scope here but I will include a bunch of detail somewhere in
cranelift/codegen/docs/
if this merges. The important decision points here, IMHO, are:
- Should we have a mid-end?
- Should it operate based on rewriting rules?
- Is an implementation based on the prototype acceptable, given its numbers (with polish in a bunch of production details before merging)?
I'd like to kick off a round of discussion and (hopefully!) consensus-building on that. Happy to talk more about this in the next Cranelift meeting but let's see what folks think here first. Thanks!
cfallin commented on issue #27:
Thanks @avanhatt! Updated based on feedback.
cfallin commented on issue #27:
Hi all -- after discussing this briefly yesterday it sounds like no one has any major objections; so given that, and the positive results we've found, I would like to make a
Motion to finalize with a disposition to merge
I've updated the list below to contain all stakeholders who have committed to Cranelift within the past three months, or participated in RFC discussions in the same timeframe.
Stakeholders sign-off
Arm
- [ ] @akirilov-arm
- [ ] @sparker-arm
- [ ] @dheaton-arm
Cornell
- [ ] @avanhatt
Embark Studios
- [ ] @bnjbvr
Fastly
- [ ] @alexcrichton
- [ ] @cfallin
- [ ] @fitzgen
- [ ] @sunfishcode
- [ ] @elliottt
- [ ] @jameysharp
- [ ] @iximeow
IBM
- [ ] @uweigand
Intel
- [ ] @abrown
- [ ] @jlb6740
Unaffiliated
- [ ] @bjorn3
cfallin edited a comment on issue #27:
Hi all -- after discussing this briefly yesterday it sounds like no one has any major objections; so given that, and the positive results we've found, I would like to make a
Motion to finalize with a disposition to merge
I've updated the list below to contain all stakeholders who have committed to Cranelift within the past three months, or participated in RFC discussions in the same timeframe.
Stakeholders sign-off
Arm
- [ ] @akirilov-arm
- [ ] @sparker-arm
- [ ] @dheaton-arm
Cornell
- [ ] @avanhatt
Embark Studios
- [ ] @bnjbvr
Fastly
- [ ] @alexcrichton
- [x] @cfallin
- [ ] @fitzgen
- [ ] @sunfishcode
- [ ] @elliottt
- [ ] @jameysharp
- [ ] @iximeow
IBM
- [ ] @uweigand
Intel
- [ ] @abrown
- [ ] @jlb6740
Unaffiliated
- [ ] @bjorn3
cfallin edited a comment on issue #27:
Hi all -- after discussing this briefly yesterday it sounds like no one has any major objections; so given that, and the positive results we've found, I would like to make a
Motion to finalize with a disposition to merge
I've updated the list below to contain all stakeholders who have committed to Cranelift within the past three months, or participated in RFC discussions in the same timeframe.
Stakeholders sign-off
Arm
- [ ] @akirilov-arm
- [ ] @sparker-arm
- [ ] @dheaton-arm
Cornell
- [ ] @avanhatt
Embark Studios
- [ ] @bnjbvr
Fastly
- [ ] @alexcrichton
- [x] @cfallin
- [ ] @fitzgen
- [ ] @sunfishcode
- [ ] @elliottt
- [ ] @jameysharp
- [ ] @iximeow
IBM
- [ ] @uweigand
Intel
- [ ] @abrown
- [ ] @jlb6740
Unaffiliated
- [x] @bjorn3
fitzgen edited a comment on issue #27:
Hi all -- after discussing this briefly yesterday it sounds like no one has any major objections; so given that, and the positive results we've found, I would like to make a
Motion to finalize with a disposition to merge
I've updated the list below to contain all stakeholders who have committed to Cranelift within the past three months, or participated in RFC discussions in the same timeframe.
Stakeholders sign-off
Arm
- [ ] @akirilov-arm
- [ ] @sparker-arm
- [ ] @dheaton-arm
Cornell
- [ ] @avanhatt
Embark Studios
- [ ] @bnjbvr
Fastly
- [ ] @alexcrichton
- [x] @cfallin
- [x] @fitzgen
- [ ] @sunfishcode
- [ ] @elliottt
- [ ] @jameysharp
- [ ] @iximeow
IBM
- [ ] @uweigand
Intel
- [ ] @abrown
- [ ] @jlb6740
Unaffiliated
- [x] @bjorn3
cfallin edited a comment on issue #27:
Hi all -- after discussing this briefly yesterday it sounds like no one has any major objections; so given that, and the positive results we've found, I would like to make a
Motion to finalize with a disposition to merge
I've updated the list below to contain all stakeholders who have committed to Cranelift within the past three months, or participated in RFC discussions in the same timeframe.
Stakeholders sign-off
Arm
- [ ] @akirilov-arm
- [ ] @sparker-arm
- [ ] @dheaton-arm
Cornell
- [x] @avanhatt
Embark Studios
- [ ] @bnjbvr
Fastly
- [ ] @alexcrichton
- [x] @cfallin
- [x] @fitzgen
- [ ] @sunfishcode
- [ ] @elliottt
- [ ] @jameysharp
- [ ] @iximeow
IBM
- [ ] @uweigand
Intel
- [ ] @abrown
- [ ] @jlb6740
Unaffiliated
- [x] @bjorn3
cfallin commented on issue #27:
Given approvals from multiple stakeholder groups, we now enter the 10-day final comment period.
If no objections are registered by 2022-09-03, we will merge this RFC. Thanks all for the discussion!
cfallin edited a comment on issue #27:
Hi all -- after discussing this briefly yesterday it sounds like no one has any major objections; so given that, and the positive results we've found, I would like to make a
Motion to finalize with a disposition to merge
I've updated the list below to contain all stakeholders who have committed to Cranelift within the past three months, or participated in RFC discussions in the same timeframe.
Stakeholders sign-off
Arm
- [ ] @akirilov-arm
- [ ] @sparker-arm
- [ ] @dheaton-arm
Cornell
- [x] @avanhatt
Embark Studios
- [ ] @bnjbvr
Fastly
- [ ] @alexcrichton
- [x] @cfallin
- [x] @fitzgen
- [ ] @sunfishcode
- [ ] @elliottt
- [ ] @jameysharp
- [ ] @iximeow
IBM
- [x] @uweigand
Intel
- [ ] @abrown
- [ ] @jlb6740
Unaffiliated
- [x] @bjorn3
jameysharp edited a comment on issue #27:
Hi all -- after discussing this briefly yesterday it sounds like no one has any major objections; so given that, and the positive results we've found, I would like to make a
Motion to finalize with a disposition to merge
I've updated the list below to contain all stakeholders who have committed to Cranelift within the past three months, or participated in RFC discussions in the same timeframe.
Stakeholders sign-off
Arm
- [ ] @akirilov-arm
- [ ] @sparker-arm
- [ ] @dheaton-arm
Cornell
- [x] @avanhatt
Embark Studios
- [ ] @bnjbvr
Fastly
- [ ] @alexcrichton
- [x] @cfallin
- [x] @fitzgen
- [ ] @sunfishcode
- [ ] @elliottt
- [x] @jameysharp
- [ ] @iximeow
IBM
- [x] @uweigand
Intel
- [ ] @abrown
- [ ] @jlb6740
Unaffiliated
- [x] @bjorn3
elliottt edited a comment on issue #27:
Hi all -- after discussing this briefly yesterday it sounds like no one has any major objections; so given that, and the positive results we've found, I would like to make a
Motion to finalize with a disposition to merge
I've updated the list below to contain all stakeholders who have committed to Cranelift within the past three months, or participated in RFC discussions in the same timeframe.
Stakeholders sign-off
Arm
- [ ] @akirilov-arm
- [ ] @sparker-arm
- [ ] @dheaton-arm
Cornell
- [x] @avanhatt
Embark Studios
- [ ] @bnjbvr
Fastly
- [ ] @alexcrichton
- [x] @cfallin
- [x] @fitzgen
- [ ] @sunfishcode
- [x] @elliottt
- [x] @jameysharp
- [ ] @iximeow
IBM
- [x] @uweigand
Intel
- [ ] @abrown
- [ ] @jlb6740
Unaffiliated
- [x] @bjorn3
akirilov-arm edited a comment on issue #27:
Hi all -- after discussing this briefly yesterday it sounds like no one has any major objections; so given that, and the positive results we've found, I would like to make a
Motion to finalize with a disposition to merge
I've updated the list below to contain all stakeholders who have committed to Cranelift within the past three months, or participated in RFC discussions in the same timeframe.
Stakeholders sign-off
Arm
- [x] @akirilov-arm
- [ ] @sparker-arm
- [ ] @dheaton-arm
Cornell
- [x] @avanhatt
Embark Studios
- [ ] @bnjbvr
Fastly
- [ ] @alexcrichton
- [x] @cfallin
- [x] @fitzgen
- [ ] @sunfishcode
- [x] @elliottt
- [x] @jameysharp
- [ ] @iximeow
IBM
- [x] @uweigand
Intel
- [ ] @abrown
- [ ] @jlb6740
Unaffiliated
- [x] @bjorn3
alexcrichton edited a comment on issue #27:
Hi all -- after discussing this briefly yesterday it sounds like no one has any major objections; so given that, and the positive results we've found, I would like to make a
Motion to finalize with a disposition to merge
I've updated the list below to contain all stakeholders who have committed to Cranelift within the past three months, or participated in RFC discussions in the same timeframe.
Stakeholders sign-off
Arm
- [x] @akirilov-arm
- [ ] @sparker-arm
- [ ] @dheaton-arm
Cornell
- [x] @avanhatt
Embark Studios
- [ ] @bnjbvr
Fastly
- [x] @alexcrichton
- [x] @cfallin
- [x] @fitzgen
- [ ] @sunfishcode
- [x] @elliottt
- [x] @jameysharp
- [ ] @iximeow
IBM
- [x] @uweigand
Intel
- [ ] @abrown
- [ ] @jlb6740
Unaffiliated
- [x] @bjorn3
cfallin commented on issue #27:
The FCP has now (more than) elapsed; with no objections, I will now merge this RFC! Thanks all for the spirited discussion and ideas. Hopefully soon now we can get a proper implementation into Cranelift; my next task is to clean up the prototype and start to submit pieces for review.
Last updated: Dec 23 2024 at 12:05 UTC