Hello! Me and a colleague of mine would like to tackle issue #6260, proposing a better approach for instruction scheduling using heuristics.
First, it would be good to know if the proposal is still relevant, given almost a year has passed since it was first published. For example, is there current work that would invalidate this, or, are you aware of someone already trying to implement it?
We should note that while we feel like we adequately understand the heuristics proposed in the paper, we don't have any experience in the cranelift codebase. If you deem it necessary, we would be happy to join the weekly meeting (we also don't know how to do this).
I forgot to add a link to the issue for convenience :)
AFAIK it is still relevant
And I don’t believe anyone is poking at this kind of thing right now
Yes, it's still relevant! In my opinion at least, it'd be really useful to get at least approximate answers to: (i) how much benefit do we see with any sort of scheduling (versus the demand-driven / implicitly-sunk-to-latest approach we have now) across a wider set of benchmarks than the examples we've seen where it matters; and (ii) how low can we push the compile-time cost, to fit with the general ethos of Cranelift's optimizations
To that end my personal take is "simpler might be better?" but it's worth exploring a range of approaches to be sure
in the linked issue with the example that drove this thinking, here I gave a possibly acceptable and very simple approach: try to preserve original ordering when one can (and maybe extend the notion of ordering to "nodes created while rewriting another node"). That addresses the "llvm/user carefully scheduled the inner loop" cases, at least, without getting into complex heuristics
Panagiotis Karouzakis has marked this topic as resolved.
Panagiotis Karouzakis has marked this topic as unresolved.
In Issue #6260 Jamey Sharp referenced a paper that combines two greedy techniques Critical Path (CP) that "counts the maximum length of any data-dependent chain of instructions that must execute after this instruction before the end of the block". Last Used Count (LUC) that "It counts how many live-ranges will end if we place this instruction now. Put another way, for each operand, check whether there are any other instructions that will use the same value but haven't been scheduled yet". The LUC is used first, as a tie breaker we have the CP and as a last resort tie breaker we can use the smallest instruction number to be scheduled first. Other heuristics could be applied to but I think for a first approach we can try the CP+LUC first what do you think?
That plus “preserve original ordering” above; latter cheaper, lower quality, curious how much lower (and how much faster during compilation) for real benchmarks!
Hello again!
Since this has been our first interaction with cranelift, we took some time (me and Panagiotis) to understand the codebase and the egraph concept a bit. We have some questions on where the arbitrary topo-sort selection of instructions (as described in issue #6260) appears inside code. We have pinpointed three places on elaborate.rs
you might be talking about:
elaborate_domtree()
(although this should not concern basic blocks):for child in self.ctrl_plane.shuffled(domtree.children(block)) { ... }
elaborate_block()
:for arg in elab_values.iter_mut() { ... }
process_elab_stack()
:for arg in self.func.dfg.inst_values(inst).rev() { ... }
Are we (firstly) correct and (secondly) complete with our findings?
Lastly, please tell us if such "mentoring" advice is undesirable for this difficulty level in the codebase.
The elaboration-stack machinery is indeed what decides instruction order today
Starting note: if this seems more like something that should be posted as an issue reply, please tell us.
We started a prototype implementation here for the program order retention approach.
Inst
removed from the function's layout, we give it a monotonically increasing sequence number using an instance of your SecondaryMap
structure.before
instruction as found in process_elab_stack()
.before
placements) in ascending sequence order (presumably respecting the original order), and then elaborate the egraph-generated one.before
placements) each time we finish with a skeleton instruction.before
placements might be erroneous.maybe_remat_arg()
.SecondaryMap
we use takes an Inst
and gives us a sequence number, but does a Layout
have duplicate Inst
entries, or are all of them unique? With our current implementation we don't reorder among neither egraph-generated instructions, nor skeleton instructions. The latter we believe is already optimal for the non-heuristics approach, but for the former we are thinking the delta approach might give us bigger reorder-ready ranges. We are not sure how frequent are such new instruction generations from the egraph though, so it might not be worth it in the end to track their generation roots.
We are aware that an InstNode
in Layout
has a seq
field. I think we don't have access to this field currently, so we created a u32
sequence number to simulate this that wraps around.
When running cargo test
for the entire project (on x86_64 machines), we pass 161/1183 tests. We believe most failed tests are because they expect a fixed sequence of instructions, and we indeed see from the failure outputs that our implementation reorders some of them. There were some though that concearned us, since they duplicated some instructions. One such test is disas
/ load-store
/ x64
/ load_store_static_kind_i32_index_0_guard_yes_spectre_i8_access_0x1000_offset.wat
.
We tried to simply run the benchmarks from issue #6159 with our release build of wasmtime, and they indeed ran without any observable problems (noted that we are extremely inexperienced here). Even concequtive runs from the same build had measurable variance, hence when comparing to the build without our commit the results were not clear. That's probably because we did them with our laptops, but still, it might be worth it to verify the stability of the results from issue #6159 .
We had to port the run.sh
script from the #6159 to the current CLI format. Also noted that we couldn't find the option to disable egraph optimization on the new wasmtime CLI.
Updated run.sh
:
set -e
wasmtime="../wasmtime/target/release/wasmtime run -C cache=n \
-W relaxed-simd=y,threads=y \
-S threads=y --"
for t in `$wasmtime ./end2end.simd.relaxed.wasm --benchmark_list_tests | rg -v FP16`; do
# echo ====================================================================
for file in end2end.simd.relaxed.wasm; do
# echo == $file ==
echo -n "wasmtime "
$wasmtime ./$file --benchmark_filter=$t 2>&1 | rg $t
done
echo
done
Correction: we fail in 161 tests out of the total 1,183 :sweat_smile:
@Chris Fallin Can you provide us with bigger benchmarks in order to test the implementation with more rigorous results? The results from the existing benchmarks have a lot of variance and they are small.
Have you guys looked into using sightglass? https://github.com/bytecodealliance/sightglass
This is what we usually use to measure changes, it has a few wasm benchmarks that are useful
https://github.com/bytecodealliance/sightglass is the usual suite of benchmarks we use for cranelift and wasmtime
particularly the spidermonkey benchmark is important
the default suite (if you don't explicitly pass a particular benchmark on the CLI) is a good choice in general
the default suite contains:
bz2
you'll want to build wasmtime-bench-api
on main
and then on your branch to produce two different shared libraries:
$ git checkout main
$ cargo build --release -p wasmtime-bench-api
$ cp target/release/libwasmtime_bench_api.so /tmp/main.so
$ git checkout better-inst-sched # or whatever your branch is named
$ cargo build --release -p wasmtime-bench-api
$ cp target/release/libwasmtime_bench_api.so /tmp/better-inst-sched.so
and finally in sightglass do:
$ cargo run -- benchmark --engine /tmp/main.so --engine /tmp/better-inst-sched.so
Thanks for the pointers! — we didn't have time to check the contributors' documentation extensively...
We also just found Chris Fallin's talk on egraphs in cranelift. We couldn't find the recording from the conference, so until now we relied on the accompanying slides, which was a bit less than ideal.
We'll follow up with results in the next few days probably :)
There's a recording linked from my website; unfortunately the conference itself did not record (there were AV problems)
https://cfallin.org/academics/ under "invited talks"
Yes, that's the one we found a few days ago too :+1:
fitzgen (he/him) said:
https://github.com/bytecodealliance/sightglass is the usual suite of benchmarks we use for cranelift and wasmtime
particularly the spidermonkey benchmark is important
We thank you for your guidance
thank you for digging into this stuff!
A small update on the benchmarks from my PC (Ryzen 5700G, running Opensuse Tumbleweed), comparing our prototype program order retention optimization with main:
We're trying to tackle the heuristics-approach implementation since yesterday, but if you have any comments on the proposed implementation for the program-order-retention approach tell us.
If you exclude some comment-notes, the diff for our prototype program-order-retention implementation is quite small — should you want to look at the actual code.
One high-level note: it may be the case that you’ll need to run more benchmark iterations and/or tackle system noise; instantiation time should not change at all if you’re only changing codegen (it involves only data structure + memory setup performed by the runtime, not Cranelift-compiled code) so if it has deltas on the same order as your reported results that likely indicates noise
We have a document on precision benchmarking somewhere written up by Jamey (involving isolating a core and frequency-pinning etc) if you want to go that far; can’t find atm as I’m on phone
Yeah, I basically just closed everything else along with network access while running the benchmarks, but the instantiation differences did concern me a bit indeed...
I'll try to find that document, thanks!
Compilation being faster in pulldown-cmark
was also a bit alarming...
Here we go, found it: https://github.com/bytecodealliance/sightglass/blob/main/docs/cpu-isolation.md
I run the same set of benchmarks after cpu isolation and frequency pinning (to 400MHz :smiling_face_with_tear:), and the results quite decisive, albeit with one unexpected instantiation result:
pulldown-cmark
hex-simd
spidermonkey
pulldown-cmark
All other results showed no difference in performance.
We might have missed something that makes our implementation incorrect from first principles.
It might also be the case that our restriction of disallowing restructuring across new, egraph-generated instructions is important after all — any thoughts?
Actually, do we expect performance gains with program-order retention on any benchmark from sightglass at all? I mean, are there such carefully-pipelined simd examples there?
The benchmark from the original "instruction scheduling slowdowns" issue might help?: https://github.com/bytecodealliance/wasmtime/issues/6159
Is it possible to use the XNNPACK benchmark with the same method as sightglass benchmarks are used? The XNNPACK repository does not contain any instructions. Any thoughts?
cc @Alex Crichton
IIRC I had to apply quite a number of hacks at the time to get XNNPACK to build from source, that was a year or two ago now though so things may be different now. In any case I don't think building from source will be easy.
the other "weird thing" is that I could only get it to build with wasi-threads, which makes it very different from all other benchmarks
so... theoretically possible but will require a lot of elbow grease
XNNPACK is also so big I couldn't figure out at the time what was actually a benchmark to run since there seemed to be hundreds of benchmarks
Does anyone know how the macro named trace can be enabled? We have reached a point in our implementation that we need it.
The Cargo feature trace-log
on the cranelift-codegen
crate controls that -- when I'm debugging I usually just throw the feature in the default
list there (and make sure not to commit it/include it in PRs)
(the intent is to avoid even the cost of the dynamic check of log level in some performance-sensitive parts of the compiler by excluding all the logging stuff statically in ordinary builds)
Btw, he is talking about the heuristics implementation. We believe we have a complete implementation (along with rank pairing heaps), but now we are trying to tackle duplication of instructions and how this would affect us.
We also have an open issue with rematerialization (should we leave it in the scoped elaboration walk, or should we integrate it in the ready queue pass), so we have currently just removed it altogether.
I'll probably ask this more formally in the proposal issue actually...
In elaborate.rs::elaborate_block::768:
// Record the first branch we see in the block; all
// elaboration for args of *any* branch must be inserted
// before the *first* branch, because the branch group
// must remain contiguous at the end of the block.
We have some questions about that.
1) Can we have multiple branches in one single block?
2) What is the meaning of the " branch group must remain continuous" ?
Ah, this is left over from before a cleanup we made a while ago
It used to be that Cranelift represented multiple-target terminators (e.g., a BB ending with a conditional branch) with multiple single-target branches
in effect, the terminator was "exploded" into multiple pieces
(which itself was left over from ancient history where Cranelift used Extended Basic Blocks / EBBs that could have side-exits mid-block)
thankfully Trevor cleaned that up... last year sometime? certainly after that comment was written... so that we have proper single-branch-terminators now
the comment is alluding to subtleties that could happen when a branch other than the first had an arg that required elaboration
now it's safe to ignore it (and please feel free to submit a PR to remove that comment and associated logic too!)
Hello @Chris Fallin and @Jamey Sharp,
Just a small update for our project that me and @Dimitris Aspetakis are working on.
We are still fixing issues related to minor bugs. A few of them still remain.
Hopefully next week we will run some benchmarks and see the results.
Quick question: does register allocation in the lowering backend depend on rematerialization from the midend?
We don't get any errors anymore from testing cranelift-tools, except from 6 files where the order changed or we got errors due to our missing licm or rematerialization optimizations
I manually checked all 6 and the resulting IR seems valid, and changing manually the check tests to our output, we now pass everything in cranelift-tools
Exciting!
Register allocation can be impacted by any mid-end change including rematerialization, just because they perturb the input to the register allocator, but I don't think there's any specific dependency between them. I'm not sure I understood the question though?
When running spidermonkey from the sightglass suite we get an error like this:
dbcb5903-d407-48b8-a004-a471a4e21c2f.jpg
That was my intuition behind a possible dependency :sweat_smile:
We just aren't sure where to go next if the tests in cranelift-tools pass, and we seem to have proper IR output in them...
I'd have to look up that assert in compile.rs
line 76 and figure out what it's actually checking, because that failure message doesn't mean anything to me :sweat_smile:
Yeah, I get it... What I meant earlier is if you know of the lowring stage having an explicit dependency on rematerialization from the midend. I figured it might expect a certain number of live SSA ranges at any point for example.
Lowering and register allocation have to work even when the egraph pass doesn't run because optimizations are disabled, so there "shouldn't" be any dependency like that. But of course there can always be bugs :shrug:
We'll try to look into the compile.rs
errors now I guess then...
I would dig into it more with you but I can't today. If you're still having trouble next week I can hopefully take a look then!
+1 to all of Jamey's answers above re: independence of the two bits. Regalloc should be happy with any valid VCode, and we should get valid VCode from valid CLIF (ergo, one way to find this might be to explicitly run the CLIF validator at various points).
Re: the assert: compile.rs:76
is just where regalloc is invoked; this is propagating upward an error that regalloc2 detected. Specifically EntryLivein
means that some register was used that was never defined (it was a live-in at entry, which isn't supposed to happen). Likely this means that an instruction is missing somewhere. Running the CLIF validator as above should find this at an earlier point and indicate exactly where in the CLIF the problem is.
Re: this
We just aren't sure where to go next if the tests in cranelift-tools pass, and we seem to have proper IR output in them...
The test suite is fine but mostly has small functions, so especially for algorithmic tweaks that start to get interesting with more substantial block sizes (see: instruction scheduling!), I'd expect one to have to expand the test suite to fully cover its behavior. So it doesn't surprise me that it currently passes the suite but fails on a real program. The way I would go about this is resolving the bug as above -- add the validator, add as many debug prints as needed, extract the CLIF -- then when the bug is fixed, turn it into a regression test (new entry in the suite). In general I think we'd consider something like this feature well-tested when it passes (i) the suite, (ii) fuzzing, and (iii) reasonable real programs like spidermonkey.
Oh yeah, there's a Cranelift option to enable the verifier, right? I don't remember what the option is called. Also, this reminds me that a good step in between the small CLIF filetests and a large program like Spidermonkey might be the disas
tests, which you can run with cargo test --test disas
, if I remember correctly. They have a little bit more interesting control flow while still being pretty small.
enable_verifier
, indeed; one could also sprinkle manual calls to the verifier throughout the compilation pipeline (e.g. just before and after the egraph pass invocation) which may be easier when running tests etc
Thank you both! The pointers are very helpful!
We pass ~1000 and we indeed fail 237 on the disas
tests.
(I hesitate to give a "thumbs-up" reaction to that but at least you have more input to make progress! best of luck :-) )
It kinda seemed logical to me that our disas
errors might be from checks using the previous ordering, so I gave them my blessing (with WASMTIME_TEST_BLESS=1
), and the errors disappeared. We also put the following at the end of egraph_pass()
(after our scheduling):
// self.verify_if(fisa)
// verify_function(&self.func, fisa)?;
self.verify(fisa)?;
Ok(())
Was that what you were talking about when mentioning verification? — we tested using both verify_function()
and self.verify()
. The disas
errors disappeared even with those verifications in place.
We then starting tackling blake3-simd
from Sightglass (which we can see when compiling that it gives us a familiar error), but got stuck in trying to get it to emit egraph-optimized clif.
We want the optimized clif to diff it with the previous implementation, but the --emit-clif option from wasmtime probably gives us unoptimized clif (edit: we think that because it emits the same functions using our implementation, and using the previous one).
Do you know of a way to get optimized clif using the wasmtime binary (or another tool), or should we print it explicitly through changes in the code?
You're right that --emit-clif
prints as soon as the first CLIF is generated, before any optimization passes. I haven't been very satisfied with the options we have for getting optimized CLIF from there. What I usually do is put the unoptimized CLIF in a file and add
test optimize precise-output
set opt_level=speed
target x86_64
at the top, then run CRANELIFT_TEST_BLESS=1 cargo run -p cranelift-tools -- test <filename>
, and finally look at what it wrote into comments at the end of the file.
That's exactly what we did 20 minutes ago :sweat_smile:
And I think self.verify
was exactly what Chris was suggesting to do, although I'm not looking at the code right now to make sure.
We're actually passing now quite a few of the sightglass benchmarks! For now, only spidermonkey fails...
Expect benchmark numbers coming in a few hours :upside_down:
We probably fixed all the errors. The error in Spidermonkey was due to the rank pairing heap implementation. After modifying it a bit we ran Spidermonkey with no problems. Tomorrow we will have benchmarks (without the remat and licm optimizations).
So... we are quite bad on compilation (up to 194% worse), but out of the three default benchmarks, two have no difference in execution time, and spidermonkey has 2–4% better performance in cycles. I used the suggested method pinning a core, but with one difference: I didn't pin it to its min frequency. I hope that's still valid...
I should probably mention that it's our first time writing Rust, so we our implementation probably has a lot of room for improvements...
Nice to see a speedup in generated code!
So... we are quite bad on compilation (up to 194% worse)
That is what we call a, uh, "growth opportunity" :-) Worth profiling and seeing if there are simple improvements that could give quick speedups, paying attention to allocations, etc.
FWIW, the alternative "keep sequence numbers from original code" idea I had proposed was made with exactly this worry, that a full recomputation of an instruction order would be too expensive; maybe it's worth trying that too
We kinda tried that, but we got up to the point of rearranging among new egraph-generated instructions, and not across them, with mixed results if I remember correctly. Now that we have a much better understanding of everything, we might try to remove that restriction (after we write the project report for our university course :sweat_smile: ).
After a few optimizations following flamegraph sessions, we got compilation a bit closer (but it's still probably not good enough). Surprisingly, even though we didn't change anything that should affect execution performance, we are now worse (consistent with multiple runs, unlike last time). Keep in mind though that we still haven't re-applied LICM and rematerialization (currently working a bit on LICM).
Spidermonkey Benchmark Results
Ohh, I should probably do such updates on the issue from now on actually :sweat_smile:
One thing you could try, just as a data point to isolate your core contribution, would be to compare to a baseline without remat or licm either (manually disabled with a hack of some sort)
iirc, remat in particular is pretty important on SM — hoisted constants create a lot of reg pressure otherwise
Hello, We just completed both licm and remat and we are currently running some benchmarks!
Here are some preliminary results
SpiderMonkey: 3-6% faster on execution time, 42-44% slower on compilation time.
image.png
Bz2: no difference in performance,
pulldown-cmark 2-11% worse on execution time,
While we are following all the instructions from
https://github.com/bytecodealliance/sightglass/blob/main/docs/cpu-isolation.md
We get a big variance on the sightglass execution time Δs. Any tips for that problem?
From now on we will use Vtune to optimize the compilation time and possibly lookout for the reasons that the execution time of some benchmarks is worse.
Oh, did anyone suggest the Sightglass option --measure=perf-counters
yet? (I might have the name wrong, but it's something like that.)
I think instructions retired is probably best
Instruction scheduling is supposed to help with ipc, which instructions retired explicitly doesn't account for. Looking at the cycles counter would be more reliable for this.
true, but we're talking about the compile time overhead here, not the codegen quality
So, compared to cycles, the motivation for using retired instructions for compilation performance would be to get more stable results, or is there something else you have in mind?
yes exactly
For the compilation overhead, we managed to reduce it from 200% to 44% in a couple of hours using vtune. I am optimistic that we can lower it even more.
I am more concerned about getting accurate results on cpu cycles.
You're making fantastic progress. This is very cool!
@Dimitris Aspetakis is wondering what compilation time cost is acceptable in our implementation that results in 3-6% performance gain in cycles in Spidermonkey.
I am thinking less than 2%, what do you think?
That's a complicated question and I think different people would answer it different ways. Personally, I'm more concerned about whether the implementation is clear and easy to maintain; I feel it's okay to have some room for improvement in compile time if we're comfortable with our ability to work on it more later. Unfortunately I still haven't had a chance to carefully review your code.
One thing I'm curious about is whether the heuristics are separable. Hypothetically, could you add options that allow someone to enable or disable the critical path heuristic, or the last-use count heuristic, independently of each other? If so, does one have more impact than the other on either compile time or execution time?
There are downsides to adding options like this (more configurations to test and more confusion for users, for example) but one option I am considering is merging your changes under an off-by-default option. I would really like to have your work upstream! We just need to be confident that we can maintain it and also manage our other goals, such as compile speed.
In my mind they are quite separable, as we just use the metrics in the ordering traits implementations of the ready queue.
Nice. Does computing either metric take a significant amount of time, or does switching to the ready queue make a big difference in compile time?
I wanted to read a bit more documentation in sightglass before I asked the question that Panagiotis mentioned...
But since he did, the question would be on the significance of compilation vs execution time in the Sightglass benchmarks. I remember seeing compilation taking a lot more cycles in e.g. Spidermonkey. Is that conforming with real-world expectations, or would a slightly better execution time amortize an e.g. 30% worse compilation time?
Jamey Sharp said:
Nice. Does computing either metric take a significant amount of time, or does switching to the ready queue make a big difference in compile time?
I believe the use of the rank pairing heap takes ~30% of the elaboration function's time. If I'm not mistaken, we only use it due to LUC needing dynamic updates.
We can probably create a branch using only the Critical Paths and a simpler sorted queue.
The reason that's a complicated question is that it depends on how people are using Cranelift, and that varies a lot. I believe originally Cranelift was intended for use in JIT compilers, so compilation time was extremely important. It's still important, especially for uses like the Rust compiler's Cranelift backend as an alternative to LLVM, where the goal is to get debug builds quickly. But then for example at Fastly where we use Cranelift via Wasmtime, we'll compile a module once and invoke it many times, so compile time is less important there than execution time. When we maintainers make decisions about Cranelift, we're trying to keep all these different uses in mind.
Really cool to hear about this progress!
I think at a high level we should consider upstreaming this under a default-off option at least -- the bar for that is much lower. "Try this option if you want to optimize further", something like an -O3
setting (whereas by analogy today's Cranelift is something like -O2
).
I was about to say more but Jamey beat me to it re: compile-time :-) The only thing I'd add is that we've had a fairly high standard of "grind on the thing for weeks/months to squeeze out more performance" (at least, I've personally done this on RA2 and the original aegraphs impl) so going back the other way with say 30% regressions in compile time feels like a huge loss :-) Ideally we'd optimize to a local maximum as much as possible and see how low we can get it before turning it on by default. Again, can definitely happen in-tree as long as it doesn't affect the default build!
Oh, one other thing, decisions/discussions like this are often good to have in the context of the Cranelift weekly; would you be interested in attending and talking about your work here? (Wednesdays at 15:30 UTC)
Yeah, I'm certainly interested!
(and Panagiotis too I believe)
OK cool, I can add you (both?) to the calendar invite with details -- DM me good email addresses for that?
(fwiw next few weeks probably quiet as a bunch of us will be out next week, following week is just before US holiday and folks will be out too; maybe later July is good if you want to put something on the agenda for Jul 17, 24 or 31 in bytecodealliance/meetings with a PR)
Hello @Chris Fallin and @Jamey Sharp
I am afraid I have bad news to share today.
Me and @Dimitris Aspetakis are a bit disappointed because since we completed the implementation,
we got no speedup almost on all the benchmarks except in some of the benchmarks in XNNPACK. We ran the benchmarks many times and we verified this.
We verified also that the implementation works correctly by taking a complex function out of Spidermonkey and executing separately to see the new order of instructions.
Also we are a bit stuck on improving the compilation time with overhead 20-25%. Seems the priority queue has too much overhead that we cannot find a way to avoid, especially the delete/pop operation that we do on every (pure) Instruction that we insert into the layout.
We don't want this work to go to waste so we would like to know if our implementation is close to some other heuristic we can try. Personally, I am thinking that instruction scheduling across blocks might be more beneficial, but can be more expensive also.
It's always disappointing when that happens! We've certainly had plenty of other Cranelift changes we were excited about that didn't turn out to be helpful so I feel your pain here. If you're not ready to give up though then I'm not either!
I'm not sure from your description which things you've already tried. Did you test a version with only the LUC heuristic and not the CP heuristic? You'd previously thought that perhaps you only needed a priority queue because of the CP heuristic. It would be great to keep the critical path heuristic to improve instruction-level parallelism. But of the two I actually expect bigger wins from the last-use count heuristic reducing register pressure, and therefore reducing the number of stack spills and reloads needed.
In addition to Jamey's thoughts I wanted to add a bit of meta-encouragement / a snapshot of my own philosophy on these things, if it helps:
First off, this exploration is extremely valuable, even if it turns out not to have the perf characteristics that would let us put it in by default. All of the work we're doing is basically research, a kind of "hill-climbing" trying to find ways to move toward better perf (or maintainability or correctness or ...). Finding out that in a certain direction the slope is actually downward gives us a lot of information -- it means others don't spend time on it later, or perhaps we've learned which specific parts were expensive, or are inspired to think about certain aspects of the problem more, etc.
So in that sense I almost think of the current state of the tree -- what's in, what's out, what's on by default, which approach we've taken -- as the sum of all approaches we've tried, and whichever one "wins" for our current metrics is incidental.
Secondly, conditions change over time; maybe we find new benchmarks where good scheduling matters more, or maybe we try something more exotic with our elab pass that requires active scheduling more urgently, or ... basically it's useful to know that we can do this if we need to and that it addressed the one case (xnnpack) where we saw the issue. That doesn't mean we merge it now necessarily but does mean it's useful to keep the branch and notes on its final state and conclusions around.
Basically, to echo Jamey, please feel free to keep going if you want; but we've already gotten value from this so I hope you don't feel like it "failed" in any really substantial way!
Hello from me too!
I just wrote in issue6260 a summary of our progress. Yesterday I also experimented with different implementations (using only LUC, only CP, only sequences), and I put the results in a text file in the end of that summary. The outcomes don't showcase any particular pattern from what I can tell.
The best case is that we have logical errors in our implementation :sweat_smile: — something that should surface after creating explicit testfiles checking for expected instruction scheduling.
Other than that, the only way I can think of that might get us closer to an ideal scheduling (using the current heuristics) is using weights for the critical path metric according to how "heavy" an opcode is...
Chris Fallin said:
In addition to Jamey's thoughts I wanted to add a bit of meta-encouragement / a snapshot of my own philosophy on these things, if it helps:
First off, this exploration is extremely valuable, even if it turns out not to have the perf characteristics that would let us put it in by default. All of the work we're doing is basically research, a kind of "hill-climbing" trying to find ways to move toward better perf (or maintainability or correctness or ...). Finding out that in a certain direction the slope is actually downward gives us a lot of information -- it means others don't spend time on it later, or perhaps we've learned which specific parts were expensive, or are inspired to think about certain aspects of the problem more, etc.
So in that sense I almost think of the current state of the tree -- what's in, what's out, what's on by default, which approach we've taken -- as the sum of all approaches we've tried, and whichever one "wins" for our current metrics is incidental.
Secondly, conditions change over time; maybe we find new benchmarks where good scheduling matters more, or maybe we try something more exotic with our elab pass that requires active scheduling more urgently, or ... basically it's useful to know that we can do this if we need to and that it addressed the one case (xnnpack) where we saw the issue. That doesn't mean we merge it now necessarily but does mean it's useful to keep the branch and notes on its final state and conclusions around.
Basically, to echo Jamey, please feel free to keep going if you want; but we've already gotten value from this so I hope you don't feel like it "failed" in any really substantial way!
Please find the time to think and propose ideas. Maybe we can try something in the near future.
Personally I am thinking that the heuristics don't work well because the CPU can do optimal instruction scheduling almost in every single block except inside loops, this is why some of the benchmarks in XNNPACK are better with our heuristics.
Maybe moving instructions from hot paths to other basic blocks may improve performance, this my basic idea, but I don't think we have enough information to identify a hot path
An out-of-order CPU should be able to do well with loops too, as long as it predicts the branches well enough. It will just keep looking forward and finding more instructions to run from future loop iterations until it runs out of buffer, which may be hundreds of instructions on a modern CPU.
I think CPUs are less good at dealing with registers getting spilled to the stack and reloaded later. They may be able to optimize away the load if the store is still in the store buffer, but they still have to execute the store, eventually writing all the way to main memory. So anything we can do that reduces register pressure ought to be good, I think.
We do loop-invariant code motion (LICM) already, which is the only form of "moving instructions from hot paths" that I can think of right now, but perhaps we could do it better somehow?
I watched the Unison video from EuroLLVM2017 that combines Instruction Selection and Register Allocation into a single combinatorial problem. Obviously that is too slow but it showcases examples where LLVM does poor job and then you can implement something better based on this observation. If we had something like this on cranelift we could do the same. Perhaps we don't need it at all, if we can use Unison and compare it with the output of Cranelift to simple test programs to identify where cranelift does poor job. This could be a good plan in my opinion.
@Chris Fallin and @Jamey Sharp Please give your feedback.
If you can't find the time to elaborate on the video above/further ideas (totally understandable), keep in mind that I'll probably reserve a "slot" in one of the next few meetings to summarize everything and give a few possible directions. I don't know if Panagiotis will be able to join since he's working, but I think I can adequately cover his proposals as well given that we usually talk them through together.
Panagiotis Karouzakis said:
we got no speedup almost on all the benchmarks except in some of the benchmarks in XNNPACK. We ran the benchmarks many times and we verified this.
have you tried running the benchmarks on an in-order processor like the Arm Cortex A53 in the Raspberry Pi 3? I expect scheduling to have a much greater benefit on in-order cpus since they don't have the out-of-order architecture that basically does scheduling again in hardware.
Chris Fallin ... Please give your feedback.
I'm on vacation this week (I jumped in above to give encouragement but my "work brain" is solidly shut down right now!); I'll think more about this next week. I do agree at a high level that starting with examples where we know we could do better seems like a reasonable approach.
have you tried running the benchmarks on an in-order processor like the Arm Cortex A53 in the Raspberry Pi 3?
Not yet, but it certainly was in the back of my mind! The only such CPU I believe I have available is in a rpi zero w, but I have literally no idea how to run Sightglass on it (I have never touched it before :sweat_smile: ). I'll do my best to run it, and if I fail, I might try with a core 2 duo I have lying around for fun and (possibly) profit — although I'm not sure how less reordering range those had (if any).
for x86, basically everything after the pentium pro has out-of-order execution, so the core 2 duo isn't too likely to be substantially different than modern x86 processors
Actually, the rpi zero w I have has an older 32-bit arm processor (the rpi zero 2 has a quad core A53)...
So no luck on that side, given that wasmtine (as far as I can tell) supports only 64-bit arm.
Jacob Lifshay said:
Panagiotis Karouzakis said:
we got no speedup almost on all the benchmarks except in some of the benchmarks in XNNPACK. We ran the benchmarks many times and we verified this.
have you tried running the benchmarks on an in-order processor like the Arm Cortex A53 in the Raspberry Pi 3? I expect scheduling to have a much greater benefit on in-order cpus since they don't have the out-of-order architecture that basically does scheduling again in hardware.
I think the point is to do better on out-of-order cpus, simply because those are currently in a lot of use. Probably with older in-order cpus we maybe see improvement but this isn't the objective here.
Well there are quite a few SBCs still in use that use in-order CPUs, and although it might not be in wasmtime's main targets, I believe cranelift itself does cast a wider net for deployment targets. It would at the very least be interesting to see if the critical path metric gives us anything in ILP. Of course, our main target should probably still be LUC for register spilling — something that (as Jamey Sharp mentioned) we expect to give more universal improvements.
I do think the question of whether the CP heuristic helps on in-order CPUs is interesting, and relevant for Cranelift, so if you can find a system to test that on, feel free! If we can make that heuristic off-by-default and have no impact on compile time when it's disabled, we could certainly merge it.
It's tricky because Cranelift currently only supports 64-bit targets, and so many 64-bit CPUs are out-of-order. I did a little research and as far as I can tell the only Intel CPU that was both in-order and 64-bit was the second-generation Atom, codenamed Pineview, from 2010. I guess ARM has more to choose from here though.
quite a lot of riscv cpus are in-order, maybe try on one of those?
Jamey Sharp said:
I do think the question of whether the CP heuristic helps on in-order CPUs is interesting, and relevant for Cranelift, so if you can find a system to test that on, feel free! If we can make that heuristic off-by-default and have no impact on compile time when it's disabled, we could certainly merge it.
It's tricky because Cranelift currently only supports 64-bit targets, and so many 64-bit CPUs are out-of-order. I did a little research and as far as I can tell the only Intel CPU that was both in-order and 64-bit was the second-generation Atom, codenamed Pineview, from 2010. I guess ARM has more to choose from here though.
For that to work, we need to test at compile time if the current CPU is in-order or out of order. Does cranelift supports this? I think LLVM supports this with SchedModel, SubTargetInfo and other Classes.
No, I'm suggesting making it an option that someone can manually enable. Later we could automate it but that's not necessary to get it merged.
I just looked again at the paper that proposed LUC+CP for instruction scheduling.
They used SPEC2006 for their benchmarks. Most of the INT2006 benchmarks didn't show any improvement. In FP2006 they did very well in some benchmarks.
They said that the FP benchmarks have more workload than INT benchmarks and this is why the performance was increased there.
They also counted the number of spills by LLVM's regalloc.
So I'm wondering does Sightglass has large FP programs for benchmarking?
afaict not really, maybe try adding ngspice? (useful since it's a decently large fp program, compiles on just about anything, and could reasonably be used in web-based circuit design software)
turns out ngspice wants dup2, which isn't supported, so i gave up trying to compile it to wasi for now
I have 2,5 years of experience with LLVM, but I don't know much about webAssembly benchmarks like Sightglass and stuff so please help us.. We did a lot of work to reach this point.
You did do a lot of work to reach this point! Let's talk about it at the Cranelift meeting tomorrow.
@Jamey Sharp My Calendar says that in my local timezone is at 16:30 noon. I just need to make sure that I don't get the time wrong and miss it.
I like to let Zulip do timezone conversions for me: It's at
It still shows it at 16:30 for me! Thank you
Hello, after a while!
While we have somewhat stagnated with progress, for our convenience I tried adding an --emit-opt-clif
option that follows --emit-clif
, but prints functions after a (possible?) egraph optimization. It can be currently found in the latest commits of our main branch from github.com/karouzakisp/wasmtime.
I find my solution a bit hacky in a specific place, probably due to my inexperience with the borrow checker. I didn't yet bother with possible cli checks (like whether the paths for --emit-clif
and --emit-opt-clif
are the same) because I'm not sure how you would like the cli to behave anyway. Also, I believe the --emit-clif
option is supported by a wasmtime/cranelift "explorer"? — I have not done anything there...
I'm not familiar with github contributions in general, that's why I'm asking here first — should I search/create an issue for that, should I bother with a pull request already?
I feel like its state is not ready for a pull request, and I also don't know where improvements to the branch should happen.
Cool, and good questions! The general reason that projects like this ask people to open an issue first is so that we're aware of what people are working on and can offer advice before you get too far into a project. It can be disappointing to write a bunch of code and then find out that it might not make sense to merge it, as you know, so when we can help avoid that outcome it's nice. But when you already have code written, there's no need to open an issue, at least in my opinion. I would suggest opening a pull request that contains only your changes for --emit-opt-clif
; it sounds like a useful bit of work by itself. If there are things you feel are important to improve before merging it, feel free to mention those in your pull request, but I wouldn't worry about integrating this mode with wasmtime explore
or anything; that can be done in later work if necessary.
We should also mention, that while we took some time off regarding this project, cause of vacations, etc. We are still thinking of ways and implementing some features to get some performance! When we have news we will make an update!
Using this approximation, we found that our implementation of the LUC+CP heuristic results in ~17% more spills in Spidermonkey...
I believe the paper we draw from also notes an increase in spills — with an increase in execution time. Although for their spills, it might be that they compare with a more traditional scheduler as a baseline.
Considering we prioritize LUC, these results seem a bit counter-intuitive to me. We might have issues with the implementation, or maybe the Critical Path heuristic is at fault.
We will probably check out what's going on with spills separately for LUC+sequenc and CP+sequence next...
I agree, that's not intuitive to me. Sounds like you have a good plan. Also I want to note that the approximation that Chris suggested is not measuring how many values got spilled but rather how many times spilled values got moved around, which is subtly different. It's a useful measure because it probably correlates better with execution time, but it might be confusing depending on what you're trying to figure out.
Hmm yes, noted. We just tested it, and we still find we're worse in that metric with LUC+original_sequence...
I dunno, at this point we probably have to check manually some examples :sweat_smile:
Yeah, that's what I would suggest! Something small where you know what outcome you want the heuristic to produce.
The other thing you can try, if you come up with any guesses about specific situations where the heuristic might be making choices you don't want, is to add tracing that helps you find examples of those situations, and then run the compiler with that tracing enabled on a reasonably large input like Spidermonkey.
We kinda already have some respectable tracing from our "debugging days" — that was the only way we got some feedback on the correctness of our implementation too. But we surely have to check some examples more extensively.
I'm not sure about this, but maybe the dependency driven approach is already approaching spill optimality.
Heh, I don't have any reason to believe it is, but I don't have evidence that it isn't, either.
If I remember correctly, the elaboration stack is a backward, depth-first pass. So it schedules graph "paths" in a sequence.
If the average graph approaches an upside-down tree, wouldn't this depth-first scheduling avoid scheduling parallel tree paths, which I think mostly hurt spilling?
Ehh, I should have thought on it a bit more before I wrote it down...
The sure thing is that we will examine examples more thoroughly (with the heuristics).
It's often helpful to write things out like this, especially when you're not sure whether you have it right, and I think it's an interesting thought! I guess the situation where the heuristics should matter is when the same value is used multiple times, so it isn't a tree, right? I don't know.
Yeah, that's kinda what I had in mind.
Thinking prematurely ahead of time, it might be interesting to see if we could apply the heuristics depending on some summarizing descriptors for basic blocks...
If there exist patterns strong enough to separate basic blocks into groups that would benefit from different scheduling polices, that is.
That kind of approach can be great, especially if some heuristics are slow to evaluate but you can quickly prove they won't help sometimes.
I think we need to check how many LUC collisions we have, since LUC is supposed to help with spilling. Then one approach is to revert back to the data-driven scheduling. Another approach is employ a more sophisticated algorithm only to some blocks.
Following the discussion above, I drew a small example to gain stronger intuition on it:
Example of the LUC-CP heuristic VS a dependency-driven scheduling
Although small and possibly not representative, we can see that the dep-driven scheduling is better in terms of live values (given that my application of the algorithms is indeed correct :sweat_smile: ).
Obviously, I don't really consider it a strong indication of any pattern — but it did give an idea: we probably should initialize the LUC heuristic with negative values depending on how many new values an instruction generates. Isolating the choice between (e) and (h) in the example, I believe this would result in a preference for scheduling (e) -> (g) -> (h), which would at least save us from an increase back to 5 live values from 4.
It makes sense to me intuitively. While we like removing live values by scheduling last users, we should also probably consider how many values those last users generate. The heuristic might also need a new name if we decide to make this modification...
I'll probably implement this and report back if we gain anything.
Ooh, yeah, I bet the paper authors didn't consider instructions with multiple results.
Haven't read the paper yet, but would https://dl.acm.org/doi/10.1109/WCSE.2009.39 (A Novel Lightweight Instruction Scheduling Algorithm for Just-in-Time Compiler) be useful here?
Note that the paper specifically targets an in-order CPU. IMO with a OoO CPU you're better off scheduling to minimize register pressure instead.
Last updated: Nov 22 2024 at 16:03 UTC