Better Heuristics for Instruction Scheduling · cranelift

Hello! Me and a colleague of mine would like to tackle issue #6260, proposing a better approach for instruction scheduling using heuristics.

First, it would be good to know if the proposal is still relevant, given almost a year has passed since it was first published. For example, is there current work that would invalidate this, or, are you aware of someone already trying to implement it?

We should note that while we feel like we adequately understand the heuristics proposed in the paper, we don't have any experience in the cranelift codebase. If you deem it necessary, we would be happy to join the weekly meeting (we also don't know how to do this).

Dimitris Aspetakis (Apr 13 2024 at 15:54):

fitzgen (he/him) (Apr 13 2024 at 16:46):

Chris Fallin (Apr 13 2024 at 17:19):

Yes, it's still relevant! In my opinion at least, it'd be really useful to get at least approximate answers to: (i) how much benefit do we see with any sort of scheduling (versus the demand-driven / implicitly-sunk-to-latest approach we have now) across a wider set of benchmarks than the examples we've seen where it matters; and (ii) how low can we push the compile-time cost, to fit with the general ethos of Cranelift's optimizations

Chris Fallin (Apr 13 2024 at 17:19):

To that end my personal take is "simpler might be better?" but it's worth exploring a range of approaches to be sure

Chris Fallin (Apr 13 2024 at 17:21):

in the linked issue with the example that drove this thinking, here I gave a possibly acceptable and very simple approach: try to preserve original ordering when one can (and maybe extend the notion of ordering to "nodes created while rewriting another node"). That addresses the "llvm/user carefully scheduled the inner loop" cases, at least, without getting into complex heuristics

Notification Bot (Apr 16 2024 at 07:42):

Notification Bot (Apr 16 2024 at 07:43):

Panagiotis Karouzakis (Apr 16 2024 at 07:50):

In Issue #6260 Jamey Sharp referenced a paper that combines two greedy techniques Critical Path (CP) that "counts the maximum length of any data-dependent chain of instructions that must execute after this instruction before the end of the block". Last Used Count (LUC) that "It counts how many live-ranges will end if we place this instruction now. Put another way, for each operand, check whether there are any other instructions that will use the same value but haven't been scheduled yet". The LUC is used first, as a tie breaker we have the CP and as a last resort tie breaker we can use the smallest instruction number to be scheduled first. Other heuristics could be applied to but I think for a first approach we can try the CP+LUC first what do you think?

Chris Fallin (Apr 20 2024 at 00:27):

That plus “preserve original ordering” above; latter cheaper, lower quality, curious how much lower (and how much faster during compilation) for real benchmarks!

Dimitris Aspetakis (May 08 2024 at 16:28):

Since this has been our first interaction with cranelift, we took some time (me and Panagiotis) to understand the codebase and the egraph concept a bit. We have some questions on where the arbitrary topo-sort selection of instructions (as described in issue #6260) appears inside code. We have pinpointed three places on elaborate.rs you might be talking about:

Lastly, please tell us if such "mentoring" advice is undesirable for this difficulty level in the codebase.

Chris Fallin (May 08 2024 at 17:39):

Dimitris Aspetakis (May 11 2024 at 15:30):

Starting note: if this seems more like something that should be posted as an issue reply, please tell us.

We started a prototype implementation here for the program order retention approach.

Implementation steps

Possible mistakes

Other notes

With our current implementation we don't reorder among neither egraph-generated instructions, nor skeleton instructions. The latter we believe is already optimal for the non-heuristics approach, but for the former we are thinking the delta approach might give us bigger reorder-ready ranges. We are not sure how frequent are such new instruction generations from the egraph though, so it might not be worth it in the end to track their generation roots.

We are aware that an InstNode in Layout has a seq field. I think we don't have access to this field currently, so we created a u32 sequence number to simulate this that wraps around.

Testing

When running cargo test for the entire project (on x86_64 machines), we pass 161/1183 tests. We believe most failed tests are because they expect a fixed sequence of instructions, and we indeed see from the failure outputs that our implementation reorders some of them. There were some though that concearned us, since they duplicated some instructions. One such test is disas / load-store / x64 / load_store_static_kind_i32_index_0_guard_yes_spectre_i8_access_0x1000_offset.wat.

We tried to simply run the benchmarks from issue #6159 with our release build of wasmtime, and they indeed ran without any observable problems (noted that we are extremely inexperienced here). Even concequtive runs from the same build had measurable variance, hence when comparing to the build without our commit the results were not clear. That's probably because we did them with our laptops, but still, it might be worth it to verify the stability of the results from issue #6159 .

We had to port the run.sh script from the #6159 to the current CLI format. Also noted that we couldn't find the option to disable egraph optimization on the new wasmtime CLI.

set -e
wasmtime="../wasmtime/target/release/wasmtime run -C cache=n \
  -W relaxed-simd=y,threads=y \
  -S threads=y --"

for t in `$wasmtime ./end2end.simd.relaxed.wasm --benchmark_list_tests | rg -v FP16`; do
  # echo ====================================================================

  for file in end2end.simd.relaxed.wasm; do
    # echo == $file ==
    echo -n "wasmtime     "
    $wasmtime ./$file --benchmark_filter=$t 2>&1 | rg $t

  done
  echo

done

Dimitris Aspetakis (May 11 2024 at 22:42):

Panagiotis Karouzakis (May 13 2024 at 18:59):

@Chris Fallin Can you provide us with bigger benchmarks in order to test the implementation with more rigorous results? The results from the existing benchmarks have a lot of variance and they are small.

Afonso Bordado (May 13 2024 at 19:03):

fitzgen (he/him) (May 13 2024 at 19:03):

fitzgen (he/him) (May 13 2024 at 19:05):

the default suite (if you don't explicitly pass a particular benchmark on the CLI) is a good choice in general

fitzgen (he/him) (May 13 2024 at 19:10):

you'll want to build wasmtime-bench-api on main and then on your branch to produce two different shared libraries:

$ git checkout main
$ cargo build --release -p wasmtime-bench-api
$ cp target/release/libwasmtime_bench_api.so /tmp/main.so
$ git checkout better-inst-sched # or whatever your branch is named
$ cargo build --release -p wasmtime-bench-api
$ cp target/release/libwasmtime_bench_api.so /tmp/better-inst-sched.so

Dimitris Aspetakis (May 13 2024 at 19:32):

Thanks for the pointers! — we didn't have time to check the contributors' documentation extensively...

We also just found Chris Fallin's talk on egraphs in cranelift. We couldn't find the recording from the conference, so until now we relied on the accompanying slides, which was a bit less than ideal.

Chris Fallin (May 13 2024 at 19:44):

There's a recording linked from my website; unfortunately the conference itself did not record (there were AV problems)

Chris Fallin (May 13 2024 at 19:44):

Dimitris Aspetakis (May 13 2024 at 19:50):

Panagiotis Karouzakis (May 14 2024 at 09:02):

fitzgen (he/him) (May 14 2024 at 16:23):

Dimitris Aspetakis (May 15 2024 at 05:23):

A small update on the benchmarks from my PC (Ryzen 5700G, running Opensuse Tumbleweed), comparing our prototype program order retention optimization with main:

Compilation

Instantiation

Execution

Dimitris Aspetakis (May 15 2024 at 05:27):

We're trying to tackle the heuristics-approach implementation since yesterday, but if you have any comments on the proposed implementation for the program-order-retention approach tell us.

If you exclude some comment-notes, the diff for our prototype program-order-retention implementation is quite small — should you want to look at the actual code.

Chris Fallin (May 15 2024 at 05:38):

One high-level note: it may be the case that you’ll need to run more benchmark iterations and/or tackle system noise; instantiation time should not change at all if you’re only changing codegen (it involves only data structure + memory setup performed by the runtime, not Cranelift-compiled code) so if it has deltas on the same order as your reported results that likely indicates noise

Chris Fallin (May 15 2024 at 05:39):

We have a document on precision benchmarking somewhere written up by Jamey (involving isolating a core and frequency-pinning etc) if you want to go that far; can’t find atm as I’m on phone

Dimitris Aspetakis (May 15 2024 at 05:42):

Yeah, I basically just closed everything else along with network access while running the benchmarks, but the instantiation differences did concern me a bit indeed...

Dimitris Aspetakis (May 15 2024 at 05:42):

Dimitris Aspetakis (May 15 2024 at 05:46):

Chris Fallin (May 15 2024 at 05:58):

Dimitris Aspetakis (May 15 2024 at 07:33):

I run the same set of benchmarks after cpu isolation and frequency pinning (to 400MHz :smiling_face_with_tear:), and the results quite decisive, albeit with one unexpected instantiation result:

Dimitris Aspetakis (May 15 2024 at 07:36):

We might have missed something that makes our implementation incorrect from first principles.

It might also be the case that our restriction of disallowing restructuring across new, egraph-generated instructions is important after all — any thoughts?

Dimitris Aspetakis (May 15 2024 at 07:58):

Actually, do we expect performance gains with program-order retention on any benchmark from sightglass at all? I mean, are there such carefully-pipelined simd examples there?

Chris Fallin (May 15 2024 at 14:41):

Panagiotis Karouzakis (May 16 2024 at 15:04):

Is it possible to use the XNNPACK benchmark with the same method as sightglass benchmarks are used? The XNNPACK repository does not contain any instructions. Any thoughts?

Chris Fallin (May 16 2024 at 15:04):

Alex Crichton (May 16 2024 at 15:05):

IIRC I had to apply quite a number of hacks at the time to get XNNPACK to build from source, that was a year or two ago now though so things may be different now. In any case I don't think building from source will be easy.

Alex Crichton (May 16 2024 at 15:06):

the other "weird thing" is that I could only get it to build with wasi-threads, which makes it very different from all other benchmarks

Alex Crichton (May 16 2024 at 15:06):

Alex Crichton (May 16 2024 at 15:07):

XNNPACK is also so big I couldn't figure out at the time what was actually a benchmark to run since there seemed to be hundreds of benchmarks

Panagiotis Karouzakis (May 19 2024 at 20:27):

Does anyone know how the macro named trace can be enabled? We have reached a point in our implementation that we need it.

Chris Fallin (May 20 2024 at 03:42):

The Cargo feature trace-log on the cranelift-codegen crate controls that -- when I'm debugging I usually just throw the feature in the default list there (and make sure not to commit it/include it in PRs)

Chris Fallin (May 20 2024 at 03:42):

(the intent is to avoid even the cost of the dynamic check of log level in some performance-sensitive parts of the compiler by excluding all the logging stuff statically in ordinary builds)

Dimitris Aspetakis (May 20 2024 at 03:45):

Btw, he is talking about the heuristics implementation. We believe we have a complete implementation (along with rank pairing heaps), but now we are trying to tackle duplication of instructions and how this would affect us.

Dimitris Aspetakis (May 20 2024 at 03:47):

We also have an open issue with rematerialization (should we leave it in the scoped elaboration walk, or should we integrate it in the ready queue pass), so we have currently just removed it altogether.

Dimitris Aspetakis (May 20 2024 at 03:48):

Panagiotis Karouzakis (May 23 2024 at 19:56):

// Record the first branch we see in the block; all
// elaboration for args of *any* branch must be inserted
// before the *first* branch, because the branch group
// must remain contiguous at the end of the block.

We have some questions about that.
1) Can we have multiple branches in one single block?
2) What is the meaning of the " branch group must remain continuous" ?

Chris Fallin (May 23 2024 at 19:56):

Chris Fallin (May 23 2024 at 19:57):

It used to be that Cranelift represented multiple-target terminators (e.g., a BB ending with a conditional branch) with multiple single-target branches

Chris Fallin (May 23 2024 at 19:57):

(which itself was left over from ancient history where Cranelift used Extended Basic Blocks / EBBs that could have side-exits mid-block)

Chris Fallin (May 23 2024 at 19:58):

thankfully Trevor cleaned that up... last year sometime? certainly after that comment was written... so that we have proper single-branch-terminators now

Chris Fallin (May 23 2024 at 19:58):

the comment is alluding to subtleties that could happen when a branch other than the first had an arg that required elaboration

Chris Fallin (May 23 2024 at 19:58):

now it's safe to ignore it (and please feel free to submit a PR to remove that comment and associated logic too!)

Panagiotis Karouzakis (Jun 09 2024 at 08:30):

Just a small update for our project that me and @Dimitris Aspetakis are working on.
We are still fixing issues related to minor bugs. A few of them still remain.

Dimitris Aspetakis (Jun 13 2024 at 07:42):

Quick question: does register allocation in the lowering backend depend on rematerialization from the midend?

Dimitris Aspetakis (Jun 13 2024 at 07:43):

We don't get any errors anymore from testing cranelift-tools, except from 6 files where the order changed or we got errors due to our missing licm or rematerialization optimizations

Dimitris Aspetakis (Jun 13 2024 at 07:45):

I manually checked all 6 and the resulting IR seems valid, and changing manually the check tests to our output, we now pass everything in cranelift-tools

Jamey Sharp (Jun 13 2024 at 07:46):

Register allocation can be impacted by any mid-end change including rematerialization, just because they perturb the input to the register allocator, but I don't think there's any specific dependency between them. I'm not sure I understood the question though?

Dimitris Aspetakis (Jun 13 2024 at 07:47):

Dimitris Aspetakis (Jun 13 2024 at 07:48):

We just aren't sure where to go next if the tests in cranelift-tools pass, and we seem to have proper IR output in them...

Jamey Sharp (Jun 13 2024 at 07:50):

I'd have to look up that assert in compile.rs line 76 and figure out what it's actually checking, because that failure message doesn't mean anything to me :sweat_smile:

Dimitris Aspetakis (Jun 13 2024 at 07:53):

Yeah, I get it... What I meant earlier is if you know of the lowring stage having an explicit dependency on rematerialization from the midend. I figured it might expect a certain number of live SSA ranges at any point for example.

Jamey Sharp (Jun 13 2024 at 07:55):

Lowering and register allocation have to work even when the egraph pass doesn't run because optimizations are disabled, so there "shouldn't" be any dependency like that. But of course there can always be bugs :shrug:

Dimitris Aspetakis (Jun 13 2024 at 07:55):

Jamey Sharp (Jun 13 2024 at 07:56):

I would dig into it more with you but I can't today. If you're still having trouble next week I can hopefully take a look then!

Chris Fallin (Jun 13 2024 at 15:27):

+1 to all of Jamey's answers above re: independence of the two bits. Regalloc should be happy with any valid VCode, and we should get valid VCode from valid CLIF (ergo, one way to find this might be to explicitly run the CLIF validator at various points).

Re: the assert: compile.rs:76 is just where regalloc is invoked; this is propagating upward an error that regalloc2 detected. Specifically EntryLivein means that some register was used that was never defined (it was a live-in at entry, which isn't supposed to happen). Likely this means that an instruction is missing somewhere. Running the CLIF validator as above should find this at an earlier point and indicate exactly where in the CLIF the problem is.

The test suite is fine but mostly has small functions, so especially for algorithmic tweaks that start to get interesting with more substantial block sizes (see: instruction scheduling!), I'd expect one to have to expand the test suite to fully cover its behavior. So it doesn't surprise me that it currently passes the suite but fails on a real program. The way I would go about this is resolving the bug as above -- add the validator, add as many debug prints as needed, extract the CLIF -- then when the bug is fixed, turn it into a regression test (new entry in the suite). In general I think we'd consider something like this feature well-tested when it passes (i) the suite, (ii) fuzzing, and (iii) reasonable real programs like spidermonkey.

Jamey Sharp (Jun 13 2024 at 15:34):

Oh yeah, there's a Cranelift option to enable the verifier, right? I don't remember what the option is called. Also, this reminds me that a good step in between the small CLIF filetests and a large program like Spidermonkey might be the disas tests, which you can run with cargo test --test disas, if I remember correctly. They have a little bit more interesting control flow while still being pretty small.

Chris Fallin (Jun 13 2024 at 15:40):

enable_verifier, indeed; one could also sprinkle manual calls to the verifier throughout the compilation pipeline (e.g. just before and after the egraph pass invocation) which may be easier when running tests etc

Dimitris Aspetakis (Jun 13 2024 at 15:41):

Thank you both! The pointers are very helpful!
We pass ~1000 and we indeed fail 237 on the disas tests.

Chris Fallin (Jun 13 2024 at 15:41):

(I hesitate to give a "thumbs-up" reaction to that but at least you have more input to make progress! best of luck :-) )

Dimitris Aspetakis (Jun 13 2024 at 19:00):

It kinda seemed logical to me that our disas errors might be from checks using the previous ordering, so I gave them my blessing (with WASMTIME_TEST_BLESS=1), and the errors disappeared. We also put the following at the end of egraph_pass() (after our scheduling):

Was that what you were talking about when mentioning verification? — we tested using both verify_function() and self.verify(). The disas errors disappeared even with those verifications in place.

Dimitris Aspetakis (Jun 13 2024 at 19:06):

We then starting tackling blake3-simd from Sightglass (which we can see when compiling that it gives us a familiar error), but got stuck in trying to get it to emit egraph-optimized clif.

We want the optimized clif to diff it with the previous implementation, but the --emit-clif option from wasmtime probably gives us unoptimized clif (edit: we think that because it emits the same functions using our implementation, and using the previous one).

Do you know of a way to get optimized clif using the wasmtime binary (or another tool), or should we print it explicitly through changes in the code?

Jamey Sharp (Jun 13 2024 at 19:34):

You're right that --emit-clif prints as soon as the first CLIF is generated, before any optimization passes. I haven't been very satisfied with the options we have for getting optimized CLIF from there. What I usually do is put the unoptimized CLIF in a file and add

at the top, then run CRANELIFT_TEST_BLESS=1 cargo run -p cranelift-tools -- test <filename>, and finally look at what it wrote into comments at the end of the file.

Dimitris Aspetakis (Jun 13 2024 at 19:35):

Jamey Sharp (Jun 13 2024 at 19:36):

And I think self.verify was exactly what Chris was suggesting to do, although I'm not looking at the code right now to make sure.

Dimitris Aspetakis (Jun 13 2024 at 20:20):

We're actually passing now quite a few of the sightglass benchmarks! For now, only spidermonkey fails...

Dimitris Aspetakis (Jun 13 2024 at 20:20):

Panagiotis Karouzakis (Jun 13 2024 at 21:03):

We probably fixed all the errors. The error in Spidermonkey was due to the rank pairing heap implementation. After modifying it a bit we ran Spidermonkey with no problems. Tomorrow we will have benchmarks (without the remat and licm optimizations).

Dimitris Aspetakis (Jun 13 2024 at 22:20):

So... we are quite bad on compilation (up to 194% worse), but out of the three default benchmarks, two have no difference in execution time, and spidermonkey has 2–4% better performance in cycles. I used the suggested method pinning a core, but with one difference: I didn't pin it to its min frequency. I hope that's still valid...

Dimitris Aspetakis (Jun 13 2024 at 22:21):

I should probably mention that it's our first time writing Rust, so we our implementation probably has a lot of room for improvements...

Chris Fallin (Jun 13 2024 at 22:22):

That is what we call a, uh, "growth opportunity" :-) Worth profiling and seeing if there are simple improvements that could give quick speedups, paying attention to allocations, etc.

FWIW, the alternative "keep sequence numbers from original code" idea I had proposed was made with exactly this worry, that a full recomputation of an instruction order would be too expensive; maybe it's worth trying that too

Dimitris Aspetakis (Jun 13 2024 at 22:25):

We kinda tried that, but we got up to the point of rearranging among new egraph-generated instructions, and not across them, with mixed results if I remember correctly. Now that we have a much better understanding of everything, we might try to remove that restriction (after we write the project report for our university course :sweat_smile: ).

Dimitris Aspetakis (Jun 15 2024 at 12:15):

After a few optimizations following flamegraph sessions, we got compilation a bit closer (but it's still probably not good enough). Surprisingly, even though we didn't change anything that should affect execution performance, we are now worse (consistent with multiple runs, unlike last time). Keep in mind though that we still haven't re-applied LICM and rematerialization (currently working a bit on LICM).

Spidermonkey Benchmark Results

dimitris_aspetakis@hourmas:~/.../wasmtime-instr_sched_opt/sightglass$ taskset --cpu-list 7 cargo run -- benchmark --measure cycles --pin --processes 1 --iterations-per-process 20 --engine /tmp/wasmtime_main.so --engine /tmp/wasmtime_heuristics.so -- benchmarks/spidermonkey/benchmark.wasm
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.09s
     Running `target/debug/sightglass-cli benchmark --measure cycles --pin --processes 1 --iterations-per-process 20 --engine /tmp/wasmtime_main.so --engine /tmp/wasmtime_heuristics.so -- benchmarks/spidermonkey/benchmark.wasm`

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 9844247846.30 ± 56071233.34 (confidence = 99%)

  main.so is 1.35x to 1.36x faster than heuristics.so!

  [37470061310 37672977217.40 37781450866] heuristics.so
  [27642976158 27828729371.10 27885085254] main.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 106659806.00 ± 11041198.41 (confidence = 99%)

  main.so is 1.07x to 1.09x faster than heuristics.so!

  [1375717572 1390076390.70 1403583922] heuristics.so
  [1217294356 1283416584.70 1302562366] main.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [884146 910732.70 989558] heuristics.so
  [849338 923529.20 1552376] main.so

Dimitris Aspetakis (Jun 15 2024 at 12:18):

Ohh, I should probably do such updates on the issue from now on actually :sweat_smile:

Chris Fallin (Jun 15 2024 at 17:01):

One thing you could try, just as a data point to isolate your core contribution, would be to compare to a baseline without remat or licm either (manually disabled with a hack of some sort)

Chris Fallin (Jun 15 2024 at 17:02):

iirc, remat in particular is pretty important on SM — hoisted constants create a lot of reg pressure otherwise

Panagiotis Karouzakis (Jun 16 2024 at 22:51):

Hello, We just completed both licm and remat and we are currently running some benchmarks!
Here are some preliminary results
SpiderMonkey: 3-6% faster on execution time, 42-44% slower on compilation time.
image.png

Bz2: no difference in performance,
pulldown-cmark 2-11% worse on execution time,

We get a big variance on the sightglass execution time Δs. Any tips for that problem?
From now on we will use Vtune to optimize the compilation time and possibly lookout for the reasons that the execution time of some benchmarks is worse.

Jamey Sharp (Jun 16 2024 at 22:56):

Oh, did anyone suggest the Sightglass option --measure=perf-counters yet? (I might have the name wrong, but it's something like that.)

fitzgen (he/him) (Jun 17 2024 at 16:56):

bjorn3 (Jun 17 2024 at 20:51):

Instruction scheduling is supposed to help with ipc, which instructions retired explicitly doesn't account for. Looking at the cycles counter would be more reliable for this.

fitzgen (he/him) (Jun 17 2024 at 21:34):

true, but we're talking about the compile time overhead here, not the codegen quality

Dimitris Aspetakis (Jun 17 2024 at 21:37):

So, compared to cycles, the motivation for using retired instructions for compilation performance would be to get more stable results, or is there something else you have in mind?

fitzgen (he/him) (Jun 17 2024 at 21:48):

Panagiotis Karouzakis (Jun 18 2024 at 02:50):

For the compilation overhead, we managed to reduce it from 200% to 44% in a couple of hours using vtune. I am optimistic that we can lower it even more.
I am more concerned about getting accurate results on cpu cycles.

Jamey Sharp (Jun 18 2024 at 03:29):

Panagiotis Karouzakis (Jun 19 2024 at 21:15):

@Dimitris Aspetakis is wondering what compilation time cost is acceptable in our implementation that results in 3-6% performance gain in cycles in Spidermonkey.
I am thinking less than 2%, what do you think?

Jamey Sharp (Jun 19 2024 at 21:37):

That's a complicated question and I think different people would answer it different ways. Personally, I'm more concerned about whether the implementation is clear and easy to maintain; I feel it's okay to have some room for improvement in compile time if we're comfortable with our ability to work on it more later. Unfortunately I still haven't had a chance to carefully review your code.

One thing I'm curious about is whether the heuristics are separable. Hypothetically, could you add options that allow someone to enable or disable the critical path heuristic, or the last-use count heuristic, independently of each other? If so, does one have more impact than the other on either compile time or execution time?

There are downsides to adding options like this (more configurations to test and more confusion for users, for example) but one option I am considering is merging your changes under an off-by-default option. I would really like to have your work upstream! We just need to be confident that we can maintain it and also manage our other goals, such as compile speed.

Dimitris Aspetakis (Jun 19 2024 at 21:46):

In my mind they are quite separable, as we just use the metrics in the ordering traits implementations of the ready queue.

Jamey Sharp (Jun 19 2024 at 21:50):

Nice. Does computing either metric take a significant amount of time, or does switching to the ready queue make a big difference in compile time?

Dimitris Aspetakis (Jun 19 2024 at 21:52):

I wanted to read a bit more documentation in sightglass before I asked the question that Panagiotis mentioned...

But since he did, the question would be on the significance of compilation vs execution time in the Sightglass benchmarks. I remember seeing compilation taking a lot more cycles in e.g. Spidermonkey. Is that conforming with real-world expectations, or would a slightly better execution time amortize an e.g. 30% worse compilation time?

Dimitris Aspetakis (Jun 19 2024 at 21:54):

I believe the use of the rank pairing heap takes ~30% of the elaboration function's time. If I'm not mistaken, we only use it due to LUC needing dynamic updates.

Dimitris Aspetakis (Jun 19 2024 at 21:55):

We can probably create a branch using only the Critical Paths and a simpler sorted queue.

Jamey Sharp (Jun 19 2024 at 21:58):

The reason that's a complicated question is that it depends on how people are using Cranelift, and that varies a lot. I believe originally Cranelift was intended for use in JIT compilers, so compilation time was extremely important. It's still important, especially for uses like the Rust compiler's Cranelift backend as an alternative to LLVM, where the goal is to get debug builds quickly. But then for example at Fastly where we use Cranelift via Wasmtime, we'll compile a module once and invoke it many times, so compile time is less important there than execution time. When we maintainers make decisions about Cranelift, we're trying to keep all these different uses in mind.

Chris Fallin (Jun 19 2024 at 22:01):

I think at a high level we should consider upstreaming this under a default-off option at least -- the bar for that is much lower. "Try this option if you want to optimize further", something like an -O3 setting (whereas by analogy today's Cranelift is something like -O2).

I was about to say more but Jamey beat me to it re: compile-time :-) The only thing I'd add is that we've had a fairly high standard of "grind on the thing for weeks/months to squeeze out more performance" (at least, I've personally done this on RA2 and the original aegraphs impl) so going back the other way with say 30% regressions in compile time feels like a huge loss :-) Ideally we'd optimize to a local maximum as much as possible and see how low we can get it before turning it on by default. Again, can definitely happen in-tree as long as it doesn't affect the default build!

Chris Fallin (Jun 19 2024 at 22:03):

Oh, one other thing, decisions/discussions like this are often good to have in the context of the Cranelift weekly; would you be interested in attending and talking about your work here? (Wednesdays at 15:30 UTC)

Dimitris Aspetakis (Jun 19 2024 at 22:04):

Dimitris Aspetakis (Jun 19 2024 at 22:05):

Chris Fallin (Jun 19 2024 at 22:05):

OK cool, I can add you (both?) to the calendar invite with details -- DM me good email addresses for that?

Chris Fallin (Jun 19 2024 at 22:06):

(fwiw next few weeks probably quiet as a bunch of us will be out next week, following week is just before US holiday and folks will be out too; maybe later July is good if you want to put something on the agenda for Jul 17, 24 or 31 in bytecodealliance/meetings with a PR)

Panagiotis Karouzakis (Jul 10 2024 at 21:12):

Me and @Dimitris Aspetakis are a bit disappointed because since we completed the implementation,
we got no speedup almost on all the benchmarks except in some of the benchmarks in XNNPACK. We ran the benchmarks many times and we verified this.
We verified also that the implementation works correctly by taking a complex function out of Spidermonkey and executing separately to see the new order of instructions.

Also we are a bit stuck on improving the compilation time with overhead 20-25%. Seems the priority queue has too much overhead that we cannot find a way to avoid, especially the delete/pop operation that we do on every (pure) Instruction that we insert into the layout.
We don't want this work to go to waste so we would like to know if our implementation is close to some other heuristic we can try. Personally, I am thinking that instruction scheduling across blocks might be more beneficial, but can be more expensive also.

Jamey Sharp (Jul 10 2024 at 21:45):

It's always disappointing when that happens! We've certainly had plenty of other Cranelift changes we were excited about that didn't turn out to be helpful so I feel your pain here. If you're not ready to give up though then I'm not either!

I'm not sure from your description which things you've already tried. Did you test a version with only the LUC heuristic and not the CP heuristic? You'd previously thought that perhaps you only needed a priority queue because of the CP heuristic. It would be great to keep the critical path heuristic to improve instruction-level parallelism. But of the two I actually expect bigger wins from the last-use count heuristic reducing register pressure, and therefore reducing the number of stack spills and reloads needed.

Chris Fallin (Jul 10 2024 at 22:04):

In addition to Jamey's thoughts I wanted to add a bit of meta-encouragement / a snapshot of my own philosophy on these things, if it helps:

Basically, to echo Jamey, please feel free to keep going if you want; but we've already gotten value from this so I hope you don't feel like it "failed" in any really substantial way!

Dimitris Aspetakis (Jul 11 2024 at 13:17):

I just wrote in issue6260 a summary of our progress. Yesterday I also experimented with different implementations (using only LUC, only CP, only sequences), and I put the results in a text file in the end of that summary. The outcomes don't showcase any particular pattern from what I can tell.

Dimitris Aspetakis (Jul 11 2024 at 13:22):

The best case is that we have logical errors in our implementation :sweat_smile: — something that should surface after creating explicit testfiles checking for expected instruction scheduling.

Other than that, the only way I can think of that might get us closer to an ideal scheduling (using the current heuristics) is using weights for the critical path metric according to how "heavy" an opcode is...

Panagiotis Karouzakis (Jul 11 2024 at 19:50):

Please find the time to think and propose ideas. Maybe we can try something in the near future.
Personally I am thinking that the heuristics don't work well because the CPU can do optimal instruction scheduling almost in every single block except inside loops, this is why some of the benchmarks in XNNPACK are better with our heuristics.
Maybe moving instructions from hot paths to other basic blocks may improve performance, this my basic idea, but I don't think we have enough information to identify a hot path

Jamey Sharp (Jul 11 2024 at 20:31):

An out-of-order CPU should be able to do well with loops too, as long as it predicts the branches well enough. It will just keep looking forward and finding more instructions to run from future loop iterations until it runs out of buffer, which may be hundreds of instructions on a modern CPU.

I think CPUs are less good at dealing with registers getting spilled to the stack and reloaded later. They may be able to optimize away the load if the store is still in the store buffer, but they still have to execute the store, eventually writing all the way to main memory. So anything we can do that reduces register pressure ought to be good, I think.

We do loop-invariant code motion (LICM) already, which is the only form of "moving instructions from hot paths" that I can think of right now, but perhaps we could do it better somehow?

Panagiotis Karouzakis (Jul 11 2024 at 21:01):

I watched the Unison video from EuroLLVM2017 that combines Instruction Selection and Register Allocation into a single combinatorial problem. Obviously that is too slow but it showcases examples where LLVM does poor job and then you can implement something better based on this observation. If we had something like this on cranelift we could do the same. Perhaps we don't need it at all, if we can use Unison and compare it with the output of Cranelift to simple test programs to identify where cranelift does poor job. This could be a good plan in my opinion.

Dimitris Aspetakis (Jul 11 2024 at 23:01):

If you can't find the time to elaborate on the video above/further ideas (totally understandable), keep in mind that I'll probably reserve a "slot" in one of the next few meetings to summarize everything and give a few possible directions. I don't know if Panagiotis will be able to join since he's working, but I think I can adequately cover his proposals as well given that we usually talk them through together.

Jacob Lifshay (Jul 11 2024 at 23:53):

have you tried running the benchmarks on an in-order processor like the Arm Cortex A53 in the Raspberry Pi 3? I expect scheduling to have a much greater benefit on in-order cpus since they don't have the out-of-order architecture that basically does scheduling again in hardware.

Chris Fallin (Jul 11 2024 at 23:54):

I'm on vacation this week (I jumped in above to give encouragement but my "work brain" is solidly shut down right now!); I'll think more about this next week. I do agree at a high level that starting with examples where we know we could do better seems like a reasonable approach.

Dimitris Aspetakis (Jul 12 2024 at 00:14):

Not yet, but it certainly was in the back of my mind! The only such CPU I believe I have available is in a rpi zero w, but I have literally no idea how to run Sightglass on it (I have never touched it before :sweat_smile: ). I'll do my best to run it, and if I fail, I might try with a core 2 duo I have lying around for fun and (possibly) profit — although I'm not sure how less reordering range those had (if any).

Jacob Lifshay (Jul 12 2024 at 00:36):

for x86, basically everything after the pentium pro has out-of-order execution, so the core 2 duo isn't too likely to be substantially different than modern x86 processors

Dimitris Aspetakis (Jul 12 2024 at 12:41):

Actually, the rpi zero w I have has an older 32-bit arm processor (the rpi zero 2 has a quad core A53)...

So no luck on that side, given that wasmtine (as far as I can tell) supports only 64-bit arm.

Panagiotis Karouzakis (Jul 12 2024 at 18:13):

I think the point is to do better on out-of-order cpus, simply because those are currently in a lot of use. Probably with older in-order cpus we maybe see improvement but this isn't the objective here.

Dimitris Aspetakis (Jul 12 2024 at 18:19):

Well there are quite a few SBCs still in use that use in-order CPUs, and although it might not be in wasmtime's main targets, I believe cranelift itself does cast a wider net for deployment targets. It would at the very least be interesting to see if the critical path metric gives us anything in ILP. Of course, our main target should probably still be LUC for register spilling — something that (as Jamey Sharp mentioned) we expect to give more universal improvements.

Jamey Sharp (Jul 12 2024 at 19:19):

I do think the question of whether the CP heuristic helps on in-order CPUs is interesting, and relevant for Cranelift, so if you can find a system to test that on, feel free! If we can make that heuristic off-by-default and have no impact on compile time when it's disabled, we could certainly merge it.

It's tricky because Cranelift currently only supports 64-bit targets, and so many 64-bit CPUs are out-of-order. I did a little research and as far as I can tell the only Intel CPU that was both in-order and 64-bit was the second-generation Atom, codenamed Pineview, from 2010. I guess ARM has more to choose from here though.

Jacob Lifshay (Jul 12 2024 at 22:11):

Panagiotis Karouzakis (Jul 13 2024 at 09:51):

For that to work, we need to test at compile time if the current CPU is in-order or out of order. Does cranelift supports this? I think LLVM supports this with SchedModel, SubTargetInfo and other Classes.

Jamey Sharp (Jul 13 2024 at 16:21):

No, I'm suggesting making it an option that someone can manually enable. Later we could automate it but that's not necessary to get it merged.

Panagiotis Karouzakis (Jul 15 2024 at 17:00):

I just looked again at the paper that proposed LUC+CP for instruction scheduling.
They used SPEC2006 for their benchmarks. Most of the INT2006 benchmarks didn't show any improvement. In FP2006 they did very well in some benchmarks.

They said that the FP benchmarks have more workload than INT benchmarks and this is why the performance was increased there.

Jacob Lifshay (Jul 15 2024 at 22:24):

afaict not really, maybe try adding ngspice? (useful since it's a decently large fp program, compiles on just about anything, and could reasonably be used in web-based circuit design software)

Jacob Lifshay (Jul 15 2024 at 23:22):

turns out ngspice wants dup2, which isn't supported, so i gave up trying to compile it to wasi for now

Panagiotis Karouzakis (Jul 16 2024 at 18:50):

I have 2,5 years of experience with LLVM, but I don't know much about webAssembly benchmarks like Sightglass and stuff so please help us.. We did a lot of work to reach this point.

Jamey Sharp (Jul 16 2024 at 18:51):

You did do a lot of work to reach this point! Let's talk about it at the Cranelift meeting tomorrow.

Panagiotis Karouzakis (Jul 16 2024 at 18:54):

@Jamey Sharp My Calendar says that in my local timezone is at 16:30 noon. I just need to make sure that I don't get the time wrong and miss it.

Jamey Sharp (Jul 16 2024 at 18:55):

I like to let Zulip do timezone conversions for me: It's at 2024-07-17T08:30:00-07:00

Panagiotis Karouzakis (Jul 16 2024 at 18:56):

Dimitris Aspetakis (Aug 19 2024 at 12:43):

While we have somewhat stagnated with progress, for our convenience I tried adding an --emit-opt-clif option that follows --emit-clif, but prints functions after a (possible?) egraph optimization. It can be currently found in the latest commits of our main branch from github.com/karouzakisp/wasmtime.

I find my solution a bit hacky in a specific place, probably due to my inexperience with the borrow checker. I didn't yet bother with possible cli checks (like whether the paths for --emit-clif and --emit-opt-clif are the same) because I'm not sure how you would like the cli to behave anyway. Also, I believe the --emit-clif option is supported by a wasmtime/cranelift "explorer"? — I have not done anything there...

Dimitris Aspetakis (Aug 19 2024 at 12:45):

I'm not familiar with github contributions in general, that's why I'm asking here first — should I search/create an issue for that, should I bother with a pull request already?

Dimitris Aspetakis (Aug 19 2024 at 12:47):

I feel like its state is not ready for a pull request, and I also don't know where improvements to the branch should happen.

Jamey Sharp (Aug 19 2024 at 17:57):

Cool, and good questions! The general reason that projects like this ask people to open an issue first is so that we're aware of what people are working on and can offer advice before you get too far into a project. It can be disappointing to write a bunch of code and then find out that it might not make sense to merge it, as you know, so when we can help avoid that outcome it's nice. But when you already have code written, there's no need to open an issue, at least in my opinion. I would suggest opening a pull request that contains only your changes for --emit-opt-clif; it sounds like a useful bit of work by itself. If there are things you feel are important to improve before merging it, feel free to mention those in your pull request, but I wouldn't worry about integrating this mode with wasmtime explore or anything; that can be done in later work if necessary.

Panagiotis Karouzakis (Aug 19 2024 at 19:02):

We should also mention, that while we took some time off regarding this project, cause of vacations, etc. We are still thinking of ways and implementing some features to get some performance! When we have news we will make an update!

Dimitris Aspetakis (Sep 08 2024 at 16:57):

Using this approximation, we found that our implementation of the LUC+CP heuristic results in ~17% more spills in Spidermonkey...

Dimitris Aspetakis (Sep 08 2024 at 17:00):

I believe the paper we draw from also notes an increase in spills — with an increase in execution time. Although for their spills, it might be that they compare with a more traditional scheduler as a baseline.

Dimitris Aspetakis (Sep 08 2024 at 17:08):

Considering we prioritize LUC, these results seem a bit counter-intuitive to me. We might have issues with the implementation, or maybe the Critical Path heuristic is at fault.

Dimitris Aspetakis (Sep 08 2024 at 17:10):

We will probably check out what's going on with spills separately for LUC+sequenc and CP+sequence next...

Jamey Sharp (Sep 08 2024 at 17:20):

I agree, that's not intuitive to me. Sounds like you have a good plan. Also I want to note that the approximation that Chris suggested is not measuring how many values got spilled but rather how many times spilled values got moved around, which is subtly different. It's a useful measure because it probably correlates better with execution time, but it might be confusing depending on what you're trying to figure out.

Dimitris Aspetakis (Sep 08 2024 at 17:24):

Hmm yes, noted. We just tested it, and we still find we're worse in that metric with LUC+original_sequence...

Dimitris Aspetakis (Sep 08 2024 at 17:24):

I dunno, at this point we probably have to check manually some examples :sweat_smile:

Jamey Sharp (Sep 08 2024 at 17:25):

Yeah, that's what I would suggest! Something small where you know what outcome you want the heuristic to produce.

Jamey Sharp (Sep 08 2024 at 17:36):

The other thing you can try, if you come up with any guesses about specific situations where the heuristic might be making choices you don't want, is to add tracing that helps you find examples of those situations, and then run the compiler with that tracing enabled on a reasonably large input like Spidermonkey.

Dimitris Aspetakis (Sep 08 2024 at 17:46):

We kinda already have some respectable tracing from our "debugging days" — that was the only way we got some feedback on the correctness of our implementation too. But we surely have to check some examples more extensively.

Dimitris Aspetakis (Sep 08 2024 at 17:52):

I'm not sure about this, but maybe the dependency driven approach is already approaching spill optimality.

Jamey Sharp (Sep 08 2024 at 17:53):

Heh, I don't have any reason to believe it is, but I don't have evidence that it isn't, either.

Dimitris Aspetakis (Sep 08 2024 at 17:53):

If I remember correctly, the elaboration stack is a backward, depth-first pass. So it schedules graph "paths" in a sequence.

Dimitris Aspetakis (Sep 08 2024 at 17:57):

If the average graph approaches an upside-down tree, wouldn't this depth-first scheduling avoid scheduling parallel tree paths, which I think mostly hurt spilling?

Dimitris Aspetakis (Sep 08 2024 at 17:59):

Ehh, I should have thought on it a bit more before I wrote it down...
The sure thing is that we will examine examples more thoroughly (with the heuristics).

Jamey Sharp (Sep 08 2024 at 18:30):

It's often helpful to write things out like this, especially when you're not sure whether you have it right, and I think it's an interesting thought! I guess the situation where the heuristics should matter is when the same value is used multiple times, so it isn't a tree, right? I don't know.

Dimitris Aspetakis (Sep 08 2024 at 18:32):

Yeah, that's kinda what I had in mind.
Thinking prematurely ahead of time, it might be interesting to see if we could apply the heuristics depending on some summarizing descriptors for basic blocks...

Dimitris Aspetakis (Sep 08 2024 at 18:34):

If there exist patterns strong enough to separate basic blocks into groups that would benefit from different scheduling polices, that is.

Jamey Sharp (Sep 08 2024 at 19:03):

That kind of approach can be great, especially if some heuristics are slow to evaluate but you can quickly prove they won't help sometimes.

Panagiotis Karouzakis (Sep 08 2024 at 20:54):

I think we need to check how many LUC collisions we have, since LUC is supposed to help with spilling. Then one approach is to revert back to the data-driven scheduling. Another approach is employ a more sophisticated algorithm only to some blocks.

Dimitris Aspetakis (Sep 19 2024 at 10:27):

Although small and possibly not representative, we can see that the dep-driven scheduling is better in terms of live values (given that my application of the algorithms is indeed correct :sweat_smile: ).

Obviously, I don't really consider it a strong indication of any pattern — but it did give an idea: we probably should initialize the LUC heuristic with negative values depending on how many new values an instruction generates. Isolating the choice between (e) and (h) in the example, I believe this would result in a preference for scheduling (e) -> (g) -> (h), which would at least save us from an increase back to 5 live values from 4.

Dimitris Aspetakis (Sep 19 2024 at 10:42):

It makes sense to me intuitively. While we like removing live values by scheduling last users, we should also probably consider how many values those last users generate. The heuristic might also need a new name if we decide to make this modification...

Jamey Sharp (Sep 19 2024 at 17:58):

Ooh, yeah, I bet the paper authors didn't consider instructions with multiple results.

bjorn3 (Nov 10 2024 at 18:04):

Haven't read the paper yet, but would https://dl.acm.org/doi/10.1109/WCSE.2009.39 (A Novel Lightweight Instruction Scheduling Algorithm for Just-in-Time Compiler) be useful here?

Amanieu (Nov 10 2024 at 18:37):

Note that the paper specifically targets an in-order CPU. IMO with a OoO CPU you're better off scheduling to minimize register pressure instead.

Stream: cranelift

Topic: Better Heuristics for Instruction Scheduling

Dimitris Aspetakis (Apr 13 2024 at 15:40):

Dimitris Aspetakis (Apr 13 2024 at 15:54):

fitzgen (he/him) (Apr 13 2024 at 16:46):

fitzgen (he/him) (Apr 13 2024 at 16:46):

Chris Fallin (Apr 13 2024 at 17:19):

Chris Fallin (Apr 13 2024 at 17:19):

Chris Fallin (Apr 13 2024 at 17:21):

Notification Bot (Apr 16 2024 at 07:42):

Notification Bot (Apr 16 2024 at 07:43):

Panagiotis Karouzakis (Apr 16 2024 at 07:50):

Chris Fallin (Apr 20 2024 at 00:27):

Dimitris Aspetakis (May 08 2024 at 16:28):

Chris Fallin (May 08 2024 at 17:39):

Dimitris Aspetakis (May 11 2024 at 15:30):

Implementation steps

Possible mistakes

Other notes

Testing

Dimitris Aspetakis (May 11 2024 at 22:42):

Panagiotis Karouzakis (May 13 2024 at 18:59):

Afonso Bordado (May 13 2024 at 19:03):

fitzgen (he/him) (May 13 2024 at 19:03):

fitzgen (he/him) (May 13 2024 at 19:05):

fitzgen (he/him) (May 13 2024 at 19:10):

Dimitris Aspetakis (May 13 2024 at 19:32):

Chris Fallin (May 13 2024 at 19:44):

Chris Fallin (May 13 2024 at 19:44):

Dimitris Aspetakis (May 13 2024 at 19:50):

Panagiotis Karouzakis (May 14 2024 at 09:02):

fitzgen (he/him) (May 14 2024 at 16:23):

Dimitris Aspetakis (May 15 2024 at 05:23):

Compilation

Instantiation

Execution

Dimitris Aspetakis (May 15 2024 at 05:27):

Chris Fallin (May 15 2024 at 05:38):

Chris Fallin (May 15 2024 at 05:39):

Dimitris Aspetakis (May 15 2024 at 05:42):

Dimitris Aspetakis (May 15 2024 at 05:42):

Dimitris Aspetakis (May 15 2024 at 05:46):

Chris Fallin (May 15 2024 at 05:58):

Dimitris Aspetakis (May 15 2024 at 07:33):

Dimitris Aspetakis (May 15 2024 at 07:36):

Dimitris Aspetakis (May 15 2024 at 07:58):

Chris Fallin (May 15 2024 at 14:41):

Panagiotis Karouzakis (May 16 2024 at 15:04):

Chris Fallin (May 16 2024 at 15:04):

Alex Crichton (May 16 2024 at 15:05):

Alex Crichton (May 16 2024 at 15:06):

Alex Crichton (May 16 2024 at 15:06):

Alex Crichton (May 16 2024 at 15:07):

Panagiotis Karouzakis (May 19 2024 at 20:27):

Chris Fallin (May 20 2024 at 03:42):

Chris Fallin (May 20 2024 at 03:42):

Dimitris Aspetakis (May 20 2024 at 03:45):

Dimitris Aspetakis (May 20 2024 at 03:47):

Dimitris Aspetakis (May 20 2024 at 03:48):

Panagiotis Karouzakis (May 23 2024 at 19:56):

Chris Fallin (May 23 2024 at 19:56):

Chris Fallin (May 23 2024 at 19:57):

Chris Fallin (May 23 2024 at 19:57):

Chris Fallin (May 23 2024 at 19:57):

Chris Fallin (May 23 2024 at 19:58):

Chris Fallin (May 23 2024 at 19:58):

Chris Fallin (May 23 2024 at 19:58):

Panagiotis Karouzakis (Jun 09 2024 at 08:30):

Dimitris Aspetakis (Jun 13 2024 at 07:42):

Dimitris Aspetakis (Jun 13 2024 at 07:43):

Dimitris Aspetakis (Jun 13 2024 at 07:45):

Jamey Sharp (Jun 13 2024 at 07:46):

Dimitris Aspetakis (Jun 13 2024 at 07:47):

Dimitris Aspetakis (Jun 13 2024 at 07:47):

Dimitris Aspetakis (Jun 13 2024 at 07:48):

Jamey Sharp (Jun 13 2024 at 07:50):

Dimitris Aspetakis (Jun 13 2024 at 07:53):

Jamey Sharp (Jun 13 2024 at 07:55):

Dimitris Aspetakis (Jun 13 2024 at 07:55):

Jamey Sharp (Jun 13 2024 at 07:56):