Using context for `TargetIsa::compile_function` · cranelift

I was wondering why TargetIsa::compile_function does not take a context argument? There are already more frontend-facing APIs that use context with reusable allocations to optimize compilation, from looking into what happens inside each Isa lowering stage, there are many places where resources can be reused.

My guess is that it's because TargetIsa is immutable but this is not a difficult problem to work around. The backend context can look something like this:

struct IsaCtx {
    // can be smallvec
    ctxs: Vec<Box<dyn std::any::Any>>,
}

impl IsaCtx {
    // this is easier to use since you don't need to worry about passing context to wrong isa
    pub fn get_ctx<T: 'static>(&mut self, or_insert: impl FnOnce() -> T) -> &mut T{
        // rust is stpd and does not let me use find_map
        match self.ctxs.iter_mut().position(|c| c.downcast_mut::<T>().is_some()) {
            Some(i) => self.ctxs[i].downcast_mut().unwrap(),
            None => {
                self.ctxs.push(Box::new(or_insert()));
                self.ctxs.last_mut().unwap().downcast_mut().unwrap()
            }
        }
    }
}

impl TargetIsa for SomeBackend {
    fn compile_function_with_ctx(
        &self,
        func: &Function,
        domtree: &DominatorTree,
        want_disasm: bool,
        ctrl_plane: &mut ControlPlane,
        ctx: &mut IsaCtx,
    ) -> CodegenResult<CompiledCodeBase<Stencil>> {
        ctx.get_ctx(SomeBackendCtx::default).emit(self, func, domtree, want_disasm, ctrl_plane)
    }
    /* ... */
}

I wonder how much would this improve compile times when crunching many functions. (I could volunteer to try it out!)

bjorn3 (Sep 20 2024 at 16:18):

Storing a Option<Box<dyn Any>> in Context for the backend to reuse allocations would be an option.

It would be a lot of work to get rid of all allocations in the backend however. Unlike the frontend where there are a bunch of allocations that live for the entire duration of the Function, the backend allocates and deallocates all the time. The most costly lists already use SmallVec to avoid allocation most of the time however.

Jakub Dóka (Sep 20 2024 at 16:30):

@bjorn3 What I should have asked first: Are backend allocations an actual issue based on profiling?

bjorn3 (Sep 20 2024 at 16:32):

Probably not really. Regalloc is one of the biggest issues. Someone is writing a faster but dumber regalloc for GSOC.

Amanieu (Sep 20 2024 at 17:31):

fitzgen (he/him) (Sep 20 2024 at 17:34):

and it is mostly in the b-trees. I did some experiments a while back forking the std b-trees into both bumpalo-backed allocation and arena-backed allocation with 32-bit ids instead of full 64-bit pointers. neither change provided any wins, unfortunately

Chris Fallin (Sep 20 2024 at 17:38):

To the original question, I suspect reusing allocations for RA2 data structures might be an interesting experiment -- the BTreeMaps (preg allocation maps) I think will hang onto allocated memory if cleared, likewise for the large entity arrays (liverange data, etc)

Chris Fallin (Sep 20 2024 at 17:38):

a big part of the cost in my experience is in cache misses as one touches new memory -- allocation hurts not just because the allocator itself takes time to find some memory but because it implies we're writing out lots of data to new cache lines

Chris Fallin (Sep 20 2024 at 17:39):

so reusing memory helps in that way (likely already in some level of the cache hierarchy)

Chris Fallin (Sep 20 2024 at 17:39):

RA2's API doesn't have the equivalent of Cranelift's Context but it could be added

Chris Fallin (Sep 20 2024 at 17:40):

all that said, I forget how this was wired up with the bumpalo experiment @fitzgen (he/him) -- were you reusing the same arena backing store or was there a fresh one each function compilation?

bjorn3 (Sep 20 2024 at 17:40):

I don't think it does. There is no with_capacity or shrink_to_fit like Vec and HashMap do. I believe BTreeMap allocates every node individually and doesn't keep a cache with unused nodes around.

Chris Fallin (Sep 20 2024 at 17:40):

bjorn3 (Sep 20 2024 at 17:43):

Are the items in the BTreeMap expected to be inserted in the same order as the key sorts? If so maybe a Vec + binary_search for retrieval would work?

fitzgen (he/him) (Sep 20 2024 at 17:43):

yeah it doesn't because it is multiple allocations for each node, not one big table like Vec/HashMap, and I suspect that std takes a conservative approach and says "why do we expect this super generic collection code to do better size-classing/pooling/reuse than the actual allocator?"

fitzgen (he/him) (Sep 20 2024 at 17:44):

whatever cranelift does, which last time I looked was creating a new reg alloc env for each function, iirc

fitzgen (he/him) (Sep 20 2024 at 17:44):

Chris Fallin (Sep 20 2024 at 17:44):

ah, there's the MachineEnv which is the register data; that was indeed fixed at some point

Chris Fallin (Sep 20 2024 at 17:44):

fitzgen (he/him) (Sep 20 2024 at 17:45):

but I could be misremembering. anyways: the perf experiments were running sightglass's spidermonkey/bz2/pulldown-cmark, so however wasmtime uses cranelift uses regalloc2

Chris Fallin (Sep 20 2024 at 17:45):

unfortunately not, the signature here doesn't allow for any mutable persisted state between runs

fitzgen (he/him) (Sep 20 2024 at 17:46):

Chris Fallin (Sep 20 2024 at 17:46):

might be interesting to revive your experiment branch specifically for the btrees -- but reusing the same bumpalo backing store between runs

Chris Fallin (Sep 20 2024 at 17:46):

(I forget the precise API, is there a way to reuse the same blocks from a torn-down arena?)

fitzgen (he/him) (Sep 20 2024 at 17:46):

fitzgen (he/him) (Sep 20 2024 at 17:47):

the reset method will reuse the largest backing chunk from the bump arena, that will quickly give you a steady state of never needing to allocate more backing chunks

fitzgen (he/him) (Sep 20 2024 at 17:47):

Chris Fallin (Sep 20 2024 at 17:48):

/me reaches into his reserves and pulls out a large block of free time to experiment with this

Chris Fallin (Sep 20 2024 at 17:48):

fitzgen (he/him) (Sep 20 2024 at 17:49):

Chris Fallin (Sep 20 2024 at 17:49):

Jakub Dóka (Sep 20 2024 at 17:52):

Seems good, If there is a good way to benchmark I can try this out. Maybe a question: could the whole backend use a bump allocator?

Chris Fallin (Sep 20 2024 at 17:53):

that's a bigger lift I think, would have to be threaded through and many collections changed

Chris Fallin (Sep 20 2024 at 17:53):

Jakub Dóka (Sep 20 2024 at 17:53):

Jakub Dóka (Sep 20 2024 at 17:54):

Jakub Dóka (Sep 20 2024 at 17:58):

@fitzgen (he/him) would you mind linking me to the branch with bumpalo? I am using regalloc2 in my project and already planned to do this on my fork (- the bumpalo)

fitzgen (he/him) (Sep 20 2024 at 18:05):

Use b-trees based on bumpalo arenas by fitzgen · Pull Request #92 · bytecodealliance/regalloc2

Alternative to #88. cc @Amanieu Numbers don't look great though? $ cargo run --release -- benchmark -e ~/scratch/bumpalo-arena.so -e ~/scratch/main.so -m perf-counters --stop-after compilation ...

Jakub Dóka (Sep 20 2024 at 18:16):

One important question: what is the unsafe code policy? I like to use unsafe if it simplifies my life e.g. not sprinkling lifetimes everywhere where arena is used, instead live with few unsafe blocks behind safe API.

Amanieu (Sep 20 2024 at 18:18):

Chris Fallin (Sep 20 2024 at 18:20):

@Jakub Dóka in general we try to avoid unsafe unless absolutely necessary: in our opinion (well, my opinion at least) it makes for a much easier auditing to know for sure that there is no unsafety

Chris Fallin (Sep 20 2024 at 18:20):

we do have unsafe in I think one place in Cranelift to wrap the lower-level raw hashtable interface in CtxHash; RA2 is completely unsafe-free

Jakub Dóka (Sep 20 2024 at 18:39):

I admit that most people get filtered on unsafe, I will see what's possible without it. Note though that safe arenas in Rust require lifetime and I am not sure if we can afford to store that anywhere, definitely not behind dyn std::any::Any.

Chris Fallin (Sep 20 2024 at 18:41):

in general I'd be OK with threading a lifetime through the backend -- it's more "honest" in some sense (if we're going to bump-allocate all compilation temporary data) than trying to hide the arena somehow

Chris Fallin (Sep 20 2024 at 18:42):

(the only use of std::any::Any I see in cranelift-codegen is in the timing submodule, in a thing that I think doesn't need to interact with this at all)

Jakub Dóka (Sep 20 2024 at 18:49):

In order to smuggle context into the TargetIsa trait, It needs to be behind dyn and rust does not allow downcasting things with lifetimes (in safe code at least), I might be able to find an arena that has owned part tho (I have implemented that before so its possible, but did not publish it).

Chris Fallin (Sep 20 2024 at 18:53):

ah, interesting. OK, yeah, if we need a tiny bit of carefully-vetted unsafe somewhere, it's not the end of the world; though ideally we'd wrap it in an abstraction that minimizes misuse

Jakub Dóka (Sep 20 2024 at 18:53):

Taking that back rustc_arena::DroplessArena has no lifetime, I confused it with scoped version sorry, All good.

Jakub Dóka (Sep 20 2024 at 18:59):

Jakub Dóka (Sep 21 2024 at 11:33):

Adding new API that accepts resusable context. by jakubDoka · Pull Request #196 · bytecodealliance/regalloc2

Sadly due to how the code was structured, I needed to change the `Env' fields so basically everything that was used was changed as well. I did not benchmark anything yet (work in progress). Con...

Jakub Dóka (Sep 21 2024 at 18:23):

regalloc with sorted vecs instead of btrees with reused resources : 7,363,597,706 cycles
regalloc with sorted vecs instead of btrees with no resource reuse: 8,900,538,665 cycles
regalloc from the crates.io --------------------------------------: 9,969,449,265 cycles

Amanieu (Sep 21 2024 at 18:33):

Do you just re-sort the vec after every insertion? That's definitely going to cause problems with larger functions.

Jakub Dóka (Sep 21 2024 at 18:34):

Amanieu (Sep 21 2024 at 18:36):

Hmm, that's still O(n) per insertion instead of O(log(n)), so it could still end up being an issue for larger functions.

Amanieu (Sep 21 2024 at 18:37):

Jakub Dóka (Sep 21 2024 at 18:38):

Amanieu (Sep 21 2024 at 18:39):

The way I usually benchmark is by building wasmtime with my local copy of regalloc2 by adding this to Cargo.toml:

[patch.crates-io]
regalloc2 = { path = "../regalloc2" }

Amanieu (Sep 21 2024 at 18:40):

Then grab some .wasm benchmarks from sightglass and compare the 2 builds of wasmtime (with & without your changes) using hyperfine:

hyperfine "./wasmtime.orig compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm" "./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm"

Amanieu (Sep 21 2024 at 18:41):

Jakub Dóka (Sep 21 2024 at 18:41):

Jakub Dóka (Sep 21 2024 at 19:06):

@Amanieu this is just testing difference between vecs (new) and binary trees (old)

hyperfine "./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm" "./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm"
Benchmark 1: ./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
  Time (mean ± σ):     144.2 ms ±   3.5 ms    [User: 135.4 ms, System: 7.5 ms]
  Range (min … max):   140.4 ms … 152.8 ms    20 runs

Benchmark 2: ./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
  Time (mean ± σ):     142.0 ms ±   4.3 ms    [User: 134.7 ms, System: 5.8 ms]
  Range (min … max):   137.1 ms … 155.0 ms    21 runs

Summary
  ./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm ran
    1.02 ± 0.04 times faster than ./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm

Jakub Dóka (Sep 21 2024 at 19:08):

bjorn3 (Sep 21 2024 at 19:11):

You could add --runs 100 as argument to reduce noise a bit by running it 100 times instead of 20 times.

Jakub Dóka (Sep 21 2024 at 19:21):

hyperfine --runs 100 "./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm" "./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm"
Benchmark 1: ./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
  Time (mean ± σ):     148.9 ms ±   6.6 ms    [User: 139.9 ms, System: 7.2 ms]
  Range (min … max):   139.1 ms … 178.7 ms    100 runs

Benchmark 2: ./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
  Time (mean ± σ):     156.8 ms ±   4.6 ms    [User: 147.1 ms, System: 7.5 ms]
  Range (min … max):   149.6 ms … 173.3 ms    100 runs

Summary
  ./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm ran
    1.05 ± 0.06 times faster than ./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm

Jakub Dóka (Sep 21 2024 at 19:29):

It appears that the order in which I run the benchmark matters :smiling_face_with_tear:

Jakub Dóka (Sep 21 2024 at 19:38):

./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm ran
    1.03 ± 0.05 times faster than ./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm

Jakub Dóka (Sep 21 2024 at 20:36):

(old  - new) / old =
(15608637381 - 14952009848) / 15608637381 = 0.0420682163

Jakub Dóka (Sep 22 2024 at 07:32):

Jakub Dóka (Sep 22 2024 at 09:33):

The allocations are actually not eliminated that well Regalloc still contains ~22% of all temporary allocations and most of them are from SmallVecs overflowing:
image.png
At this point, using arena-backed Vecs instead of SmallVecs might help.

Jakub Dóka (Sep 22 2024 at 14:04):

Okay, I reduced the temporary allocations within RA2 from 35% to 0.7% (according to heaptrack).
image.png
That's great but the total cycle count did not improve that much

(old - new) / old =
(15,552,110,365 - 14,632,439,684) / 15,552,110,365 = 0.0591347836 (~6%) (cpu_core/cycles/u form perf stat)

The peak memory usage also increased by ~15MB. I have tested this by storing the RA2 context in thread-local on sightglass/benchmarks/spidermonkey/benchmark.wasm

Jakub Dóka (Sep 22 2024 at 18:31):

I am a bit confused about why the program takes fewer cycles but RA2 takes a bigger chunk of runtime. Could it be that allocating less in one place can make other things faster?

Jakub Dóka (Sep 23 2024 at 12:03):

Okay, I managed to avoid lifetimes and unsafe code by using bumpalo::Bump in Rc, I had to implement Allocator trait, which is unsafe but all it does is delegate methods. I then clear the arena through Rc::get_mut (after all arena-backed Vec's are dropped).

fitzgen (he/him) (Sep 23 2024 at 15:52):

seems like you're investigations are unearthing some good stuff, but I do want to note that we can't rely on nightly-only Rust features here

Jakub Dóka (Sep 23 2024 at 15:53):

Jakub Dóka (Sep 23 2024 at 15:54):

fitzgen (he/him) (Sep 23 2024 at 17:35):

Jakub Dóka (Sep 24 2024 at 16:55):

Welp, I managed to save 1b out of 15.5b cycles, with some non-memory optimizations, like replacing hashmaps with vectors (where the hashmap key can index it), and two places do linear merge instead of sorting. This brings the cycles down by another ~150m. I am unsure what more I can do to make this faster. (on Spidermonkey bench) @Chris Fallin Are there any tests that can verify that the generated code is valid? I noticed some bugs I introduced during development were not caught by Fuzzer but when compiling Spidermonkey I would get a crash, so I am not confident If miss-compilation could occur.

Chris Fallin (Sep 24 2024 at 16:56):

Ideally, passing the test-suite in Wasmtime/Cranelift, passing the RA2 fuzzer (ion_checker target -- it can take overnight or longer on 8 cores to find some bugs, let it run for a while), and maybe passing differential fuzz target in Wasmtime for a little bit too

Chris Fallin (Sep 24 2024 at 16:57):

for regalloc work mainly we trust the RA-specific checker (it'll have much higher throughput by banging just on RA) but again it has to run for a while

Chris Fallin (Sep 24 2024 at 16:58):

I neglected to start with though: 1B out of 15.5B cycles saved is great, thanks so much for looking into this! We'll definitely take it assuming it is passing all tests

Chris Fallin (Sep 24 2024 at 16:58):

(or, "most likely" I should say, I still need to review the details of the integration)

Jakub Dóka (Sep 24 2024 at 17:13):

I assume there are some nobs on the fuzzer to use more threads, my notebook can maintain 8 cores at 2ghz is that enough for a night?
I should probably use highest level of optimizations (lto, panic-abort, codegen-units=1) which is not done currently.

Chris Fallin (Sep 24 2024 at 17:14):

-j 8 to cargo fuzz should work; also I typically run with -s none, i.e. no sanitizer instrumentation, because that's not needed to find logic bugs and it's a significant speedup

Jakub Dóka (Sep 25 2024 at 06:51):

#133803205: cov: 12736 ft: 18897 corp: 3233 exec/s 194 oom/timeout/crash: 0/0/0 time: 46251s job: 2599 dft_time: 0
#133872329: cov: 12736 ft: 18897 corp: 3233 exec/s 229 oom/timeout/crash: 0/0/0 time: 46269s job: 2600 dft_time: 0
#133918523: cov: 12736 ft: 18897 corp: 3233 exec/s 153 oom/timeout/crash: 0/0/0 time: 46289s job: 2601 dft_time: 0
#133971880: cov: 12736 ft: 18897 corp: 3233 exec/s 177 oom/timeout/crash: 0/0/0 time: 46308s job: 2602 dft_time: 0
#134018335: cov: 12736 ft: 18897 corp: 3233 exec/s 154 oom/timeout/crash: 0/0/0 time: 46327s job: 2603 dft_time: 0
#134087566: cov: 12736 ft: 18897 corp: 3233 exec/s 230 oom/timeout/crash: 0/0/0 time: 46345s job: 2604 dft_time: 0

Jakub Dóka (Oct 04 2024 at 17:47):

@Chris Fallin I would like to add the changes I reverted during the review of the first pr which would turn into the following PRs:

Chris Fallin (Oct 04 2024 at 18:23):

sure, I can't say for sure until seeing the PRs and we'll iterate on it but in principle I'm ok with these

Stream: cranelift

Topic: Using context for `TargetIsa::compile_function`

Jakub Dóka (Sep 20 2024 at 16:14):

bjorn3 (Sep 20 2024 at 16:18):

Jakub Dóka (Sep 20 2024 at 16:30):

bjorn3 (Sep 20 2024 at 16:32):

Amanieu (Sep 20 2024 at 17:31):

fitzgen (he/him) (Sep 20 2024 at 17:34):

Chris Fallin (Sep 20 2024 at 17:38):

Chris Fallin (Sep 20 2024 at 17:38):

Chris Fallin (Sep 20 2024 at 17:39):

Chris Fallin (Sep 20 2024 at 17:39):

Chris Fallin (Sep 20 2024 at 17:40):

bjorn3 (Sep 20 2024 at 17:40):

Chris Fallin (Sep 20 2024 at 17:40):

bjorn3 (Sep 20 2024 at 17:43):

fitzgen (he/him) (Sep 20 2024 at 17:43):

fitzgen (he/him) (Sep 20 2024 at 17:44):

fitzgen (he/him) (Sep 20 2024 at 17:44):

Chris Fallin (Sep 20 2024 at 17:44):

Chris Fallin (Sep 20 2024 at 17:44):

fitzgen (he/him) (Sep 20 2024 at 17:45):

fitzgen (he/him) (Sep 20 2024 at 17:45):

Chris Fallin (Sep 20 2024 at 17:45):

fitzgen (he/him) (Sep 20 2024 at 17:46):

Chris Fallin (Sep 20 2024 at 17:46):

Chris Fallin (Sep 20 2024 at 17:46):

Chris Fallin (Sep 20 2024 at 17:46):

fitzgen (he/him) (Sep 20 2024 at 17:46):

fitzgen (he/him) (Sep 20 2024 at 17:47):

fitzgen (he/him) (Sep 20 2024 at 17:47):

Chris Fallin (Sep 20 2024 at 17:48):

Chris Fallin (Sep 20 2024 at 17:48):

Chris Fallin (Sep 20 2024 at 17:48):

fitzgen (he/him) (Sep 20 2024 at 17:49):

fitzgen (he/him) (Sep 20 2024 at 17:49):

Chris Fallin (Sep 20 2024 at 17:49):

Jakub Dóka (Sep 20 2024 at 17:52):

Chris Fallin (Sep 20 2024 at 17:53):

Chris Fallin (Sep 20 2024 at 17:53):

Chris Fallin (Sep 20 2024 at 17:53):

Jakub Dóka (Sep 20 2024 at 17:53):

Jakub Dóka (Sep 20 2024 at 17:54):

Jakub Dóka (Sep 20 2024 at 17:58):

fitzgen (he/him) (Sep 20 2024 at 18:05):

Jakub Dóka (Sep 20 2024 at 18:16):

Amanieu (Sep 20 2024 at 18:18):

Chris Fallin (Sep 20 2024 at 18:20):

Chris Fallin (Sep 20 2024 at 18:20):

Jakub Dóka (Sep 20 2024 at 18:39):

Chris Fallin (Sep 20 2024 at 18:41):

Chris Fallin (Sep 20 2024 at 18:42):

Jakub Dóka (Sep 20 2024 at 18:49):

Chris Fallin (Sep 20 2024 at 18:53):

Jakub Dóka (Sep 20 2024 at 18:53):

Jakub Dóka (Sep 20 2024 at 18:59):

Jakub Dóka (Sep 21 2024 at 11:33):

Jakub Dóka (Sep 21 2024 at 18:23):

Amanieu (Sep 21 2024 at 18:33):

Jakub Dóka (Sep 21 2024 at 18:34):

Amanieu (Sep 21 2024 at 18:36):

Amanieu (Sep 21 2024 at 18:37):

Jakub Dóka (Sep 21 2024 at 18:38):

Amanieu (Sep 21 2024 at 18:39):

Amanieu (Sep 21 2024 at 18:40):

Amanieu (Sep 21 2024 at 18:41):

Jakub Dóka (Sep 21 2024 at 18:41):

Jakub Dóka (Sep 21 2024 at 19:06):

Jakub Dóka (Sep 21 2024 at 19:08):

bjorn3 (Sep 21 2024 at 19:11):

Jakub Dóka (Sep 21 2024 at 19:21):

Jakub Dóka (Sep 21 2024 at 19:29):

Jakub Dóka (Sep 21 2024 at 19:38):

Jakub Dóka (Sep 21 2024 at 20:36):

Jakub Dóka (Sep 22 2024 at 07:32):

Jakub Dóka (Sep 22 2024 at 09:33):

Jakub Dóka (Sep 22 2024 at 14:04):

Jakub Dóka (Sep 22 2024 at 18:31):

Jakub Dóka (Sep 23 2024 at 12:03):

fitzgen (he/him) (Sep 23 2024 at 15:52):