Hello there,
I was wondering why TargetIsa::compile_function
does not take a context argument? There are already more frontend-facing APIs that use context with reusable allocations to optimize compilation, from looking into what happens inside each Isa lowering stage, there are many places where resources can be reused.
My guess is that it's because TargetIsa
is immutable but this is not a difficult problem to work around. The backend context can look something like this:
struct IsaCtx {
// can be smallvec
ctxs: Vec<Box<dyn std::any::Any>>,
}
impl IsaCtx {
// this is easier to use since you don't need to worry about passing context to wrong isa
pub fn get_ctx<T: 'static>(&mut self, or_insert: impl FnOnce() -> T) -> &mut T{
// rust is stpd and does not let me use find_map
match self.ctxs.iter_mut().position(|c| c.downcast_mut::<T>().is_some()) {
Some(i) => self.ctxs[i].downcast_mut().unwrap(),
None => {
self.ctxs.push(Box::new(or_insert()));
self.ctxs.last_mut().unwap().downcast_mut().unwrap()
}
}
}
}
impl TargetIsa for SomeBackend {
fn compile_function_with_ctx(
&self,
func: &Function,
domtree: &DominatorTree,
want_disasm: bool,
ctrl_plane: &mut ControlPlane,
ctx: &mut IsaCtx,
) -> CodegenResult<CompiledCodeBase<Stencil>> {
ctx.get_ctx(SomeBackendCtx::default).emit(self, func, domtree, want_disasm, ctrl_plane)
}
/* ... */
}
I wonder how much would this improve compile times when crunching many functions. (I could volunteer to try it out!)
Storing a Option<Box<dyn Any>>
in Context
for the backend to reuse allocations would be an option.
It would be a lot of work to get rid of all allocations in the backend however. Unlike the frontend where there are a bunch of allocations that live for the entire duration of the Function
, the backend allocates and deallocates all the time. The most costly lists already use SmallVec
to avoid allocation most of the time however.
@bjorn3 What I should have asked first: Are backend allocations an actual issue based on profiling?
Probably not really. Regalloc is one of the biggest issues. Someone is writing a faster but dumber regalloc for GSOC.
regalloc spends ~5-10% of the time in the allocator.
and it is mostly in the b-trees. I did some experiments a while back forking the std
b-trees into both bumpalo
-backed allocation and arena-backed allocation with 32-bit ids instead of full 64-bit pointers. neither change provided any wins, unfortunately
To the original question, I suspect reusing allocations for RA2 data structures might be an interesting experiment -- the BTreeMap
s (preg allocation maps) I think will hang onto allocated memory if cleared, likewise for the large entity arrays (liverange data, etc)
a big part of the cost in my experience is in cache misses as one touches new memory -- allocation hurts not just because the allocator itself takes time to find some memory but because it implies we're writing out lots of data to new cache lines
so reusing memory helps in that way (likely already in some level of the cache hierarchy)
RA2's API doesn't have the equivalent of Cranelift's Context
but it could be added
all that said, I forget how this was wired up with the bumpalo experiment @fitzgen (he/him) -- were you reusing the same arena backing store or was there a fresh one each function compilation?
the
BTreeMap
s (preg allocation maps) I think will hang onto allocated memory if cleared
I don't think it does. There is no with_capacity
or shrink_to_fit
like Vec
and HashMap
do. I believe BTreeMap
allocates every node individually and doesn't keep a cache with unused nodes around.
ah, hmm, that's too bad
Are the items in the BTreeMap expected to be inserted in the same order as the key sorts? If so maybe a Vec + binary_search for retrieval would work?
yeah it doesn't because it is multiple allocations for each node, not one big table like Vec
/HashMap
, and I suspect that std
takes a conservative approach and says "why do we expect this super generic collection code to do better size-classing/pooling/reuse than the actual allocator?"
Chris Fallin said:
all that said, I forget how this was wired up with the bumpalo experiment fitzgen (he/him) -- were you reusing the same arena backing store or was there a fresh one each function compilation?
whatever cranelift does, which last time I looked was creating a new reg alloc env for each function, iirc
although I also vaguely remember someone fixing that maybe?
ah, there's the MachineEnv which is the register data; that was indeed fixed at some point
I'm thinking of the actual allocation data structures though
yeah I thought maybe someone did the actual data structures
but I could be misremembering. anyways: the perf experiments were running sightglass's spidermonkey/bz2/pulldown-cmark, so however wasmtime uses cranelift uses regalloc2
unfortunately not, the signature here doesn't allow for any mutable persisted state between runs
ah okay, well we should fix that
indeed, that's what I'm suggesting here :-)
might be interesting to revive your experiment branch specifically for the btrees -- but reusing the same bumpalo backing store between runs
(I forget the precise API, is there a way to reuse the same blocks from a torn-down arena?)
yeah, I think the PRs are still accessible in the github history
Chris Fallin said:
(I forget the precise API, is there a way to reuse the same blocks from a torn-down arena?)
the reset
method will reuse the largest backing chunk from the bump arena, that will quickly give you a steady state of never needing to allocate more backing chunks
backing chunks are allocated in a roughly doubling manner
cool, seems totally workable then
/me reaches into his reserves and pulls out a large block of free time to experiment with this
(not really but it should be high on our perf ideas list IMHO!)
I mean it really shouldn't be too hard to try, could be a good first issue
hardest part is probably measuring the perf wins
@Jakub Dóka how do you feel about the above? makes sense or need more details?
Seems good, If there is a good way to benchmark I can try this out. Maybe a question: could the whole backend use a bump allocator?
that's a bigger lift I think, would have to be threaded through and many collections changed
good eventual goal
but we're talking about a ~200kLoC project, so, one piece at a time!
I guess lets try lift the regalloc first and see if that helps
darn it, if only this was zig haha
@fitzgen (he/him) would you mind linking me to the branch with bumpalo? I am using regalloc2 in my project and already planned to do this on my fork (- the bumpalo)
https://github.com/bytecodealliance/regalloc2/pull/92
https://github.com/bytecodealliance/regalloc2/pull/88
One important question: what is the unsafe
code policy? I like to use unsafe if it simplifies my life e.g. not sprinkling lifetimes everywhere where arena is used, instead live with few unsafe blocks behind safe API.
By the way, regalloc3's API is specifically designed to reuse allocations.
@Jakub Dóka in general we try to avoid unsafe
unless absolutely necessary: in our opinion (well, my opinion at least) it makes for a much easier auditing to know for sure that there is no unsafety
we do have unsafe
in I think one place in Cranelift to wrap the lower-level raw hashtable interface in CtxHash
; RA2 is completely unsafe-free
Chris Fallin said:
Jakub Dóka in general we try to avoid
unsafe
unless absolutely necessary: in our opinion (well, my opinion at least) it makes for a much easier auditing to know for sure that there is no unsafety
I admit that most people get filtered on unsafe
, I will see what's possible without it. Note though that safe arenas in Rust require lifetime and I am not sure if we can afford to store that anywhere, definitely not behind dyn std::any::Any
.
in general I'd be OK with threading a lifetime through the backend -- it's more "honest" in some sense (if we're going to bump-allocate all compilation temporary data) than trying to hide the arena somehow
(the only use of std::any::Any
I see in cranelift-codegen
is in the timing submodule, in a thing that I think doesn't need to interact with this at all)
Chris Fallin said:
(the only use of
std::any::Any
I see incranelift-codegen
is in the timing submodule, in a thing that I think doesn't need to interact with this at all)
In order to smuggle context into the TargetIsa
trait, It needs to be behind dyn
and rust does not allow downcasting things with lifetimes (in safe code at least), I might be able to find an arena that has owned part tho (I have implemented that before so its possible, but did not publish it).
ah, interesting. OK, yeah, if we need a tiny bit of carefully-vetted unsafe
somewhere, it's not the end of the world; though ideally we'd wrap it in an abstraction that minimizes misuse
Jakub Dóka said:
I admit that most people get filtered on
unsafe
, I will see what's possible without it. Note though that safe arenas in Rust require lifetime and I am not sure if we can afford to store that anywhere, definitely not behinddyn std::any::Any
.
Taking that back rustc_arena::DroplessArena
has no lifetime, I confused it with scoped version sorry, All good.
Oh jeez, and allocator API is unstable I hate this so much.
Here are the changes to regalloc2
https://github.com/bytecodealliance/regalloc2/pull/196
Here are some results from my horrible benchmark:
regalloc with sorted vecs instead of btrees with reused resources : 7,363,597,706 cycles
regalloc with sorted vecs instead of btrees with no resource reuse: 8,900,538,665 cycles
regalloc from the crates.io --------------------------------------: 9,969,449,265 cycles
The benchmark is terrible because:
In any case, this is best I ve gotten
Do you just re-sort the vec after every insertion? That's definitely going to cause problems with larger functions.
Amanieu said:
Do you just re-sort the vec after every insertion? That's definitely going to cause problems with larger functions.
no I insert with binary search
Hmm, that's still O(n)
per insertion instead of O(log(n))
, so it could still end up being an issue for larger functions.
It's worth benchmarking compilation of real programs.
Amanieu said:
It's worth benchmarking compilation of real programs.
I d love to do that, but I don't know how, at least not easily
The way I usually benchmark is by building wasmtime with my local copy of regalloc2 by adding this to Cargo.toml:
[patch.crates-io]
regalloc2 = { path = "../regalloc2" }
Then grab some .wasm benchmarks from sightglass and compare the 2 builds of wasmtime (with & without your changes) using hyperfine:
hyperfine "./wasmtime.orig compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm" "./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm"
Though I guess perf stat would also work for measuring cycles.
Awesome, thank you very much
@Amanieu this is just testing difference between vecs (new) and binary trees (old)
hyperfine "./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm" "./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm"
Benchmark 1: ./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
Time (mean ± σ): 144.2 ms ± 3.5 ms [User: 135.4 ms, System: 7.5 ms]
Range (min … max): 140.4 ms … 152.8 ms 20 runs
Benchmark 2: ./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
Time (mean ± σ): 142.0 ms ± 4.3 ms [User: 134.7 ms, System: 5.8 ms]
Range (min … max): 137.1 ms … 155.0 ms 21 runs
Summary
./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm ran
1.02 ± 0.04 times faster than ./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
How should I understand the result? This seems within margin of error
tho it seems I cant get the old to beat new
You could add --runs 100
as argument to reduce noise a bit by running it 100 times instead of 20 times.
bjorn3 said:
You could add
--runs 100
as argument to reduce noise a bit by running it 100 times instead of 20 times.
hyperfine --runs 100 "./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm" "./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm"
Benchmark 1: ./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
Time (mean ± σ): 148.9 ms ± 6.6 ms [User: 139.9 ms, System: 7.2 ms]
Range (min … max): 139.1 ms … 178.7 ms 100 runs
Benchmark 2: ./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
Time (mean ± σ): 156.8 ms ± 4.6 ms [User: 147.1 ms, System: 7.5 ms]
Range (min … max): 149.6 ms … 173.3 ms 100 runs
Summary
./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm ran
1.05 ± 0.06 times faster than ./wasmtime.old compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
note: I have a shitty intel CPU that tends to throttle
It appears that the order in which I run the benchmark matters :smiling_face_with_tear:
./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm ran
1.03 ± 0.05 times faster than ./wasmtime.new compile -C parallel-compilation=n ../sightglass/benchmarks/pulldown-cmark/benchmark.wasm
:smiling_face_with_tear:
comparing cycles seems more relayable:
(old - new) / old =
(15608637381 - 14952009848) / 15608637381 = 0.0420682163
Okay, I have tried bumpalo + btree map, the performance degraded
The allocations are actually not eliminated that well Regalloc still contains ~22% of all temporary allocations and most of them are from SmallVec
s overflowing:
image.png
At this point, using arena-backed Vecs instead of SmallVecs might help.
Okay, I reduced the temporary allocations within RA2 from 35% to 0.7% (according to heaptrack).
image.png
That's great but the total cycle count did not improve that much
(old - new) / old =
(15,552,110,365 - 14,632,439,684) / 15,552,110,365 = 0.0591347836 (~6%) (cpu_core/cycles/u form perf stat)
The peak memory usage also increased by ~15MB. I have tested this by storing the RA2 context in thread-local on sightglass/benchmarks/spidermonkey/benchmark.wasm
I am a bit confused about why the program takes fewer cycles but RA2 takes a bigger chunk of runtime. Could it be that allocating less in one place can make other things faster?
Okay, I managed to avoid lifetimes and unsafe code by using bumpalo::Bump
in Rc
, I had to implement Allocator
trait, which is unsafe but all it does is delegate methods. I then clear the arena through Rc::get_mut
(after all arena-backed Vec
's are dropped).
seems like you're investigations are unearthing some good stuff, but I do want to note that we can't rely on nightly-only Rust features here
fitzgen (he/him) said:
seems like you're investigations are unearthing some good stuff, but I do want to note that we can't rely on nightly-only Rust features here
I used allocator-api2, I know about this
It compiles on stable
great!
Welp, I managed to save 1b out of 15.5b cycles, with some non-memory optimizations, like replacing hashmaps with vectors (where the hashmap key can index it), and two places do linear merge instead of sorting. This brings the cycles down by another ~150m. I am unsure what more I can do to make this faster. (on Spidermonkey bench) @Chris Fallin Are there any tests that can verify that the generated code is valid? I noticed some bugs I introduced during development were not caught by Fuzzer but when compiling Spidermonkey I would get a crash, so I am not confident If miss-compilation could occur.
Ideally, passing the test-suite in Wasmtime/Cranelift, passing the RA2 fuzzer (ion_checker
target -- it can take overnight or longer on 8 cores to find some bugs, let it run for a while), and maybe passing differential
fuzz target in Wasmtime for a little bit too
for regalloc work mainly we trust the RA-specific checker (it'll have much higher throughput by banging just on RA) but again it has to run for a while
I neglected to start with though: 1B out of 15.5B cycles saved is great, thanks so much for looking into this! We'll definitely take it assuming it is passing all tests
(or, "most likely" I should say, I still need to review the details of the integration)
Chris Fallin said:
for regalloc work mainly we trust the RA-specific checker (it'll have much higher throughput by banging just on RA) but again it has to run for a while
I assume there are some nobs on the fuzzer to use more threads, my notebook can maintain 8 cores at 2ghz is that enough for a night?
I should probably use highest level of optimizations (lto, panic-abort, codegen-units=1) which is not done currently.
-j 8
to cargo fuzz
should work; also I typically run with -s none
, i.e. no sanitizer instrumentation, because that's not needed to find logic bugs and it's a significant speedup
I have ran the fuzzer (ion_checker) overnight, It ended with this:
#133803205: cov: 12736 ft: 18897 corp: 3233 exec/s 194 oom/timeout/crash: 0/0/0 time: 46251s job: 2599 dft_time: 0
#133872329: cov: 12736 ft: 18897 corp: 3233 exec/s 229 oom/timeout/crash: 0/0/0 time: 46269s job: 2600 dft_time: 0
#133918523: cov: 12736 ft: 18897 corp: 3233 exec/s 153 oom/timeout/crash: 0/0/0 time: 46289s job: 2601 dft_time: 0
#133971880: cov: 12736 ft: 18897 corp: 3233 exec/s 177 oom/timeout/crash: 0/0/0 time: 46308s job: 2602 dft_time: 0
#134018335: cov: 12736 ft: 18897 corp: 3233 exec/s 154 oom/timeout/crash: 0/0/0 time: 46327s job: 2603 dft_time: 0
#134087566: cov: 12736 ft: 18897 corp: 3233 exec/s 230 oom/timeout/crash: 0/0/0 time: 46345s job: 2604 dft_time: 0
@Chris Fallin I would like to add the changes I reverted during the review of the first pr which would turn into the following PRs:
sure, I can't say for sure until seeing the PRs and we'll iterate on it but in principle I'm ok with these
Last updated: Dec 23 2024 at 12:05 UTC