Stream: cranelift

Topic: regalloc3 benchmarks


view this post on Zulip Erik Rose (Feb 26 2025 at 20:36):

@Amanieu Per our exchange at today's meeting, here are the benchmarks I ran against wasmtime main (79e1b5374710cd66af9f6330b4c2a231e61d9866) vs. https://github.com/bytecodealliance/wasmtime/commit/eb2acdfb4f27795856c7455824b08bb705d60e64, which pulls in your regalloc3 branch of regalloc2 (but not your adjust-split branch). You had expected a 50% slowdown in compilation, but I recalled a 25% one. Some benchmarks were indeed up in the 20s, but, on average, they were 14% slower on the low end of the CI, 17% slower on the high. I offer this only in case you find it interesting/surprising and by no means want to slow your work on the allocator, which will benefit us all! :-) Again, I ran these on an ARM chip (M3 Max), which has double the GPRs as an x64.

regalloc3 all suite compilation only.txt

I didn't change anything in there yet anyway.

view this post on Zulip fitzgen (he/him) (Feb 26 2025 at 20:39):

I'd suggest focusing on just spidermonkey, bz2, and pulldown-cmark. maybe as a second tier intgemm-simd, meshoptimizer, and libsodium. but all the shootout stuff is just tiny micro benchmarks that aren't really interesting unless you are focused on optimizing the thing they happen to micro bench.

view this post on Zulip fitzgen (he/him) (Feb 26 2025 at 20:40):

in practice, the shootout benchmarks are going to just be noise unless you're trying to focus on their specific thing

view this post on Zulip fitzgen (he/him) (Feb 26 2025 at 20:41):

I guess blake3 is fairly interesting interesting

view this post on Zulip Erik Rose (Feb 26 2025 at 20:42):

So 28-29, 17-20, and 25-27%, respectively.
2nd tier: 25-26, 17-21, and ≈15-20

Doesn't really change the numbers, but sure will save me time benchmarking! Thanks!

view this post on Zulip fitzgen (he/him) (Feb 26 2025 at 20:44):

right yeah, I'm just trying to help make benchmarking easier and make it easier to evaluate the results

view this post on Zulip fitzgen (he/him) (Feb 26 2025 at 20:47):

you can also benchmark particular phases as well, if you want to focus only on compile time for example, you can do

$ sightglass-cli benchmark --benchmark-phase compilation ...

view this post on Zulip Erik Rose (Feb 26 2025 at 20:49):

I can also have it spit out JSON and save myself a lot of regexes, but I have yet to remember to do that until the middle of a long run. ;-)

view this post on Zulip Amanieu (Feb 26 2025 at 22:21):

FYI I run my benchmarks with the shuffling allocator disabled because it makes allocations extremely slow and distorts the runtime. That could be an explanation of why you're seeing different results.

view this post on Zulip Chris Fallin (Feb 26 2025 at 23:01):

ah, right, I filed https://github.com/bytecodealliance/sightglass/issues/280 a while ago to switch the default but then ran out of spare energy to do it; @Erik Rose that could be an easy PR to make

A recent report by @d-sonuga indicates wildly differing data when measuring the impact of regalloc improvements on compile time depending on whether Sightglass's shuffling allocator is enabled or n...

view this post on Zulip Amanieu (Feb 26 2025 at 23:01):

It's a one-line change:

diff --git a/crates/bench-api/Cargo.toml b/crates/bench-api/Cargo.toml
index a171b0beed..de288922ba 100644
--- a/crates/bench-api/Cargo.toml
+++ b/crates/bench-api/Cargo.toml
@@ -35,5 +35,5 @@ clap = { workspace = true }
 wat = { workspace = true }

 [features]
-default = ["shuffling-allocator", "wasi-nn"]
+default = ["wasi-nn"]
 wasi-nn = ["wasmtime-wasi-nn"]

view this post on Zulip Chris Fallin (Feb 26 2025 at 23:03):

I must have been very low on spare energy then :-)

view this post on Zulip Amanieu (Feb 26 2025 at 23:12):

@Erik Rose Here are the results I am getting on my (admittedly quite perf-noisy) machine:

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 43971560.50 ± 34846300.90 (confidence = 99%)

  3-spill.so is 1.08x to 1.66x faster than 2.so!

  [129186085 162930575.50 216937350] 2.so
  [85216530 118959015.00 141328075] 3-spill.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [95009950 116081567.00 158748170] 2.so
  [80138100 111471111.50 141233575] 3-spill.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [1066633785 1108409337.00 1163300915] 2.so
  [1039408055 1077584459.00 1112557740] 3-spill.so

This was using an older version of regalloc3 which didn't have live range splitting (the 50% cost I mentioned in only with live range splitting). This should be the same version you tested based on your Cargo.lock.

view this post on Zulip Amanieu (Feb 26 2025 at 23:14):

However you can see that ra3 is consistently faster than ra2 in terms of compilation speed.

view this post on Zulip Chris Fallin (Feb 26 2025 at 23:14):

how does runtime perf look in the non-spilling version? (I forget, sorry; it'd be helpful to have an up-to-date summary of all the numbers somewhere)

view this post on Zulip Amanieu (Feb 26 2025 at 23:17):

I'm currently in the middle of some perf optimizations, but you can find the older results here: #cranelift > regalloc3 progress update @ 💬

view this post on Zulip Amanieu (Feb 27 2025 at 13:32):

@Erik Rose I've confirmed that the 25% regression you observed is entirely due to the shuffling allocator. regalloc3 doesn't make use of smallvec and instead relies on the caller to reuse the register allocation context for multiple functions (preserving the Vec allocations inside it). Unfortunately Cranelift doesn't use regalloc2::run_with_ctx and therefore cannot take advantage of this.

view this post on Zulip Amanieu (Feb 27 2025 at 13:32):

However the difference is only really noticable with the shuffling allocator enabled since it allocations so slow that compilation takes ~5x longer.

view this post on Zulip Erik Rose (Feb 27 2025 at 14:40):

Fantastic. Thanks, Amanieu! I will update my checkouts, get off the shuffling allocator, and re-run my benchmarks.

I opened a PR to disable the shuffling allocator as well: https://github.com/bytecodealliance/wasmtime/pull/10300.

The slowness of the shuffling obscured performance signal (by dwarfing it) more than the accidental localities it was meant to avoid. Closes bytecodealliance/sightglass#280. This issue came up in h...

view this post on Zulip Amanieu (Feb 27 2025 at 14:46):

Note that if you update your checkouts now, the current main branch default to enabling live range splitting which is what causes the 50% slowdown I previously mentioned.

view this post on Zulip Amanieu (Feb 27 2025 at 14:47):

You can override it by selecting SplitStrategy::Spill in the regalloc3 options. Or just modify your regalloc3 checkout to make that the default temporarily.

view this post on Zulip Leaves (Jun 26 2025 at 03:17):

@Amanieu considering swapping from ra2 to 3. Do you have any sort of design doc available similar to ra2? Mostly I'm just a bit confused on the difference between a bank and a class. In what instance would I need multiple register banks to describe one class?

view this post on Zulip Amanieu (Jun 26 2025 at 04:45):

@Leaves Have you looked at the documentation for the reginfo module? Also there are examples available here.

New register allocator designed as a successor to regalloc2 - Amanieu/regalloc3
New register allocator designed as a successor to regalloc2 - Amanieu/regalloc3

view this post on Zulip Leaves (Jun 26 2025 at 13:17):

Yep, that's exactly what I'm looking for, I had entirely missed that comment at the top!

Edit: Ok that comment explained everything I was curious about, thank you much.

view this post on Zulip Leaves (Jun 27 2025 at 00:20):

@Amanieu Sorry to bug you again, is there a proper way to get the set of all registers used by the allocator? Including registers allocated to instructions, and those used by edits. I'm currently just iterating over all the instructions using output_insts(..) and recording it as I go. But a succinct method to get this would be nice(if it exists).

view this post on Zulip Amanieu (Jun 27 2025 at 05:19):

@Leaves No, there is no other way. Regalloc3 doesn't track this internally, especially for temporary registers that are only used as scratch registers for moves.

view this post on Zulip Amanieu (Jun 27 2025 at 05:20):

What do you need this information for? If it's for stack frame setup (callee-saved registers), you may want to specify those registers as defined on function entry and used by return instructions. That way they will automatically be spilled to a stack slot by the register allocator.

view this post on Zulip Leaves (Jun 27 2025 at 13:20):

Yeah for stack frame setup. I need to know what registers are saved to properly emit push/pop pairs. It's important that these registers are saved in an easy to define way for the various UnwindInfo formats. If RA3 is deciding where they are spilled/filled, I won't be able to locate this to emit proper unwind information. And I need to know how many pairs of push/pops there are, so I can know the entire stack frame size, prior to lowering an intrinsic like __AddressOfReturnAddress() which needs to read from RSP+FrameSize.

Where FrameSize = PushedRegisters * 8 + StackLayout.spillslot_area_size.

So, I need to iterate over all the instructions and edits first, find out which non-volatile registers are written in any way, and only then can I lower that intrinsic.

view this post on Zulip Amanieu (Jun 27 2025 at 13:28):

Alternatively, you can let the register allocator handle saving/restoring into regalloc-managed spillslots. Just specify the constraints that the return instruction takes the original values of the callee-saved registers as fixed-register inputs. Then just get the stack frame size from output.stack_layout().

view this post on Zulip Leaves (Jun 27 2025 at 13:29):

I will not be able to describe these saves with unwind info though, and for certain targets like windows that require support for asyncronous exception support, I need to be very careful with this. Allowing RA3 to arbitrarily determine spill/fill locations for non-volatile registers will make that impossible.

view this post on Zulip Amanieu (Jun 27 2025 at 13:29):

For the unwind information, you can get the location of each value from output.value_locations(). You can use that to get the values of the callee-saved at every point in the function.

view this post on Zulip Leaves (Jun 27 2025 at 13:31):

That is true, however there is a limit to the number of locations that a non-volatile register can be saved into in with the Windows UnwindInfo spec. They must be saved in the prolog, and the prolog cannot be larger than 256 bytes. So, I would need to guarantee that regalloc saves them once there, and only restores from that location at function end.

view this post on Zulip Amanieu (Jun 27 2025 at 13:32):

Right, in that case you probably have to do this yourself. And there is no way other than scanning through the output instructions to find out which registers are used. It's not tracked internally in the allocator, so there isn't any faster way of doing this.

view this post on Zulip Leaves (Jun 27 2025 at 13:32):

Additionally PUSH/POP pairs at the beginning/end of the function are smaller in size, even when encoding a spill/fill with an 8bit displacement. MOV [RSP+8BitDisp]

view this post on Zulip Leaves (Jun 27 2025 at 13:33):

Got it, I implemented that, and it appears to work well. RA2 required something similar anyway but there were more simple allocs/edits slices to iterate over.

view this post on Zulip Amanieu (Jun 27 2025 at 13:34):

Hmm now that I think about it, it's actually possible to make this more efficient by tracking it internally. I'll add an API for it.

view this post on Zulip Amanieu (Jun 27 2025 at 13:35):

It would be something along the lines of fn reg_units_used(&self) -> &RegUnitSet.

view this post on Zulip Leaves (Jun 27 2025 at 13:37):

What would be most helpful would be reg_units_read(&self), reg_units_written(&self). Read alone wouldn't be enough for this case.

Are you open to PRs btw?

view this post on Zulip Amanieu (Jun 27 2025 at 13:39):

That's not as simple as it seems: the initial values for registers is emitted by the entry instruction, so all registers that are used are technically written.

view this post on Zulip Amanieu (Jun 27 2025 at 13:40):

I'm open to PRs but I would prefer if it was discussed first to avoid unnecessary work.

view this post on Zulip Leaves (Jun 27 2025 at 13:40):

Thats right actually, I do skip the defs created by my NativeRegistersIn instruction that is the first instruction in the function. So that would need to be considered as well :grimacing:

view this post on Zulip Amanieu (Jun 27 2025 at 13:43):

Tracking which registers are used can be done relatively efficiently, essentially meaning "does this register at any point hold a regalloc-managed value".

view this post on Zulip Amanieu (Jun 27 2025 at 13:43):

For anything more than that your best bet is still just scanning the output I think.

view this post on Zulip Leaves (Jun 27 2025 at 13:44):

I think so now as well after considering the fact that I need to disregard Defs from certain instructions.

view this post on Zulip Leaves (Jun 27 2025 at 15:56):

I suppose this is really getting away from cranelift/regalloc2 talk now. But if both validators are returning Ok for RA3 and I'm still getting an error about rematerialization(despite always returning None for rematerialization), what direction should I go in for debugging?

Error actually arising is "add_remat called with non-rematerializable value" and it's called on a block parameter(%17) for a small exit block.

block13(%17) freq(1):
; predecessors: block6, block7, block10, block11, block12
    inst66: inst
    inst67: inst Use(%2):r1 Use(%17):r0
    inst68: ret

view this post on Zulip Amanieu (Jun 27 2025 at 15:57):

Can you dump the function and reginfo to text form so I can try it with regalloc3-tool?

view this post on Zulip Leaves (Jun 27 2025 at 15:59):

Do you mean just converting to a GenericFunction then printing it?

view this post on Zulip Leaves (Jun 27 2025 at 16:01):

https://pastebin.com/raw/vpSvqK8X
https://pastebin.com/raw/QzJx8aGE

view this post on Zulip Notification Bot (Jun 27 2025 at 16:18):

42 messages were moved from this topic to #cranelift > regalloc3 debugging by Amanieu.

view this post on Zulip Amanieu (Jul 04 2025 at 15:41):

Latest benchmark results for regalloc3:

compilation
  benchmarks/bz2/benchmark.wasm
    cycles
      [388409280 394336775.00 420104370] ./ra2.so
      [324084635 327185663.00 345267545] ./ra3-spill.so
      [361255790 364677054.00 383229420] ./ra3-split.so
  benchmarks/pulldown-cmark/benchmark.wasm
    cycles
      [796059285 799510071.50 807007075] ./ra2.so
      [716654015 721313358.50 729701735] ./ra3-spill.so
      [782036010 786266075.00 789498185] ./ra3-split.so
  benchmarks/spidermonkey/benchmark.wasm
    cycles
      [20091993285 20178233887.00 20278362790] ./ra2.so
      [18244842070 18398756379.50 18586440250] ./ra3-spill.so
      [19562045300 19644339190.00 19873169225] ./ra3-split.so
execution
  benchmarks/bz2/benchmark.wasm
    cycles
      [118045165 118595158.50 121545340] ./ra2.so
      [112530215 112865336.50 113412705] ./ra3-spill.so
      [111467545 111745469.50 112304780] ./ra3-split.so
  benchmarks/pulldown-cmark/benchmark.wasm
    cycles
      [9640925 9705132.50 9788520] ./ra2.so
      [8748110 8830787.00 9037630] ./ra3-spill.so
      [8934240 9045858.50 9167515] ./ra3-split.so
  benchmarks/spidermonkey/benchmark.wasm
    cycles
      [1136702140 1140818234.50 1153406240] ./ra2.so
      [1090508300 1100439718.00 1122041375] ./ra3-spill.so
      [1113807730 1119928414.50 1128316140] ./ra3-split.so

view this post on Zulip Amanieu (Jul 04 2025 at 19:53):

At this point I consider regalloc3 basically done and ready for use. I'm currently writing up a design document for it, but other than that it's ready for review.

view this post on Zulip Leaves (Jul 04 2025 at 23:30):

I second that. After porting my own compiler form RA2 to RA3, fuzzing extensively, and deploying into product, I am VERY happy with the results. The only change I would make would be relaxing some of the requirements around BB params, but only if it doesn't hurt output code quality. Requiring additional optimization passes to prep for regalloc is somewhat tedius.

view this post on Zulip Chris Fallin (Jul 05 2025 at 01:05):

@Amanieu thanks for these latest results. As we discussed back in January, to get RA3 into Cranelift, we'll need someone to:

This is big enough that we'll probably want an RFC for the transition as well; and please feel free to schedule time on the Cranelift weekly meeting agenda.

view this post on Zulip Amanieu (Jul 06 2025 at 16:15):

@Chris Fallin Sure! I'm currently finishing up the design document. Once that is done, I will post a PR against an empty branch so that you can review the code. I've also submitted https://github.com/bytecodealliance/regalloc2/pull/230 which adds regalloc3 as a back-end for regalloc2.

Currently still a draft because: The regalloc3 crate isn't published yet. This is missing support for debug locations.

Last updated: Dec 06 2025 at 06:05 UTC