Stream: git-wasmtime

Topic: wasmtime / Issue #1066 Optimize live value handling


view this post on Zulip Wasmtime GitHub notifications bot (Feb 03 2021 at 17:09):

bjorn3 commented on Issue #1066:

The new backend framework uses a different register allocator.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 03 2021 at 17:30):

cfallin commented on Issue #1066:

Indeed, closing; addressed with regalloc.rs.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 03 2021 at 17:30):

cfallin closed Issue #1066:

I've done some performance analysis of cranelift that shows handling of live values, particularly the partition_slice function, can be hot. For example, in the oryol-quad benchmark the live value list frequently has over 500 values in it, and having to repeatedly partition it is slow.

I have a sketch of a redesign for LiveValueVec that might speed things up. It's a bit complicated because LiveValueVec gets used in three different ways.

First, in LiveValueTracker::ebb_top, the LiveValueVec is first cleared. Then we add the live_in values, followed by the args. We subsequently iterate over the live_in values, then the args. So this suggests a structure like so:

  live_ins: Vec<LiveValue>, args: Vec<LiveValue>

Second, within process_inst, we need to partition the LiveValues by endpoint (giving throughs and kills) and then add zero, one or two defs. This suggests a structure like so:

  throughs: FxHashMap<Inst, Vec<LiveValue>>,
  kills: Vec<LiveValue>,
  defs: Vec<LiveValue>,

throughs is pre-partitioned -- all LiveValues that share the same endpoint will be in the same Vec. That way partitioning it to extract all the LiveValues with a particular endpoint is just a hash table removal. That removed Vec then becomes kills. And we can just append the defs.

Third, once drop_dead exits, we don't need the throughs/kills/defs distinction, so we would just combine those three pieces into a single piece like so:

  values: FxhashMap<Inst, Vec<LiveValue>>

This is then ready for the next call to process_inst.

I hope this will speed things up because partitioning is an O(1) operation (a hash table lookup) rather than O(n). But there will be more allocations for the nested data structure, and there's some fiddling about to convert between the three forms as necessary. Also, it's not entirely clear to me what the best way to represent this is... maybe an enum with three variants?


Last updated: Nov 22 2024 at 16:03 UTC