wasmtime / PR #1664 Add a transformation pass which remov... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #1664 Add a transformation pass which remov...

Wasmtime GitHub notifications bot (May 06 2020 at 07:17):

julian-seward1 opened PR #1664 from remove-constant-phis to master:

…onstrate

that only one value ever flows. Has been observed to improve generated code
run times by up to 8%. Compilation cost increases by about 0.6%, but up to 7%
total cost has been observed to be saved; iow it can be a significant win in
terms of compilation time, overall.

Wasmtime GitHub notifications bot (May 06 2020 at 07:17):

julian-seward1 requested cfallin for a review on PR #1664.

Wasmtime GitHub notifications bot (May 06 2020 at 16:30):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 06 2020 at 16:30):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 06 2020 at 16:30):

cfallin created PR Review Comment:

tiny, tiny nit: s/arm64/AArch64/ (we did a bulk find/replace during merge and have tried to standardize on the official name as far as we can)

Wasmtime GitHub notifications bot (May 06 2020 at 16:30):

cfallin created PR Review Comment:

since AbstractValue is Copy, the params can be self, other: AbstractValue and the dereferences below can be removed.

Wasmtime GitHub notifications bot (May 06 2020 at 16:30):

cfallin created PR Review Comment:

We should be able to just add PartialEq, Eq to the #[derive(...)] and get equality automatically, no?

Wasmtime GitHub notifications bot (May 06 2020 at 16:30):

cfallin created PR Review Comment:

It sounds like in practice this converges quickly, but I wonder if it might be worth using a worklist-of-blocks-with-changed-input-state algorithm instead (see e.g. the dataflow analysis in the regalloc checker) to avoid a potential worst case where a small subgraph takes a while to converge within a large function body?

Wasmtime GitHub notifications bot (May 06 2020 at 16:30):

cfallin created PR Review Comment:

Might be worth a comment here that multi-dest transfers (i.e., branch tables) don't carry parameters in our IR, so we only have to care about SingleDest here. (At least my intuition from most analyses otherwise is that one has to consider every CFG edge, which of course isn't the case here.)

Wasmtime GitHub notifications bot (May 06 2020 at 16:30):

cfallin created PR Review Comment:

for absval in state.absvals.values()?

Wasmtime GitHub notifications bot (May 06 2020 at 16:30):

cfallin created PR Review Comment:

for (formal, actual) in dst_formals.iter().zip(src_actuals.iter())?

Wasmtime GitHub notifications bot (May 07 2020 at 04:54):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (May 07 2020 at 04:54):

julian-seward1 created PR Review Comment:

I was being paranoid here .. I wanted to be certain that nothing would mess up join and equality operations on the abstract values. Could use derived Eq I suppose.

Wasmtime GitHub notifications bot (May 07 2020 at 05:12):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (May 07 2020 at 05:12):

julian-seward1 created PR Review Comment:

For the liveness analysis in the register allocator, I used a worklist initialised in reverse postorder, plus a mechanism to avoid duplicates in the worklist (== all bells and whistles I could think of). That achieves around 1.8-2.8 evaluations per block on average, for large functions. From those experiments I saw that the by far the most important factor was to initialise the worklist in reverse postorder, and operate it as a queue thereafter.

This one does something simpler, which is to repeatedly visit the blocks in reverse postorder, until there are no changes. Many inputs require 3 passes (3.0 evaluations per block), and the worst I saw, for joey_big.clif, required 4 passes. For the following reasons, I concluded that using a worklist wasn't worth the extra complexity:

given the above numbers, using a worklist wouldn't give a big reduction in the average evals per block

the worklist itself adds some overhead, and so does having an auxiliary mapping of block to bool, so as to be able to efficiently avoid inserting the same item in the worklist if it is already there

more than half the expense lies in Phase 1 (building the summaries) and in particular Phase 3 (editing the result); in particular removing formal parameters is expensive because CL's cost for removing just one formal parameter is linear in the number of formals, and that adds up when blocks have 300+ formals.

for a nonoptimising CLIF->aarch64 translation, this pass costs circa 0.6% in instruction counts, and will be correspondingly less in the opt_level=speed case.

One simple thing I could do is to filter out summaries for blocks that have neither formal parameters nor actual parameters in their jumps, before iterating. From what I've seen there are many such blocks and they can't possibly contribute anything to the overall result.

Wasmtime GitHub notifications bot (May 07 2020 at 09:34):

julian-seward1 created PR Review Comment:

One simple thing I could do is to filter out summaries for blocks that have neither formal parameters nor actual parameters in their jumps, before iterating.

I tried that. For a couple of large tests, it roughly halved the number of summaries (meaning, from around 1/2 the number of blocks to around 1/4 the number of blocks) but reduced the overall insn count for the pass only by around 3%. And the fixpointing is done over the summaries. So the fixpointing cost must be pretty minimal already.

I'll leave this refinement in the respin patch.

Wasmtime GitHub notifications bot (May 07 2020 at 09:34):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (May 07 2020 at 09:45):

julian-seward1 submitted PR Review.

Wasmtime GitHub notifications bot (May 07 2020 at 09:45):

julian-seward1 created PR Review Comment:

Good point. Added.

Wasmtime GitHub notifications bot (May 07 2020 at 09:51):

julian-seward1 edited PR Review Comment.

Wasmtime GitHub notifications bot (May 07 2020 at 09:59):

julian-seward1 updated PR #1664 from remove-constant-phis to master:

…onstrate

that only one value ever flows. Has been observed to improve generated code
run times by up to 8%. Compilation cost increases by about 0.6%, but up to 7%
total cost has been observed to be saved; iow it can be a significant win in
terms of compilation time, overall.

Wasmtime GitHub notifications bot (May 07 2020 at 10:48):

julian-seward1 requested cfallin for a review on PR #1664.

Wasmtime GitHub notifications bot (May 07 2020 at 16:50):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 07 2020 at 16:50):

cfallin submitted PR Review.

Wasmtime GitHub notifications bot (May 07 2020 at 16:50):

cfallin created PR Review Comment:

Sounds good -- thanks for the detailed analysis!

Wasmtime GitHub notifications bot (May 07 2020 at 16:50):

cfallin created PR Review Comment:

(detached from other thread because this line changed -- related to Eq discussion)

Yup, I'd prefer to stay idiomatic here and remove fn equals, using the derived Eq instead -- if the derived code isn't doing the right thing then we have bigger problems (and others who maintain the code in the future might expect to use == anyway).

Wasmtime GitHub notifications bot (May 08 2020 at 07:08):

julian-seward1 updated PR #1664 from remove-constant-phis to master:

…onstrate

that only one value ever flows. Has been observed to improve generated code
run times by up to 8%. Compilation cost increases by about 0.6%, but up to 7%
total cost has been observed to be saved; iow it can be a significant win in
terms of compilation time, overall.

Wasmtime GitHub notifications bot (May 08 2020 at 07:41):

julian-seward1 merged PR #1664.

Last updated: Apr 17 2025 at 08:04 UTC