julian-seward1 opened PR #1664 from remove-constant-phis
to master
:
…onstrate
that only one value ever flows. Has been observed to improve generated code
run times by up to 8%. Compilation cost increases by about 0.6%, but up to 7%
total cost has been observed to be saved; iow it can be a significant win in
terms of compilation time, overall.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
julian-seward1 requested cfallin for a review on PR #1664.
cfallin submitted PR Review.
cfallin submitted PR Review.
cfallin created PR Review Comment:
tiny, tiny nit: s/arm64/AArch64/ (we did a bulk find/replace during merge and have tried to standardize on the official name as far as we can)
cfallin created PR Review Comment:
since
AbstractValue
isCopy
, the params can beself, other: AbstractValue
and the dereferences below can be removed.
cfallin created PR Review Comment:
We should be able to just add
PartialEq, Eq
to the#[derive(...)]
and get equality automatically, no?
cfallin created PR Review Comment:
It sounds like in practice this converges quickly, but I wonder if it might be worth using a worklist-of-blocks-with-changed-input-state algorithm instead (see e.g. the dataflow analysis in the regalloc checker) to avoid a potential worst case where a small subgraph takes a while to converge within a large function body?
cfallin created PR Review Comment:
Might be worth a comment here that multi-dest transfers (i.e., branch tables) don't carry parameters in our IR, so we only have to care about
SingleDest
here. (At least my intuition from most analyses otherwise is that one has to consider every CFG edge, which of course isn't the case here.)
cfallin created PR Review Comment:
for absval in state.absvals.values()
?
cfallin created PR Review Comment:
for (formal, actual) in dst_formals.iter().zip(src_actuals.iter())
?
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
I was being paranoid here .. I wanted to be certain that nothing would mess up join and equality operations on the abstract values. Could use derived Eq I suppose.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
For the liveness analysis in the register allocator, I used a worklist initialised in reverse postorder, plus a mechanism to avoid duplicates in the worklist (== all bells and whistles I could think of). That achieves around 1.8-2.8 evaluations per block on average, for large functions. From those experiments I saw that the by far the most important factor was to initialise the worklist in reverse postorder, and operate it as a queue thereafter.
This one does something simpler, which is to repeatedly visit the blocks in reverse postorder, until there are no changes. Many inputs require 3 passes (3.0 evaluations per block), and the worst I saw, for
joey_big.clif
, required 4 passes. For the following reasons, I concluded that using a worklist wasn't worth the extra complexity:
given the above numbers, using a worklist wouldn't give a big reduction in the average evals per block
the worklist itself adds some overhead, and so does having an auxiliary mapping of block to bool, so as to be able to efficiently avoid inserting the same item in the worklist if it is already there
more than half the expense lies in Phase 1 (building the summaries) and in particular Phase 3 (editing the result); in particular removing formal parameters is expensive because CL's cost for removing just one formal parameter is linear in the number of formals, and that adds up when blocks have 300+ formals.
for a nonoptimising CLIF->aarch64 translation, this pass costs circa 0.6% in instruction counts, and will be correspondingly less in the opt_level=speed case.
One simple thing I could do is to filter out summaries for blocks that have neither formal parameters nor actual parameters in their jumps, before iterating. From what I've seen there are many such blocks and they can't possibly contribute anything to the overall result.
julian-seward1 created PR Review Comment:
One simple thing I could do is to filter out summaries for blocks that have neither formal parameters nor actual parameters in their jumps, before iterating.
I tried that. For a couple of large tests, it roughly halved the number of summaries (meaning, from around 1/2 the number of blocks to around 1/4 the number of blocks) but reduced the overall insn count for the pass only by around 3%. And the fixpointing is done over the summaries. So the fixpointing cost must be pretty minimal already.
I'll leave this refinement in the respin patch.
julian-seward1 submitted PR Review.
julian-seward1 submitted PR Review.
julian-seward1 created PR Review Comment:
Good point. Added.
julian-seward1 edited PR Review Comment.
julian-seward1 updated PR #1664 from remove-constant-phis
to master
:
…onstrate
that only one value ever flows. Has been observed to improve generated code
run times by up to 8%. Compilation cost increases by about 0.6%, but up to 7%
total cost has been observed to be saved; iow it can be a significant win in
terms of compilation time, overall.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
julian-seward1 requested cfallin for a review on PR #1664.
cfallin submitted PR Review.
cfallin submitted PR Review.
cfallin created PR Review Comment:
Sounds good -- thanks for the detailed analysis!
cfallin created PR Review Comment:
(detached from other thread because this line changed -- related to
Eq
discussion)Yup, I'd prefer to stay idiomatic here and remove
fn equals
, using the derivedEq
instead -- if the derived code isn't doing the right thing then we have bigger problems (and others who maintain the code in the future might expect to use==
anyway).
julian-seward1 updated PR #1664 from remove-constant-phis
to master
:
…onstrate
that only one value ever flows. Has been observed to improve generated code
run times by up to 8%. Compilation cost increases by about 0.6%, but up to 7%
total cost has been observed to be saved; iow it can be a significant win in
terms of compilation time, overall.<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
julian-seward1 merged PR #1664.
Last updated: Dec 23 2024 at 12:05 UTC