github-actions[bot] commented on Issue #2061:
Subscribe to Label Action
cc @bnjbvr
<details>
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64"Thus the following users have been cc'd because of the following labels:
- bnjbvr: cranelift
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
cfallin commented on Issue #2061:
Thanks for the review! A few answers below:
* the weird cases (addends64.len() == 0, etc) that I comment on -- are they all tested?
Indeed, each of the if-branches has at least one test case in
filetests/vcode/aarch64/amodes.clif
.* [per recent discussion[ the CLIF adds and/or shifts that get rolled into some other instruction -- is there some mechanism that ensures that they won't be re-evaluated later? That would be wasteful.
There isn't, but in this case I don't think it actually adds any extra work: the only ops that we bypass are
uextend
/sextend
, and these use the free register-extend mode for the second operand in anadd
. The use-counting still applies, so if the address mode expression was the only place where the extended value was used, then the original extend op will be skipped.* the search back up the DAG for addends -- is there some mechanism that ensures it stays in the same "color" as the root CLIF node (the load or store) at which the search is rooted?
No on this as well, but I think we are still correct: the color-crossing prohibition only applies to impure (side-effecting) ops; a sign/zero-extend is a pure function, as is an addition, so whether we keep the original source values live from many side effects ago and compute the final add/extend at use here, or pre-compute some part of the address in the distant past and carry the 64-bit value across the oceans to the present, the result should be the same. (The inputs here are still SSA, so there can't be redefinitions.) It's possible I'm missing some subtlety here though?
cfallin commented on Issue #2061:
Err, to caveat the above "no extra work", I realized immediately after posting that we can generate extra work by redoing adds. It intuitively seems unlikely to me that this would generate much extra work (emprically, on our one
bz2
test-case at least, we reduce work overall). I'm happy to tweak the logic to avoid looking through adds in other color-zones, if you'd prefer (or save that for when we update the machine-independent lowering logic to have proper rules around this).
cfallin commented on Issue #2061:
I just did a quick experiment to quantify what pulling pure ops across colors (side-effects) gives us. On 'bz2', with merging of ops only within the same color, I see:
1182.421475 task-clock (msec) # 1.000 CPUs utilized 142 context-switches # 0.120 K/sec 0 cpu-migrations # 0.000 K/sec 5,159 page-faults # 0.004 M/sec 3,630,298,549 cycles # 3.070 GHz 4,201,675,603 instructions # 1.16 insn per cycle <not supported> branches 25,751,269 branch-misses 1.182215583 seconds time elapsed
Compared to the baseline (same config as post-results in this PR description):
961.162550 task-clock (msec) # 0.998 CPUs utilized 113 context-switches # 0.118 K/sec 0 cpu-migrations # 0.000 K/sec 5,097 page-faults # 0.005 M/sec 3,035,892,640 cycles # 3.159 GHz 3,837,092,873 instructions # 1.26 insn per cycle <not supported> branches 28,753,633 branch-misses 0.962659380 seconds time elapsed
So that's about a 9.5% instruction-count regression, or 22% runtime regression, if we disallow such merging. I vote we keep it :-)
julian-seward1 commented on Issue #2061:
Thanks for the comments and measurements. Sure, let's retain pulling of pure ops across colours :-)
The potential for duplicate work as a result of not forcing used-more-than-once values into a register, I'm still not happy about. We should measure that at some point. However, it's an issue which is far broader than this PR. Is it easy to modify the generic lowering-ctx machinery to make that impossible, without having to change the target-dependent isel code itself? I volunteer to do the measurements if a simple patch became available :-)
In any case, r+; please land as-is. Thanks for chasing this around.
cfallin commented on Issue #2061:
@julian-seward1, to close the loop on the merging-rules discussion, I wrote another quick patch to limit merges to only instructions with one use, available as the latest commit here (no merge, may break, in particular
ifcmp
with multiple uses not merging would break things, butbz2
seems OK without this): commit in branch eval-merging-restrictionsWith the patch applied, which limits merging to 1-use-only instructions as well as same-color (so it includes the above restriction too), on
bz2
I see a 16.8% regression in instruction count and a 24.8% regression in runtime. This matches my intuition that at least on aarch64, pulling pure ops (extends, etc) into instructions is pretty important. Please do poke holes in the patch and/or play with it more, though, if desired!
Last updated: Jan 24 2025 at 00:11 UTC