Stream: git-wasmtime

Topic: wasmtime / Issue #2061 Aarch64 codegen quality: support m...


view this post on Zulip Wasmtime GitHub notifications bot (Jul 22 2020 at 06:22):

github-actions[bot] commented on Issue #2061:

Subscribe to Label Action

cc @bnjbvr

<details>
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64"

Thus the following users have been cc'd because of the following labels:

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 19:03):

cfallin commented on Issue #2061:

Thanks for the review! A few answers below:

* the weird cases (addends64.len() == 0, etc) that I comment on -- are they all tested?

Indeed, each of the if-branches has at least one test case in filetests/vcode/aarch64/amodes.clif.

* [per recent discussion[ the CLIF adds and/or shifts that get rolled into some other instruction -- is there some mechanism that ensures that they won't be re-evaluated later?  That would be wasteful.

There isn't, but in this case I don't think it actually adds any extra work: the only ops that we bypass are uextend/sextend, and these use the free register-extend mode for the second operand in an add. The use-counting still applies, so if the address mode expression was the only place where the extended value was used, then the original extend op will be skipped.

* the search back up the DAG for addends -- is there some mechanism that ensures it stays in the same "color" as the root CLIF node (the load or store) at which the search is rooted?

No on this as well, but I think we are still correct: the color-crossing prohibition only applies to impure (side-effecting) ops; a sign/zero-extend is a pure function, as is an addition, so whether we keep the original source values live from many side effects ago and compute the final add/extend at use here, or pre-compute some part of the address in the distant past and carry the 64-bit value across the oceans to the present, the result should be the same. (The inputs here are still SSA, so there can't be redefinitions.) It's possible I'm missing some subtlety here though?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 19:09):

cfallin commented on Issue #2061:

Err, to caveat the above "no extra work", I realized immediately after posting that we can generate extra work by redoing adds. It intuitively seems unlikely to me that this would generate much extra work (emprically, on our one bz2 test-case at least, we reduce work overall). I'm happy to tweak the logic to avoid looking through adds in other color-zones, if you'd prefer (or save that for when we update the machine-independent lowering logic to have proper rules around this).

view this post on Zulip Wasmtime GitHub notifications bot (Jul 24 2020 at 20:00):

cfallin commented on Issue #2061:

I just did a quick experiment to quantify what pulling pure ops across colors (side-effects) gives us. On 'bz2', with merging of ops only within the same color, I see:

       1182.421475      task-clock (msec)         #    1.000 CPUs utilized
               142      context-switches          #    0.120 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,159      page-faults               #    0.004 M/sec
     3,630,298,549      cycles                    #    3.070 GHz
     4,201,675,603      instructions              #    1.16  insn per cycle
   <not supported>      branches
        25,751,269      branch-misses

       1.182215583 seconds time elapsed

Compared to the baseline (same config as post-results in this PR description):

        961.162550      task-clock (msec)         #    0.998 CPUs utilized
               113      context-switches          #    0.118 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,097      page-faults               #    0.005 M/sec
     3,035,892,640      cycles                    #    3.159 GHz
     3,837,092,873      instructions              #    1.26  insn per cycle
   <not supported>      branches
        28,753,633      branch-misses

       0.962659380 seconds time elapsed

So that's about a 9.5% instruction-count regression, or 22% runtime regression, if we disallow such merging. I vote we keep it :-)

view this post on Zulip Wasmtime GitHub notifications bot (Jul 25 2020 at 07:48):

julian-seward1 commented on Issue #2061:

Thanks for the comments and measurements. Sure, let's retain pulling of pure ops across colours :-)

The potential for duplicate work as a result of not forcing used-more-than-once values into a register, I'm still not happy about. We should measure that at some point. However, it's an issue which is far broader than this PR. Is it easy to modify the generic lowering-ctx machinery to make that impossible, without having to change the target-dependent isel code itself? I volunteer to do the measurements if a simple patch became available :-)

In any case, r+; please land as-is. Thanks for chasing this around.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2020 at 22:02):

cfallin commented on Issue #2061:

@julian-seward1, to close the loop on the merging-rules discussion, I wrote another quick patch to limit merges to only instructions with one use, available as the latest commit here (no merge, may break, in particular ifcmp with multiple uses not merging would break things, but bz2 seems OK without this): commit in branch eval-merging-restrictions

With the patch applied, which limits merging to 1-use-only instructions as well as same-color (so it includes the above restriction too), on bz2 I see a 16.8% regression in instruction count and a 24.8% regression in runtime. This matches my intuition that at least on aarch64, pulling pure ops (extends, etc) into instructions is pretty important. Please do poke holes in the patch and/or play with it more, though, if desired!


Last updated: Oct 23 2024 at 20:03 UTC