jameysharp opened PR #4955 from shortcut-use-var
to main
:
This kept bothering me and I couldn't leave well enough alone. :sweat_smile: It's yet another take on @adambratschikaye's issue #4923.
In the common case where there is a chain of sealed blocks that each have exactly one predecessor, we can keep track of any sub-sequence of those blocks in O(1) space. So there's no need to use the state machine stack to propagate variable definitions back along the chain.
Instead, we can do one loop to find which block to stop at, then either get the variable definition from that block or introduce a block parameter there, and finally do one more loop to update variable definitions in all the intervening blocks.
The existing implementation already had to do a graph traversal to propagate variable definitions correctly, so this doesn't visit any more blocks than before. However, this change also makes it possible to integrate cycle detection with the graph traversal. That eliminates the need for the in_predecessor_cycle flags, and any possibility of spiky performance profiles in maintaining those flags.
To keep the cost of cycle detection down, I used epoch counters: a block is only considered "visited" if its epoch counter is equal to the SSABuilder's counter. This lets me amortize the cost of resetting the "visited" flags across 2^32 graph traversals. I'm guessing that in practice the epoch counter never actually wraps, making flag maintenance effectively O(1).
According to DHAT, this reduces by 0.3% the number of instructions executed when compiling the pulldown-cmark benchmark from Sightglass. It also reduces the number of bytes read and number of bytes written by about 1.5MiB each, despite increasing the total memory allocated by about 165KiB.
According to Sightglass, the runtime improvement is better than the numbers from DHAT suggest. My previous attempt to amortize cycle detection costs improved compiler performance, but this patch improves things even further.
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 30421181.40 ± 21622798.42 (confidence = 99%)
shortcut-use-var-b45072fc5.so is 1.03x to 1.15x faster than main-bd870a9d6.so!
[230303200 371040077.28 447540220] main-bd870a9d6.so
[228324186 340618895.88 406919480] shortcut-use-var-b45072fc5.socompilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 322167491.42 ± 263599878.59 (confidence = 99%)
shortcut-use-var-b45072fc5.so is 1.01x to 1.07x faster than main-bd870a9d6.so!
[9018928022 9172881615.14 9380337158] main-bd870a9d6.so
[5257172910 8850714123.72 9524127710] shortcut-use-var-b45072fc5.socompilation :: cycles :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[156992858 174357302.96 270010020] main-bd870a9d6.so
[120515776 171661174.34 269878252] shortcut-use-var-b45072fc5.so<!--
Please ensure that the following steps are all taken care of before submitting
the PR.
[ ] This has been discussed in issue #..., or if not, please tell us why
here.[ ] A short description of what this does, why it is needed; if the
description becomes long, the matter should probably be discussed in an issue
first.[ ] This PR contains test cases, if meaningful.
- [ ] A reviewer from the core maintainer team has been assigned for this PR.
If you don't know who could review this, please indicate so. The list of
suggested reviewers on the right can help you.Please ensure all communication adheres to the code of conduct.
-->
jameysharp requested cfallin for a review on PR #4955.
cfallin submitted PR review.
cfallin submitted PR review.
cfallin created PR review comment:
Can we have a doc-comment here indicating what a "new epoch" means (amortized visited-bit-clearing) and when we do it?
cfallin created PR review comment:
This code is likely never going to be visited at all, because other implementation limits will prevent us from processing a function of 2^32 blocks. And it'll thus be dead and untested code. I think I'd rather see an assert here that the epoch doesn't wrap (or otherwise do a
checked_add(1).expect("epoch wrapped")
or somesuch).
cfallin created PR review comment:
Can we have a comment here to describe what the semantics of this function are? E.g., when does it return
Some(...)
, and which block is returned? What mustepoch
be?
cfallin created PR review comment:
there's something subtle here about detecting/avoiding cycles vs. not visiting a block more than once. Not visiting more than once certainly avoids cycles; but do we need to do something else for them? Could you comment in which cases cycles may occur and why we are correct (insert the appropriate blockparams) in such cases?
cfallin created PR review comment:
I would have, naively, expected
new_epoch
to be called whenever we modify predecessor info somehow. Perhaps my mental model for what exactly we are caching is wrong. Could we have a comment here (or somewhere) describing the overall scheme?Basically I'm looking for a from-first-principles description of why we do this -- something like: when traversing backward to find defs, we want to not visit a block more than once; we could do this by using visited-flags; we can do this more cheaply by setting a field for "last epoch in which visited" to a freshly allocated epoch number; etc.
The "epoch" name might be confusing me too. Is it really a "variable-def search instance number" (i.e., each search gets a new index/epoch/...)?
jameysharp submitted PR review.
jameysharp created PR review comment:
It isn't currently a new epoch per block: it's per call to
use_var
that doesn't find a definition for the variable within the same block, and there could be a lot of those. I'm not sure whether _that_ can happen 2^32 times before hitting other implementation limits. But I'm happy to turn wrapping the counter into a panic, and revisit it later if anyone actually hits it.
cfallin created PR review comment:
Ah, I wonder what the impact of using a
u64
instead would be then? That would ensure that for all practical purposes we never wrap.
cfallin submitted PR review.
jameysharp submitted PR review.
jameysharp created PR review comment:
I considered it, but there were 7 bytes of padding in
SSABlockData
so au32
was the biggest that would avoid making that struct bigger. :laughing:That said, epoch is only used when the block is sealed, and
undef_variables
can only be non-empty when the block is not yet sealed, so I could overlap them with an enum... :thinking:And with _that_ said, I should go try a
SecondaryMap<Block, bool>
to store visited flags instead of any of this epoch machinery and see if that easier-to-explain approach actually has any performance impact.
jameysharp submitted PR review.
jameysharp created PR review comment:
Sightglass claims to be able to tell that the epoch counters are faster, by very small margins. According to hyperfine it's either a wash, or when compiling Spidermonkey, the
SecondaryMap
might be very slightly faster. DHAT says the epoch counters allocate slightly less (naturally), read 0.04% fewer bytes, and write 4% fewer bytes (11MiB).In short, the epoch counters are better, but not enough better to be worth anyone's confusion. I'm switching to a
SecondaryMap
ofbool
.
jameysharp updated PR #4955 from shortcut-use-var
to main
.
jameysharp submitted PR review.
jameysharp created PR review comment:
Actually, switching to
EntitySet
now that I discovered that exists. DHAT says this does 0.5% more writes than epoch counters and 0.07% more reads, which is a small enough margin that I don't care. It also executes 0.2% more instructions thanmain
, which executes 0.2% more instructions than using epoch counters.I've also tested
HashSet
but that executes a lot more instructions as well as doing even more reads and writes.
cfallin submitted PR review.
jameysharp merged PR #4955.
Last updated: Nov 22 2024 at 17:03 UTC