jameysharp commented on issue #4955:
I think this is much easier to reason about compared with both the predicate-cycle approach and the version before that, but I absolutely agree it needs more comments. I'll work on that. Thanks for the suggestions about specific confusing parts. I hope I don't have to document the whole existing state machine before we can merge this though; I do have another couple of branches here to make the rest easier (for me) to understand, but I don't want to do that all in this PR.
The performance impact of this change is actually surprisingly complicated. Note that all measurements are relative to the version with last week's predicate-cycles work included. I'm currently comparing against main at 7b311004b.
I'm still trying to characterize this, but now that I have a lower-variance benchmarking setup, it looks like it ranges from main being very slightly faster ("1.01 ± 0.01 times" per hyperfine, "1.00x to 1.01x faster" per Sightglass+perf cpu-cycles) to a wash ("1.00 ± 0.01" favoring either way), depending on the benchmark.
However, main pretty consistently has fewer cache accesses and fewer cache misses than this PR, to the tune of being "1.02x to 1.04x faster" by cache accesses. For cache misses, Sightglass sometimes reports ranges like "1.04x to 1.28x faster". I find this surprising since DHAT says this PR reads and writes something like 1.5MiB less, and even the .text section is smaller. I guess it either means that this PR has worse memory locality, or that my cache metrics are pretty noisy still.
cfallin commented on issue #4955:
I think this is much easier to reason about compared with both the predicate-cycle approach and the version before that, but I absolutely agree it needs more comments. I'll work on that. Thanks for the suggestions about specific confusing parts. I hope I don't have to document the whole existing state machine before we can merge this though
Ah, no, no need to retroactively document the whole SSA-building algorithm; just the bits that are being updated. For whatever it's worth, it may very well be simpler now, it's just that I found the mechanism quite confusing and difficult to puzzle out in its current state; so it's probably a matter of communicating invariants and design, rather than fundamental complexity.
jameysharp commented on issue #4955:
I think the comments now reflect everything I know about why this works. I've also removed every unrelated change (I'll open another PR with a bunch of small cleanups later) and gotten rid of the epoch counters. Could you give this another review?
Last updated: Jan 24 2025 at 00:11 UTC