Stream: git-wasmtime

Topic: wasmtime / PR #4955 Bypass state machine for single-prede...


view this post on Zulip Wasmtime GitHub notifications bot (Sep 25 2022 at 07:17):

jameysharp opened PR #4955 from shortcut-use-var to main:

This kept bothering me and I couldn't leave well enough alone. :sweat_smile: It's yet another take on @adambratschikaye's issue #4923.

In the common case where there is a chain of sealed blocks that each have exactly one predecessor, we can keep track of any sub-sequence of those blocks in O(1) space. So there's no need to use the state machine stack to propagate variable definitions back along the chain.

Instead, we can do one loop to find which block to stop at, then either get the variable definition from that block or introduce a block parameter there, and finally do one more loop to update variable definitions in all the intervening blocks.

The existing implementation already had to do a graph traversal to propagate variable definitions correctly, so this doesn't visit any more blocks than before. However, this change also makes it possible to integrate cycle detection with the graph traversal. That eliminates the need for the in_predecessor_cycle flags, and any possibility of spiky performance profiles in maintaining those flags.

To keep the cost of cycle detection down, I used epoch counters: a block is only considered "visited" if its epoch counter is equal to the SSABuilder's counter. This lets me amortize the cost of resetting the "visited" flags across 2^32 graph traversals. I'm guessing that in practice the epoch counter never actually wraps, making flag maintenance effectively O(1).

According to DHAT, this reduces by 0.3% the number of instructions executed when compiling the pulldown-cmark benchmark from Sightglass. It also reduces the number of bytes read and number of bytes written by about 1.5MiB each, despite increasing the total memory allocated by about 165KiB.

According to Sightglass, the runtime improvement is better than the numbers from DHAT suggest. My previous attempt to amortize cycle detection costs improved compiler performance, but this patch improves things even further.

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

Δ = 30421181.40 ± 21622798.42 (confidence = 99%)

shortcut-use-var-b45072fc5.so is 1.03x to 1.15x faster than main-bd870a9d6.so!

[230303200 371040077.28 447540220] main-bd870a9d6.so
[228324186 340618895.88 406919480] shortcut-use-var-b45072fc5.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

Δ = 322167491.42 ± 263599878.59 (confidence = 99%)

shortcut-use-var-b45072fc5.so is 1.01x to 1.07x faster than main-bd870a9d6.so!

[9018928022 9172881615.14 9380337158] main-bd870a9d6.so
[5257172910 8850714123.72 9524127710] shortcut-use-var-b45072fc5.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[156992858 174357302.96 270010020] main-bd870a9d6.so
[120515776 171661174.34 269878252] shortcut-use-var-b45072fc5.so

<!--

Please ensure that the following steps are all taken care of before submitting
the PR.

Please ensure all communication adheres to the code of conduct.
-->

view this post on Zulip Wasmtime GitHub notifications bot (Sep 25 2022 at 07:17):

jameysharp requested cfallin for a review on PR #4955.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2022 at 22:34):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2022 at 22:34):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2022 at 22:34):

cfallin created PR review comment:

Can we have a doc-comment here indicating what a "new epoch" means (amortized visited-bit-clearing) and when we do it?

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2022 at 22:34):

cfallin created PR review comment:

This code is likely never going to be visited at all, because other implementation limits will prevent us from processing a function of 2^32 blocks. And it'll thus be dead and untested code. I think I'd rather see an assert here that the epoch doesn't wrap (or otherwise do a checked_add(1).expect("epoch wrapped") or somesuch).

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2022 at 22:34):

cfallin created PR review comment:

Can we have a comment here to describe what the semantics of this function are? E.g., when does it return Some(...), and which block is returned? What must epoch be?

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2022 at 22:34):

cfallin created PR review comment:

there's something subtle here about detecting/avoiding cycles vs. not visiting a block more than once. Not visiting more than once certainly avoids cycles; but do we need to do something else for them? Could you comment in which cases cycles may occur and why we are correct (insert the appropriate blockparams) in such cases?

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2022 at 22:34):

cfallin created PR review comment:

I would have, naively, expected new_epoch to be called whenever we modify predecessor info somehow. Perhaps my mental model for what exactly we are caching is wrong. Could we have a comment here (or somewhere) describing the overall scheme?

Basically I'm looking for a from-first-principles description of why we do this -- something like: when traversing backward to find defs, we want to not visit a block more than once; we could do this by using visited-flags; we can do this more cheaply by setting a field for "last epoch in which visited" to a freshly allocated epoch number; etc.

The "epoch" name might be confusing me too. Is it really a "variable-def search instance number" (i.e., each search gets a new index/epoch/...)?

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2022 at 23:46):

jameysharp submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2022 at 23:46):

jameysharp created PR review comment:

It isn't currently a new epoch per block: it's per call to use_var that doesn't find a definition for the variable within the same block, and there could be a lot of those. I'm not sure whether _that_ can happen 2^32 times before hitting other implementation limits. But I'm happy to turn wrapping the counter into a panic, and revisit it later if anyone actually hits it.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2022 at 23:52):

cfallin created PR review comment:

Ah, I wonder what the impact of using a u64 instead would be then? That would ensure that for all practical purposes we never wrap.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 27 2022 at 23:52):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 28 2022 at 00:01):

jameysharp submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 28 2022 at 00:01):

jameysharp created PR review comment:

I considered it, but there were 7 bytes of padding in SSABlockData so a u32 was the biggest that would avoid making that struct bigger. :laughing:

That said, epoch is only used when the block is sealed, and undef_variables can only be non-empty when the block is not yet sealed, so I could overlap them with an enum... :thinking:

And with _that_ said, I should go try a SecondaryMap<Block, bool> to store visited flags instead of any of this epoch machinery and see if that easier-to-explain approach actually has any performance impact.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 28 2022 at 00:42):

jameysharp submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 28 2022 at 00:42):

jameysharp created PR review comment:

Sightglass claims to be able to tell that the epoch counters are faster, by very small margins. According to hyperfine it's either a wash, or when compiling Spidermonkey, the SecondaryMap might be very slightly faster. DHAT says the epoch counters allocate slightly less (naturally), read 0.04% fewer bytes, and write 4% fewer bytes (11MiB).

In short, the epoch counters are better, but not enough better to be worth anyone's confusion. I'm switching to a SecondaryMap of bool.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 28 2022 at 20:55):

jameysharp updated PR #4955 from shortcut-use-var to main.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 28 2022 at 20:56):

jameysharp submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 28 2022 at 20:56):

jameysharp created PR review comment:

Actually, switching to EntitySet now that I discovered that exists. DHAT says this does 0.5% more writes than epoch counters and 0.07% more reads, which is a small enough margin that I don't care. It also executes 0.2% more instructions than main, which executes 0.2% more instructions than using epoch counters.

I've also tested HashSet but that executes a lot more instructions as well as doing even more reads and writes.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 28 2022 at 22:13):

cfallin submitted PR review.

view this post on Zulip Wasmtime GitHub notifications bot (Sep 29 2022 at 00:05):

jameysharp merged PR #4955.


Last updated: Dec 23 2024 at 12:05 UTC