fitzgen opened issue #12156:
Note: I am forking off a new issue from https://github.com/bytecodealliance/wasmtime/issues/12106. See that issue's discussion for a description of the issue of enode cost saturating to infinity due to extraction not "understanding" shared structure and how this problem is equivalent to weighted-set cover and therefore NP-hard. +cc @cfallin @bongjunj
Chris and I were talking about extraction again yesterday. A couple things came out of it, which I will note down for our future selves and posterity:
I kind of tried to explain this point earlier but don't think I was very clear so I will try again: The problem with this algorithm is that the "number of users" == "number of incoming edges in the (a)egraph", and for this to be a good algorithm in practice, it must be the case that the number of incoming edges in the (a)egraph approximates the number of incoming edges in the set of actually-extracted expressions in practice. It is not clear to me that property is generally true. In fact, I suspect that the more we do "egraph-y" things like add more canonicalizations and neutral rewrites with the hope that they will unlock further rewrites that are beneficial, then the less this property will hold!
I had the realization that if we topo-sort values before computing best costs, then we will reach a fix point in a single iteration. This has some nice implications for the "If we aren't maintaining phasing" algorithm I sketched above. Bear with me for a second.
Our current algorithm to compute the best enode (specific value) for each eclass (canonical value) is a fixpoint loop that looks roughly like this
rust fn compute_best_values() { let mut keep_going = true; while keep_going { for value in dfg.all_values() { let orig_best_value = best[value]; best[value] = best_value(value, dfg); keep_going |= best[value] != orig_best_value; } } }It will do
n + 1iterations before returning wherenis the length of the longest chain ofvNN -> vMMedges in the dataflow graph whereNN < MM(and the+ 1is to determine that we did actually reach the fixpoint and nothing else is changing anymore).[^typical] Note that at mostn == len(values)so we do a pass over allnvaluesntimes, leading to anO(n^2)worst case runtime. But at minimum, in the best case, it will do at least two passes over allnvalues: once to compute the best values and then another to verify that we reached the fixpoint and nothing else is changing.Topologically sorting the values based on the DFG is also one pass over all
nvalues. And then we can do one iteration of the inner loop from the previous algorithm, which is also one pass over allnvalues, and in this case we know that we will have reached a fixedpoint immediately because (a) the DFG is acyclic (we don't look through block params) and (b) we are processing uses before defs. So this gives us anO(n)worst case (and best case) runtime algorithm. Strictly better (in terms of worst case runtime) than what we do today!```rust
fn compute_best_values() {
let sorted = topo_sort_values(dfg);
compute_best_values_already_sorted(sorted);
}fn compute_best_values_already_sorted(sorted) {
for value in sorted {
best[value] = best_value(value, dfg);
}
}
```But, this now plays nicely with incrementally updating costs to reflect certain sub-expressions getting cheaper because we've already extracted them! The following brings us back to
O(n^2)but I think should give us a much better extraction algorithm that "understands" shared subexpressions:```rust
let sorted = topo_sort_values(dfg);for inst in skeleton {
// Compute the best values. If this is the first iteration of the loop,
// it is equivalent to what we had before. Otherwise, it will update the
// best values based the adjusted cost of values we have already
// extracted.
compute_best_values_already_sorted(sorted);// Elaborate this instruction, extracting the best values for its // operands and inserting their defining instructions back into the // CFG's basic blocks. elaborate_inst(inst); // Update the cost for each value we extracted. It is ~free (modulo // register pressure) to reuse this value, since we have already computed // it and it is now available. The next iteration of the outer loop will // recompute best values to reflect these updated costs. for value in transitive_values_used_by(inst) { // Or could do something very low but non-zero, like `1`, to reflect // the cost of additional register pressure. Worth experimenting // with, but is also somewhat orthogonal... let new_cost = 0; set_cost(value, new_cost); }}
```It could be that this is fast enough to just be what we do by default. If not, it might be something we only use when configured to higher optimization levels.
[^typical]: Typically
nis fairly small; it is affected by how much we created new instructions and spliced their definitions into the middle of the DFG after the CLIF-producer gave us the initial CLIF. That is something we generally don't do a ton of, except in our legalization pass and NaN canonicalization and neither of them do it too much.
fitzgen added the cranelift label to Issue #12156.
fitzgen added the cranelift:goal:optimize-speed label to Issue #12156.
fitzgen commented on issue #12156:
Update: I implemented the topo sorting without incremental updates in #12177, which didn't have any statistically significant affect on compilation or execution and has since merged. Then I made two branches implementing the two main approaches we've been discussing:
**Incremental update and recompute:** In this branch, after we elaborate a value, we update its cost to zero and then recompute all values' costs again to reflect that values we have already elaborated are already available and therefore "free" to reuse.
This allows the cost function to understand shared substructure across operands to side-effectful skeleton instructions. It does not understand shared substructure within a single operand to a side-effectful skeleton instruction.
**Maintain the set of instructions required to compute a value:** In this branch, we track not just the best cost of each value, but the set of instructions that went into computing that best value. When computing new best costs based on operands, we only add the cost of instructions that aren't already present in the value's set.
This allows the cost function to understand shared substructure both across and within operands to side-effectful skeleton instructions (and within any expression or subexpression).
Illustrative CLIF Example
I've been using the following CLIF test case to see the effect of different cost functions:
function %f(i64) { fn0 = colocated %fn0(i64) block0(v0: i64): v1 = iadd v0, v0 v2 = iadd v1, v1 v3 = iadd v2, v2 v4 = iadd v3, v3 v5 = iadd v4, v4 v6 = iadd v5, v5 v7 = iadd v6, v6 v8 = iadd v7, v7 v9 = iadd v8, v8 v10 = iadd v9, v9 v11 = iadd v10, v10 v12 = iadd v11, v11 v13 = iadd v12, v12 v14 = iadd v13, v13 v15 = iadd v14, v14 v16 = iadd v15, v15 v17 = iadd v16, v16 v18 = iadd v17, v17 v19 = iadd v18, v18 v20 = iadd v19, v19 v21 = iadd v20, v20 v22 = iadd v21, v21 v23 = iadd v22, v22 v24 = iadd v23, v23 v25 = iadd v24, v24 v26 = iadd v25, v25 v27 = iadd v26, v26 v28 = iadd v27, v27 v29 = iadd v28, v28 v30 = iadd v29, v29 v31 = iadd v30, v30 v32 = iadd v31, v31 v33 = iadd v32, v32 ;; With the old cost function, we should have `cost(v33) == infinite` at ;; this point. Of course, we *should* replace `v35`'s definition with ;; `ishl v33, 1`. But the (old) cost function cannot tell us to do that ;; because `x + infinity == y + infinity`, even when `x < y`. So it must ;; arbitrarily choose one of the two. In practice, it happens to choose ;; the `ishl` (annoyingly for those us who are trying to show the ;; benefits of a new cost function). v34 = iconst.i64 2 v35 = imul v33, v34 call fn0(v35) ;; And then this undoes the multiply, and we *should* see through that ;; via `x * 2 => x + x` and `(x + y) - y => x` and ultimately recognize ;; that `v36 == v33`. ;; ;; The old cost function fails to do this, however. This is because it ;; considers `x * 2 - x` and `x` to have equal cost when `cost(x) == ;; infinity`. ;; ;; Both the proposed incremental-update and inst-set cost functions do ;; manage to simplify `x * 2 - x` to `x` in this case because they do ;; not do _not_ have `cost(v35) = infinity` at the time that we are ;; elaborating this second call's operands. ;; ;; However, if we comment out the first `call fn0(v35)`, then the ;; incremental-update cost function does no better than the old cost ;; function, and will lead us to extracting `(x << 2) - x` instead of ;; simply `x`, because it cannot "see" shared structure *within* a ;; side-effectful skeleton's operand, only *across* side-effectful ;; skeleton operands (because we only zero-and-update after elaborating ;; each side-effectful skeleton operand). ;; ;; The inst-set cost function, which does "see" shared structure ;; *within* a side-effectful skeleton instruction's operand, will still ;; report that `cost(x) < cost(x * 2 - 1)` in this case. v36 = isub v35, v33 call fn0(v36) return }Here is the post-optimization CLIF for the current (old) cost function:
... v33 = iadd v32, v32 v38 = iconst.i64 1 v39 = ishl v33, v38 ; v38 = 1 call fn0(v39) v36 = isub v39, v33 call fn0(v36) returnHere it is with the incremental-update and inst-set cost function:
... v33 = iadd v32, v32 v38 = iconst.i64 1 v39 = ishl v33, v38 ; v38 = 1 call fn0(v39) call fn0(v33) returnAnd if we comment out the first
call, here is the CLIF for both the old cost function and the incremental-update cost function:... v33 = iadd v32, v32 v38 = iconst.i64 1 v39 = ishl v33, v38 ; v38 = 1 v36 = isub v39, v33 call fn0(v36) returnWhile the inst-set cost function can still see through
x * 2 - x:... v33 = iadd v32, v32 call fn0(v33) returnWill post some Sightglass benchmark results in a little bit.
fitzgen commented on issue #12156:
Initial Sightglass Results
I benchmarked compilation and execution times for
- bz2
- spidermonkey
- pulldown-cmark
- blake3-scalar
- blake3-simd
- shootout-keccak
- shootout-switch
The raw results are below.
Given that our benchmarks' execution isn't affected too much by any of these but is generally leaning positive towards the inst-set cost function, we (surprisingly) get a compilation speedup with it, and that it accounts for the most substructure sharing in theory out the three options, I'm going to continue pursuing that approach.
Compilation
incr-cost-update.sovs.inst-set.so<details>
compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 56736.35 ± 28906.20 (confidence = 95%) inst-set.so is 1.00x to 1.01x faster than incr-cost-update.so! [14348083 14424211.45 14508124] incr-cost-update.so [14285793 14367475.10 14484470] inst-set.so compilation :: instructions-retired :: benchmarks/shootout/shootout-switch.wasm Δ = 7695.45 ± 5606.84 (confidence = 95%) inst-set.so is 1.00x to 1.01x faster than incr-cost-update.so! [2100829 2115908.85 2129910] incr-cost-update.so [2094815 2108213.40 2120720] inst-set.so compilation :: instructions-retired :: benchmarks/blake3-scalar/benchmark.wasm Δ = 23876.40 ± 18942.83 (confidence = 95%) inst-set.so is 1.00x to 1.01x faster than incr-cost-update.so! [8074884 8159311.95 8217976] incr-cost-update.so [8068552 8135435.55 8181995] inst-set.so compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm Δ = 9228.95 ± 4846.06 (confidence = 95%) inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so! [3544513 3560461.75 3578035] incr-cost-update.so [3538490 3551232.80 3566901] inst-set.so compilation :: instructions-retired :: benchmarks/blake3-simd/benchmark.wasm Δ = 3458.40 ± 2365.48 (confidence = 95%) inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so! [1830921 1836239.40 1840748] incr-cost-update.so [1829117 1832781.00 1848816] inst-set.so compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 808809.85 ± 280559.15 (confidence = 95%) inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so! [492921665 494078167.90 494535116] incr-cost-update.so [492325966 493269358.05 493749431] inst-set.so compilation :: instructions-retired :: benchmarks/shootout/shootout-keccak.wasm Δ = 615.45 ± 381.06 (confidence = 95%) inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so! [680111 681389.50 683010] incr-cost-update.so [679956 680774.05 681474] inst-set.so</details>
main.sovs.inst-set.so<details>
compilation :: instructions-retired :: benchmarks/shootout/shootout-switch.wasm Δ = 7484.40 ± 5754.15 (confidence = 95%) inst-set.so is 1.00x to 1.01x faster than main.so! [2094815 2108213.40 2120720] inst-set.so [2097021 2115697.80 2133060] main.so compilation :: instructions-retired :: benchmarks/blake3-scalar/benchmark.wasm Δ = 22612.80 ± 17853.16 (confidence = 95%) inst-set.so is 1.00x to 1.00x faster than main.so! [8068552 8135435.55 8181995] inst-set.so [8113305 8158048.35 8218755] main.so compilation :: instructions-retired :: benchmarks/blake3-simd/benchmark.wasm Δ = 4569.35 ± 2860.23 (confidence = 95%) inst-set.so is 1.00x to 1.00x faster than main.so! [1829117 1832781.00 1848816] inst-set.so [1831835 1837350.35 1853297] main.so compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 34711.95 ± 30446.18 (confidence = 95%) inst-set.so is 1.00x to 1.00x faster than main.so! [14285793 14367475.10 14484470] inst-set.so [14333253 14402187.05 14508930] main.so compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 872967.45 ± 245452.83 (confidence = 95%) inst-set.so is 1.00x to 1.00x faster than main.so! [492325966 493269358.05 493749431] inst-set.so [493441547 494142325.50 494562345] main.so compilation :: instructions-retired :: benchmarks/shootout/shootout-keccak.wasm Δ = 762.35 ± 366.77 (confidence = 95%) inst-set.so is 1.00x to 1.00x faster than main.so! [679956 680774.05 681474] inst-set.so [680707 681536.40 683187] main.so compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm No difference in performance. [3538490 3551232.80 3566901] inst-set.so [3533856 3556067.10 3583550] main.so</details>
main.sovs.incr-cost-update.so<details>
compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [14348083 14424211.45 14508124] incr-cost-update.so [14333253 14402187.05 14508930] main.so compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm No difference in performance. [3544513 3560461.75 3578035] incr-cost-update.so [3533856 3556067.10 3583550] main.so compilation :: instructions-retired :: benchmarks/blake3-simd/benchmark.wasm No difference in performance. [1830921 1836239.40 1840748] incr-cost-update.so [1831835 1837350.35 1853297] main.so compilation :: instructions-retired :: benchmarks/shootout/shootout-keccak.wasm No difference in performance. [680111 681389.50 683010] incr-cost-update.so [680707 681536.40 683187] main.so compilation :: instructions-retired :: benchmarks/blake3-scalar/benchmark.wasm No difference in performance. [8074884 8159311.95 8217976] incr-cost-update.so [8113305 8158048.35 8218755] main.so compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [492921665 494078167.90 494535116] incr-cost-update.so [493441547 494142325.50 494562345] main.so compilation :: instructions-retired :: benchmarks/shootout/shootout-switch.wasm No difference in performance. [2100829 2115908.85 2129910] incr-cost-update.so [2097021 2115697.80 2133060] main.so</details>
Execution
incr-cost-update.sovs.inst-set.so<details>
execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 93581.75 ± 43.16 (confidence = 95%) inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so! [20865519 20865542.75 20865749] incr-cost-update.so [20771935 20771961.00 20772154] inst-set.so execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm Δ = 563843.20 ± 7.84 (confidence = 95%) incr-cost-update.so is 1.00x to 1.00x faster than inst-set.so! [221233136 221233145.30 221233181] incr-cost-update.so [221796977 221796988.50 221797031] inst-set.so execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 2851242.00 ± 41511.20 (confidence = 95%) inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so! [2112679982 2112745727.45 2112830546] incr-cost-update.so [2109813917 2109894485.45 2110054897] inst-set.so execution :: instructions-retired :: benchmarks/blake3-simd/benchmark.wasm Δ = 63.10 ± 0.30 (confidence = 95%) inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so! [1432847 1432847.35 1432848] incr-cost-update.so [1432784 1432784.25 1432785] inst-set.so execution :: instructions-retired :: benchmarks/blake3-scalar/benchmark.wasm Δ = 12.05 ± 0.25 (confidence = 95%) inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so! [1482967 1482967.20 1482968] incr-cost-update.so [1482955 1482955.15 1482956] inst-set.so execution :: instructions-retired :: benchmarks/shootout/shootout-keccak.wasm No difference in performance. [77020130 77020136.05 77020159] incr-cost-update.so [77020130 77020133.90 77020151] inst-set.so execution :: instructions-retired :: benchmarks/shootout/shootout-switch.wasm No difference in performance. [162345998 162346008.80 162346034] incr-cost-update.so [162345996 162346009.10 162346036] inst-set.so</details>
main.sovs.inst-set.so<details>
execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm Δ = 524604.40 ± 8.60 (confidence = 95%) main.so is 1.00x to 1.00x faster than inst-set.so! [221796977 221796988.50 221797031] inst-set.so [221272372 221272384.10 221272438] main.so execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 1875739.10 ± 40441.13 (confidence = 95%) inst-set.so is 1.00x to 1.00x faster than main.so! [2109813917 2109894485.45 2110054897] inst-set.so [2111698119 2111770224.55 2111838249] main.so execution :: instructions-retired :: benchmarks/blake3-simd/benchmark.wasm Δ = 65.10 ± 0.30 (confidence = 95%) inst-set.so is 1.00x to 1.00x faster than main.so! [1432784 1432784.25 1432785] inst-set.so [1432849 1432849.35 1432850] main.so execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 711.95 ± 33.91 (confidence = 95%) main.so is 1.00x to 1.00x faster than inst-set.so! [20771935 20771961.00 20772154] inst-set.so [20771239 20771249.05 20771390] main.so execution :: instructions-retired :: benchmarks/shootout/shootout-switch.wasm No difference in performance. [162345996 162346009.10 162346036] inst-set.so [162345994 162346015.55 162346081] main.so execution :: instructions-retired :: benchmarks/shootout/shootout-keccak.wasm No difference in performance. [77020130 77020133.90 77020151] inst-set.so [77020130 77020134.90 77020160] main.so execution :: instructions-retired :: benchmarks/blake3-scalar/benchmark.wasm No difference in performance. [1482955 1482955.15 1482956] inst-set.so [1482955 1482955.15 1482956] main.so</details>
main.sovs.incr-cost-update.so<details>
execution :: instructions-retired :: benchmarks/pulldown-c [message truncated]
bongjunj commented on issue #12156:
The idea here is very interesting and the compilation speedup looks cool.
We need to check if the problem of cost saturation went away (https://github.com/bytecodealliance/wasmtime/issues/12106#issuecomment-3615043494).
If this can prevent the cost from saturating to infinity in the benchmark, we can bring back the removed canonicalizations.I will look at some results with this strategy.
bongjunj commented on issue #12156:
The infinity cost problem went away in the inst-set version.
shootout-keccak// skipped the first part [2025-12-28T06:19:25Z TRACE cranelift_codegen::egraph::elaborate] -> cost of value v6922 = Cost::Finite { op_cost: 13476, depth: 0 } [2025-12-28T06:19:25Z TRACE cranelift_codegen::egraph::elaborate] -> cost of value v6923 = Cost::Finite { op_cost: 13555, depth: 0 } [2025-12-28T06:19:25Z TRACE cranelift_codegen::egraph::elaborate] -> cost of value v6930 = Cost::Finite { op_cost: 13351, depth: 0 } [2025-12-28T06:19:25Z TRACE cranelift_codegen::egraph::elaborate] -> cost of value v8131 = Cost::Finite { op_cost: 13350, depth: 0 } [2025-12-28T06:19:25Z TRACE cranelift_codegen::egraph::elaborate] -> best of union(BestEntry(ImmCostMap { cost: Cost::Finite { op_cost: 13351, depth: 0 }, insts: {inst21, ... , inst5388} }, v8131) [2025-12-28T06:19:25Z TRACE cranelift_codegen::egraph::elaborate] -> cost of value v6931 = Cost::Finite { op_cost: 13470, depth: 0 } [2025-12-28T06:19:25Z TRACE cranelift_codegen::egraph::elaborate] -> cost of value v6932 = Cost::Finite { op_cost: 13548, depth: 0 }When I brought back the DeMorgan rules which caused the performance regression due to the poor judgement caused by the cost saturation, the cost saturation disappeared.
All that remains is to check the perf regression from the no-opt version has been resolved with this.
Benchmark
bench inst-set v-base Speedup blake3-scalar 316,666 319,718 0.96% blake3-simd 307,564 316,571 2.93% bz2 86,561,625 87,697,336 1.31% pulldown-cmark 6,594,664 6,850,795 3.88% regex 210,685,003 210,205,862 -0.23% shootout-ackermann 7,759,800 8,549,359 10.17% shootout-base64 351,116,296 382,052,837 8.81% shootout-ctype 796,474,233 831,090,155 4.35% shootout-ed25519 9,415,587,644 9,595,743,123 1.91% shootout-fib2 3,012,767,106 3,015,971,015 0.11% shootout-gimli 5,404,306 5,334,580 -1.29% shootout-heapsort 2,378,935,527 2,378,745,072 -0.01% shootout-keccak 21,137,129 25,168,432 19.07% shootout-matrix 545,135,532 541,156,378 -0.73% shootout-memmove 36,621,534 38,439,632 4.96% shootout-minicsv 1,294,436,335 1,498,544,688 15.77% shootout-nestedloop 673 956 41.94% shootout-random 439,752,867 630,876,317 43.46% shootout-ratelimit 39,836,311 39,279,479 -1.40% shootout-seqhash 8,589,789,109 8,838,928,513 2.90% shootout-sieve 841,599,406 906,693,706 7.73% shootout-switch 153,847,745 139,731,817 -9.18% shootout-xblabla20 2,918,569 3,015,435 3.32% shootout-xchacha20 4,478,767 4,398,827 -1.78% spidermonkey 642,585,002 639,089,855 -0.54%
shoout-keccaktells us that the performance regression induced by the cost saturation went away.
I think we can bring back the rules.
fitzgen commented on issue #12156:
Cleaned up the inst-set branch and put up a PR: https://github.com/bytecodealliance/wasmtime/pull/12230
shoout-keccaktells us that the performance regression induced by the cost saturation went away.
I think we can bring back the rules.Fantastic! Thanks for looking into that. Would you like to make the PR bringing the rules back?
bongjunj commented on issue #12156:
Sure thing! Thanks for the PR.
bongjunj edited a comment on issue #12156:
Sure thing! Thanks for the PR.
Will bring back the rules if your PR gets merged.
Last updated: Jan 09 2026 at 13:15 UTC