Stream: git-wasmtime

Topic: wasmtime / issue #12156 Cranelift: improve egraph cost fu...


view this post on Zulip Wasmtime GitHub notifications bot (Dec 11 2025 at 19:11):

fitzgen opened issue #12156:

Note: I am forking off a new issue from https://github.com/bytecodealliance/wasmtime/issues/12106. See that issue's discussion for a description of the issue of enode cost saturating to infinity due to extraction not "understanding" shared structure and how this problem is equivalent to weighted-set cover and therefore NP-hard. +cc @cfallin @bongjunj

Chris and I were talking about extraction again yesterday. A couple things came out of it, which I will note down for our future selves and posterity:

[^typical]: Typically n is fairly small; it is affected by how much we created new instructions and spliced their definitions into the middle of the DFG after the CLIF-producer gave us the initial CLIF. That is something we generally don't do a ton of, except in our legalization pass and NaN canonicalization and neither of them do it too much.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 11 2025 at 19:11):

fitzgen added the cranelift label to Issue #12156.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 11 2025 at 19:11):

fitzgen added the cranelift:goal:optimize-speed label to Issue #12156.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 18 2025 at 23:37):

fitzgen commented on issue #12156:

Update: I implemented the topo sorting without incremental updates in #12177, which didn't have any statistically significant affect on compilation or execution and has since merged. Then I made two branches implementing the two main approaches we've been discussing:

  1. **Incremental update and recompute:** In this branch, after we elaborate a value, we update its cost to zero and then recompute all values' costs again to reflect that values we have already elaborated are already available and therefore "free" to reuse.

    This allows the cost function to understand shared substructure across operands to side-effectful skeleton instructions. It does not understand shared substructure within a single operand to a side-effectful skeleton instruction.

  2. **Maintain the set of instructions required to compute a value:** In this branch, we track not just the best cost of each value, but the set of instructions that went into computing that best value. When computing new best costs based on operands, we only add the cost of instructions that aren't already present in the value's set.

    This allows the cost function to understand shared substructure both across and within operands to side-effectful skeleton instructions (and within any expression or subexpression).

Illustrative CLIF Example

I've been using the following CLIF test case to see the effect of different cost functions:

function %f(i64) {
    fn0 = colocated %fn0(i64)

    block0(v0: i64):
        v1 = iadd v0, v0
        v2 = iadd v1, v1
        v3 = iadd v2, v2
        v4 = iadd v3, v3
        v5 = iadd v4, v4
        v6 = iadd v5, v5
        v7 = iadd v6, v6
        v8 = iadd v7, v7
        v9 = iadd v8, v8
        v10 = iadd v9, v9
        v11 = iadd v10, v10
        v12 = iadd v11, v11
        v13 = iadd v12, v12
        v14 = iadd v13, v13
        v15 = iadd v14, v14
        v16 = iadd v15, v15
        v17 = iadd v16, v16
        v18 = iadd v17, v17
        v19 = iadd v18, v18
        v20 = iadd v19, v19
        v21 = iadd v20, v20
        v22 = iadd v21, v21
        v23 = iadd v22, v22
        v24 = iadd v23, v23
        v25 = iadd v24, v24
        v26 = iadd v25, v25
        v27 = iadd v26, v26
        v28 = iadd v27, v27
        v29 = iadd v28, v28
        v30 = iadd v29, v29
        v31 = iadd v30, v30
        v32 = iadd v31, v31
        v33 = iadd v32, v32

        ;; With the old cost function, we should have `cost(v33) == infinite` at
        ;; this point. Of course, we *should* replace `v35`'s definition with
        ;; `ishl v33, 1`. But the (old) cost function cannot tell us to do that
        ;; because `x + infinity == y + infinity`, even when `x < y`. So it must
        ;; arbitrarily choose one of the two. In practice, it happens to choose
        ;; the `ishl` (annoyingly for those us who are trying to show the
        ;; benefits of a new cost function).

        v34 = iconst.i64 2
        v35 = imul v33, v34
        call fn0(v35)

        ;; And then this undoes the multiply, and we *should* see through that
        ;; via `x * 2 => x + x` and `(x + y) - y => x` and ultimately recognize
        ;; that `v36 == v33`.
        ;;
        ;; The old cost function fails to do this, however. This is because it
        ;; considers `x * 2 - x` and `x` to have equal cost when `cost(x) ==
        ;; infinity`.
        ;;
        ;; Both the proposed incremental-update and inst-set cost functions do
        ;; manage to simplify `x * 2 - x` to `x` in this case because they do
        ;; not do _not_ have `cost(v35) = infinity` at the time that we are
        ;; elaborating this second call's operands.
        ;;
        ;; However, if we comment out the first `call fn0(v35)`, then the
        ;; incremental-update cost function does no better than the old cost
        ;; function, and will lead us to extracting `(x << 2) - x` instead of
        ;; simply `x`, because it cannot "see" shared structure *within* a
        ;; side-effectful skeleton's operand, only *across* side-effectful
        ;; skeleton operands (because we only zero-and-update after elaborating
        ;; each side-effectful skeleton operand).
        ;;
        ;; The inst-set cost function, which does "see" shared structure
        ;; *within* a side-effectful skeleton instruction's operand, will still
        ;; report that `cost(x) < cost(x * 2 - 1)` in this case.

        v36 = isub v35, v33
        call fn0(v36)

        return
}

Here is the post-optimization CLIF for the current (old) cost function:

    ...
    v33 = iadd v32, v32
    v38 = iconst.i64 1
    v39 = ishl v33, v38  ; v38 = 1
    call fn0(v39)
    v36 = isub v39, v33
    call fn0(v36)
    return

Here it is with the incremental-update and inst-set cost function:

    ...
    v33 = iadd v32, v32
    v38 = iconst.i64 1
    v39 = ishl v33, v38  ; v38 = 1
    call fn0(v39)
    call fn0(v33)
    return

And if we comment out the first call, here is the CLIF for both the old cost function and the incremental-update cost function:

    ...
    v33 = iadd v32, v32
    v38 = iconst.i64 1
    v39 = ishl v33, v38  ; v38 = 1
    v36 = isub v39, v33
    call fn0(v36)
    return

While the inst-set cost function can still see through x * 2 - x:

    ...
    v33 = iadd v32, v32
    call fn0(v33)
    return

Will post some Sightglass benchmark results in a little bit.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 19 2025 at 01:13):

fitzgen commented on issue #12156:

Initial Sightglass Results

I benchmarked compilation and execution times for

The raw results are below.

Given that our benchmarks' execution isn't affected too much by any of these but is generally leaning positive towards the inst-set cost function, we (surprisingly) get a compilation speedup with it, and that it accounts for the most substructure sharing in theory out the three options, I'm going to continue pursuing that approach.

Compilation

incr-cost-update.so vs. inst-set.so

<details>

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 56736.35 ± 28906.20 (confidence = 95%)

  inst-set.so is 1.00x to 1.01x faster than incr-cost-update.so!

  [14348083 14424211.45 14508124] incr-cost-update.so
  [14285793 14367475.10 14484470] inst-set.so

compilation :: instructions-retired :: benchmarks/shootout/shootout-switch.wasm

  Δ = 7695.45 ± 5606.84 (confidence = 95%)

  inst-set.so is 1.00x to 1.01x faster than incr-cost-update.so!

  [2100829 2115908.85 2129910] incr-cost-update.so
  [2094815 2108213.40 2120720] inst-set.so

compilation :: instructions-retired :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 23876.40 ± 18942.83 (confidence = 95%)

  inst-set.so is 1.00x to 1.01x faster than incr-cost-update.so!

  [8074884 8159311.95 8217976] incr-cost-update.so
  [8068552 8135435.55 8181995] inst-set.so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 9228.95 ± 4846.06 (confidence = 95%)

  inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so!

  [3544513 3560461.75 3578035] incr-cost-update.so
  [3538490 3551232.80 3566901] inst-set.so

compilation :: instructions-retired :: benchmarks/blake3-simd/benchmark.wasm

  Δ = 3458.40 ± 2365.48 (confidence = 95%)

  inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so!

  [1830921 1836239.40 1840748] incr-cost-update.so
  [1829117 1832781.00 1848816] inst-set.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 808809.85 ± 280559.15 (confidence = 95%)

  inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so!

  [492921665 494078167.90 494535116] incr-cost-update.so
  [492325966 493269358.05 493749431] inst-set.so

compilation :: instructions-retired :: benchmarks/shootout/shootout-keccak.wasm

  Δ = 615.45 ± 381.06 (confidence = 95%)

  inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so!

  [680111 681389.50 683010] incr-cost-update.so
  [679956 680774.05 681474] inst-set.so

</details>

main.so vs. inst-set.so

<details>

compilation :: instructions-retired :: benchmarks/shootout/shootout-switch.wasm

  Δ = 7484.40 ± 5754.15 (confidence = 95%)

  inst-set.so is 1.00x to 1.01x faster than main.so!

  [2094815 2108213.40 2120720] inst-set.so
  [2097021 2115697.80 2133060] main.so

compilation :: instructions-retired :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 22612.80 ± 17853.16 (confidence = 95%)

  inst-set.so is 1.00x to 1.00x faster than main.so!

  [8068552 8135435.55 8181995] inst-set.so
  [8113305 8158048.35 8218755] main.so

compilation :: instructions-retired :: benchmarks/blake3-simd/benchmark.wasm

  Δ = 4569.35 ± 2860.23 (confidence = 95%)

  inst-set.so is 1.00x to 1.00x faster than main.so!

  [1829117 1832781.00 1848816] inst-set.so
  [1831835 1837350.35 1853297] main.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 34711.95 ± 30446.18 (confidence = 95%)

  inst-set.so is 1.00x to 1.00x faster than main.so!

  [14285793 14367475.10 14484470] inst-set.so
  [14333253 14402187.05 14508930] main.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 872967.45 ± 245452.83 (confidence = 95%)

  inst-set.so is 1.00x to 1.00x faster than main.so!

  [492325966 493269358.05 493749431] inst-set.so
  [493441547 494142325.50 494562345] main.so

compilation :: instructions-retired :: benchmarks/shootout/shootout-keccak.wasm

  Δ = 762.35 ± 366.77 (confidence = 95%)

  inst-set.so is 1.00x to 1.00x faster than main.so!

  [679956 680774.05 681474] inst-set.so
  [680707 681536.40 683187] main.so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [3538490 3551232.80 3566901] inst-set.so
  [3533856 3556067.10 3583550] main.so

</details>

main.so vs. incr-cost-update.so

<details>

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [14348083 14424211.45 14508124] incr-cost-update.so
  [14333253 14402187.05 14508930] main.so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [3544513 3560461.75 3578035] incr-cost-update.so
  [3533856 3556067.10 3583550] main.so

compilation :: instructions-retired :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [1830921 1836239.40 1840748] incr-cost-update.so
  [1831835 1837350.35 1853297] main.so

compilation :: instructions-retired :: benchmarks/shootout/shootout-keccak.wasm

  No difference in performance.

  [680111 681389.50 683010] incr-cost-update.so
  [680707 681536.40 683187] main.so

compilation :: instructions-retired :: benchmarks/blake3-scalar/benchmark.wasm

  No difference in performance.

  [8074884 8159311.95 8217976] incr-cost-update.so
  [8113305 8158048.35 8218755] main.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [492921665 494078167.90 494535116] incr-cost-update.so
  [493441547 494142325.50 494562345] main.so

compilation :: instructions-retired :: benchmarks/shootout/shootout-switch.wasm

  No difference in performance.

  [2100829 2115908.85 2129910] incr-cost-update.so
  [2097021 2115697.80 2133060] main.so

</details>

Execution

incr-cost-update.so vs. inst-set.so

<details>

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 93581.75 ± 43.16 (confidence = 95%)

  inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so!

  [20865519 20865542.75 20865749] incr-cost-update.so
  [20771935 20771961.00 20772154] inst-set.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 563843.20 ± 7.84 (confidence = 95%)

  incr-cost-update.so is 1.00x to 1.00x faster than inst-set.so!

  [221233136 221233145.30 221233181] incr-cost-update.so
  [221796977 221796988.50 221797031] inst-set.so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 2851242.00 ± 41511.20 (confidence = 95%)

  inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so!

  [2112679982 2112745727.45 2112830546] incr-cost-update.so
  [2109813917 2109894485.45 2110054897] inst-set.so

execution :: instructions-retired :: benchmarks/blake3-simd/benchmark.wasm

  Δ = 63.10 ± 0.30 (confidence = 95%)

  inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so!

  [1432847 1432847.35 1432848] incr-cost-update.so
  [1432784 1432784.25 1432785] inst-set.so

execution :: instructions-retired :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 12.05 ± 0.25 (confidence = 95%)

  inst-set.so is 1.00x to 1.00x faster than incr-cost-update.so!

  [1482967 1482967.20 1482968] incr-cost-update.so
  [1482955 1482955.15 1482956] inst-set.so

execution :: instructions-retired :: benchmarks/shootout/shootout-keccak.wasm

  No difference in performance.

  [77020130 77020136.05 77020159] incr-cost-update.so
  [77020130 77020133.90 77020151] inst-set.so

execution :: instructions-retired :: benchmarks/shootout/shootout-switch.wasm

  No difference in performance.

  [162345998 162346008.80 162346034] incr-cost-update.so
  [162345996 162346009.10 162346036] inst-set.so

</details>

main.so vs. inst-set.so

<details>

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 524604.40 ± 8.60 (confidence = 95%)

  main.so is 1.00x to 1.00x faster than inst-set.so!

  [221796977 221796988.50 221797031] inst-set.so
  [221272372 221272384.10 221272438] main.so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1875739.10 ± 40441.13 (confidence = 95%)

  inst-set.so is 1.00x to 1.00x faster than main.so!

  [2109813917 2109894485.45 2110054897] inst-set.so
  [2111698119 2111770224.55 2111838249] main.so

execution :: instructions-retired :: benchmarks/blake3-simd/benchmark.wasm

  Δ = 65.10 ± 0.30 (confidence = 95%)

  inst-set.so is 1.00x to 1.00x faster than main.so!

  [1432784 1432784.25 1432785] inst-set.so
  [1432849 1432849.35 1432850] main.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 711.95 ± 33.91 (confidence = 95%)

  main.so is 1.00x to 1.00x faster than inst-set.so!

  [20771935 20771961.00 20772154] inst-set.so
  [20771239 20771249.05 20771390] main.so

execution :: instructions-retired :: benchmarks/shootout/shootout-switch.wasm

  No difference in performance.

  [162345996 162346009.10 162346036] inst-set.so
  [162345994 162346015.55 162346081] main.so

execution :: instructions-retired :: benchmarks/shootout/shootout-keccak.wasm

  No difference in performance.

  [77020130 77020133.90 77020151] inst-set.so
  [77020130 77020134.90 77020160] main.so

execution :: instructions-retired :: benchmarks/blake3-scalar/benchmark.wasm

  No difference in performance.

  [1482955 1482955.15 1482956] inst-set.so
  [1482955 1482955.15 1482956] main.so

</details>

main.so vs. incr-cost-update.so

<details>

execution :: instructions-retired :: benchmarks/pulldown-c
[message truncated]

view this post on Zulip Wasmtime GitHub notifications bot (Dec 22 2025 at 06:20):

bongjunj commented on issue #12156:

The idea here is very interesting and the compilation speedup looks cool.

We need to check if the problem of cost saturation went away (https://github.com/bytecodealliance/wasmtime/issues/12106#issuecomment-3615043494).
If this can prevent the cost from saturating to infinity in the benchmark, we can bring back the removed canonicalizations.

I will look at some results with this strategy.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 28 2025 at 08:02):

bongjunj commented on issue #12156:

The infinity cost problem went away in the inst-set version.

shootout-keccak

// skipped the first part
[2025-12-28T06:19:25Z TRACE cranelift_codegen::egraph::elaborate]  -> cost of value v6922 = Cost::Finite { op_cost: 13476, depth: 0 }
[2025-12-28T06:19:25Z TRACE cranelift_codegen::egraph::elaborate]  -> cost of value v6923 = Cost::Finite { op_cost: 13555, depth: 0 }
[2025-12-28T06:19:25Z TRACE cranelift_codegen::egraph::elaborate]  -> cost of value v6930 = Cost::Finite { op_cost: 13351, depth: 0 }
[2025-12-28T06:19:25Z TRACE cranelift_codegen::egraph::elaborate]  -> cost of value v8131 = Cost::Finite { op_cost: 13350, depth: 0 }
[2025-12-28T06:19:25Z TRACE cranelift_codegen::egraph::elaborate]  -> best of union(BestEntry(ImmCostMap { cost: Cost::Finite { op_cost: 13351, depth: 0 }, insts: {inst21, ... , inst5388} }, v8131)
[2025-12-28T06:19:25Z TRACE cranelift_codegen::egraph::elaborate]  -> cost of value v6931 = Cost::Finite { op_cost: 13470, depth: 0 }
[2025-12-28T06:19:25Z TRACE cranelift_codegen::egraph::elaborate]  -> cost of value v6932 = Cost::Finite { op_cost: 13548, depth: 0 }

When I brought back the DeMorgan rules which caused the performance regression due to the poor judgement caused by the cost saturation, the cost saturation disappeared.

All that remains is to check the perf regression from the no-opt version has been resolved with this.

Benchmark

bench inst-set v-base Speedup
blake3-scalar 316,666 319,718 0.96%
blake3-simd 307,564 316,571 2.93%
bz2 86,561,625 87,697,336 1.31%
pulldown-cmark 6,594,664 6,850,795 3.88%
regex 210,685,003 210,205,862 -0.23%
shootout-ackermann 7,759,800 8,549,359 10.17%
shootout-base64 351,116,296 382,052,837 8.81%
shootout-ctype 796,474,233 831,090,155 4.35%
shootout-ed25519 9,415,587,644 9,595,743,123 1.91%
shootout-fib2 3,012,767,106 3,015,971,015 0.11%
shootout-gimli 5,404,306 5,334,580 -1.29%
shootout-heapsort 2,378,935,527 2,378,745,072 -0.01%
shootout-keccak 21,137,129 25,168,432 19.07%
shootout-matrix 545,135,532 541,156,378 -0.73%
shootout-memmove 36,621,534 38,439,632 4.96%
shootout-minicsv 1,294,436,335 1,498,544,688 15.77%
shootout-nestedloop 673 956 41.94%
shootout-random 439,752,867 630,876,317 43.46%
shootout-ratelimit 39,836,311 39,279,479 -1.40%
shootout-seqhash 8,589,789,109 8,838,928,513 2.90%
shootout-sieve 841,599,406 906,693,706 7.73%
shootout-switch 153,847,745 139,731,817 -9.18%
shootout-xblabla20 2,918,569 3,015,435 3.32%
shootout-xchacha20 4,478,767 4,398,827 -1.78%
spidermonkey 642,585,002 639,089,855 -0.54%

shoout-keccak tells us that the performance regression induced by the cost saturation went away.
I think we can bring back the rules.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 29 2025 at 18:41):

fitzgen commented on issue #12156:

Cleaned up the inst-set branch and put up a PR: https://github.com/bytecodealliance/wasmtime/pull/12230

shoout-keccak tells us that the performance regression induced by the cost saturation went away.
I think we can bring back the rules.

Fantastic! Thanks for looking into that. Would you like to make the PR bringing the rules back?

view this post on Zulip Wasmtime GitHub notifications bot (Dec 30 2025 at 05:59):

bongjunj commented on issue #12156:

Sure thing! Thanks for the PR.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 30 2025 at 07:58):

bongjunj edited a comment on issue #12156:

Sure thing! Thanks for the PR.
Will bring back the rules if your PR gets merged.


Last updated: Jan 09 2026 at 13:15 UTC