gaaraw opened issue #13258:
Describe the bug
table.initappears to have a very expensive non-empty path in Wasmtime in a minimal repeated microbenchmark.I first found this in a generated differential benchmark, then reduced it to a much smaller testcase. The slowdown remains after removing loop-derived operand shaping and shrinking the table/element resources to the minimum needed.
The smallest clear reproducer I found is:
primary_reproducer_table_init_len1.watA close control with
len = 0is:
supporting_control_table_init_len0.watTest Case
Primary reproducer loop body:
i32.const 0 i32.const 0 i32.const 1 table.init 0 0Minimal resources:
(table $tab0 1 funcref) (elem funcref (ref.null func))Supporting controls:
supporting_control_table_init_len0.wat(len = 0)supporting_len2_table_init.wat(len = 2with table/elem size 2)supporting_table_fill_len1.watsupporting_table_copy_len1.watSteps to Reproduce
- Build the primary testcase:
wat2wasm --enable-all primary_reproducer_table_init_len1.wat -o primary_reproducer_table_init_len1.wasm
- Warm up once:
wasmtime primary_reproducer_table_init_len1.wasm
- Measure runtime:
perf stat -r 3 -e 'task-clock' wasmtime primary_reproducer_table_init_len1.wasm
- For comparison, run the same flow on:
supporting_control_table_init_len0.wasmsupporting_len2_table_init.wasmsupporting_table_fill_len1.wasmsupporting_table_copy_len1.wasmIf helpful, I can also provide the exact commands I used for the other runtimes in the comparison table.
Expected and actual Results
Primary reduced
table.initresults
testcase shape wasmer_llvm (s) wasmedge_jit (s) wamr_llvm_jit (s) wasmer_cranelift (s) wasmtime (s) wamr_fast_jit (s) const_len0 dst=0, src=0, len=0, table=1, elem=113.2085 6.2617 2.8532 13.4362 59.9080 3.2286 const_len1 dst=0, src=0, len=1, table=1, elem=113.8520 9.0505 4.1151 13.9670 99.9186 4.6532 const_len2 dst=0, src=0, len=2, table=2, elem=214.6396 9.0903 4.41133 14.6610 132.7836 4.9468 const_src1_len1 dst=0, src=1, len=1, table=2, elem=213.7660 9.0285 4.1430 14.1467 99.7570 4.6662 Observed pattern:
- Wasmtime is already much slower than the comparison runtimes for
len = 0.- The cost rises sharply for
len = 1and again forlen = 2.- Changing
srcfrom0to1does not materially change the result.Target-removed control
A target-removed control with the same outer loop / stack shaping but no
table.initis very fast:
testcase wasmer_llvm (s) wasmedge_jit (s) wamr_llvm_jit (s) wasmer_cranelift (s) wasmtime (s) wamr_fast_jit (s) control_no_target 0.011744 0.022739 0.015508 0.29056 0.28542 0.43075 So this does not look like a loop/scaffold artifact. The expensive part seems tied to
table.inititself.Related bulk-table instructions
I also compared matched
table.fill/table.copycases withlen = 1:
testcase wasmer_llvm (s) wasmedge_jit (s) wamr_llvm_jit (s) wasmer_cranelift (s) wasmtime (s) wamr_fast_jit (s) table.fill len=1 5.0919 4.89015 2.18633 5.36801 12.0544 2.6832 table.copy len=1 6.32213 8.8099 4.8734 6.64548 18.5358 6.4398 Wasmtime is not the fastest there either, but the slowdown is much less dramatic than for
table.init.So the anomaly looks more specific to
table.initthan to all small bulk-table operations in general.Versions and Environment
- Wasmtime version:
wasmtime 41.0.0 (4898322a4 2025-12-18)- wasmer: 6.1.0
- WAMR: iwasm 2.4.4
- wasmedge: 0.16.1-18-gc457fe30
- wabt: 1.0.39
- llvm: 21.1.5
- Host OS: Ubuntu 22.04.5 LTS x64
- CPU: 12th Gen Intel® Core™ i7-12700 × 20
If useful, I can also attach the generated CLIF for the reduced testcase.
Extra Info
For the reduced
const_len1testcase, Wasmtime still keeps the hot loop alive and still lowers the operation through thetable.initbuiltin/helper path.I generated CLIF with:
wasmtime compile -C cache=n --emit-clif out_dir primary_reproducer_table_init_len1.wasmIn the generated CLIF for the reduced case, the hot loop still contains a per-iteration call equivalent to:
call fn0(vmctx, 0, 0, 0, 0, 1)So this does not appear to be caused by dead-code elimination or by loop-derived operand shaping.
Based on the measurements, the strongest trigger condition I can currently support is:
- repeated
table.init 0 0- in-bounds
- minimal table / passive element segment
- especially the non-empty path (
len > 0)I have not confirmed the internal root cause, so I’m only reporting the measured trigger pattern here.
gaaraw added the bug label to Issue #13258.
gaaraw added the fuzz-bug label to Issue #13258.
alexcrichton added the wasm-proposal:gc label to Issue #13258.
alexcrichton added the performance label to Issue #13258.
alexcrichton commented on issue #13258:
Thanks for the report! This is a known issue with the performance of table-related intrinsics right now. Notably the inner loops here are "very generically written" insofar as they're using high-level APIs which are known to not optimize well. The fix here will be to reimplement/refactor things internally using lower-level APIs. Part of this is due to historical oddities and part of this is due to just how things are right now.
Wasmtime notably implements the full GC proposal which make these intrinsics significantly more complicated than pre-GC-proposal. Wasmtime doesn't currently have a fast-path for "GC proposal disabled" or something like that.
alexcrichton removed the bug label from Issue #13258.
alexcrichton removed the fuzz-bug label from Issue #13258.
fitzgen commented on issue #13258:
@alexcrichton did this get fixed in one of your recent PRs?
alexcrichton commented on issue #13258:
To the best of my measurements, which is somewhat difficult to correlate given the wall-of-text of this issue and how the reproducers don't exactly line up with test names, yes I believe this is fixed. There's one more PR on top of https://github.com/bytecodealliance/wasmtime/pull/13438 I've got which handles
table.initwhich will fully close this, so I'll leave this open til then
cfallin closed issue #13258:
Describe the bug
table.initappears to have a very expensive non-empty path in Wasmtime in a minimal repeated microbenchmark.I first found this in a generated differential benchmark, then reduced it to a much smaller testcase. The slowdown remains after removing loop-derived operand shaping and shrinking the table/element resources to the minimum needed.
The smallest clear reproducer I found is:
primary_reproducer_table_init_len1.watA close control with
len = 0is:
supporting_control_table_init_len0.watTest Case
Primary reproducer loop body:
i32.const 0 i32.const 0 i32.const 1 table.init 0 0Minimal resources:
(table $tab0 1 funcref) (elem funcref (ref.null func))Supporting controls:
supporting_control_table_init_len0.wat(len = 0)supporting_len2_table_init.wat(len = 2with table/elem size 2)supporting_table_fill_len1.watsupporting_table_copy_len1.watSteps to Reproduce
- Build the primary testcase:
wat2wasm --enable-all primary_reproducer_table_init_len1.wat -o primary_reproducer_table_init_len1.wasm
- Warm up once:
wasmtime primary_reproducer_table_init_len1.wasm
- Measure runtime:
perf stat -r 3 -e 'task-clock' wasmtime primary_reproducer_table_init_len1.wasm
- For comparison, run the same flow on:
supporting_control_table_init_len0.wasmsupporting_len2_table_init.wasmsupporting_table_fill_len1.wasmsupporting_table_copy_len1.wasmIf helpful, I can also provide the exact commands I used for the other runtimes in the comparison table.
Expected and actual Results
Primary reduced
table.initresults
testcase shape wasmer_llvm (s) wasmedge_jit (s) wamr_llvm_jit (s) wasmer_cranelift (s) wasmtime (s) wamr_fast_jit (s) const_len0 dst=0, src=0, len=0, table=1, elem=113.2085 6.2617 2.8532 13.4362 59.9080 3.2286 const_len1 dst=0, src=0, len=1, table=1, elem=113.8520 9.0505 4.1151 13.9670 99.9186 4.6532 const_len2 dst=0, src=0, len=2, table=2, elem=214.6396 9.0903 4.41133 14.6610 132.7836 4.9468 const_src1_len1 dst=0, src=1, len=1, table=2, elem=213.7660 9.0285 4.1430 14.1467 99.7570 4.6662 Observed pattern:
- Wasmtime is already much slower than the comparison runtimes for
len = 0.- The cost rises sharply for
len = 1and again forlen = 2.- Changing
srcfrom0to1does not materially change the result.Target-removed control
A target-removed control with the same outer loop / stack shaping but no
table.initis very fast:
testcase wasmer_llvm (s) wasmedge_jit (s) wamr_llvm_jit (s) wasmer_cranelift (s) wasmtime (s) wamr_fast_jit (s) control_no_target 0.011744 0.022739 0.015508 0.29056 0.28542 0.43075 So this does not look like a loop/scaffold artifact. The expensive part seems tied to
table.inititself.Related bulk-table instructions
I also compared matched
table.fill/table.copycases withlen = 1:
testcase wasmer_llvm (s) wasmedge_jit (s) wamr_llvm_jit (s) wasmer_cranelift (s) wasmtime (s) wamr_fast_jit (s) table.fill len=1 5.0919 4.89015 2.18633 5.36801 12.0544 2.6832 table.copy len=1 6.32213 8.8099 4.8734 6.64548 18.5358 6.4398 Wasmtime is not the fastest there either, but the slowdown is much less dramatic than for
table.init.So the anomaly looks more specific to
table.initthan to all small bulk-table operations in general.Versions and Environment
- Wasmtime version:
wasmtime 41.0.0 (4898322a4 2025-12-18)- wasmer: 6.1.0
- WAMR: iwasm 2.4.4
- wasmedge: 0.16.1-18-gc457fe30
- wabt: 1.0.39
- llvm: 21.1.5
- Host OS: Ubuntu 22.04.5 LTS x64
- CPU: 12th Gen Intel® Core™ i7-12700 × 20
If useful, I can also attach the generated CLIF for the reduced testcase.
Extra Info
For the reduced
const_len1testcase, Wasmtime still keeps the hot loop alive and still lowers the operation through thetable.initbuiltin/helper path.I generated CLIF with:
wasmtime compile -C cache=n --emit-clif out_dir primary_reproducer_table_init_len1.wasmIn the generated CLIF for the reduced case, the hot loop still contains a per-iteration call equivalent to:
call fn0(vmctx, 0, 0, 0, 0, 1)So this does not appear to be caused by dead-code elimination or by loop-derived operand shaping.
Based on the measurements, the strongest trigger condition I can currently support is:
- repeated
table.init 0 0- in-bounds
- minimal table / passive element segment
- especially the non-empty path (
len > 0)I have not confirmed the internal root cause, so I’m only reporting the measured trigger pattern here.
Last updated: Jun 01 2026 at 09:49 UTC