github-actions[bot] opened issue #12821:
See https://github.com/bytecodealliance/wasmtime/actions/runs/23394837677
alexcrichton commented on issue #12821:
I've started seeing this spuriously on PRs too:
- https://github.com/bytecodealliance/wasmtime/actions/runs/23384684882/job/68029697928
- https://github.com/bytecodealliance/wasmtime/actions/runs/23382746300/job/68024739930
- https://github.com/bytecodealliance/wasmtime/actions/runs/23394837677/job/68055922685
- https://github.com/bytecodealliance/wasmtime/actions/runs/23378185726/job/68013470956
so I don't think this is spurious...
cfallin commented on issue #12821:
I'll try to root-cause this today...
cfallin commented on issue #12821:
A little more info: this is definitely a real stack overflow; I've reproduced locally (not perfectly deterministically) with
cargo test --test all.I've got a coredump and the overflow is in Cranelift compilation, in ISLE-generated Rust code. Here's the smoking gun -- check out this stackprobe sequence:
(gdb) disas Dump of assembler code for function _ZN17cranelift_codegen4opts14generated_code20constructor_simplify17he4342ea94d307425E: 0x000055d480399980 <+0>: mov %rsp,%r11 0x000055d480399983 <+3>: sub $0x44000,%r11 0x000055d48039998a <+10>: sub $0x1000,%rsp => 0x000055d480399991 <+17>: movq $0x0,(%rsp) 0x000055d480399999 <+25>: cmp %r11,%rsp 0x000055d48039999c <+28>: jne 0x55d48039998a <_ZN17cranelift_codegen4opts14generated_code20constructor_simplify17he4342ea94d307425E+10> 0x000055d48039999e <+30>: sub $0x998,%rsp 0x000055d4803999a5 <+37>: mov %rdi,0x18de8(%rsp) 0x000055d4803999ad <+45>: mov %esi,0x18df4(%rsp) 0x000055d4803999b4 <+52>: mov %rdx,0x18de0(%rsp) 0x000055d4803999bc <+60>: mov %rdi,0x3ed70(%rsp) 0x000055d4803999c4 <+68>: mov %esi,0x3ed7c(%rsp) 0x000055d4803999cb <+75>: mov %rdx,0x3ed80(%rsp) 0x000055d4803999d3 <+83>: movb $0x0,0x3ed6f(%rsp)That
0x44000indicates that we have a stack frame size for one (1) frame ofsimplifyof... 272 KiB (!!). With our rewrite depth limit of 5, that's a little over a megabyte of stack.Now, a stack frame size that large is absurd: no single point in the generated code should have more than a few live variables. Note that this is in a debug build, so it's likely we're getting terrible or no regalloc for all the locals, and there are a ton of locals throughout the thousands of matched cases. I wonder if we can force optimization (or at least regalloc).
Splitting the body of the matches I don't think would help much: there still needs to be one single large level with all the opcode cases, so we'll still have a bunch of locals.
We could bound recursion even more tightly, say to only two levels, but that does fundamentally limit performance upside (chained rewrites) too.
I'll go read the generated source and think a bit more...
cfallin commented on issue #12821:
And a little more: naively slapping
#[optimize(speed)]on the generated constructor bodies (in the hope to force real regalloc) -- even though this is a nightly-only option -- doesn't shrink the stack frame, because I think the size is coming from the many eclass iterators (which internally hold aSmallVec<[Value; 8]>traversal stack). The size of that smallvec isn't the issue; rather the issue is that every single use-site creates a separate stackslot and these are not merged. (See rust-lang/rust#61849.) So our stackframe size is linear in the number of rules we have (more precisely, the sum of their term counts in their left-hand sides), even though the number of actively used slots in the stackframe is linear only in the live-set of a given rule matching path.My current best idea is to do some sort of manual regalloc of iterators -- we need only one per nesting level of the big match function, so we should be able to have O(levels) top-level uninitialized
Option<EtorIter>and use the one for the appropriate level at each point. I'll play with that inislec.
cfallin commented on issue #12821:
I spent the afternoon trawling through the generated code and disassemblies thinking through a few strategies here. I implemented multi-extractor iterator reuse, but unfortunately despite consolidating the iters it increases stackframe size of
constructor_simplifyto0x45000bytes.I suspect now (unfortunately I'm not aware of good tooling to actually explain/blame stack frame size) that this is actually the aggregation of all locals in the debug build -- every single
Value, or array-of-two-values, or little result struct, or whatever.This is confirmed by looking at the release-mode compilation of
clif-util:constructor_simplifyhas a stackframe size of0x1000bytes (4KiB).I am not sure what we can do without a stable rustc feature to force optimization for ISLE-generated functions. I suppose we could use the function-splitting that was added in #12303 and enable it by default in debug builds with a very low threshold -- thoughts? (I'm out for the day but I'll pick this back up tomorrow)
fitzgen closed issue #12821:
See https://github.com/bytecodealliance/wasmtime/actions/runs/23394837677
Last updated: Apr 12 2026 at 23:10 UTC