alexcrichton opened issue #4060:
Fuzzing just found this wasm module which times out while fuzzing. Locally I see:
$ time RAYON_NUM_THREADS=1 ./target/release/wasmtime compile testcase1.wasm --epoch-interruption --wasm-features memory64 RAYON_NUM_THREADS=1 ./target/release/wasmtime compile testcase1.wasm 3.58s user 0.04s system 99% cpu 3.618 total
cc @cfallin
alexcrichton labeled issue #4060:
Fuzzing just found this wasm module which times out while fuzzing. Locally I see:
$ time RAYON_NUM_THREADS=1 ./target/release/wasmtime compile testcase1.wasm --epoch-interruption --wasm-features memory64 RAYON_NUM_THREADS=1 ./target/release/wasmtime compile testcase1.wasm 3.58s user 0.04s system 99% cpu 3.618 total
cc @cfallin
cfallin commented on issue #4060:
Interesting -- this has an ~11k-level-deep loop nest. Epoch interruption instruments the top of every loop. So we're getting a very complex CFG, substantially increasing the cost of many parts of the compilation pipeline.
I don't know that this can be squarely pinned on regalloc2 per se: just before the merge (at commit
bfae6384aa7fb6f64046bc33a62e36d0ba2a1b84
) on my machine, I see:cfallin@cfallin-tower:~/work/wasmtime2$ time RAYON_NUM_THREADS=1 target/release/wasmtime compile ~/testcase1.wasm --epoch-interruption --wasm-features memory64 real 0m0.613s user 0m0.585s sys 0m0.028s
and with current
main
(5c2db166f1792ea0db2abc4ae1994cc1af44cede
), I see:cfallin@cfallin-tower:~/work/wasmtime2$ time RAYON_NUM_THREADS=1 target/release/wasmtime compile ~/testcase1.wasm --epoch-interruption --wasm-features memory64 real 0m0.821s user 0m0.777s sys 0m0.044s
So that's a 34% slowdown. It's unclear from the profile if anything is really out-of-whack; time is spent approximately where I would expect.
In general when transitioning from one large, complex system to another I don't think we can expect uniform speedups: for an NP-hard problem like register allocation, where heuristics are everything, there will always be edge-cases and tradeoffs. Unless we design the system for completely deterministic linear compilation (a la baseline JIT compilers, which have different tradeoffs e.g. ~10x slower runtime) there will be inputs that cause slowdowns relative to some other algorithm.
So all that to say, I don't see anything actionable here; I don't know what to do with it. We can keep it open if we like, but it's unclear what the "acceptance criteria" are, to me: do we want compilation to never time out for the fuzz testcase generator parameters that we set? If so, we should tune the fuzz testcase generator. But that doesn't strike me as very useful -- we're just hammering it into shape around our existing code's performance characteristics. Or I suppose we just want there never to be an actual infinite loop or degenerate (exponential time) case: then the timeout should be much much higher relative to the size of inputs.
This is actually an open-ended question for me: what are our goals? "As fast as possible" but, when is performance a bug?
cfallin edited a comment on issue #4060:
Interesting -- this has an ~11k-level-deep loop nest. Epoch interruption instruments the top of every loop. So we're getting a very complex CFG, substantially increasing the cost of many parts of the compilation pipeline.
~I don't know that this can be squarely pinned on regalloc2 per se~ (EDIT: on looking at the perf diff more it is mostly a difference in regalloc time): just before the merge (at commit
bfae6384aa7fb6f64046bc33a62e36d0ba2a1b84
) on my machine, I see:cfallin@cfallin-tower:~/work/wasmtime2$ time RAYON_NUM_THREADS=1 target/release/wasmtime compile ~/testcase1.wasm --epoch-interruption --wasm-features memory64 real 0m0.613s user 0m0.585s sys 0m0.028s
and with current
main
(5c2db166f1792ea0db2abc4ae1994cc1af44cede
), I see:cfallin@cfallin-tower:~/work/wasmtime2$ time RAYON_NUM_THREADS=1 target/release/wasmtime compile ~/testcase1.wasm --epoch-interruption --wasm-features memory64 real 0m0.821s user 0m0.777s sys 0m0.044s
So that's a 34% slowdown. It's unclear from the profile if anything is really out-of-whack; time is spent approximately where I would expect.
In general when transitioning from one large, complex system to another I don't think we can expect uniform speedups: for an NP-hard problem like register allocation, where heuristics are everything, there will always be edge-cases and tradeoffs. Unless we design the system for completely deterministic linear compilation (a la baseline JIT compilers, which have different tradeoffs e.g. ~10x slower runtime) there will be inputs that cause slowdowns relative to some other algorithm.
So all that to say, I don't see anything actionable here; I don't know what to do with it. We can keep it open if we like, but it's unclear what the "acceptance criteria" are, to me: do we want compilation to never time out for the fuzz testcase generator parameters that we set? If so, we should tune the fuzz testcase generator. But that doesn't strike me as very useful -- we're just hammering it into shape around our existing code's performance characteristics. Or I suppose we just want there never to be an actual infinite loop or degenerate (exponential time) case: then the timeout should be much much higher relative to the size of inputs.
This is actually an open-ended question for me: what are our goals? "As fast as possible" but, when is performance a bug?
alexcrichton commented on issue #4060:
To be clear I'm just the messenger here, I'm reporting timeouts on oss-fuzz here. I don't mean to presume that this is better or worse than before and should or should not be fixed. I would prefer to not get timeouts on oss-fuzz so if this isn't fixable in regalloc then I think we need to add configuration to wasm-smith to not generate modules of this shape.
cfallin commented on issue #4060:
Oh for sure, sorry, I don't mean to pick on you specifically as the transporter of bugs from there to here. I'm more annoyed by the oss-fuzz situation (or I guess combination of oss-fuzz and our configuration) where there's an arbitrary timeout line, over which we have a "bug", without a more explicit conversation about what constitutes acceptable vs. bug.
The most precise exclusion we could make here I think would be to say that deep nesting is problematic and somehow pre-parse in our fuzz targets to find the maximum block/loop nesting level, rejecting over say 1000 because "likely to timeout".
Or alternately we could say that we reject all fuzz inputs with > N wasm opcodes -- N = 1k maybe? This in general would I think be a not-terrible tradeoff between having programs large enough to still be interesting, while keeping the fuzzer away from statically-too-large programs.
Now, I say all that, but it was a fuzz input with deep block nesting, IIRC, that found the O(n^2) behavior in the wasm spec interpreter, so it can still be useful... thoughts?
alexcrichton commented on issue #4060:
If there is no reasonable recourse to fix issues like this personally I feel like this is a turning-point for Wasmtime that means we need to update our documentation and such. For example our README currently says:
Fast. Wasmtime is built on the optimizing Cranelift code generator to quickly generate high-quality machine code at runtime.
and if we can't fix issues like this then I don't think that's correct to leave in the README (specifically the "quickly generate" part). For example issues like https://github.com/bytecodealliance/wasmtime/issues/3523 come from real-world wasm modules that users are generating and we would basically need to close that as "WONTFIX" (since that's an old bug I just measured, the 0.36.0 release compiles the
extract.wasm.gz
module in one of my comments in 17 seconds and themain
branch compiles it in 7 seconds)Separately from that for the fuzz input I don't feel like there's any great options here. If we artificially limit the size of wasms then we're going to fundamentally have a less interesting space of modules to explore. My gut is that we should implement simple heuristics one-at-a-time and basically play whack-a-mole until we stop getting timeouts on oss-fuzz. It doesn't seem to me that min-maxing trying to get the most interesting modules that cranelift can still compile in time is going to be too worthwhile, we will get more mileage from just getting fewer timeouts on oss-fuzz.
cfallin commented on issue #4060:
So, I do appreciate how nice it would be to never have outliers in compilation time. I just think this is a matter of setting valid/reasonable expectations. Optimizing compilers involve slower-than-linear algorithms, and register allocation is NP-hard under certain assumptions (splitting in particular blows up the possibility space, and register constraints make the solver's life harder too). If we can make Cranelift-with-optimizing-settings "quickly generate code" in all cases, then we have solved P=NP. I don't mean to be glib but I do feel that there is a widely varying algorithmic-complexity spectrum underlying superficially similar problems here that is sometimes not fully appreciated. This is "years of research and PhD dissertation comes out of it" level challenge if we want to do the heuristics well and ultimately slow/stop the trickle of outliers, not a simple bugfix.
That challenge isn't uniquely ours, either: one can find "slow compilation" bugs for other compilers too. I spoke with Julian Seward recently as he was tackling a slowdown in IonMonkey (also regalloc-related). That's a backend with engineer-decades of tuning!
So the main thing I'd argue against is: I am not sure if this is a "turning point" unless I've misunderstood the extent of our positioning/marketing: to my knowledge we have never said that we guarantee linear-time compilation, or anything of the sort. If we are super-linear, then there will be outliers.
If others feel differently, that we must never have compilation time outliers, then this could be justification for building a baseline compiler. For what it's worth that is probably about a year of work to reach parity with current functionality and to optimize it well enough to rely on as a first tier.
Anyway, that's my perspective on this; sorry, I don't mean to be too rigid; I just need to say "not possible" when something seems not possible (at the requested level) to me :-)
alexcrichton commented on issue #4060:
I don't mean to trivialize register allocation by saying "it should always be lightning fast", I'm well aware of the difficulties of register allocation and optimizing one. I'm pointing out that if a user hits a slow compilation it's probably going to be surprising because:
- AFAIK most documentation about Wasmtime is that it's suitable in a JIT context. In my mind, though, a JIT context means that latency-to-code-running is extremely important. This, however, is not really of the utmost importance to Wasmtime at this time and fundamentally won't ever be optimized for unless we have a baseline compiler of sorts.
- Additionally if users report a slow compilation we're not really motivated to look into it at all. It's very easy to open up a profiler, see register allocation, and then forget about it. In a context where we do care about compilation speed and at least getting close to a JIT-like performance profile this seems worrisome to me.
I'm not proposing we go off and crate a baseline JIT. What I'm saying is that if we have reached the line in the sand where further "big" optimizations to register allocation are no longer going to happen then we need to reflect the current state of the world in how we communicate about Wasmtime (both in the README and in how we talk about it). Whenever we mention that Wasmtime can be used as a JIT we need to have like a footnote explaining that there will be outliers that take a very long time to compile. We also need to take fuzz bugs back to wasm-smith and make sure it doesn't generate such a slow-to-compile module any more.
Basically we need to do something. I don't think it's reasonable to close a bug like this (and the other slow-to-compile bugs we have) and just ignore them.
cfallin commented on issue #4060:
Additionally if users report a slow compilation we're not really motivated to look into it at all. It's very easy to open up a profiler, see register allocation, and then forget about it. In a context where we do care about compilation speed and at least getting close to a JIT-like performance profile this seems worrisome to me.
Sure, to be clear, I took a look at this bug and am happy to do so for future slowdown bugs, and if there is anything obvious ("hey we forgot about this O(n^2) pass here, we can do better") I'm happy to fix it ASAP. The particular input is causing a lot of splits though and that's kind of fundamental to the design; a heuristic to band-aid this case might break others or cause worse performance on average. The agonizingly annoying reality of problems like this is that we can never win the game of whack-a-mole. In other words, I could go write regalloc3, get it completed and merged in 2023... and this timeout might go away, but then we'll have new ones.
So I'm not really saying that "big optimizations won't happen" so much as: this kind of bug won't go away, there will always be outliers, and we should be prepared for that reality somehow.
Re: how we advertise, all I can say I guess is that we're in good company -- most other compilers will have outliers too -- so we could add the asterisk, but then LLVM needs that, and SpiderMonkey (*for the optimizing tier), and Golang, and ...
I talked with @tschneidereit a bit yesterday about this and setting expectations with optimizing compilers and outliers; Till, any more thoughts here?
tschneidereit commented on issue #4060:
My view on this is that we should aim to have roughly the same profile as browser engines' optimizing tiers—and those do have these kinds of pathological cases for the reasons Chris describes.
That does mean that there are use cases for which Cranelift isn't the right fit: those that need guarantees around linear(-ish) compilation time. For those, a baseline compiler is needed on a pretty fundamental level.
In turn this means that there very clearly is room for a baseline tier or even an interpreter in wasmtime, and I expect that to materialize eventually. For use cases where these kinds of pathological cases aren't acceptable, but the default should still be optimized compilation, one could imagine setups like using Cranelift with a timeout, and falling back to baseline if the timeout is hit.
alexcrichton commented on issue #4060:
I agree with @tschneidereit but I don't feel that our current documentation and thinking of Wasmtime reflects this. We can't bill ourselves as "suitable for JIT in all cases" given the compile-time outliers we're fundamentally going to have without a baseline compiler in my opinion. We need to make sure documentation and such indicates that the JIT part of Wasmtime is only suitable if outliers in compile times are acceptable.
Re: how we advertise, all I can say I guess is that we're in good company -- most other compilers will have outliers too -- so we could add the asterisk, but then LLVM needs that, and SpiderMonkey (*for the optimizing tier), and Golang, and ...
Personally this is what I want to push back against. Wasmtime is notably different than SpiderMonkey due to its lack of a basline compiler or interpreter, which means that in my mind that's an apples-to-oranges comparison.
For other compilers I'm under the impression that no one expects LLVM to be fast and I suspect Go folks are similar where they don't expect 100% of code to compile quickly. Users on the web, however, do generally expect that 100% of code gets at least running quickly.
Overall I'm probably going to just ignore timeouts in fuzz bugs for now and we'll figure out a way to deal with them if that ever becomes a problem. Otherwise I don't think it's worthwhile keeping an issue open for this, so I'm going to close it.
alexcrichton closed issue #4060:
Fuzzing just found this wasm module which times out while fuzzing. Locally I see:
$ time RAYON_NUM_THREADS=1 ./target/release/wasmtime compile testcase1.wasm --epoch-interruption --wasm-features memory64 RAYON_NUM_THREADS=1 ./target/release/wasmtime compile testcase1.wasm 3.58s user 0.04s system 99% cpu 3.618 total
cc @cfallin
Last updated: Dec 23 2024 at 13:07 UTC