tiran opened issue #4214:
Test Case
wasmtime-stack-py.zip wasm32-wasi build of CPython main branch (3.12-dev) with part of the stdlib and a reproducer file.
Steps to Reproduce
- extract zipfile
cd wasmtime-stack-py
- Run
wasmtime run --dir . python.wasm -- recursive.py 1500
with wasmtime 0.36 and 0.37Expected Results
wasmtime 0.36 works correctly with a Python recursion limit up to a little more than 1500.
Error: failed to run main module `python.wasm` Caused by: 0: failed to invoke command default 1: wasm trap: call stack exhausted wasm backtrace: 0: <unknown>!_PyEval_EvalFrameDefault 1: 0x14e399 - <unknown>!_PyEval_Vector 2: 0x57e85 - <unknown>!_PyFunction_Vectorcall 3: 0x57d9d - <unknown>!PyObject_CallOneArg 4: 0xc85dd - <unknown>!slot_tp_repr ... 3574: 0x193776 - <unknown>!run_mod 3575: 0x192e05 - <unknown>!_PyRun_SimpleFileObject 3576: 0x1924bb - <unknown>!_PyRun_AnyFileObject 3577: 0x1b23d3 - <unknown>!Py_RunMain 3578: 0x1b2ede - <unknown>!pymain_main 3579: 0x1b2f8d - <unknown>!Py_BytesMain 3580: 0x4c52 - <unknown>!main 3581: 0x31b09e - <unknown>!__main_void 3582: 0x31b01d - <unknown>!__original_main 3583: 0x4c36 - <unknown>!_start 3584: 0x3343f7 - <unknown>!_start.command_export note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information
Actual Results
wasmtime 0.37 fails with a much lower recursion limit. It starts to run into call stack exhaustion at about 480 Python recursions and 1155 WASM call stacks. Please note that the highest WASM call stack number with wasmtime 0.36 is 3584.
Error: failed to run main module `python.wasm` Caused by: 0: failed to invoke command default 1: wasm trap: call stack exhausted wasm backtrace: 0: <unknown>!_PyEval_EvalFrameDefault 1: 0x14e399 - <unknown>!_PyEval_Vector 2: 0x57e85 - <unknown>!_PyFunction_Vectorcall 3: 0x57d9d - <unknown>!PyObject_CallOneArg 4: 0xc85dd - <unknown>!slot_tp_repr ... 1145: 0x193776 - <unknown>!run_mod 1146: 0x192e05 - <unknown>!_PyRun_SimpleFileObject 1147: 0x1924bb - <unknown>!_PyRun_AnyFileObject 1148: 0x1b23d3 - <unknown>!Py_RunMain 1149: 0x1b2ede - <unknown>!pymain_main 1150: 0x1b2f8d - <unknown>!Py_BytesMain 1151: 0x4c52 - <unknown>!main 1152: 0x31b09e - <unknown>!__main_void 1153: 0x31b01d - <unknown>!__original_main 1154: 0x4c36 - <unknown>!_start 1155: 0x3343f7 - <unknown>!_start.command_export note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information
Versions and Environment
Wasmtime version or commit: wasmtime-v0.37.0-x86_64-linux
Operating system: Linux (Fedora 36)
Architecture: X84_64
WASI SDK: 16.0
Python 3.12-dev (latest commit on main as of today)
Extra Info
You can find build scripts and artifacts in the git repo https://github.com/ethanhs/python-wasm/
tiran labeled issue #4214:
Test Case
wasmtime-stack-py.zip wasm32-wasi build of CPython main branch (3.12-dev) with part of the stdlib and a reproducer file.
Steps to Reproduce
- extract zipfile
cd wasmtime-stack-py
- Run
wasmtime run --dir . python.wasm -- recursive.py 1500
with wasmtime 0.36 and 0.37Expected Results
wasmtime 0.36 works correctly with a Python recursion limit up to a little more than 1500.
Error: failed to run main module `python.wasm` Caused by: 0: failed to invoke command default 1: wasm trap: call stack exhausted wasm backtrace: 0: <unknown>!_PyEval_EvalFrameDefault 1: 0x14e399 - <unknown>!_PyEval_Vector 2: 0x57e85 - <unknown>!_PyFunction_Vectorcall 3: 0x57d9d - <unknown>!PyObject_CallOneArg 4: 0xc85dd - <unknown>!slot_tp_repr ... 3574: 0x193776 - <unknown>!run_mod 3575: 0x192e05 - <unknown>!_PyRun_SimpleFileObject 3576: 0x1924bb - <unknown>!_PyRun_AnyFileObject 3577: 0x1b23d3 - <unknown>!Py_RunMain 3578: 0x1b2ede - <unknown>!pymain_main 3579: 0x1b2f8d - <unknown>!Py_BytesMain 3580: 0x4c52 - <unknown>!main 3581: 0x31b09e - <unknown>!__main_void 3582: 0x31b01d - <unknown>!__original_main 3583: 0x4c36 - <unknown>!_start 3584: 0x3343f7 - <unknown>!_start.command_export note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information
Actual Results
wasmtime 0.37 fails with a much lower recursion limit. It starts to run into call stack exhaustion at about 480 Python recursions and 1155 WASM call stacks. Please note that the highest WASM call stack number with wasmtime 0.36 is 3584.
Error: failed to run main module `python.wasm` Caused by: 0: failed to invoke command default 1: wasm trap: call stack exhausted wasm backtrace: 0: <unknown>!_PyEval_EvalFrameDefault 1: 0x14e399 - <unknown>!_PyEval_Vector 2: 0x57e85 - <unknown>!_PyFunction_Vectorcall 3: 0x57d9d - <unknown>!PyObject_CallOneArg 4: 0xc85dd - <unknown>!slot_tp_repr ... 1145: 0x193776 - <unknown>!run_mod 1146: 0x192e05 - <unknown>!_PyRun_SimpleFileObject 1147: 0x1924bb - <unknown>!_PyRun_AnyFileObject 1148: 0x1b23d3 - <unknown>!Py_RunMain 1149: 0x1b2ede - <unknown>!pymain_main 1150: 0x1b2f8d - <unknown>!Py_BytesMain 1151: 0x4c52 - <unknown>!main 1152: 0x31b09e - <unknown>!__main_void 1153: 0x31b01d - <unknown>!__original_main 1154: 0x4c36 - <unknown>!_start 1155: 0x3343f7 - <unknown>!_start.command_export note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information
Versions and Environment
Wasmtime version or commit: wasmtime-v0.37.0-x86_64-linux
Operating system: Linux (Fedora 36)
Architecture: X84_64
WASI SDK: 16.0
Python 3.12-dev (latest commit on main as of today)
Extra Info
You can find build scripts and artifacts in the git repo https://github.com/ethanhs/python-wasm/
alexcrichton commented on issue #4214:
cc @cfallin, this is probably related to regalloc2 perhaps?
cfallin commented on issue #4214:
It could be; a larger stack frame in some particular function of this program due to different spilling behavior could very well cause this. @tiran, for context, we switched to a new register allocator (regalloc2) in wasmtime 0.37, and it works in quite a different way from the old one (generally better results, enough on average to justify switching, but not always).
@tiran, I see your note about "artifacts and build scripts in the repo", but could you give a URL for a specific
.wasm
that I could download and compile, to compare the output before-and-after? That would greatly help diagnosing.I'm definitely happy to look at this -- it's possible that some heuristic can be improved, or that something is just not working as expected.
We should also note though that stackframe size is a completely implementation-defined detail. One can configure the stack size of one's instances and sometimes this may be necessary for deeply recursive programs. To be clear I am absolutely wanting to dive into this and understand what happened, and we shouldn't have regressions of this sort if we can help it at all; but just want to set expectations at the same time :-)
Thanks for reporting the issue!
cfallin commented on issue #4214:
@tiran, I see your note about "artifacts and build scripts in the repo", but could you give a URL for a specific .wasm that I could download and compile, to compare the output before-and-after? That would greatly help diagnosing.
Nevermind this, I fail at reading occasionally; I see your .zip at the top of the comment. Thanks!
cfallin edited a comment on issue #4214:
It could be; a larger stack frame in some particular function of this program due to different spilling behavior could very well cause this. @tiran, for context, we switched to a new register allocator (regalloc2) in wasmtime 0.37, and it works in quite a different way from the old one (generally better results, enough on average to justify switching, but not always).
@tiran, I see your note about "artifacts and build scripts in the repo", but could you give a URL for a specific.wasm
that I could download and compile, to compare the output before-and-after? That would greatly help diagnosing.I'm definitely happy to look at this -- it's possible that some heuristic can be improved, or that something is just not working as expected.
We should also note though that stackframe size is a completely implementation-defined detail. One can configure the stack size of one's instances and sometimes this may be necessary for deeply recursive programs. To be clear I am absolutely wanting to dive into this and understand what happened, and we shouldn't have regressions of this sort if we can help it at all; but just want to set expectations at the same time :-)
Thanks for reporting the issue!
alexcrichton commented on issue #4214:
I got an odd result with bisection locally reproducing with the
*.zip
above. I'm on an aarch64 machine and the v0.37.0 release actually runs therecursive.py
just fine. Onmain
, however, I get the stack overflow as reported here. Bisection points to 0824abbae47a88e3c0fa651e61be3c49fcddc0fc as the "cause" and I can confirm that before that commitrecursive.py
runs ok but on that commitrecursive.py
runs out of stack.I suspect there's probably x86_64-vs-aarch64 differences which makes it such that v0.37.0 fails for @tiran but doesn't fail for me. Otherwise though there may be something going on with the alias analysis added that accidentally increases stack sizes. (or it was really close to the limit prior and alias analysis just happened to push it over the edge or something like that)
cfallin commented on issue #4214:
I did a quick analysis of compiled objects from the
python.wasm
in the zip before (0.35.2 that I happened to have lying around) and after (main
from today). I grepped out allsub $IMM, %rsp
instructions, which appear in prologues to allocate stack frames, then sorted and counted instances to form histograms.Before:
2517 0x10 1654 0x20 1069 0x30 1376 0x40 864 0x50 495 0x60 302 0x70 150 0x80 121 0x90 61 0xa0 39 0xb0 20 0xc0 15 0xd0 19 0xe0 8 0xf0 5 0x100 5 0x110 2 0x120 1 0x130 1 0x160 2 0x170 1 0x180 2 0x190 2 0x1a0 1 0x260
after:
2471 0x10 1640 0x20 1212 0x30 1471 0x40 791 0x50 424 0x60 273 0x70 141 0x80 101 0x90 62 0xa0 32 0xb0 33 0xc0 26 0xd0 20 0xe0 11 0xf0 3 0x100 6 0x110 8 0x120 8 0x130 6 0x140 1 0x150 2 0x180 1 0x190 2 0x1a0 1 0x1b0 1 0x1c0 1 0x1e0 2 0x1f0 2 0x200 1 0x310 2 0x3e0 1 0x550 1 0x620 1 0x950
so we are seeing significantly larger stack frames in some cases.
My current hypothesis is that alias analysis is creating longer liveranges by doing what it does (finding aliases, using values from further up the function body), and so this is potentially causing more spilling. This is a pretty classic code-motion problem with the usual NP-hard flavor -- do we reuse a value, but pay for storing it, or do we recompute it (in this case reloading); the former is better if we have free registers, the latter if we don't, but compiler phasing puts this decision before we get to regalloc and know for sure.
The alias analysis got some benefits but they were middling at best (see #4163), because it actually would do better when combined with GVN and other analyses concurrently; I'm happy to keep it in-tree but turn it off by default if we think that's the more predictable path for now, and turn it back on as part of a bigger mid-end optimization effort. Thoughts?
cfallin commented on issue #4214:
and with alias analysis disabled, the distribution is:
2473 0x10 1644 0x20 1202 0x30 1500 0x40 807 0x50 431 0x60 255 0x70 142 0x80 83 0x90 60 0xa0 32 0xb0 32 0xc0 16 0xd0 21 0xe0 14 0xf0 5 0x100 6 0x110 6 0x120 8 0x130 3 0x140 1 0x150 2 0x180 3 0x190 2 0x1b0 1 0x1e0 4 0x1f0 1 0x360 1 0x370 1 0x3d0 1 0x560 1 0x640 1 0x930
so there are still some large stack frames, but slightly fewer. On my x86-64 machine, with alias analysis disabled I still get a Wasm stack overflow trap with the repro above. I suspect the difference wrt @alexcrichton 's results on aarch64 is that aarch64 has twice as many registers so needs smaller stackframes generally.
So:
- I can take a look at stackslot merging in RA2 to see if it's doing anything clearly wrong, or easy to fix;
- We can turn off alias analysis for now.
alexcrichton commented on issue #4214:
I picked the biggest stack frame in the object which was for
_wasm_function_3078
(_PyEval_EvalFrameDefault
) and compared the sizes of it:
0.36.0 0.37.0 x86_64 608 2816 aarch64 528 1168 This function is indeed part of the recursive trace in the stack traces above, which means that inflating the stack size of this function indeed reduces the amount of recursion available.
Given that 0.37.0 doesn't have alias analysis while I'm sure it affects stack usage somewhat (as is evidenced from my bisection run on aarch64) this seems more worrisome. While different stack sizes are expected is it expected that ra2 generates a stack size that's 4x larger for equivalent code on x86_64?
cfallin commented on issue #4214:
Indeed, that corresponds to the 0x260 (608) in the histogram above; 0x950 (2384) bytes on
main
.Definitely not expected! I'm looking into the stackslot merging as mentioned above...
tiran commented on issue #4214:
Thank you for looking into the issue. I appreciate your promptly and detailed feedback.
I'm in the process of adding experimental support for wasm32-emscripten and wasm32-wasi to CPython upstream. Basic Emscripten support is mostly done. We now have all downstream patches from Pyodide integrated into CPython upstream and are able to run the test suite successfully under Node. WASI support is in an earlier state. I had all crashes fixed with WASI SDK 15 and wasmtime 0.36. WASI SDK 16 fixed a bug in utimensat that affected CPython. Right now I'm going through all remaining test failures and added skip markers to test cases that do not work with wasmtime (e.g. symlinks with absolute paths, tests that depend on working chmod/umask, etc.).
Once I have WASI builds + wasmtime figured out and stable, I plan to look into other runtime. Does WASI offer an API to introspect the call stack at runtime so we could determinate the Python recursion limit dynamically?
@alexcrichton The function
_PyEval_EvalFrameDefault
is the core of Python's ceval loop and executes Python byte code. It's a very long function with lots of locals. On GCC and clang it normally uses computed gotos. Since WASM target does not support computed gotos, it falls back to a long switch statement.
alexcrichton commented on issue #4214:
Currently there is no way for a wasm module to determine its own stack limit or otherwise see how big stack frames are, so there's no way for the wasm module itself to detect what its recursion limit should be.
Otherwise though it definitely makes sense that
_PyEval_EvalFrameDefault
needs to be pretty optimized! @cfallin is investigating the possibility of an accidental bug being introduced when Wasmtime switched from one register allocator to another. The new register allocator sems to produce much larger stack sizes than before for_PyEval_EvalFrameDefault
which is believed to be an unintended result of switching register allocators.The work you're doing sounds great though! If you've got specific issues with wasmtime's WASI support feel free to open issues here as well, and additionally if you've got thoughts about WASI in general I'm sure the WASI repository would be welcome to issues as well
alexcrichton edited a comment on issue #4214:
Currently there is no way for a wasm module to determine its own stack limit or otherwise see how big stack frames are, so there's no way for the wasm module itself to detect what its recursion limit should be.
Otherwise though it definitely makes sense that
_PyEval_EvalFrameDefault
needs to be pretty optimized! @cfallin is investigating the possibility of an accidental bug being introduced when Wasmtime switched from one register allocator to another. The new register allocator seems to produce much larger stack sizes than before for_PyEval_EvalFrameDefault
which is believed to be an unintended result of switching register allocators.The work you're doing sounds great though! If you've got specific issues with wasmtime's WASI support feel free to open issues here as well, and additionally if you've got thoughts about WASI in general I'm sure the WASI repository would be welcome to issues as well
cfallin commented on issue #4214:
Alright, I've discovered a truly boneheaded mistake of my own making in the spillslot allocation in regalloc2; PR incoming, but the histogram is now:
2471 0x10 1640 0x20 1212 0x30 1471 0x40 829 0x50 432 0x60 274 0x70 151 0x80 97 0x90 57 0xa0 43 0xb0 23 0xc0 15 0xd0 11 0xe0 5 0xf0 7 0x100 5 0x110 4 0x120 3 0x130 1 0x170 3 0x190 1 0x1b0 1 0x1e0 2 0x2f0
or a maximum frame size of 752 bytes (on x86-64). The remaining delta from regalloc.rs I believe can be explained by: (i) regalloc.rs reuses frame space between integer and float/vec freelists, while RA2 does not, for simplicity (they are different size classes); and (ii) RA2 bounds the probing (testing for overlap of liveranges) of spillslots to 10 attempts max, to avoid quadratic behavior, while regalloc.rs does not have such a bound, so gets more compaction sometimes at the cost of compile time.
cfallin commented on issue #4214:
Changes are up in a PR; maybe if we're lucky, we can get it reviewed, version-bumped, and pulled into Cranelift today before the 0.38 branch starts its two-week freeze/fuzzing period on Jun 5 (this weekend)!
cfallin closed issue #4214:
Test Case
wasmtime-stack-py.zip wasm32-wasi build of CPython main branch (3.12-dev) with part of the stdlib and a reproducer file.
Steps to Reproduce
- extract zipfile
cd wasmtime-stack-py
- Run
wasmtime run --dir . python.wasm -- recursive.py 1500
with wasmtime 0.36 and 0.37Expected Results
wasmtime 0.36 works correctly with a Python recursion limit up to a little more than 1500.
Error: failed to run main module `python.wasm` Caused by: 0: failed to invoke command default 1: wasm trap: call stack exhausted wasm backtrace: 0: <unknown>!_PyEval_EvalFrameDefault 1: 0x14e399 - <unknown>!_PyEval_Vector 2: 0x57e85 - <unknown>!_PyFunction_Vectorcall 3: 0x57d9d - <unknown>!PyObject_CallOneArg 4: 0xc85dd - <unknown>!slot_tp_repr ... 3574: 0x193776 - <unknown>!run_mod 3575: 0x192e05 - <unknown>!_PyRun_SimpleFileObject 3576: 0x1924bb - <unknown>!_PyRun_AnyFileObject 3577: 0x1b23d3 - <unknown>!Py_RunMain 3578: 0x1b2ede - <unknown>!pymain_main 3579: 0x1b2f8d - <unknown>!Py_BytesMain 3580: 0x4c52 - <unknown>!main 3581: 0x31b09e - <unknown>!__main_void 3582: 0x31b01d - <unknown>!__original_main 3583: 0x4c36 - <unknown>!_start 3584: 0x3343f7 - <unknown>!_start.command_export note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information
Actual Results
wasmtime 0.37 fails with a much lower recursion limit. It starts to run into call stack exhaustion at about 480 Python recursions and 1155 WASM call stacks. Please note that the highest WASM call stack number with wasmtime 0.36 is 3584.
Error: failed to run main module `python.wasm` Caused by: 0: failed to invoke command default 1: wasm trap: call stack exhausted wasm backtrace: 0: <unknown>!_PyEval_EvalFrameDefault 1: 0x14e399 - <unknown>!_PyEval_Vector 2: 0x57e85 - <unknown>!_PyFunction_Vectorcall 3: 0x57d9d - <unknown>!PyObject_CallOneArg 4: 0xc85dd - <unknown>!slot_tp_repr ... 1145: 0x193776 - <unknown>!run_mod 1146: 0x192e05 - <unknown>!_PyRun_SimpleFileObject 1147: 0x1924bb - <unknown>!_PyRun_AnyFileObject 1148: 0x1b23d3 - <unknown>!Py_RunMain 1149: 0x1b2ede - <unknown>!pymain_main 1150: 0x1b2f8d - <unknown>!Py_BytesMain 1151: 0x4c52 - <unknown>!main 1152: 0x31b09e - <unknown>!__main_void 1153: 0x31b01d - <unknown>!__original_main 1154: 0x4c36 - <unknown>!_start 1155: 0x3343f7 - <unknown>!_start.command_export note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information
Versions and Environment
Wasmtime version or commit: wasmtime-v0.37.0-x86_64-linux
Operating system: Linux (Fedora 36)
Architecture: X84_64
WASI SDK: 16.0
Python 3.12-dev (latest commit on main as of today)
Extra Info
You can find build scripts and artifacts in the git repo https://github.com/ethanhs/python-wasm/
cfallin commented on issue #4214:
@tiran this just merged, and made the window for 0.38.0; assuming all goes well (i.e. we don't have to back anything out), it should be included in our release scheduled for Jun 20. Thanks again for the report!
tiran commented on issue #4214:
@tiran this just merged, and made the window for 0.38.0; assuming all goes well (i.e. we don't have to back anything out), it should be included in our release scheduled for Jun 20. Thanks again for the report!
Thank you for the quick analysis and fix. Much appreciated!
The work you're doing sounds great though! If you've got specific issues with wasmtime's WASI support feel free to open issues here as well, and additionally if you've got thoughts about WASI in general I'm sure the WASI repository would be welcome to issues as well
@alexcrichton Will do! :) As I mentioned before we are at a very early stage of WASI support. I'm new to WebAssembly world, too. I started with Emscripten support for CPython little more than half a year ago. I'll start to open feature requests once I have a better understanding of WASI. In the mean time you can find my notes at https://github.com/python/cpython/blob/main/Tools/wasm/README.md
Last updated: Nov 22 2024 at 16:03 UTC