Stream: git-wasmtime

Topic: wasmtime / issue #4214 wasmtime 0.37 has much smaller cal...


view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 14:20):

tiran opened issue #4214:

Test Case

wasmtime-stack-py.zip wasm32-wasi build of CPython main branch (3.12-dev) with part of the stdlib and a reproducer file.

Steps to Reproduce

Expected Results

wasmtime 0.36 works correctly with a Python recursion limit up to a little more than 1500.

Error: failed to run main module `python.wasm`

Caused by:
    0: failed to invoke command default
    1: wasm trap: call stack exhausted
       wasm backtrace:
           0: <unknown>!_PyEval_EvalFrameDefault
           1: 0x14e399 - <unknown>!_PyEval_Vector
           2: 0x57e85 - <unknown>!_PyFunction_Vectorcall
           3: 0x57d9d - <unknown>!PyObject_CallOneArg
           4: 0xc85dd - <unknown>!slot_tp_repr
...
         3574: 0x193776 - <unknown>!run_mod
         3575: 0x192e05 - <unknown>!_PyRun_SimpleFileObject
         3576: 0x1924bb - <unknown>!_PyRun_AnyFileObject
         3577: 0x1b23d3 - <unknown>!Py_RunMain
         3578: 0x1b2ede - <unknown>!pymain_main
         3579: 0x1b2f8d - <unknown>!Py_BytesMain
         3580: 0x4c52 - <unknown>!main
         3581: 0x31b09e - <unknown>!__main_void
         3582: 0x31b01d - <unknown>!__original_main
         3583: 0x4c36 - <unknown>!_start
         3584: 0x3343f7 - <unknown>!_start.command_export
       note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information

Actual Results

wasmtime 0.37 fails with a much lower recursion limit. It starts to run into call stack exhaustion at about 480 Python recursions and 1155 WASM call stacks. Please note that the highest WASM call stack number with wasmtime 0.36 is 3584.

Error: failed to run main module `python.wasm`

Caused by:
    0: failed to invoke command default
    1: wasm trap: call stack exhausted
       wasm backtrace:
           0: <unknown>!_PyEval_EvalFrameDefault
           1: 0x14e399 - <unknown>!_PyEval_Vector
           2: 0x57e85 - <unknown>!_PyFunction_Vectorcall
           3: 0x57d9d - <unknown>!PyObject_CallOneArg
           4: 0xc85dd - <unknown>!slot_tp_repr
...
         1145: 0x193776 - <unknown>!run_mod
         1146: 0x192e05 - <unknown>!_PyRun_SimpleFileObject
         1147: 0x1924bb - <unknown>!_PyRun_AnyFileObject
         1148: 0x1b23d3 - <unknown>!Py_RunMain
         1149: 0x1b2ede - <unknown>!pymain_main
         1150: 0x1b2f8d - <unknown>!Py_BytesMain
         1151: 0x4c52 - <unknown>!main
         1152: 0x31b09e - <unknown>!__main_void
         1153: 0x31b01d - <unknown>!__original_main
         1154: 0x4c36 - <unknown>!_start
         1155: 0x3343f7 - <unknown>!_start.command_export
       note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information

Versions and Environment

Wasmtime version or commit: wasmtime-v0.37.0-x86_64-linux

Operating system: Linux (Fedora 36)

Architecture: X84_64

WASI SDK: 16.0

Python 3.12-dev (latest commit on main as of today)

Extra Info

You can find build scripts and artifacts in the git repo https://github.com/ethanhs/python-wasm/

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 14:20):

tiran labeled issue #4214:

Test Case

wasmtime-stack-py.zip wasm32-wasi build of CPython main branch (3.12-dev) with part of the stdlib and a reproducer file.

Steps to Reproduce

Expected Results

wasmtime 0.36 works correctly with a Python recursion limit up to a little more than 1500.

Error: failed to run main module `python.wasm`

Caused by:
    0: failed to invoke command default
    1: wasm trap: call stack exhausted
       wasm backtrace:
           0: <unknown>!_PyEval_EvalFrameDefault
           1: 0x14e399 - <unknown>!_PyEval_Vector
           2: 0x57e85 - <unknown>!_PyFunction_Vectorcall
           3: 0x57d9d - <unknown>!PyObject_CallOneArg
           4: 0xc85dd - <unknown>!slot_tp_repr
...
         3574: 0x193776 - <unknown>!run_mod
         3575: 0x192e05 - <unknown>!_PyRun_SimpleFileObject
         3576: 0x1924bb - <unknown>!_PyRun_AnyFileObject
         3577: 0x1b23d3 - <unknown>!Py_RunMain
         3578: 0x1b2ede - <unknown>!pymain_main
         3579: 0x1b2f8d - <unknown>!Py_BytesMain
         3580: 0x4c52 - <unknown>!main
         3581: 0x31b09e - <unknown>!__main_void
         3582: 0x31b01d - <unknown>!__original_main
         3583: 0x4c36 - <unknown>!_start
         3584: 0x3343f7 - <unknown>!_start.command_export
       note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information

Actual Results

wasmtime 0.37 fails with a much lower recursion limit. It starts to run into call stack exhaustion at about 480 Python recursions and 1155 WASM call stacks. Please note that the highest WASM call stack number with wasmtime 0.36 is 3584.

Error: failed to run main module `python.wasm`

Caused by:
    0: failed to invoke command default
    1: wasm trap: call stack exhausted
       wasm backtrace:
           0: <unknown>!_PyEval_EvalFrameDefault
           1: 0x14e399 - <unknown>!_PyEval_Vector
           2: 0x57e85 - <unknown>!_PyFunction_Vectorcall
           3: 0x57d9d - <unknown>!PyObject_CallOneArg
           4: 0xc85dd - <unknown>!slot_tp_repr
...
         1145: 0x193776 - <unknown>!run_mod
         1146: 0x192e05 - <unknown>!_PyRun_SimpleFileObject
         1147: 0x1924bb - <unknown>!_PyRun_AnyFileObject
         1148: 0x1b23d3 - <unknown>!Py_RunMain
         1149: 0x1b2ede - <unknown>!pymain_main
         1150: 0x1b2f8d - <unknown>!Py_BytesMain
         1151: 0x4c52 - <unknown>!main
         1152: 0x31b09e - <unknown>!__main_void
         1153: 0x31b01d - <unknown>!__original_main
         1154: 0x4c36 - <unknown>!_start
         1155: 0x3343f7 - <unknown>!_start.command_export
       note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information

Versions and Environment

Wasmtime version or commit: wasmtime-v0.37.0-x86_64-linux

Operating system: Linux (Fedora 36)

Architecture: X84_64

WASI SDK: 16.0

Python 3.12-dev (latest commit on main as of today)

Extra Info

You can find build scripts and artifacts in the git repo https://github.com/ethanhs/python-wasm/

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 14:59):

alexcrichton commented on issue #4214:

cc @cfallin, this is probably related to regalloc2 perhaps?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 16:13):

cfallin commented on issue #4214:

It could be; a larger stack frame in some particular function of this program due to different spilling behavior could very well cause this. @tiran, for context, we switched to a new register allocator (regalloc2) in wasmtime 0.37, and it works in quite a different way from the old one (generally better results, enough on average to justify switching, but not always).

@tiran, I see your note about "artifacts and build scripts in the repo", but could you give a URL for a specific .wasm that I could download and compile, to compare the output before-and-after? That would greatly help diagnosing.

I'm definitely happy to look at this -- it's possible that some heuristic can be improved, or that something is just not working as expected.

We should also note though that stackframe size is a completely implementation-defined detail. One can configure the stack size of one's instances and sometimes this may be necessary for deeply recursive programs. To be clear I am absolutely wanting to dive into this and understand what happened, and we shouldn't have regressions of this sort if we can help it at all; but just want to set expectations at the same time :-)

Thanks for reporting the issue!

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 16:49):

cfallin commented on issue #4214:

@tiran, I see your note about "artifacts and build scripts in the repo", but could you give a URL for a specific .wasm that I could download and compile, to compare the output before-and-after? That would greatly help diagnosing.

Nevermind this, I fail at reading occasionally; I see your .zip at the top of the comment. Thanks!

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 16:49):

cfallin edited a comment on issue #4214:

It could be; a larger stack frame in some particular function of this program due to different spilling behavior could very well cause this. @tiran, for context, we switched to a new register allocator (regalloc2) in wasmtime 0.37, and it works in quite a different way from the old one (generally better results, enough on average to justify switching, but not always).

@tiran, I see your note about "artifacts and build scripts in the repo", but could you give a URL for a specific .wasm that I could download and compile, to compare the output before-and-after? That would greatly help diagnosing.

I'm definitely happy to look at this -- it's possible that some heuristic can be improved, or that something is just not working as expected.

We should also note though that stackframe size is a completely implementation-defined detail. One can configure the stack size of one's instances and sometimes this may be necessary for deeply recursive programs. To be clear I am absolutely wanting to dive into this and understand what happened, and we shouldn't have regressions of this sort if we can help it at all; but just want to set expectations at the same time :-)

Thanks for reporting the issue!

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 17:10):

alexcrichton commented on issue #4214:

I got an odd result with bisection locally reproducing with the *.zip above. I'm on an aarch64 machine and the v0.37.0 release actually runs the recursive.py just fine. On main, however, I get the stack overflow as reported here. Bisection points to 0824abbae47a88e3c0fa651e61be3c49fcddc0fc as the "cause" and I can confirm that before that commit recursive.py runs ok but on that commit recursive.py runs out of stack.

I suspect there's probably x86_64-vs-aarch64 differences which makes it such that v0.37.0 fails for @tiran but doesn't fail for me. Otherwise though there may be something going on with the alias analysis added that accidentally increases stack sizes. (or it was really close to the limit prior and alias analysis just happened to push it over the edge or something like that)

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 17:23):

cfallin commented on issue #4214:

I did a quick analysis of compiled objects from the python.wasm in the zip before (0.35.2 that I happened to have lying around) and after (main from today). I grepped out all sub $IMM, %rsp instructions, which appear in prologues to allocate stack frames, then sorted and counted instances to form histograms.

Before:

   2517 0x10
   1654 0x20
   1069 0x30
   1376 0x40
    864 0x50
    495 0x60
    302 0x70
    150 0x80
    121 0x90
     61 0xa0
     39 0xb0
     20 0xc0
     15 0xd0
     19 0xe0
      8 0xf0
      5 0x100
      5 0x110
      2 0x120
      1 0x130
      1 0x160
      2 0x170
      1 0x180
      2 0x190
      2 0x1a0
      1 0x260

after:

   2471 0x10
   1640 0x20
   1212 0x30
   1471 0x40
    791 0x50
    424 0x60
    273 0x70
    141 0x80
    101 0x90
     62 0xa0
     32 0xb0
     33 0xc0
     26 0xd0
     20 0xe0
     11 0xf0
      3 0x100
      6 0x110
      8 0x120
      8 0x130
      6 0x140
      1 0x150
      2 0x180
      1 0x190
      2 0x1a0
      1 0x1b0
      1 0x1c0
      1 0x1e0
      2 0x1f0
      2 0x200
      1 0x310
      2 0x3e0
      1 0x550
      1 0x620
      1 0x950

so we are seeing significantly larger stack frames in some cases.

My current hypothesis is that alias analysis is creating longer liveranges by doing what it does (finding aliases, using values from further up the function body), and so this is potentially causing more spilling. This is a pretty classic code-motion problem with the usual NP-hard flavor -- do we reuse a value, but pay for storing it, or do we recompute it (in this case reloading); the former is better if we have free registers, the latter if we don't, but compiler phasing puts this decision before we get to regalloc and know for sure.

The alias analysis got some benefits but they were middling at best (see #4163), because it actually would do better when combined with GVN and other analyses concurrently; I'm happy to keep it in-tree but turn it off by default if we think that's the more predictable path for now, and turn it back on as part of a bigger mid-end optimization effort. Thoughts?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 17:35):

cfallin commented on issue #4214:

and with alias analysis disabled, the distribution is:

   2473 0x10
   1644 0x20
   1202 0x30
   1500 0x40
    807 0x50
    431 0x60
    255 0x70
    142 0x80
     83 0x90
     60 0xa0
     32 0xb0
     32 0xc0
     16 0xd0
     21 0xe0
     14 0xf0
      5 0x100
      6 0x110
      6 0x120
      8 0x130
      3 0x140
      1 0x150
      2 0x180
      3 0x190
      2 0x1b0
      1 0x1e0
      4 0x1f0
      1 0x360
      1 0x370
      1 0x3d0
      1 0x560
      1 0x640
      1 0x930

so there are still some large stack frames, but slightly fewer. On my x86-64 machine, with alias analysis disabled I still get a Wasm stack overflow trap with the repro above. I suspect the difference wrt @alexcrichton 's results on aarch64 is that aarch64 has twice as many registers so needs smaller stackframes generally.

So:

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 18:04):

alexcrichton commented on issue #4214:

I picked the biggest stack frame in the object which was for _wasm_function_3078 (_PyEval_EvalFrameDefault) and compared the sizes of it:

0.36.0 0.37.0
x86_64 608 2816
aarch64 528 1168

This function is indeed part of the recursive trace in the stack traces above, which means that inflating the stack size of this function indeed reduces the amount of recursion available.

Given that 0.37.0 doesn't have alias analysis while I'm sure it affects stack usage somewhat (as is evidenced from my bisection run on aarch64) this seems more worrisome. While different stack sizes are expected is it expected that ra2 generates a stack size that's 4x larger for equivalent code on x86_64?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 18:21):

cfallin commented on issue #4214:

Indeed, that corresponds to the 0x260 (608) in the histogram above; 0x950 (2384) bytes on main.

Definitely not expected! I'm looking into the stackslot merging as mentioned above...

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 18:26):

tiran commented on issue #4214:

Thank you for looking into the issue. I appreciate your promptly and detailed feedback.

I'm in the process of adding experimental support for wasm32-emscripten and wasm32-wasi to CPython upstream. Basic Emscripten support is mostly done. We now have all downstream patches from Pyodide integrated into CPython upstream and are able to run the test suite successfully under Node. WASI support is in an earlier state. I had all crashes fixed with WASI SDK 15 and wasmtime 0.36. WASI SDK 16 fixed a bug in utimensat that affected CPython. Right now I'm going through all remaining test failures and added skip markers to test cases that do not work with wasmtime (e.g. symlinks with absolute paths, tests that depend on working chmod/umask, etc.).

Once I have WASI builds + wasmtime figured out and stable, I plan to look into other runtime. Does WASI offer an API to introspect the call stack at runtime so we could determinate the Python recursion limit dynamically?

@alexcrichton The function _PyEval_EvalFrameDefault is the core of Python's ceval loop and executes Python byte code. It's a very long function with lots of locals. On GCC and clang it normally uses computed gotos. Since WASM target does not support computed gotos, it falls back to a long switch statement.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 18:34):

alexcrichton commented on issue #4214:

Currently there is no way for a wasm module to determine its own stack limit or otherwise see how big stack frames are, so there's no way for the wasm module itself to detect what its recursion limit should be.

Otherwise though it definitely makes sense that _PyEval_EvalFrameDefault needs to be pretty optimized! @cfallin is investigating the possibility of an accidental bug being introduced when Wasmtime switched from one register allocator to another. The new register allocator sems to produce much larger stack sizes than before for _PyEval_EvalFrameDefault which is believed to be an unintended result of switching register allocators.

The work you're doing sounds great though! If you've got specific issues with wasmtime's WASI support feel free to open issues here as well, and additionally if you've got thoughts about WASI in general I'm sure the WASI repository would be welcome to issues as well

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 18:35):

alexcrichton edited a comment on issue #4214:

Currently there is no way for a wasm module to determine its own stack limit or otherwise see how big stack frames are, so there's no way for the wasm module itself to detect what its recursion limit should be.

Otherwise though it definitely makes sense that _PyEval_EvalFrameDefault needs to be pretty optimized! @cfallin is investigating the possibility of an accidental bug being introduced when Wasmtime switched from one register allocator to another. The new register allocator seems to produce much larger stack sizes than before for _PyEval_EvalFrameDefault which is believed to be an unintended result of switching register allocators.

The work you're doing sounds great though! If you've got specific issues with wasmtime's WASI support feel free to open issues here as well, and additionally if you've got thoughts about WASI in general I'm sure the WASI repository would be welcome to issues as well

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 21:05):

cfallin commented on issue #4214:

Alright, I've discovered a truly boneheaded mistake of my own making in the spillslot allocation in regalloc2; PR incoming, but the histogram is now:

   2471 0x10
   1640 0x20
   1212 0x30
   1471 0x40
    829 0x50
    432 0x60
    274 0x70
    151 0x80
     97 0x90
     57 0xa0
     43 0xb0
     23 0xc0
     15 0xd0
     11 0xe0
      5 0xf0
      7 0x100
      5 0x110
      4 0x120
      3 0x130
      1 0x170
      3 0x190
      1 0x1b0
      1 0x1e0
      2 0x2f0

or a maximum frame size of 752 bytes (on x86-64). The remaining delta from regalloc.rs I believe can be explained by: (i) regalloc.rs reuses frame space between integer and float/vec freelists, while RA2 does not, for simplicity (they are different size classes); and (ii) RA2 bounds the probing (testing for overlap of liveranges) of spillslots to 10 attempts max, to avoid quadratic behavior, while regalloc.rs does not have such a bound, so gets more compaction sometimes at the cost of compile time.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 03 2022 at 21:28):

cfallin commented on issue #4214:

Changes are up in a PR; maybe if we're lucky, we can get it reviewed, version-bumped, and pulled into Cranelift today before the 0.38 branch starts its two-week freeze/fuzzing period on Jun 5 (this weekend)!

view this post on Zulip Wasmtime GitHub notifications bot (Jun 04 2022 at 00:12):

cfallin closed issue #4214:

Test Case

wasmtime-stack-py.zip wasm32-wasi build of CPython main branch (3.12-dev) with part of the stdlib and a reproducer file.

Steps to Reproduce

Expected Results

wasmtime 0.36 works correctly with a Python recursion limit up to a little more than 1500.

Error: failed to run main module `python.wasm`

Caused by:
    0: failed to invoke command default
    1: wasm trap: call stack exhausted
       wasm backtrace:
           0: <unknown>!_PyEval_EvalFrameDefault
           1: 0x14e399 - <unknown>!_PyEval_Vector
           2: 0x57e85 - <unknown>!_PyFunction_Vectorcall
           3: 0x57d9d - <unknown>!PyObject_CallOneArg
           4: 0xc85dd - <unknown>!slot_tp_repr
...
         3574: 0x193776 - <unknown>!run_mod
         3575: 0x192e05 - <unknown>!_PyRun_SimpleFileObject
         3576: 0x1924bb - <unknown>!_PyRun_AnyFileObject
         3577: 0x1b23d3 - <unknown>!Py_RunMain
         3578: 0x1b2ede - <unknown>!pymain_main
         3579: 0x1b2f8d - <unknown>!Py_BytesMain
         3580: 0x4c52 - <unknown>!main
         3581: 0x31b09e - <unknown>!__main_void
         3582: 0x31b01d - <unknown>!__original_main
         3583: 0x4c36 - <unknown>!_start
         3584: 0x3343f7 - <unknown>!_start.command_export
       note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information

Actual Results

wasmtime 0.37 fails with a much lower recursion limit. It starts to run into call stack exhaustion at about 480 Python recursions and 1155 WASM call stacks. Please note that the highest WASM call stack number with wasmtime 0.36 is 3584.

Error: failed to run main module `python.wasm`

Caused by:
    0: failed to invoke command default
    1: wasm trap: call stack exhausted
       wasm backtrace:
           0: <unknown>!_PyEval_EvalFrameDefault
           1: 0x14e399 - <unknown>!_PyEval_Vector
           2: 0x57e85 - <unknown>!_PyFunction_Vectorcall
           3: 0x57d9d - <unknown>!PyObject_CallOneArg
           4: 0xc85dd - <unknown>!slot_tp_repr
...
         1145: 0x193776 - <unknown>!run_mod
         1146: 0x192e05 - <unknown>!_PyRun_SimpleFileObject
         1147: 0x1924bb - <unknown>!_PyRun_AnyFileObject
         1148: 0x1b23d3 - <unknown>!Py_RunMain
         1149: 0x1b2ede - <unknown>!pymain_main
         1150: 0x1b2f8d - <unknown>!Py_BytesMain
         1151: 0x4c52 - <unknown>!main
         1152: 0x31b09e - <unknown>!__main_void
         1153: 0x31b01d - <unknown>!__original_main
         1154: 0x4c36 - <unknown>!_start
         1155: 0x3343f7 - <unknown>!_start.command_export
       note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information

Versions and Environment

Wasmtime version or commit: wasmtime-v0.37.0-x86_64-linux

Operating system: Linux (Fedora 36)

Architecture: X84_64

WASI SDK: 16.0

Python 3.12-dev (latest commit on main as of today)

Extra Info

You can find build scripts and artifacts in the git repo https://github.com/ethanhs/python-wasm/

view this post on Zulip Wasmtime GitHub notifications bot (Jun 04 2022 at 00:13):

cfallin commented on issue #4214:

@tiran this just merged, and made the window for 0.38.0; assuming all goes well (i.e. we don't have to back anything out), it should be included in our release scheduled for Jun 20. Thanks again for the report!

view this post on Zulip Wasmtime GitHub notifications bot (Jun 04 2022 at 12:34):

tiran commented on issue #4214:

@tiran this just merged, and made the window for 0.38.0; assuming all goes well (i.e. we don't have to back anything out), it should be included in our release scheduled for Jun 20. Thanks again for the report!

Thank you for the quick analysis and fix. Much appreciated!

The work you're doing sounds great though! If you've got specific issues with wasmtime's WASI support feel free to open issues here as well, and additionally if you've got thoughts about WASI in general I'm sure the WASI repository would be welcome to issues as well

@alexcrichton Will do! :) As I mentioned before we are at a very early stage of WASI support. I'm new to WebAssembly world, too. I started with Emscripten support for CPython little more than half a year ago. I'll start to open feature requests once I have a better understanding of WASI. In the mean time you can find my notes at https://github.com/python/cpython/blob/main/Tools/wasm/README.md


Last updated: Nov 22 2024 at 16:03 UTC