Stream: git-wasmtime

Topic: wasmtime / issue #4431 `wasmtime`: Implement fast Wasm st...


view this post on Zulip Wasmtime GitHub notifications bot (Jul 11 2022 at 19:52):

fitzgen commented on issue #4431:

TODO

view this post on Zulip Wasmtime GitHub notifications bot (Jul 11 2022 at 20:04):

github-actions[bot] commented on issue #4431:

Subscribe to Label Action

cc @fitzgen, @peterhuene

<details>
This issue or pull request has been labeled: "wasmtime:api", "wasmtime:ref-types"

Thus the following users have been cc'd because of the following labels:

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

view this post on Zulip Wasmtime GitHub notifications bot (Jul 12 2022 at 01:09):

github-actions[bot] commented on issue #4431:

Subscribe to Label Action

cc @fitzgen

<details>
This issue or pull request has been labeled: "fuzzing"

Thus the following users have been cc'd because of the following labels:

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

view this post on Zulip Wasmtime GitHub notifications bot (Jul 12 2022 at 07:42):

uweigand commented on issue #4431:

I still need to have a closer look. But just one comment on this:

Within a sequence of Wasm-to-Wasm calls, we can use
frame pointers (which Cranelift preserves) to find the next older Wasm frame on
the stack, and we keep doing this until we reach the entry stack pointer,
meaning that the next older frame will be a host frame.

Actually, the Cranelift s390x back end does not maintain frame pointers at all in Wasm code - frame pointers are really only necessary in our ABI for frames that use dynamic stack allocation, and that never happens in Wasm, so right now we don't ever use frame pointers ... I guess it would be possible to add (the equivalent of) frame pointers, but that would incur a runtime overhead for all Wasm code.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 13 2022 at 21:43):

fitzgen commented on issue #4431:

Thanks for responding @uweigand.

Actually, the Cranelift s390x back end does not maintain frame pointers at all in Wasm code - frame pointers are really only necessary in our ABI for frames that use dynamic stack allocation, and that never happens in Wasm, so right now we don't ever use frame pointers ... I guess it would be possible to add (the equivalent of) frame pointers, but that would incur a runtime overhead for all Wasm code.

Hm, I didn't realize this, since our other backends preserve frame pointers (other than leaf functions on aarch64).

So I guess our options are to

  1. preserve frame pointers in s390x (or equivalent; unsure what you meant by "or equivalent of" since I'm not very familiar with s390x nor its calling conventions),
  2. continue to depend on the backtrace crate and the system unwinder on s390x, or
  3. write a custom stack walker for s390x that is not based on frame pointers (and is instead based on... DWARF/eh_frame? something else?)

I think (1) would be simplest, although it will incur some slight overhead to Wasm execution, but we pay this overhead on other architectures and it is generally considered acceptable. We could definitely make (2) work, but it wouldn't be as tidy as I'd like, since we would need to add more cfgs in more places than just choosing which bits of global assembly or frame pointer recovery code we'd like to compile with. Option (3) is the most nebulous to me, and if we are talking about DWARF/eh_frame then it is probably the most complicated as well. I'd like to avoid this option unless you have other implementation ideas to cut down on complexity here.

So ultimately, I'm personally leaning towards (1) but could be convinced of (2) and I'd prefer not to do (3) unless I'm missing something.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 13 2022 at 22:17):

fitzgen edited a comment on issue #4431:

TODO

view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2022 at 11:13):

uweigand commented on issue #4431:

Hm, I didn't realize this, since our other backends preserve frame pointers (other than leaf functions on aarch64).

To be honest, I never understood that either. In my mind, a _frame pointer_, i.e. a reserved register separate from the stack pointer, which is being using by generated code to access local variables, spill slots, and other objects in a function's stack frame, is only ever necessary if those objects cannot be accessed relative to the stack pointer instead. This is normally only the case if the function performs dynamic stack space allocation (the C alloca or some equivalent), because then you no longer know at compile time what the offset of your stack frame objects relative to the stack pointer is.

Given that there is no dynamic stack allocation in Cranelift, I do not think maintaining an actual frame pointer register is ever needed for any architecture, so I do not understand why the Intel and Arm back-ends are currently doing so. I would expect that as optimization of Cranelift proceeds, we would want to re-examine that choice - I believe unnecessarily maintaining a frame pointer does cost a noticable performance overhead. Note that this overhead is not (or not only) due to the extra saves and restores in prolog and epilog, but primarily due to the fact that we've blocked one whole register that we could have used freely for register allocation otherwise.

I do understand that, once we do maintain a frame pointer register, there are ancillary benefits for stack walking, since the frame pointer must be callee-saved, and the saved copies of that register on the stack happen to form a nice chain that can be easily walked to perform a simple backtrace. However, as soon as we enforce maintaining a frame pointer just for that purpose, this is not actually "free", but does incur the runtime overhead described above, which would otherwise be unnecessary. I think there may be alternative solutions for the debugging problem without that overhead ...

So I guess our options are to

1. preserve frame pointers in s390x (or equivalent; unsure what you meant by "or equivalent of" since I'm not very familiar with s390x nor its calling conventions),

2. continue to depend on the `backtrace` crate and the system unwinder on s390x, or

3. write a custom stack walker for s390x that is not based on frame pointers (and is instead based on... DWARF/eh_frame? something else?)

I think (1) would be simplest, although it will incur some slight overhead to Wasm execution, but we pay this overhead on other architectures and it is generally considered acceptable. We could definitely make (2) work, but it wouldn't be as tidy as I'd like, since we would need to add more cfgs in more places than just choosing which bits of global assembly or frame pointer recovery code we'd like to compile with. Option (3) is the most nebulous to me, and if we are talking about DWARF/eh_frame then it is probably the most complicated as well. I'd like to avoid this option unless you have other implementation ideas to cut down on complexity here.

So ultimately, I'm personally leaning towards (1) but could be convinced of (2) and I'd prefer not to do (3) unless I'm missing something.

As to the "equivalent" I mention under (1), sorry for being vague :-) What I meant is in the s390x ABI we actually have an optional feature to maintain an explicit _stack backchain_, which forms a linked list of frames on the stack, just like the saved copies of a frame pointer register would, except we're not _actually_ maintaining a frame pointer register. Basically, the backchain link is formed by prolog code storing a pointer to the caller's stack frame into the lowest word of the callee's stack frame - and that's it, there is no frame pointer register reserved throughout the function.

Now, we decided more than 20 years ago to even switch this feature off by default in the Linux toolchains, because with DWARF CFI it is no longer needed for debugging, and even that simple store has measurable overhead. (It is still used in special circumstances, e.g. in Linux kernel code.) But I could certainly enable that feature for Cranelift in the s390x back end.

That said, this would still leave us with the problem of frameless leaf functions, just like in the Arm case. I could also force any function that contains a trap to always allocate a frame anyway, just so we can store the backchain and the return address. But that would be even further run-time overhead ...

There may be a more elegant solution to the frameless leaf function problem, also for Arm. Once we know we are currently in a frameless leaf function, this is actually simple to handle: to unwind from a leaf function, you only need to notice that the caller's SP equals the current SP, and the caller's PC equals the value currently in the link register. (Assuming you have one; on Intel you'd get the return address from the lowest word of the stack instead.)

The only difficulty is to know whether we are in a frameless leaf function or not. But we should (be able to) know that - after all, we did compile that function ourselves earlier on :-) So, assuming we can associate the current PC with a data structure identifying the function (which I believe we must do anyway to get at the function name - which is the whole point of generating stack backtraces in the first place), the only thing we'd need to do is to have the compiler set a flag in that structure that this function is frameless.

In fact, spinning that thought experiment a bit further: if we'd store not just a "frameless" flag, but the actual _size_ of the stack frame, we should be able to use that information to unwind from non-leaf functions as well without requiring any explicit backchain or frame pointer: just compute the caller's SP as the callee's SP plus the callee's (known) stack frame size. Note that this makes explicit use of the fact that frame sizes in Cranelift are constants known at compile time, because there is no dynamic stack allocation. (Glossing over some details here; this assumes that the stack pointer doesn't change at all during function execution, at least not at the "relevant" places - traps and function call sites. This is true on s390x and -as far as I know- on Arm, but not on Intel. There'd be extra information required to handle Intel correctly.)

view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2022 at 12:21):

akirilov-arm commented on issue #4431:

... I do not think maintaining an actual frame pointer register is _ever_ needed for _any_ architecture, so I do not understand why the Intel and Arm back-ends are currently doing so.

This is easy to answer in the AArch64 case - because the ABI (and its implementation on Linux in particular) pretty much requires it.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2022 at 12:43):

uweigand commented on issue #4431:

... I do not think maintaining an actual frame pointer register is _ever_ needed for _any_ architecture, so I do not understand why the Intel and Arm back-ends are currently doing so.

This is easy to answer in the AArch64 case - because the ABI (and its implementation on Linux in particular) pretty much requires it.

Huh, interesting. The ABI text still seems to leave it open to the "platform", up to and including the choice:

It may elect not to maintain a frame chain and to use the frame pointer register as a general-purpose callee-saved register.

Do you know why Linux specifically requires the frame pointer?

Anyway, I agree if that's the platform ABI standard, then we need to follow it in Cranelift.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2022 at 16:17):

alexcrichton commented on issue #4431:

Personally I see two routes for s390x w.r.t. this PR:

I don't think we can keep around the backtrace-based traces for maintenance reasons. I feel that the code necessary to keep that working is so different than the code for the frame-pointer-based approach that it's too much to take on for something that most of us aren't running locally to test.

Otherwise this PR will require some more work regardless to handle things like leaf frames on AArch64. I think it would be worth evaluating at that point whether the frame pointer approach makes sense. If some metadata is plumbed through for leaf frames it might be "easy enough" to simply plumb through stack frame sizes as well making frame pointers unnecessary. If it's not as easy as originally thought though the frame pointers may still be needed.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2022 at 16:27):

uweigand commented on issue #4431:

To be clear, I certainly agree it wouldn't make sense to keep the DWARF backtrace in place just for s390x. As I said, I'd be happy to implement the backchain based mechanism on s390x described above, that's easy enough. That would still leave the leaf function problem to be solved, just like on aarch64. Just wondering whether this is best approach overall ...

view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2022 at 16:49):

fitzgen commented on issue #4431:

In fact, spinning that thought experiment a bit further: if we'd store not just a "frameless" flag, but the actual _size_ of the stack frame, we should be able to use that information to unwind from non-leaf functions as well without requiring any explicit backchain or frame pointer: just compute the caller's SP as the callee's SP plus the callee's (known) stack frame size. Note that this makes explicit use of the fact that frame sizes in Cranelift are constants known at compile time, because there is no dynamic stack allocation. (Glossing over some details here; this assumes that the stack pointer doesn't change at all during function execution, at least not at the "relevant" places - traps and function call sites. This is true on s390x and -as far as I know- on Arm, but not on Intel. There'd be extra information required to handle Intel correctly.)

When discussing how to speed up stack walking with Chris, I originally proposed a side table approach, and he pushed back on that (and I ultimately agreed with him) because frame pointers are so much simpler and easier to implement correctly. This is an important property for anything that is critical for safety[^0].

[^0]: We walk the stack to identify on-stack GC roots, for example, and failing to find a stack frame can lead to the collector reclaiming something it shouldn't, resulting in use-after-free bugs. In fact, we've had problems with system unwinders before where they fail to fully walk the stack and if we fully trusted the system unwinder we would have exactly these use-after-free bugs! This was the motivation for the "stack canary" infrastructure that Wasmtime has in main now (and is obsoleted in this PR).

Frame pointers are a local property of compiling a single function that Cranelift can make sure are preserved and it is hard to get wrong.

Side tables on the other hand are more of a global property of all of the module's compilation and we also have to make sure that the runtime is interpreting and doing lookup in the table correctly. It isn't hard to have bugs with this kind of global integration between two far away components, as we've seen with multiple stack map-related CVEs.

This is why I've been pursuing frame pointers rather than side table metadata, even though I am aware that frame pointers are not strictly necessary (outside of ABI constraints) and impose some overhead (which we already happen to be paying on our most-used backends).

view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2022 at 16:55):

fitzgen commented on issue #4431:

My current hope for leaf functions is that we won't need a side table for identifying whether we are in a leaf function at the start of the backtrace, and instead there is an easy way to identify whether a leaf function has trapping instructions and maintain frame pointers in that case (as discussed at the last Cranelift meeting).

If that doesn't work out tho, I may need to maintain an "is this PC in a leaf function?" side table, and at that point it may be worth re-evaluating whether frame poiinters are the right approach, or if we can use a side table of static frame sizes, as you described @uweigand.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2022 at 18:04):

uweigand commented on issue #4431:

If that doesn't work out tho, I may need to maintain an "is this PC in a leaf function?" side table

As an alternate approach for this part, I think this question will only ever get asked for PCs that are trap locations, and we already have a side table for trap locations (holding the trap code), so this could just be an extra bit in there.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2022 at 18:22):

fitzgen edited a comment on issue #4431:

TODO

view this post on Zulip Wasmtime GitHub notifications bot (Jul 14 2022 at 19:39):

bjorn3 commented on issue #4431:

Hm, I didn't realize this, since our other backends preserve frame pointers (other than leaf functions on aarch64).

To be honest, I never understood that either. In my mind, a _frame pointer_, i.e. a reserved register separate from the stack pointer, which is being using by generated code to access local variables, spill slots, and other objects in a function's stack frame, is only ever necessary if those objects cannot be accessed relative to the stack pointer instead. This is normally only the case if the function performs dynamic stack space allocation (the C alloca or some equivalent), because then you no longer know at compile time what the offset of your stack frame objects relative to the stack pointer is.

Frame pointers are often used for sampling profilers to have lightning fast stack unwinding as necessary for accurate profiling. For example on macOS frame pointers are mandated to have Instruments work I believe. They don't even support DWARF unwinding AFAIK.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 15 2022 at 18:18):

akirilov-arm commented on issue #4431:

@uweigand

Do you know why Linux specifically requires the frame pointer?

No, I don't. TBH I think that all major AArch64 platforms have the same requirement, but I am most familiar with Linux.

My guess is that this choice makes it possible to have reliable backtraces in the absence of DWARF CFI, which is presumably worth having (e.g. consider the sampling profiler use case mentioned above) in light of the fact that the 64-bit Arm architecture is not as register-starved as others, e.g. even Cranelift, which imposes further limitations on the usable general-purpose registers on top of the base ABI, has access to 25 of them without restrictions (other than some of them being callee-saved).

view this post on Zulip Wasmtime GitHub notifications bot (Jul 15 2022 at 19:17):

uweigand commented on issue #4431:

@uweigand

Do you know why Linux specifically requires the frame pointer?

No, I don't. TBH I think that all major AArch64 platforms have the same requirement, but I am most familiar with Linux.

My guess is that this choice makes it possible to have reliable backtraces in the absence of DWARF CFI, which is presumably worth having (e.g. consider the sampling profiler use case mentioned above) in light of the fact that the 64-bit Arm architecture is not as register-starved as others, e.g. even Cranelift, which imposes further limitations on the usable general-purpose registers on top of the base ABI, has access to 25 of them without restrictions (other than some of them being callee-saved).

Right, I guess that makes sense. With 32 general-purpose registers, it's much less of a problem. Thanks!

view this post on Zulip Wasmtime GitHub notifications bot (Jul 18 2022 at 04:16):

cfallin commented on issue #4431:

Regarding explicit stack pointers: @uweigand you're correct that one can gain additional performance by omitting them, at the cost of unwind performance and some additional complexity. And indeed, as @fitzgen noted above, I expressed a preference to keep explicit stack pointers previously (and I still feel that way). I want to elaborate a bit more on the benefits I see to this:

So, given all that, I'd prefer to converge to "always use FP" on all platforms. Thoughts?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 18 2022 at 04:16):

cfallin edited a comment on issue #4431:

Regarding explicit frame pointers: @uweigand you're correct that one can gain additional performance by omitting them, at the cost of unwind performance and some additional complexity. And indeed, as @fitzgen noted above, I expressed a preference to keep explicit frame pointers previously (and I still feel that way). I want to elaborate a bit more on the benefits I see to this:

So, given all that, I'd prefer to converge to "always use FP" on all platforms. Thoughts?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 18 2022 at 13:02):

uweigand commented on issue #4431:

Fair enough. I agree stack walking must be correct - that's the same in the DWARF CFI case, which is used e.g. for C++ exception handling, which also must always be correct. If we cannot or do not want to use CFI for performance reasons, then an explicit linked chain of stack frames is an alternative that is easy to be shown correct.

As I said, I'd be happy to implement the s390x "-mbackchain" feature, which is equivalent to what you get with -fno-omit-frame-pointer on Intel w.r.t. your considerations above. I can do that as part of this PR (once it's progressed enough), or as a stand-alone change (but I'd prefer to wait until the SIMD patch is merged to avoid merge conflicts in the ABI code).

However, there still remains the frameless leaf function problem to be solved - this is equivalent to the problem on AArch64, and can be solved using the same method we use there (either reliably detecting such functions, or else never generating them in the first place).

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2022 at 22:06):

fitzgen commented on issue #4431:

As I said, I'd be happy to implement the s390x "-mbackchain" feature, which is equivalent to what you get with -fno-omit-frame-pointer on Intel w.r.t. your considerations above. I can do that as part of this PR (once it's progressed enough), or as a stand-alone change (but I'd prefer to wait until the SIMD patch is merged to avoid merge conflicts in the ABI code).

Thanks, very appreciated. I think there are two s390x TODO items here:

More than happy to wait until your SIMD patch is merged.

As already mentioned, I think we can merge the new preserve-frame-pointers flag PR as soon as it gets review approval and CI is green, since it doesn't break anything by itself yet.

But I'd really prefer to not land this PR -- since doing so would break s390x support in Wasmtime -- until we've resolved both the frame pointers item and the trampolines item. Do you think this is something you can help out with within the next couple weeks? I'd prefer not to keep this PR open indefinitely, so if you are swamped and don't have the time, then we could consider the temporarily disabling s390x in CI until we can get it working again, but I would personally consider this an unfortunate outcome.

But yeah, I consider this PR to be pretty stable and generally ready for review now, so I don't think it makes sense to wait on it to be more fleshed out anymore.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2022 at 22:23):

github-actions[bot] commented on issue #4431:

Label Messager: wasmtime:config

It looks like you are changing Wasmtime's configuration options. Make sure to
complete this check list:

[fuzzing-config]: https://github.com/bytecodealliance/wasmtime/blob/ca0e8d0a1d8cefc0496dba2f77a670571d8fdcab/crates/fuzzing/src/generators.rs#L182-L194
[fuzzing-docs]: https://docs.wasmtime.dev/contributing-fuzzing.html


<details>

To modify this label's message, edit the <code>.github/label-messager/wasmtime-config.md</code> file.

To add new label messages or remove existing label messages, edit the
<code>.github/label-messager.json</code> configuration file.

Learn more.

</details>

view this post on Zulip Wasmtime GitHub notifications bot (Jul 19 2022 at 22:27):

fitzgen edited a comment on issue #4431:

Label Messager: wasmtime:config

It looks like you are changing Wasmtime's configuration options. Make sure to
complete this check list:

[fuzzing-config]: https://github.com/bytecodealliance/wasmtime/blob/ca0e8d0a1d8cefc0496dba2f77a670571d8fdcab/crates/fuzzing/src/generators.rs#L182-L194
[fuzzing-docs]: https://docs.wasmtime.dev/contributing-fuzzing.html


<details>

To modify this label's message, edit the <code>.github/label-messager/wasmtime-config.md</code> file.

To add new label messages or remove existing label messages, edit the
<code>.github/label-messager.json</code> configuration file.

Learn more.

</details>

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2022 at 12:23):

uweigand commented on issue #4431:

Implement frame pointers preservation. I've split out https://github.com/bytecodealliance/wasmtime/pull/4469 to add a new flag for Cranelift that controls whether a backend should preserve frame pointers (or equivalent). I think that PR can land without s390x support, since all tests will all still pass, and then we can add s390x support in a follow up PR and add the equivalent filetests for s390x that I added for aarch64 in that PR.

I've already replied on that PR.

Create s390x implementations of the host-to-Wasm and Wasm-to-host trampolines in crates/runtime/src/trampolines/s390x.rs. I don't think I have enough s390x expertise to implement this myself, but I can definitely help clear up any questons around what the trampolines should do, and why they have the constraints that they do.

Sure, I can implement the trampolines. The one concern I have is that the implementation seems to require global_asm!, and this is not yet supported in Rust for s390x - at least not yet stable.

More than happy to wait until your SIMD patch is merged.

That's already happened now.

But yeah, I consider this PR to be pretty stable and generally ready for review now, so I don't think it makes sense to wait on it to be more fleshed out anymore.

I still see this PR failing the CI tests (not just on s390x, but also elsewhere). Does it still make sense for me to try to use the current patch as basis for working on s390x support?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2022 at 14:51):

alexcrichton commented on issue #4431:

Sure, I can implement the trampolines. The one concern I have is that the implementation seems to require global_asm!, and this is not yet supported in Rust for s390x - at least not yet stable.

I think this can use the same trick as the wasmtime-fiber crate where x86_64/aarch64 use global_asm! and s390x uses an external assembly file built with the cc crate.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2022 at 14:58):

uweigand commented on issue #4431:

Sure, I can implement the trampolines. The one concern I have is that the implementation seems to require global_asm!, and this is not yet supported in Rust for s390x - at least not yet stable.

I think this can use the same trick as the wasmtime-fiber crate where x86_64/aarch64 use global_asm! and s390x uses an external assembly file built with the cc crate.

How would this work with the wasm_to_libcall_trampoline macro, which has the function name as parameter?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2022 at 15:07):

fitzgen commented on issue #4431:

I still see this PR failing the CI tests (not just on s390x, but also elsewhere). Does it still make sense for me to try to use the current patch as basis for working on s390x support?

I think so. There are some little macos and windows specific bits that I need to add, but I don't expect the overall architecture of the patch to change from here on out.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2022 at 15:30):

alexcrichton commented on issue #4431:

For s390x libcalls I think everything would need to be manually listed in the inline assembly file. I presume though that a preprocessor macro could be used to have something like LIBCALL_TRAMPOLINE(memory32_grow) which expands to everything necessary. It ideally would be pretty simple to maintain the list for others too

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2022 at 18:49):

uweigand commented on issue #4431:

@fitzgen I've had to disable these asserts:

        assert!(
            (last_wasm_exit_fp != 0 && last_wasm_exit_pc != 0)
                || (last_wasm_exit_fp == 0 && last_wasm_exit_pc == 0)
        );

and

            assert!(
                (state.old_last_wasm_exit_fp != 0 && state.old_last_wasm_exit_pc != 0)
                    || (state.old_last_wasm_exit_fp == 0 && state.old_last_wasm_exit_pc == 0)
            );

These can falsely trigger in the case where a host-to-wasm transition is immediately followed by a wasm-to-host transition. In this case, last_wasm_exit_pc will point to the host code that started the first of these transitions (so it is not 0), and last_wasm_exit_fp will contain the FP value of that host code. Since host code does not maintain FP, the contents are random, and may be 0. This may happen only rarely on Intel, but with the backchain implementation on s390x, this seems to happen frequently.

With this change, I've got an initial implementation that passes all CI test on s390x. I'll attach a patch here, feel free to include it
into this PR to make it pass CI.

diff-pr4431-s390x.txt

view this post on Zulip Wasmtime GitHub notifications bot (Jul 20 2022 at 19:32):

uweigand commented on issue #4431:

With this change, I've got an initial implementation that passes all CI test on s390x.

Just to be clear, passing CI also requires https://github.com/bytecodealliance/wasmtime/pull/4477 to be merged.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2022 at 18:17):

uweigand commented on issue #4431:

@fitzgen, I see this error:

warning: src/trampolines/s390x.S: Assembler messages:
warning: src/trampolines/s390x.S:5: Error: invalid machine `z14'

Looks like the assembler on the build machine is a bit old. You can try replacing .machine z14 with .machine z13 in the s390x.S file, that should still be sufficient. (z13 is supported in gas since 2015, so that should be good enough ...)

view this post on Zulip Wasmtime GitHub notifications bot (Jul 25 2022 at 23:06):

fitzgen edited a comment on issue #4431:

TODO

view this post on Zulip Wasmtime GitHub notifications bot (Jul 25 2022 at 23:06):

fitzgen edited a comment on issue #4431:

TODO

view this post on Zulip Wasmtime GitHub notifications bot (Jul 25 2022 at 23:06):

fitzgen edited a comment on issue #4431:

TODO

view this post on Zulip Wasmtime GitHub notifications bot (Jul 26 2022 at 17:56):

fitzgen edited a comment on issue #4431:

As I said, I'd be happy to implement the s390x "-mbackchain" feature, which is equivalent to what you get with -fno-omit-frame-pointer on Intel w.r.t. your considerations above. I can do that as part of this PR (once it's progressed enough), or as a stand-alone change (but I'd prefer to wait until the SIMD patch is merged to avoid merge conflicts in the ABI code).

Thanks, very appreciated. I think there are two s390x TODO items here:

More than happy to wait until your SIMD patch is merged.

As already mentioned, I think we can merge the new preserve-frame-pointers flag PR as soon as it gets review approval and CI is green, since it doesn't break anything by itself yet.

But I'd really prefer to not land this PR -- since doing so would break s390x support in Wasmtime -- until we've resolved both the frame pointers item and the trampolines item. Do you think this is something you can help out with within the next couple weeks? I'd prefer not to keep this PR open indefinitely, so if you are swamped and don't have the time, then we could consider the temporarily disabling s390x in CI until we can get it working again, but I would personally consider this an unfortunate outcome.

But yeah, I consider this PR to be pretty stable and generally ready for review now, so I don't think it makes sense to wait on it to be more fleshed out anymore.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 26 2022 at 17:56):

fitzgen edited a comment on issue #4431:

As I said, I'd be happy to implement the s390x "-mbackchain" feature, which is equivalent to what you get with -fno-omit-frame-pointer on Intel w.r.t. your considerations above. I can do that as part of this PR (once it's progressed enough), or as a stand-alone change (but I'd prefer to wait until the SIMD patch is merged to avoid merge conflicts in the ABI code).

Thanks, very appreciated. I think there are two s390x TODO items here:

More than happy to wait until your SIMD patch is merged.

As already mentioned, I think we can merge the new preserve-frame-pointers flag PR as soon as it gets review approval and CI is green, since it doesn't break anything by itself yet.

But I'd really prefer to not land this PR -- since doing so would break s390x support in Wasmtime -- until we've resolved both the frame pointers item and the trampolines item. Do you think this is something you can help out with within the next couple weeks? I'd prefer not to keep this PR open indefinitely, so if you are swamped and don't have the time, then we could consider the temporarily disabling s390x in CI until we can get it working again, but I would personally consider this an unfortunate outcome.

But yeah, I consider this PR to be pretty stable and generally ready for review now, so I don't think it makes sense to wait on it to be more fleshed out anymore.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 26 2022 at 20:47):

fitzgen commented on issue #4431:

Okay! I've got CI green! I think this is ready for another round of review!

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2022 at 08:39):

bjorn3 commented on issue #4431:

Are native frames still listed in backtraces, are they shown with a marker like <native frames>, or will something like wasm a -> native b -> wasm c look like a -> c in the backtrace?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2022 at 15:41):

fitzgen commented on issue #4431:

Are native frames still listed in backtraces, are they shown with a marker like <native frames>, or will something like wasm a -> native b -> wasm c look like a -> c in the backtrace?

We don't list native frames in backtraces right now, we filter to only the Wasm frames, so this doesn't change that behavior. If, in the future, we wanted to insert <host frames> in between sequences of contiguous Wasm frames in the backtrace's display, we could do that in a follow up.

https://github.com/bytecodealliance/wasmtime/blob/285bc5ce24109e41650594c2746087ae89f0953a/crates/wasmtime/src/trap.rs#L355

https://github.com/bytecodealliance/wasmtime/blob/285bc5ce24109e41650594c2746087ae89f0953a/crates/wasmtime/src/trap.rs#L315

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2022 at 18:24):

alexcrichton commented on issue #4431:

I've double-checked locally again and at least on an arm64 machine this is a significant regression in the overheads of calling in/out of wasm:

sync/no-hook/host-to-wasm - typed - nop
                        time:   [39.549 ns 39.608 ns 39.660 ns]
                        change: [+51.084% +51.260% +51.444%] (p = 0.00 < 0.05)
                        Performance has regressed.
sync/no-hook/host-to-wasm - typed - nop-params-and-results
                        time:   [42.601 ns 42.680 ns 42.765 ns]
                        change: [+39.134% +39.382% +39.673%] (p = 0.00 < 0.05)
                        Performance has regressed.
sync/no-hook/wasm-to-host - nop - typed
                        time:   [9.0086 ns 9.0131 ns 9.0181 ns]
                        change: [+50.049% +50.098% +50.154%] (p = 0.00 < 0.05)
                        Performance has regressed.
sync/no-hook/wasm-to-host - nop-params-and-results - typed
                        time:   [12.741 ns 12.771 ns 12.805 ns]
                        change: [+41.357% +41.672% +42.013%] (p = 0.00 < 0.05)
                        Performance has regressed.

I'm not going to say this is a showstopper but at the same time I don't think we've worked to quantify where all this extra time is going.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2022 at 18:30):

cfallin commented on issue #4431:

I just checked out this branch on my aarch64-apple-darwin system and confirmed that ./ci/run-tests.sh passes all tests, as of commit 10b1f2e8a36ab544c7431d9dbd9d66d0120b87a8.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2022 at 19:18):

fitzgen commented on issue #4431:

I've double-checked locally again and at least on an arm64 machine this is a significant regression in the overheads of calling in/out of wasm:

This seems significant in terms of % but in terms of actual time it seems pretty plausible/okay?

For example

sync/no-hook/wasm-to-host - nop - typed
                        time:   [9.0086 ns 9.0131 ns 9.0181 ns]
                        change: [+50.049% +50.098% +50.154%] (p = 0.00 < 0.05)

This is 50% overhead, but thats like an extra ~3 nanoseconds compared to main, which makes sense since we are doing a couple memory stores that we didn't used to.

The others are similar, although the host-to-wasm variants might have a little more room for improvement since it seems like they got ~10 nanoseconds slower, but I guess they are also doing a cmp and a cmov rather than just an unconditional store?

But ultimately, I think this just "looks" worse than it really is do to the combination of micro micro micro benchmarks and percentage display. Like, I think we can eat a performance regression on the order of ten nanoseconds for calls, since actual calls will not be nops and doing just about literally anything will make that new overhead disappear in the noise.

But let me know if you disagree, or if I am misreading these benchmark numbers.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2022 at 19:57):

alexcrichton commented on issue #4431:

I don't disagree that a handful of ns here and there is unlikely to make or break the bank, but at the same time I do think it's still worth putting some time in to quantify what the runtime impact of this change is. Some quick benchmarking locally shows that Instance::runtime_limits is not inlined across crates, for example, but provides a small perf boost if it is (and it wasn't necessary before this commit).

Additionally logic added in the latest commit greatly inflates the size of the return value which may not be easily optimizable (moving large-ish types around in Rust) and might be better modeled with a Drop for SomeCustomStruct to handle the unwind cleanup rather than deferring the resume_unwind.

I mostly just want to make sure that the slowdowns all come from expected sources. The extra indirect calls are expected to be slower along with the extra memory traffic of maintaining pc/fp along with all the old pc/fp/etc. Additionally the stack frame for catch_trap is bigger with a bigger CallThreadState so the extra memory traffic in that function is also expected.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2022 at 20:09):

akirilov-arm commented on issue #4431:

@alexcrichton @fitzgen This is about the trampoline code, right? If yes, there are a couple of micro-optimizations that could be applied to it, i.e. loading and storing pairs of registers instead of individual ones (which is also going to reduce the instruction count) and some scheduling, but I didn't bother commenting because it seemed insignificant.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2022 at 20:12):

alexcrichton commented on issue #4431:

Indeed yeah, the benchmarks I listed above are measuring the overhead it takes getting from Rust into wasm and the other way around as well. Those are affected by trampolines and additionally all the other infrastructure that we have for catching traps/etc. They're definitely micro-benchmarks but it's the bound of speed in how fast any embedder can call into wasm.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2022 at 20:23):

fitzgen commented on issue #4431:

@akirilov-arm would love to have your suggested micro optimizations in a follow up PR!

view this post on Zulip Wasmtime GitHub notifications bot (Jul 27 2022 at 21:34):

fitzgen commented on issue #4431:

I've tried tweaking a number of little bits to help improve the call benchmarks, but most changes are within the noise and changes that individually seem like a win for one benchmark are becoming losses across the board when put together.

Thing's I've tried:

At this point I don't think there are any easy wins due to accidental lack of inlining or growing the sizes of function returns. For posterity, a branch with these changes is over here: https://github.com/fitzgen/wasmtime/tree/fast-stack-walking-with-attempts-to-improve-call-perf

In the meantime, I'm going to go ahead and merge this PR, and then we can work on making sure that the regressions to the call benchmarks are exactly the overhead of the new trampolines, and not anything extra/accidental included as well in follow up investigations.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 29 2022 at 14:16):

alexcrichton commented on issue #4431:

:confetti:


Last updated: Oct 23 2024 at 20:03 UTC