Stream: git-wasmtime

Topic: wasmtime / issue #1105 Add alloca support


view this post on Zulip Wasmtime GitHub notifications bot (May 04 2022 at 20:56):

cfallin labeled issue #1105:

This is necessary to implement the unsized_locals rust feature.

cc https://github.com/bjorn3/rustc_codegen_cranelift/issues/15

view this post on Zulip Wasmtime GitHub notifications bot (May 25 2023 at 10:18):

bjorn3 commented on issue #1105:

As per https://github.com/rust-lang/compiler-team/issues/630 unsized locals will be removed from the compiler. No need to implement them in cg_clif any longer. It would still be necessary for implementing a C compiler based on Cranelift though.

view this post on Zulip Wasmtime GitHub notifications bot (May 25 2023 at 10:39):

jyn514 commented on issue #1105:

I no longer maintain rcc and don't have time to work on this.

view this post on Zulip Wasmtime GitHub notifications bot (May 25 2023 at 10:39):

jyn514 edited a comment on issue #1105:

I no longer maintain saltwater-cc and don't have time to work on this.

view this post on Zulip Wasmtime GitHub notifications bot (May 25 2023 at 17:25):

jameysharp commented on issue #1105:

Okay, I guess let's close this issue. If somebody wants this feature in the future, feel free to re-open this issue then.

view this post on Zulip Wasmtime GitHub notifications bot (May 25 2023 at 17:25):

jameysharp closed issue #1105:

This is necessary to implement the unsized_locals rust feature.

cc https://github.com/bjorn3/rustc_codegen_cranelift/issues/15

view this post on Zulip Wasmtime GitHub notifications bot (Feb 08 2024 at 10:14):

bryal commented on issue #1105:

I want to implement something similar to Swift's approach to unboxed polymorphism using Value Witness Tables.[1] alloca is needed to be able to put generic intermediate values on the stack, or we'll have to make a heap allocation for the out-parameter of every other function call. Of course there are optimizations that can alleviate the issue even without alloca, but it would be much more performant and convenient to just have alloca in the first place.

[1] Implementing Swift Generics @ 2017 LLVM Developer's Meeting

view this post on Zulip Wasmtime GitHub notifications bot (Feb 08 2024 at 21:16):

jameysharp reopened issue #1105:

This is necessary to implement the unsized_locals rust feature.

cc https://github.com/bjorn3/rustc_codegen_cranelift/issues/15

view this post on Zulip Wasmtime GitHub notifications bot (Feb 08 2024 at 21:16):

jameysharp commented on issue #1105:

That use-case makes sense to me, @bryal.

I gather your interest is related to https://git.sr.ht/~jojo/kapreolo/commit/93672f5, right? I like your current workaround of allocating a fixed-size stack slot and falling back to a heap allocation if you need more space, but we can definitely discuss how alloca could work in Cranelift.

The suggestions that folks made several years ago have some associated costs, both in runtime when accessing stack slots, and in maintenance time. We'll just have to consider those costs carefully in this discussion.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 09 2024 at 09:21):

bryal commented on issue #1105:

Exactly, @jameysharp, that's the one.

I'm not intimately familiar with any concrete ISAs. Before Cranelift, my only experience with code at this level was using LLVM. I've had to consider calling conventions, but not much more than that. That is to say, I'm not sure I have much to contribute in discussion of how alloca should work here.

That being said, if we manage to come up with a clear plan, I'd be happy to help out with the manual labour.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 09 2024 at 09:21):

bryal edited a comment on issue #1105:

Exactly, @jameysharp, that's the one.

I'm not intimately familiar with any concrete ISAs. Before Cranelift, my only experience with code at this level was using LLVM. I've had to consider calling conventions, but not much more than that. That is to say, I'm not sure I have much to contribute to the discussion of how alloca should work here.

That being said, if we manage to come up with a clear plan, I'd be happy to help out with the manual labour.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 12 2024 at 22:00):

jameysharp commented on issue #1105:

I talked with several of the other people working on Cranelift (@cfallin, @fitzgen, @elliottt, and @lpereira) about this today and there is quite a bit to say about it, which I will try to organize here. If I misrepresent any of their positions I hope they will speak up.

First off, we would welcome a PR demonstrating how this could work! But at least among the people I talked with, working on alloca support is unlikely to be a priority for the moment. We think it's surprisingly complicated to support in conjunction with Cranelift's other goals, and the complexity is difficult for us to justify committing to without a more substantial use case. We have suggestions for things you could try instead though.

There are several reasons why a fixed-size stack frame is much easier to deal with. One is that Windows requires stack-probing in the function prologue, and while I assume there's a way to make that work with alloca, the specifics are a research question that we'd need somebody to answer.

A larger reason is that accessing stack slots for spilled registers needs to be as cheap as possible. Currently there are two registers we can add fixed offsets to in order to find any stack slot in the current frame: specifically, each target has a frame pointer and a stack pointer. On x86-64 we could choose to use the frame pointer to access everything, which would allow alloca to move the stack pointer without doing any harm. On aarch64, however, there's a cost to using negative offsets, and for frames larger than something like 128 bytes, accessing stack slots would need an extra instruction. However, the ARM ABI doesn't require the frame layout we use now ("The location of the frame record within a stack frame is not specified") so a step toward making this work could be to change our frame layout on aarch64 so that all stack slots are at positive offsets from the frame pointer. We're not certain of other consequences of that change, though.

You already have a workaround where you fall back to malloc for large allocations, which is a good approach. In a comment you noted that you're "not sure we can [free] this manually. The temporary may be passed as an arg in a tail call." I'd note that alloca wouldn't work in that case either as the allocated space would be part of the caller's frame that's overwritten by the tail call. Anywhere that you can use alloca, you can also safely use malloc/free, at some performance cost.

To avoid the performance cost of heap allocations, one suggestion that we came up with is that you could allocate a separate stack that is under your compiler's control, rather than trying to share it with the stack used for calling conventions and register allocation. This kind of "shadow stack" is a common solution when data lifetimes are tied to function call scopes. Allocation and deallocation are constant-time in the common and amortized cases, just like alloca, but you don't need magic code-generator support. I think this is your best bet.

I'm going to go ahead and close this issue again to reflect that this is not currently planned, but we still welcome further discussion.

view this post on Zulip Wasmtime GitHub notifications bot (Feb 12 2024 at 22:02):

jameysharp closed issue #1105:

This is necessary to implement the unsized_locals rust feature.

cc https://github.com/bjorn3/rustc_codegen_cranelift/issues/15

view this post on Zulip Wasmtime GitHub notifications bot (Feb 13 2024 at 10:18):

bryal commented on issue #1105:

Thank you @jameysharp for your work and thank you all for your input! Indeed I had not yet stopped to consider how my stack temporaries would play with tail calls. As I intend to employ optimized tail calls extensively in my generated code, you're of course right that this approach will not work as I had planned. I assume Swift does not (or, in 2017, did not) guarantee TCO in the same way, for the alloca to work for them.

I'll see about using a shadow stack instead -- thanks for the suggestion! Now I need to think a bit about indirectly stored temporaries, tail recursion, and memory leaks.

For my part, there's no need to implement this anymore.


Last updated: Oct 23 2024 at 20:03 UTC