wasmtime / issue #4565 Missing ABI features required for ... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #4565 Missing ABI features required for ...

Wasmtime GitHub notifications bot (Jul 31 2022 at 14:05):

uweigand opened issue #4565:

In implementing support for struct-args and i128 to enable rustc_codegen_cranelift on s390x, I ran into two features that are not currently available with the common abi_impl implementation.

Struct arguments using an explicit buffer address

Many commonly used ABIs have special provisions for (large) struct arguments that cannot be passed in registers. However, there are two typical flavors of these, and abi_impl only supports one of them.

Both these flavors have in common that the actual argument is copied by the caller into a buffer on the stack; the callee will then read the argument from that buffer, and is even free to modify it. However, the two flavors differ in how the address of that buffer is _communicated_ from the caller to the callee:

Implicit buffer address. With this flavor, that ABI implicitly defines the location of the buffer, usually by making the buffer part of the same area used for stack slots for overflow scalar arguments. Both caller and callee can compute the location of the buffer at compile time simply based on the function call signature; the address is not materialized as part of the run-time call sequence.

Explicit buffer address. Here, the ABI does not define the location of the buffer itself, but leaves the compiler free to place it anywhere. The actual location of the buffer is passed at run-time from caller to callee, typically in a register (or overflow stack slot) as if we had a pointer at this place in the function call signature.

There are pros and cons to either approach, leading to different choices of method across platform ABIs. In particular, x86_64 nearly always uses the implicit buffer address method, with the only exception of structures whose size is not a compile-time constant. Many other platforms, including s390x, use the the explicit buffer address method always, which allows the opportunity to the compiler to avoid actually copying the structure, e.g. if the original copy can be shown to be unused after the call so it wouldn't matter if the callee clobbers its contents.

However, the current cranelift abi_impl _only_ supports the implicit buffer address method. This should mostly be OK for x64_64 - I'm not sure if variable-sized structures passed by value even exist with the rustc frontend. However, even for aarch64 I believe the current implementation is already incorrect - as far as I can see, GCC and LLVM use the explicit buffer address method on this architecture. This is definitely also the case on s390x.

This should be straightforward to fix, and I already have an implementation. The main change is to the ABIArg data type, which currently holds either a vector of ABISlots (registers or overflow stack slots) or a StructArg identifying a buffer location on the stack. I've changed this to optionally provide both these items - where the semantics of providing both means that the ABISlot identifies the register (or overflow stack slot) needed to hold the address of the buffer at run time.

Type legalization in the back-end

If a function signature contains an argument of a type that cannot be directly passed in registers or overflow stack slots, but needs to use the (explicit or implicit) struct-arg buffer method, the function signature needs to be legalized: this consists of replacing the argument of struct type by a pointer to that struct type, and adding code around function call sites as well as prologue and epilogue to convert between the two (taking the address of the struct, or dereferencing the pointer, as appropriate).

Now, usually this legalization is already done by the language front-end (whether clang or rustc), and by the time the (LLVM or cranelift) IR gets to the back end, we already have a pointer type. And in fact the current abi_impl assumes that exactly this is the case.

But this is unfortunately not universally true. There are a few cases whether the front end does not perform any legalization, but rather relies on the (currently LLVM) back end for that. This is the case in particular for the i128 data type on s390x. This remains as a scalar integer type in function call signatures, but is actually _implemented_ using a hidden buffer like the struct-arg case. With LLVM, this legalization is performed in the back-end - and to implement i128 in cranelift, we probably need to do the same here.

Interestingly enough, the AbiParam data type in cranelift contains a legalized_to_pointer flag which would appear to address exactly this situation. However, as far as I can tell this flag is currently never set or used in all of cranelift. It is not clear to me how this was originally intended to be used, or where exactly this legalization would have been done.

I have implemented the functionality for now using yet another flavor of the ABIArg data type, but that seems a bit of a hack. One problem is that when implementing the pointer dereferencing in the prologue, the ABIArg slots only carry type information for the pointer (usually I64), and possibly the size of the buffer, but not the type of the buffer, so it is a bit unclear which instruction to use to implement the dereferencing.

Another problem is that if the buffer address is passed explicitly, but in a stack slot, we require a temporary register to load that address in order to dereference it (since the into_regs will hold the final i128 value, which lives in the vector register set on s390x, it cannot be used as address). It's not quite clear how to allocate a temp at this stage in the compilation process. (I'm using the "stack limit register" right now, which appears to work, but I'm not sure this is guaranteed to be OK on all platforms.)

Finally, this legalization needs to be done not just for function arguments, but also return values. Here, abi_impl already has code to allocate a buffer in the caller and pass its address to the callee so it can place the return value there. This is exactly what we need on s390x for a i128 return value - except for one implementation detail: cranelift passes that implicit pointer as last argument (because this is expected by wasmtime when used for the multiple return value extension), while the platform ABI requires the implicit pointer as first argument. To work around this, I'm now passing the return pointer first if the enable_llvm_abi_extensions flag is true, and last otherwise. This seems to work with both wasmtime and rustc_codegen_cranelift, but also feels like a hack.

@cfallin these are the problems I mentioned in our last meeting. I'd be happy to implement a solution, but I'd appreciate some guidance / discussion on what the correct approach should be.

Wasmtime GitHub notifications bot (Jul 31 2022 at 14:38):

bjorn3 commented on issue #4565:

For i128 in the C ABI, rustc already performs the pass-by-ref transformation itself, so Cranelift just sees a pointer. For the Rust ABI however you are right that the transformation needs to happen in the backend.

except for one implementation detail: cranelift passes that implicit pointer as last argument (because this is expected by wasmtime when used for the multiple return value extension), while the platform ABI requires the implicit pointer as first argument. To work around this, I'm now passing the return pointer first if the enable_llvm_abi_extensions flag is true, and last otherwise. This seems to work with both wasmtime and rustc_codegen_cranelift, but also feels like a hack.

Wasmtime has a separate call conv variant called WasmtimeSystemV which is documented as affecting the way values are returned. I think you should match on the call conv, not the enable_llvm_abi_extensions flag. Also i128 isn't used at all by wasmtime, so it is probably fine to simply error on i128 returns on WasmtimeSystemV.

Wasmtime GitHub notifications bot (Jul 31 2022 at 16:33):

uweigand commented on issue #4565:

Wasmtime has a separate call conv variant called WasmtimeSystemV which is documented as affecting the way values are returned. I think you should match on the call conv, not the enable_llvm_abi_extensions flag. Also i128 isn't used at all by wasmtime, so it is probably fine to simply error on i128 returns on WasmtimeSystemV.

This seems to be working fine for me, and makes sense. Thanks for the suggestion!

Wasmtime GitHub notifications bot (Aug 01 2022 at 16:28):

uweigand commented on issue #4565:

Interestingly enough, the AbiParam data type in cranelift contains a legalized_to_pointer flag which would appear to address exactly this situation. However, as far as I can tell this flag is currently never set or used in all of cranelift. It is not clear to me how this was originally intended to be used, or where exactly this legalization would have been done.

Doing a bit of archaeology on this, this was used to implement ValueConversion::Pointer, which was introduced in https://github.com/bytecodealliance/wasmtime/pull/1670 and only ever used by the 32-bit x86 backend. So after that backend went away, the only user was gone, and when old-style legalization was removed completely a bit later on, support for ValueConversion went away with it, and no equivalent feature was ever added to the new ABI code used by new-style backends.

Wasmtime GitHub notifications bot (Aug 01 2022 at 18:17):

cfallin commented on issue #4565:

@uweigand thanks for the detailed writeup on these issues! At this point you will have likely paged in much more of the option-space here than I have; I wrote much of this code but my mental L2 cache has been thrashed many times over. So I'd defer to your suggested approaches above, as they sound reasonable. If my understanding is right, this is kind of a dual to the struct-arg mechanism: for struct args, the IR value is a pointer, and the ABI code allocates the space (implicit location, as you say); while here the IR value is the value stored in memory. That seems like a reasonable option to support.

I do think that if we can manage it, avoiding a separate legalization pass is ideal; I'd prefer that we handle everything inline in abi_impl, as long as the complexity is not too bad. But it seems that we could consider the large-types-passed-as-pointers when computing the stackframe layout on the caller side, and allocate extra space, and generate the stores (or loads for returns); and on the callee side, load from the pointer (or store to it).

In any case, happy to look at PRs as a next step!

Wasmtime GitHub notifications bot (Aug 02 2022 at 14:02):

uweigand commented on issue #4565:

@cfallin thanks for the comments! After some back-and-forth, I've now come up with the following data structure changes:

To implement the first issue (struct arguments with explicit pointer register), I'm simply adding a field to ABIArg::StructArg:
        /// Register or stack slot holding a pointer to the buffer as passed
        /// by the caller to the callee.  If None, the ABI defines the buffer
        /// to reside at a well-known location (i.e. at `offset` below).
        pointer: Option<ABIArgSlot>,
To implement the second issue (type legalization in the back end), I'm adding a new enum member ABIArg::ImplicitArg:
    /// Implicit argument. Similar to a StructArg, except that we have the
    /// target type, not a pointer type, at the CLIF-level. This argument is
    /// still being passed via reference implicitly.
    ImplicitArg {
        /// Register or stack slot holding a pointer to the buffer.
        pointer: ABIArgSlot,
        /// Offset of the argument buffer.
        offset: i64,
        /// Type of the implicit argument.
        ty: Type,
        /// Purpose of this arg.
        purpose: ir::ArgumentPurpose,
    },
With these, the implementation looks relatively straightforward and resides fully in abi_impl plus the associated ISLE and back-end code parts.

I'm working on finishing up the PRs and will hopefully submit them soon.

Wasmtime GitHub notifications bot (Aug 04 2022 at 14:05):

akirilov-arm commented on issue #4565:

In particular, x86_64 nearly always uses the implicit buffer address method, with the only exception of structures whose size is not a compile-time constant. Many other platforms, including s390x, use the the explicit buffer address method always [...] However, the current cranelift abi_impl _only_ supports the implicit buffer address method. This should mostly be OK for x64_64 - I'm not sure if variable-sized structures passed by value even exist with the rustc frontend. However, even for aarch64 I believe the current implementation is already incorrect - as far as I can see, GCC and LLVM use the explicit buffer address method on this architecture.

Interesting; I must admit that I am not really familiar with the relevant ABI code paths because I haven't worked at all on the cg_clif integration, but it looks like there is an issue there. The standard ABI is actually a bit more nuanced - structures (or composite types, as the specification calls them) that are 16 bytes or smaller are passed as parameters in registers and/or implicitly on the stack, and via an explicit buffer address otherwise (but I am simplifying things a bit).

I128 (integral quad-word) is a fundamental data type according to the ABI and as such never uses an explicit buffer address.

Finally, this legalization needs to be done not just for function arguments, but also _return values_. Here, abi_impl already has code to allocate a buffer in the caller and pass its address to the callee so it can place the return value there. This is exactly what we need on s390x for a i128 return value - except for one implementation detail: cranelift passes that implicit pointer as _last_ argument (because this is expected by wasmtime when used for the multiple return value extension)...

Return values either fit fully into registers, or go through an explicit buffer address that is kept in a register, x8, that normally is not used for parameters and return values; it is in fact the next register after the last one used for that purpose.

Wasmtime GitHub notifications bot (Aug 04 2022 at 15:10):

uweigand commented on issue #4565:

In particular, x86_64 nearly always uses the implicit buffer address method, with the only exception of structures whose size is not a compile-time constant. Many other platforms, including s390x, use the the explicit buffer address method always [...] However, the current cranelift abi_impl _only_ supports the implicit buffer address method. This should mostly be OK for x64_64 - I'm not sure if variable-sized structures passed by value even exist with the rustc frontend. However, even for aarch64 I believe the current implementation is already incorrect - as far as I can see, GCC and LLVM use the explicit buffer address method on this architecture.

Interesting; I must admit that I am not really familiar with the relevant ABI code paths because I haven't worked at all on the cg_clif integration, but it looks like there is an issue there. The standard ABI is actually a bit more nuanced - structures (or composite types, as the specification calls them) that are 16 bytes or smaller are passed as parameters in registers and/or implicitly on the stack, and via an explicit buffer address otherwise (but I am simplifying things a bit).

Yes, it's similar on most platforms. The decision which structs are passed in registers/stack slots and which are passed via buffer is made by the front end: the front end decides to either represent the parameter as some scalar type (if passed in registers / stack slots), or as a pointer type with the StructArgument attribute if passed in a buffer. The back end then simply has to implement that decision.

This is all unaffected by this particular issue: this issue is only about the case where the front end has already decided to use the StructArgument method. The question raised here is whether the implementation of that method in the back end is actually correct, because even just this also differs between platforms. Specifically, some platforms (notably Intel) define the address of the buffer implicitly in the ABI, so caller and callee automatically agree on it - while other platforms allow the caller to choose any address, which is then passed explicitly via register or stack slot to the callee.

The question I had about AArch64 was that it is currently (in Cranelift) using the implicit method like Intel, while looking at the GCC and LLVM implementation, I see those using the explicit method. So I suspect this is actually incorrect in Cranelift, but as I have no AArch64 hardware to verify, I didn't implement any change.

Now that https://github.com/bytecodealliance/wasmtime/pull/4585 has landed, it should be straightforward to fix this in the AArch64 back end as well (if indeed something needs to be fixed).

I128 (integral quad-word) is a fundamental data type according to the ABI and as such never uses an explicit buffer address.

Yes, s390x is really special here, unfortunately.

Wasmtime GitHub notifications bot (Aug 04 2022 at 20:46):

uweigand closed issue #4565:

In implementing support for struct-args and i128 to enable rustc_codegen_cranelift on s390x, I ran into two features that are not currently available with the common abi_impl implementation.

Struct arguments using an explicit buffer address

Many commonly used ABIs have special provisions for (large) struct arguments that cannot be passed in registers. However, there are two typical flavors of these, and abi_impl only supports one of them.

Both these flavors have in common that the actual argument is copied by the caller into a buffer on the stack; the callee will then read the argument from that buffer, and is even free to modify it. However, the two flavors differ in how the address of that buffer is _communicated_ from the caller to the callee:

Implicit buffer address. With this flavor, that ABI implicitly defines the location of the buffer, usually by making the buffer part of the same area used for stack slots for overflow scalar arguments. Both caller and callee can compute the location of the buffer at compile time simply based on the function call signature; the address is not materialized as part of the run-time call sequence.

Explicit buffer address. Here, the ABI does not define the location of the buffer itself, but leaves the compiler free to place it anywhere. The actual location of the buffer is passed at run-time from caller to callee, typically in a register (or overflow stack slot) as if we had a pointer at this place in the function call signature.

There are pros and cons to either approach, leading to different choices of method across platform ABIs. In particular, x86_64 nearly always uses the implicit buffer address method, with the only exception of structures whose size is not a compile-time constant. Many other platforms, including s390x, use the the explicit buffer address method always, which allows the opportunity to the compiler to avoid actually copying the structure, e.g. if the original copy can be shown to be unused after the call so it wouldn't matter if the callee clobbers its contents.

However, the current cranelift abi_impl _only_ supports the implicit buffer address method. This should mostly be OK for x64_64 - I'm not sure if variable-sized structures passed by value even exist with the rustc frontend. However, even for aarch64 I believe the current implementation is already incorrect - as far as I can see, GCC and LLVM use the explicit buffer address method on this architecture. This is definitely also the case on s390x.

This should be straightforward to fix, and I already have an implementation. The main change is to the ABIArg data type, which currently holds either a vector of ABISlots (registers or overflow stack slots) or a StructArg identifying a buffer location on the stack. I've changed this to optionally provide both these items - where the semantics of providing both means that the ABISlot identifies the register (or overflow stack slot) needed to hold the address of the buffer at run time.

Type legalization in the back-end

If a function signature contains an argument of a type that cannot be directly passed in registers or overflow stack slots, but needs to use the (explicit or implicit) struct-arg buffer method, the function signature needs to be legalized: this consists of replacing the argument of struct type by a pointer to that struct type, and adding code around function call sites as well as prologue and epilogue to convert between the two (taking the address of the struct, or dereferencing the pointer, as appropriate).

Now, usually this legalization is already done by the language front-end (whether clang or rustc), and by the time the (LLVM or cranelift) IR gets to the back end, we already have a pointer type. And in fact the current abi_impl assumes that exactly this is the case.

But this is unfortunately not universally true. There are a few cases whether the front end does not perform any legalization, but rather relies on the (currently LLVM) back end for that. This is the case in particular for the i128 data type on s390x. This remains as a scalar integer type in function call signatures, but is actually _implemented_ using a hidden buffer like the struct-arg case. With LLVM, this legalization is performed in the back-end - and to implement i128 in cranelift, we probably need to do the same here.

Interestingly enough, the AbiParam data type in cranelift contains a legalized_to_pointer flag which would appear to address exactly this situation. However, as far as I can tell this flag is currently never set or used in all of cranelift. It is not clear to me how this was originally intended to be used, or where exactly this legalization would have been done.

I have implemented the functionality for now using yet another flavor of the ABIArg data type, but that seems a bit of a hack. One problem is that when implementing the pointer dereferencing in the prologue, the ABIArg slots only carry type information for the pointer (usually I64), and possibly the size of the buffer, but not the type of the buffer, so it is a bit unclear which instruction to use to implement the dereferencing.

Another problem is that if the buffer address is passed explicitly, but in a stack slot, we require a temporary register to load that address in order to dereference it (since the into_regs will hold the final i128 value, which lives in the vector register set on s390x, it cannot be used as address). It's not quite clear how to allocate a temp at this stage in the compilation process. (I'm using the "stack limit register" right now, which appears to work, but I'm not sure this is guaranteed to be OK on all platforms.)

Finally, this legalization needs to be done not just for function arguments, but also return values. Here, abi_impl already has code to allocate a buffer in the caller and pass its address to the callee so it can place the return value there. This is exactly what we need on s390x for a i128 return value - except for one implementation detail: cranelift passes that implicit pointer as last argument (because this is expected by wasmtime when used for the multiple return value extension), while the platform ABI requires the implicit pointer as first argument. To work around this, I'm now passing the return pointer first if the enable_llvm_abi_extensions flag is true, and last otherwise. This seems to work with both wasmtime and rustc_codegen_cranelift, but also feels like a hack.

@cfallin these are the problems I mentioned in our last meeting. I'd be happy to implement a solution, but I'd appreciate some guidance / discussion on what the correct approach should be.

Last updated: Apr 18 2025 at 08:04 UTC