Stream: git-wasmtime

Topic: wasmtime / Issue #2124 Support WebAssembly memory instruc...


view this post on Zulip Wasmtime GitHub notifications bot (Aug 10 2020 at 18:13):

uweigand opened Issue #2124:

<!-- Please try to describe precisely what you would like to do in
Cranelift/Wasmtime and/or expect from it. You can answer the questions below if
they're relevant and delete this text before submitting. Thanks for opening an
issue! -->

Feature

<!-- What is the feature or code improvement you would like to do in
Cranelift/Wasmtime? -->
Current wasmtime maps the WebAssembly memory instructions (t.load, t.store etc.) directly to Cranelift IR memory instructions (load, store, uloadN, etc.).

This causes problems on big-endian platforms, because the Cranelift IR instruction are implemented as native load and store instructions using the machine byte order, while the WebAssembly memory instructions are specified to use little-endian byte order always.

Now, I initially thought that one way to solve this problem could be to treat Cranelift IR memory instructions also as always little-endian by specification. However, that does not work, because there are many other uses of these instructions that do require native byte order.

Some examples of those include:

In addition, there are cases where -while not strictly necessary for correctness- it is preferable for performance reasons to use native byte order, e.g. for spill code, for accessing variables on the stack, when implementing code such as inlined copies for small memcpy etc.

So, I believe we need some way of representing both always-little-endian memory operations (used to translate
the WebAssembly instructions), and native memory operation (used for everything else).

Benefit

<!-- What is the value of adding this in Cranelift/Wasmtime? -->

Enabling support for Wasmtime on big-endian platforms like IBM Z.

Implementation

<!-- Do you have an implementation plan, and/or ideas for data structures or
algorithms to use? -->
My current implementation of this approach simply duplicates all Cranelift IR memory instructions to create always-LE versions. So in addition to "load" there is "load_le" etc. The full list is:
load_le
load_le_complex
store_le
store_le_complex
uload16_le
uload16_le_complex
sload16_le
sload16_le_complex
istore16_le
istore16_le_complex
uload32_le
uload32_le_complex
sload32_le
sload32_le_complex
istore32_le
istore32_le_complex

Advantages of this approach include:

But there are disadvantages:

Alternatives

<!-- Have you considered alternative implementations? If so, how are they
better or worse than your proposal? -->
There's various alternative ways this could be implemented:

A) Add an additional flag argument to load/store instructions that
specifies the requested byte order. A detail question is whether the flag is
little-endian vs. native
little-endian vs. big-endian
little-endian vs. big-endian vs. native

Advantages:

Disadvantages:

B) Add an additional flag bit to the existing MemFlags
Advantages:

Disadvantages:

C) Open-code the conversion in the WebAssembly translator
Only emit a plain "load" if the target is little-endian. Otherwise emit a load followed by a byte-swap (possibly followed by an extension). Vice-versa for stores. This would probably require addition of a new "bswap" Cranelift IR instruction, unless we want to open-code bswap itself as well (possible, but a bit tedious).

Advantages:

Disadvantages:

view this post on Zulip Wasmtime GitHub notifications bot (Aug 10 2020 at 18:15):

uweigand commented on Issue #2124:

@cfallin @julian-seward1 - opened issue as discussed in our call today.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 09 2020 at 16:55):

uweigand commented on Issue #2124:

See the PR I just posted. This is not necessarily ready to go in as-is, it is intended to showcase a possible implementation.

The PR implements a variant of method (C) above, where I'm creating just two new Cranelift IR opcodes, LoadRev and StoreRev - load reverse and store reverse. (These map directly to instructions IBM Z has, and a number of other platforms have as well.)

Note that there are no extending load / truncating store versions of those (I'm not aware of any ISA that directly has them, and in any case any new back-end could just match the combination if necessary). Likewise, there are no _complex versions, those are no longer really used with new back-end anyway.

view this post on Zulip Wasmtime GitHub notifications bot (Nov 30 2020 at 17:34):

fitzgen commented on Issue #2124:

Thanks for exploring the design space and laying out our choices @uweigand.

I would personally be in favor of alternative A. I think having clif's semantics defined irrespective of the current target is valuable and will make hacking on the compiler easier.

(I think B is also a strong contender, but A will be easier from an engineering perspective, since we can leverage compiler errors to automatically tell us exactly which code needs updates, rather than manually hunting down places where mem flags are silently dropped and/or created with defaults.)

Although I don't strongly hold this opinion, I'm leaning towards only discriminating between big and little endian, not including a native endian option.

If there is no "native" flag setting, all those updates must include finding out the native byte order somehow.

We can expose the target's endianness relatively easily, since our target-lexicon crate already has this information, and the cranelift-frontend builders can implement a load-with-native-endian builder method.

https://docs.rs/target-lexicon/0.11.1/target_lexicon/enum.Endianness.html

view this post on Zulip Wasmtime GitHub notifications bot (Nov 30 2020 at 17:54):

cfallin commented on Issue #2124:

+1 to the combination of two choices:

In other words we want existing users of the builder API to continue to see the same semantics, and new builder methods can take an Endianness. Removing the "native" option at the CLIF level fully closes the target-specific semantics hole, which is nice!

view this post on Zulip Wasmtime GitHub notifications bot (Dec 02 2020 at 11:59):

uweigand commented on Issue #2124:

I've made an initial attempt at implementing something along those lines here: https://github.com/uweigand/wasmtime/pull/1

Note that this is currently not ready to even attempt merging upstream, in particular because the addition of the extra endianness member to various clif instructions makes the InstructionData structure exceed its target size of 16 bytes.

So either we find some way to encode InstructionData more efficiently (but that would need advice from somebody much more familiar with Rust than I am ...), we accept the (significantly) memory consumption, or else we go back and try something else.

However, adding the explicit endianness member also makes the code more verbose, if you look at the patch above you'll see a large number of mostly "boilerplate" changes. As an aside, I have not been able to implement this suggestion:

Make the current load-instruction builders choose the target's native endian.

as is (or maybe I misunderstood it). The builders that are used by most of the code are the ones automatically generated by the "meta" machinery, and not only did I not find an easy way to provide a default value for any member, but the builders don't seem to even have the required information (they don't have access to the TargetIsa or triple or anything else to determine the default target endianness, unless I'm missing something). That said, changing the callers to provide an explicit endianness (usually retrieved from the TargetIsa) was mostly straightforward.

As a result of those issues with adding and explicit endianness member, I've been thinking of maybe going back to option B and encoding endianness with the MemFlags. This solves the InstructionData size problem, and also removes most of the boilerplate code changes. I was originally sceptically about option B due to the potential issue of introducing errors by accidentally dropping MemFlags during optimizations. However, I think this problem can be fixed: all builders of load/store instructions already make it mandatory to provide a MemFlags value. This can be either copied from another instruction, or else created afresh by starting from MemFlags::new() or MemFlags::trusted(). Now, if we simply added a mandatory endianness argument to all such MemFlags constructors, then we could easily enforce that every copy of MemFlags always contains target endianness information, and thus this cannot be silently dropped.

I'll see if I can come up with another implementation along those lines.

Finally, I'd appreciate any further comments on the patch above. In particular, there are some user-visible changes to the textual representation of clif instructions - please let me know if this approach looks good:

view this post on Zulip Wasmtime GitHub notifications bot (Dec 02 2020 at 19:04):

fitzgen commented on Issue #2124:

As a result of those issues with adding and explicit endianness member, I've been thinking of maybe going back to option B and encoding endianness with the MemFlags. This solves the InstructionData size problem, and also removes most of the boilerplate code changes. I was originally sceptically about option B due to the potential issue of introducing errors by accidentally dropping MemFlags during optimizations. However, I think this problem can be fixed: all builders of load/store instructions already make it mandatory to provide a MemFlags value. This can be either copied from another instruction, or else created afresh by starting from MemFlags::new() or MemFlags::trusted(). Now, if we simply added a mandatory endianness argument to all such MemFlags constructors, then we could easily enforce that every copy of MemFlags always contains target endianness information, and thus this cannot be silently dropped.

This sounds very reasonable to me.

In particular, there are some user-visible changes to the textual representation of clif instructions - please let me know if this approach looks good:

This sounds good to me.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 04 2020 at 16:50):

uweigand commented on Issue #2124:

Here's an updated patch set along those lines: https://github.com/uweigand/wasmtime/pull/2

I'm still not completely happy, in particular because of the following two issues:

Comment on these questions as well as general comments on the approach are welcome - thanks!

view this post on Zulip Wasmtime GitHub notifications bot (Dec 04 2020 at 18:32):

cfallin commented on Issue #2124:

At this point, there's usually no TargetIsa available so the endianness needs to be hardcoded. (In any case it is later ignored anyway.) @cfallin you added this recently -- would it maybe be preferable to just attach a may_trap flag instead?

Hmm, yeah, I think we may want to reconsider this. I had refactored things away from attaching a SourceLoc to every may-trap memory instruction explicitly, as that was becoming untenable, and plumbing the flags through was simple enough; but considering the bits that actually exist in the MemFlags, the trap-ness is the only thing that wouldn't otherwise translate to a machine-specific thing (e.g. alignedness should manifest in how the lowering chooses instructions, if at all, not in flags on one machine instruction).

In particular, the endianness flag at the CLIF level should translate to a different instruction sequence at the VCode level on most machines (excepting only a hypothetical machine that has a "swap bytes" bit on every load instruction). So it would be suboptimal to be able to represent meaningless flags at the VCode level.

An alternative we had considered was to instead make the trap-ness part of a new "instruction metadata", along with the safepoint bit, that lives alongside the MachInsts in the body rather than inside of them. I'm starting to warm up to this more; trap-ness is analogous to safepoint-ness in that both result in metadata attached to the MachBuffer during emission, and both are properties of the instruction that are known when it is given to the LowerCtx.

So I might suggest the following: (i) create a MachInstMetadata that has, for now, two flags (safepoint, may_trap); (ii) unify LowerCtx::emit() and LowerCtx::emit_safepoint() to emit_with_opts() and emit(), the latter forwarding to former with MachInstMetadata::default(); (iii) store the metadata alongside instructions; (iv) feed the metadata to EmitState where we currently attach the stackmap on safepoints.

That's a large enough refactor that, for now, I think your PR's approach is fine (keep MemFlags on instructions, add a bit of verbosity to emission tests); but eventually we can clean things up.

GlobalValue "load" instructions

Yes, these should also carry MemFlags, I suspect; and since the lowering happens during CLIF legalization, we can just pass the flags through.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 08 2020 at 13:15):

uweigand commented on Issue #2124:

That's a large enough refactor that, for now, I think your PR's approach is fine (keep MemFlags on instructions, add a bit of verbosity to emission tests); but eventually we can clean things up.

GlobalValue "load" instructions

Yes, these should also carry MemFlags, I suspect; and since the lowering happens during CLIF legalization, we can just pass the flags through.

Thanks @cfallin, I've now updated the patch to carry a MemFlags in GlobalValueData::Load and left the MemFlags on machine instructions for now. The updated patch is now posted as PR #2488 above.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2020 at 22:00):

cfallin commented on Issue #2124:

Fixed with #2492.

view this post on Zulip Wasmtime GitHub notifications bot (Dec 14 2020 at 22:00):

cfallin closed Issue #2124:

<!-- Please try to describe precisely what you would like to do in
Cranelift/Wasmtime and/or expect from it. You can answer the questions below if
they're relevant and delete this text before submitting. Thanks for opening an
issue! -->

Feature

<!-- What is the feature or code improvement you would like to do in
Cranelift/Wasmtime? -->
Current wasmtime maps the WebAssembly memory instructions (t.load, t.store etc.) directly to Cranelift IR memory instructions (load, store, uloadN, etc.).

This causes problems on big-endian platforms, because the Cranelift IR instruction are implemented as native load and store instructions using the machine byte order, while the WebAssembly memory instructions are specified to use little-endian byte order always.

Now, I initially thought that one way to solve this problem could be to treat Cranelift IR memory instructions also as always little-endian by specification. However, that does not work, because there are many other uses of these instructions that do require native byte order.

Some examples of those include:

In addition, there are cases where -while not strictly necessary for correctness- it is preferable for performance reasons to use native byte order, e.g. for spill code, for accessing variables on the stack, when implementing code such as inlined copies for small memcpy etc.

So, I believe we need some way of representing both always-little-endian memory operations (used to translate
the WebAssembly instructions), and native memory operation (used for everything else).

Benefit

<!-- What is the value of adding this in Cranelift/Wasmtime? -->

Enabling support for Wasmtime on big-endian platforms like IBM Z.

Implementation

<!-- Do you have an implementation plan, and/or ideas for data structures or
algorithms to use? -->
My current implementation of this approach simply duplicates all Cranelift IR memory instructions to create always-LE versions. So in addition to "load" there is "load_le" etc. The full list is:
load_le
load_le_complex
store_le
store_le_complex
uload16_le
uload16_le_complex
sload16_le
sload16_le_complex
istore16_le
istore16_le_complex
uload32_le
uload32_le_complex
sload32_le
sload32_le_complex
istore32_le
istore32_le_complex

Advantages of this approach include:

But there are disadvantages:

Alternatives

<!-- Have you considered alternative implementations? If so, how are they
better or worse than your proposal? -->
There's various alternative ways this could be implemented:

A) Add an additional flag argument to load/store instructions that
specifies the requested byte order. A detail question is whether the flag is
little-endian vs. native
little-endian vs. big-endian
little-endian vs. big-endian vs. native

Advantages:

Disadvantages:

B) Add an additional flag bit to the existing MemFlags
Advantages:

Disadvantages:

C) Open-code the conversion in the WebAssembly translator
Only emit a plain "load" if the target is little-endian. Otherwise emit a load followed by a byte-swap (possibly followed by an extension). Vice-versa for stores. This would probably require addition of a new "bswap" Cranelift IR instruction, unless we want to open-code bswap itself as well (possible, but a bit tedious).

Advantages:

Disadvantages:


Last updated: Nov 22 2024 at 16:03 UTC