wasmtime / Issue #2124 Support WebAssembly memory instruc... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / Issue #2124 Support WebAssembly memory instruc...

Wasmtime GitHub notifications bot (Aug 10 2020 at 18:13):

uweigand opened Issue #2124:

Feature


Current wasmtime maps the WebAssembly memory instructions (t.load, t.store etc.) directly to Cranelift IR memory instructions (load, store, uloadN, etc.).

This causes problems on big-endian platforms, because the Cranelift IR instruction are implemented as native load and store instructions using the machine byte order, while the WebAssembly memory instructions are specified to use little-endian byte order always.

Now, I initially thought that one way to solve this problem could be to treat Cranelift IR memory instructions also as always little-endian by specification. However, that does not work, because there are many other uses of these instructions that do require native byte order.

Some examples of those include:

Memory accesses added by platform ABI code (implicit pointers for argument or return values), in particular if this needs to be compatible with native code.

Memory accesses to values prepared by trampoline code at the boundaries of VM native code and JITted code.

Memory accesses to parts of the VMContext that is also accessed by VM native code.

In addition, there are cases where -while not strictly necessary for correctness- it is preferable for performance reasons to use native byte order, e.g. for spill code, for accessing variables on the stack, when implementing code such as inlined copies for small memcpy etc.

So, I believe we need some way of representing both always-little-endian memory operations (used to translate
the WebAssembly instructions), and native memory operation (used for everything else).

Benefit



Enabling support for Wasmtime on big-endian platforms like IBM Z.

Implementation


My current implementation of this approach simply duplicates all Cranelift IR memory instructions to create always-LE versions. So in addition to "load" there is "load_le" etc. The full list is:
load_le
load_le_complex
store_le
store_le_complex
uload16_le
uload16_le_complex
sload16_le
sload16_le_complex
istore16_le
istore16_le_complex
uload32_le
uload32_le_complex
sload32_le
sload32_le_complex
istore32_le
istore32_le_complex

Advantages of this approach include:

Most code that creates load/store instructions can remain unchanged, the WebAssembly translator simply always uses the new instructions.

It's already implemented and working :-)

But there are disadvantages:

All back-ends must implement all the new instructions (usually by just mapping them back to normal loads/stores), or else the target will stop working.

Middle-end code changes operating on loads/stores (e.g. the code that recognizes and creates _complex operations) should be adapted or else we can get performance regressions.

Alternatives


There's various alternative ways this could be implemented:

A) Add an additional flag argument to load/store instructions that
specifies the requested byte order. A detail question is whether the flag is
little-endian vs. native
little-endian vs. big-endian
little-endian vs. big-endian vs. native

Advantages:

no new IR instructions required

existing back-ends could simply ignore the flag

Disadvantages:

it's still an IR change as that flag must be considered part of the IR (e.g. parsing IR, serialization ...)

All creators of loads/stores (including outside of cranelift, and possibly even outside of wasmtime!) must be updated. If there is no "native" flag setting, all those updates must include finding out the native byte order somehow.

B) Add an additional flag bit to the existing MemFlags
Advantages:

no new IR (should be covered by existing serialization ...)

can be ignored by existing back-ends

no change to (most) creators of loads/stores necessary

Disadvantages:

MemFlags can no longer be dropped, it becomes required for correctness to always preserve in in the middle end

C) Open-code the conversion in the WebAssembly translator
Only emit a plain "load" if the target is little-endian. Otherwise emit a load followed by a byte-swap (possibly followed by an extension). Vice-versa for stores. This would probably require addition of a new "bswap" Cranelift IR instruction, unless we want to open-code bswap itself as well (possible, but a bit tedious).

Advantages:

Only "bswap" as new IR element, can be ignored by back-ends for little-endian architectures and everywhere else.

Disadvantages:

No major ones I can see - this would be my preferred approach.

Wasmtime GitHub notifications bot (Aug 10 2020 at 18:15):

uweigand commented on Issue #2124:

@cfallin @julian-seward1 - opened issue as discussed in our call today.

Wasmtime GitHub notifications bot (Nov 09 2020 at 16:55):

uweigand commented on Issue #2124:

See the PR I just posted. This is not necessarily ready to go in as-is, it is intended to showcase a possible implementation.

The PR implements a variant of method (C) above, where I'm creating just two new Cranelift IR opcodes, LoadRev and StoreRev - load reverse and store reverse. (These map directly to instructions IBM Z has, and a number of other platforms have as well.)

Note that there are no extending load / truncating store versions of those (I'm not aware of any ISA that directly has them, and in any case any new back-end could just match the combination if necessary). Likewise, there are no _complex versions, those are no longer really used with new back-end anyway.

Wasmtime GitHub notifications bot (Nov 30 2020 at 17:34):

fitzgen commented on Issue #2124:

Thanks for exploring the design space and laying out our choices @uweigand.

I would personally be in favor of alternative A. I think having clif's semantics defined irrespective of the current target is valuable and will make hacking on the compiler easier.

(I think B is also a strong contender, but A will be easier from an engineering perspective, since we can leverage compiler errors to automatically tell us exactly which code needs updates, rather than manually hunting down places where mem flags are silently dropped and/or created with defaults.)

Although I don't strongly hold this opinion, I'm leaning towards only discriminating between big and little endian, not including a native endian option.

If there is no "native" flag setting, all those updates must include finding out the native byte order somehow.

We can expose the target's endianness relatively easily, since our target-lexicon crate already has this information, and the cranelift-frontend builders can implement a load-with-native-endian builder method.

https://docs.rs/target-lexicon/0.11.1/target_lexicon/enum.Endianness.html

Wasmtime GitHub notifications bot (Nov 30 2020 at 17:54):

cfallin commented on Issue #2124:

+1 to the combination of two choices:

Only "big endian" and "little endian" as choices (no "native");

Make the current load-instruction builders choose the target's native endian.

In other words we want existing users of the builder API to continue to see the same semantics, and new builder methods can take an Endianness. Removing the "native" option at the CLIF level fully closes the target-specific semantics hole, which is nice!

Wasmtime GitHub notifications bot (Dec 02 2020 at 11:59):

uweigand commented on Issue #2124:

I've made an initial attempt at implementing something along those lines here: https://github.com/uweigand/wasmtime/pull/1

Note that this is currently not ready to even attempt merging upstream, in particular because the addition of the extra endianness member to various clif instructions makes the InstructionData structure exceed its target size of 16 bytes.

So either we find some way to encode InstructionData more efficiently (but that would need advice from somebody much more familiar with Rust than I am ...), we accept the (significantly) memory consumption, or else we go back and try something else.

However, adding the explicit endianness member also makes the code more verbose, if you look at the patch above you'll see a large number of mostly "boilerplate" changes. As an aside, I have not been able to implement this suggestion:

Make the current load-instruction builders choose the target's native endian.

as is (or maybe I misunderstood it). The builders that are used by most of the code are the ones automatically generated by the "meta" machinery, and not only did I not find an easy way to provide a default value for any member, but the builders don't seem to even have the required information (they don't have access to the TargetIsa or triple or anything else to determine the default target endianness, unless I'm missing something). That said, changing the callers to provide an explicit endianness (usually retrieved from the TargetIsa) was mostly straightforward.

As a result of those issues with adding and explicit endianness member, I've been thinking of maybe going back to option B and encoding endianness with the MemFlags. This solves the InstructionData size problem, and also removes most of the boilerplate code changes. I was originally sceptically about option B due to the potential issue of introducing errors by accidentally dropping MemFlags during optimizations. However, I think this problem can be fixed: all builders of load/store instructions already make it mandatory to provide a MemFlags value. This can be either copied from another instruction, or else created afresh by starting from MemFlags::new() or MemFlags::trusted(). Now, if we simply added a mandatory endianness argument to all such MemFlags constructors, then we could easily enforce that every copy of MemFlags always contains target endianness information, and thus this cannot be silently dropped.

I'll see if I can come up with another implementation along those lines.

Finally, I'd appreciate any further comments on the patch above. In particular, there are some user-visible changes to the textual representation of clif instructions - please let me know if this approach looks good:

When dumping an instruction for debug purposes, I'm now always including endian information (by adding "little" or "big" to the output, alongside where MemFlags are printed).

When reading an instruction (e.g. for clif-util testing), the parser will accept "little" or "big" as well. If neither is present, the endianness will default to the ISA currently in operation (similarly for how register names are parsed); if no ISA is currently in operation, the parser will reject any memory instruction without explicit "little" or "big". (This happens just once in the existing filetests.)

Wasmtime GitHub notifications bot (Dec 02 2020 at 19:04):

fitzgen commented on Issue #2124:

As a result of those issues with adding and explicit endianness member, I've been thinking of maybe going back to option B and encoding endianness with the MemFlags. This solves the InstructionData size problem, and also removes most of the boilerplate code changes. I was originally sceptically about option B due to the potential issue of introducing errors by accidentally dropping MemFlags during optimizations. However, I think this problem can be fixed: all builders of load/store instructions already make it mandatory to provide a MemFlags value. This can be either copied from another instruction, or else created afresh by starting from MemFlags::new() or MemFlags::trusted(). Now, if we simply added a mandatory endianness argument to all such MemFlags constructors, then we could easily enforce that every copy of MemFlags always contains target endianness information, and thus this cannot be silently dropped.

This sounds very reasonable to me.

In particular, there are some user-visible changes to the textual representation of clif instructions - please let me know if this approach looks good:

This sounds good to me.

Wasmtime GitHub notifications bot (Dec 04 2020 at 16:50):

uweigand commented on Issue #2124:

Here's an updated patch set along those lines: https://github.com/uweigand/wasmtime/pull/2

I'm still not completely happy, in particular because of the following two issues:

New-style back ends now attach MemFlags to machine instructions, which is used only to decide whether the emitted instruction can trap or not. Having endianness encoded into MemFlags can make this more tedious, in particular with test cases that create new MemFlags (e.g. emit_tests.rs). At this point, there's usually no TargetIsa available so the endianness needs to be hardcoded. (In any case it is later ignored anyway.) @cfallin you added this recently -- would it maybe be preferable to just attach a may_trap flag instead?

GlobalValue "load" instructions use somelike similar, but not quite identical to MemFlags; it is printed and parsed as if it were MemFlags, but in the data structure there's actually just a "readonly" flag. Should this now include endianness in printing and parsing? Should the data structure transition to a full MemFlags instead?

Comment on these questions as well as general comments on the approach are welcome - thanks!

Wasmtime GitHub notifications bot (Dec 04 2020 at 18:32):

cfallin commented on Issue #2124:

At this point, there's usually no TargetIsa available so the endianness needs to be hardcoded. (In any case it is later ignored anyway.) @cfallin you added this recently -- would it maybe be preferable to just attach a may_trap flag instead?

Hmm, yeah, I think we may want to reconsider this. I had refactored things away from attaching a SourceLoc to every may-trap memory instruction explicitly, as that was becoming untenable, and plumbing the flags through was simple enough; but considering the bits that actually exist in the MemFlags, the trap-ness is the only thing that wouldn't otherwise translate to a machine-specific thing (e.g. alignedness should manifest in how the lowering chooses instructions, if at all, not in flags on one machine instruction).

In particular, the endianness flag at the CLIF level should translate to a different instruction sequence at the VCode level on most machines (excepting only a hypothetical machine that has a "swap bytes" bit on every load instruction). So it would be suboptimal to be able to represent meaningless flags at the VCode level.

An alternative we had considered was to instead make the trap-ness part of a new "instruction metadata", along with the safepoint bit, that lives alongside the MachInsts in the body rather than inside of them. I'm starting to warm up to this more; trap-ness is analogous to safepoint-ness in that both result in metadata attached to the MachBuffer during emission, and both are properties of the instruction that are known when it is given to the LowerCtx.

So I might suggest the following: (i) create a MachInstMetadata that has, for now, two flags (safepoint, may_trap); (ii) unify LowerCtx::emit() and LowerCtx::emit_safepoint() to emit_with_opts() and emit(), the latter forwarding to former with MachInstMetadata::default(); (iii) store the metadata alongside instructions; (iv) feed the metadata to EmitState where we currently attach the stackmap on safepoints.

That's a large enough refactor that, for now, I think your PR's approach is fine (keep MemFlags on instructions, add a bit of verbosity to emission tests); but eventually we can clean things up.

GlobalValue "load" instructions

Yes, these should also carry MemFlags, I suspect; and since the lowering happens during CLIF legalization, we can just pass the flags through.

Wasmtime GitHub notifications bot (Dec 08 2020 at 13:15):

uweigand commented on Issue #2124:

That's a large enough refactor that, for now, I think your PR's approach is fine (keep MemFlags on instructions, add a bit of verbosity to emission tests); but eventually we can clean things up.

GlobalValue "load" instructions

Yes, these should also carry MemFlags, I suspect; and since the lowering happens during CLIF legalization, we can just pass the flags through.

Thanks @cfallin, I've now updated the patch to carry a MemFlags in GlobalValueData::Load and left the MemFlags on machine instructions for now. The updated patch is now posted as PR #2488 above.

Wasmtime GitHub notifications bot (Dec 14 2020 at 22:00):

cfallin commented on Issue #2124:

Fixed with #2492.

Wasmtime GitHub notifications bot (Dec 14 2020 at 22:00):

cfallin closed Issue #2124:

Feature


Current wasmtime maps the WebAssembly memory instructions (t.load, t.store etc.) directly to Cranelift IR memory instructions (load, store, uloadN, etc.).

This causes problems on big-endian platforms, because the Cranelift IR instruction are implemented as native load and store instructions using the machine byte order, while the WebAssembly memory instructions are specified to use little-endian byte order always.

Now, I initially thought that one way to solve this problem could be to treat Cranelift IR memory instructions also as always little-endian by specification. However, that does not work, because there are many other uses of these instructions that do require native byte order.

Some examples of those include:

Memory accesses added by platform ABI code (implicit pointers for argument or return values), in particular if this needs to be compatible with native code.

Memory accesses to values prepared by trampoline code at the boundaries of VM native code and JITted code.

Memory accesses to parts of the VMContext that is also accessed by VM native code.

In addition, there are cases where -while not strictly necessary for correctness- it is preferable for performance reasons to use native byte order, e.g. for spill code, for accessing variables on the stack, when implementing code such as inlined copies for small memcpy etc.

So, I believe we need some way of representing both always-little-endian memory operations (used to translate
the WebAssembly instructions), and native memory operation (used for everything else).

Benefit



Enabling support for Wasmtime on big-endian platforms like IBM Z.

Implementation


My current implementation of this approach simply duplicates all Cranelift IR memory instructions to create always-LE versions. So in addition to "load" there is "load_le" etc. The full list is:
load_le
load_le_complex
store_le
store_le_complex
uload16_le
uload16_le_complex
sload16_le
sload16_le_complex
istore16_le
istore16_le_complex
uload32_le
uload32_le_complex
sload32_le
sload32_le_complex
istore32_le
istore32_le_complex

Advantages of this approach include:

Most code that creates load/store instructions can remain unchanged, the WebAssembly translator simply always uses the new instructions.

It's already implemented and working :-)

But there are disadvantages:

All back-ends must implement all the new instructions (usually by just mapping them back to normal loads/stores), or else the target will stop working.

Middle-end code changes operating on loads/stores (e.g. the code that recognizes and creates _complex operations) should be adapted or else we can get performance regressions.

Alternatives


There's various alternative ways this could be implemented:

A) Add an additional flag argument to load/store instructions that
specifies the requested byte order. A detail question is whether the flag is
little-endian vs. native
little-endian vs. big-endian
little-endian vs. big-endian vs. native

Advantages:

no new IR instructions required

existing back-ends could simply ignore the flag

Disadvantages:

it's still an IR change as that flag must be considered part of the IR (e.g. parsing IR, serialization ...)

All creators of loads/stores (including outside of cranelift, and possibly even outside of wasmtime!) must be updated. If there is no "native" flag setting, all those updates must include finding out the native byte order somehow.

B) Add an additional flag bit to the existing MemFlags
Advantages:

no new IR (should be covered by existing serialization ...)

can be ignored by existing back-ends

no change to (most) creators of loads/stores necessary

Disadvantages:

MemFlags can no longer be dropped, it becomes required for correctness to always preserve in in the middle end

C) Open-code the conversion in the WebAssembly translator
Only emit a plain "load" if the target is little-endian. Otherwise emit a load followed by a byte-swap (possibly followed by an extension). Vice-versa for stores. This would probably require addition of a new "bswap" Cranelift IR instruction, unless we want to open-code bswap itself as well (possible, but a bit tedious).

Advantages:

Only "bswap" as new IR element, can be ignored by back-ends for little-endian architectures and everywhere else.

Disadvantages:

No major ones I can see - this would be my preferred approach.

Last updated: Jan 10 2026 at 02:36 UTC