MemFlag endianness on x64/arm64 · cranelift

Hello, I am implementing a JIT application and I found an issue with MemFlags's endianness field being ignored on x86_64 and aarch64. I see this has been discovered before in #3625, but no fix has been made. I would like to take a crack at fixing this, but I am not familiar with Cranelift's architecture. After a quick peek I see that the raw instructions are emitted in (for example) codegen/src/isa/aarch64/inst/emit.rs. I guess a trivial fix would be to emit a rev instruction there after the load, based on the endianness of the ISA and flags. However, this would mean non optimal in cases like load rx, be ptr_a; store be ptr_b, rx (the byte swaps cancel each other out). I see a relevant bswap instruction got added in #5147, so perhaps somewhere higher level I should emit those so they can later be optimised out. Where would be the relevant place to do it, and is this the correct approach in the first place?

noxim (Nov 20 2022 at 15:22):

Moreso: on x86_64 there is an instruction movbe which is effectively a load and bswap in one. Using it slightly lowers the instruction decode cost, but it's not part of the base instruction set. Intel has had it since Haswell, AMD since Excavator. What is the policy with using such an instruction?

bjorn3 (Nov 20 2022 at 21:22):

For the last point you did add a new movbe target flag and enable it for haswell amd excavator. You can then check this target flag before emitting the instruction.

Chris Fallin (Nov 21 2022 at 18:06):

@noxim thanks for the interest in this; indeed it is a missing feature on aarch64/x64 at the moment. Emitting byteswap instructions immediately after loads is a perfectly reasonable first implementation; we can introduce optimizations later to rewrite swapped-store-of-swapped-load and byteswap-of-swapped-load

Jamey Sharp (Nov 21 2022 at 18:09):

To expand on bjorn3's comment: For example, on x64, there's a special case for 2-lane SIMD 64-bit integer multiplies if the AVX-512 instruction set extension is available. That's implemented in the ISLE lowering rules like this:

(rule 3 (lower (has_type (and (avx512vl_enabled $true)
                            (avx512dq_enabled $true)
                            (multi_lane 64 2))
                       (imul x y)))
      (x64_vpmullq x y))

Flags like avx512vl are declared in cranelift/codegen/meta/src/isa/x86.rs along with, unfortunately, several other places.

Last updated: Apr 09 2025 at 14:04 UTC

Stream: cranelift

Topic: MemFlag endianness on x64/arm64

noxim (Nov 20 2022 at 15:09):

noxim (Nov 20 2022 at 15:22):

bjorn3 (Nov 20 2022 at 21:22):

Chris Fallin (Nov 21 2022 at 18:06):

Jamey Sharp (Nov 21 2022 at 18:09):