Stream: cranelift

Topic: MemFlag endianness on x64/arm64


view this post on Zulip noxim (Nov 20 2022 at 15:09):

Hello, I am implementing a JIT application and I found an issue with MemFlags's endianness field being ignored on x86_64 and aarch64. I see this has been discovered before in #3625, but no fix has been made. I would like to take a crack at fixing this, but I am not familiar with Cranelift's architecture. After a quick peek I see that the raw instructions are emitted in (for example) codegen/src/isa/aarch64/inst/emit.rs. I guess a trivial fix would be to emit a rev instruction there after the load, based on the endianness of the ISA and flags. However, this would mean non optimal in cases like load rx, be ptr_a; store be ptr_b, rx (the byte swaps cancel each other out). I see a relevant bswap instruction got added in #5147, so perhaps somewhere higher level I should emit those so they can later be optimised out. Where would be the relevant place to do it, and is this the correct approach in the first place?

view this post on Zulip noxim (Nov 20 2022 at 15:22):

Moreso: on x86_64 there is an instruction movbe which is effectively a load and bswap in one. Using it slightly lowers the instruction decode cost, but it's not part of the base instruction set. Intel has had it since Haswell, AMD since Excavator. What is the policy with using such an instruction?

view this post on Zulip bjorn3 (Nov 20 2022 at 21:22):

For the last point you did add a new movbe target flag and enable it for haswell amd excavator. You can then check this target flag before emitting the instruction.

view this post on Zulip Chris Fallin (Nov 21 2022 at 18:06):

@noxim thanks for the interest in this; indeed it is a missing feature on aarch64/x64 at the moment. Emitting byteswap instructions immediately after loads is a perfectly reasonable first implementation; we can introduce optimizations later to rewrite swapped-store-of-swapped-load and byteswap-of-swapped-load

view this post on Zulip Jamey Sharp (Nov 21 2022 at 18:09):

To expand on bjorn3's comment: For example, on x64, there's a special case for 2-lane SIMD 64-bit integer multiplies if the AVX-512 instruction set extension is available. That's implemented in the ISLE lowering rules like this:

(rule 3 (lower (has_type (and (avx512vl_enabled $true)
                            (avx512dq_enabled $true)
                            (multi_lane 64 2))
                       (imul x y)))
      (x64_vpmullq x y))

Flags like avx512vl are declared in cranelift/codegen/meta/src/isa/x86.rs along with, unfortunately, several other places.


Last updated: Jan 24 2025 at 00:11 UTC