Stream: git-wasmtime

Topic: wasmtime / issue #4496 [WIP] x64: Shrink Inst size from 7...


view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2022 at 17:12):

fitzgen commented on issue #4496:

Would be really interested in sightglass benchmark results for this work when you get it in a good spot!

https://github.com/bytecodealliance/sightglass#comparing-a-feature-branch-to-main

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2022 at 17:22):

elliottt commented on issue #4496:

Would be really interested in sightglass benchmark results for this work when you get it in a good spot!

https://github.com/bytecodealliance/sightglass#comparing-a-feature-branch-to-main

Currently it shows a slowdown when running on the spidermonkey benchmark, but my guess is that it's from boxing SyntheticAmode. I'm experimenting with packing SyntheticAmode into 16 bytes to see if that leads to a similar space savings without the indirection :)

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2022 at 19:57):

jameysharp commented on issue #4496:

Huh, looking at PackedAmode now, I'm pretty sure you can get the compiler to produce code at least this good using a regular Rust enum. When we talked about this before I hadn't realized every field was aligned to some multiple of 8 bits.

The key thing is to make SyntheticAmode copy all the cases of Amode and just add a couple more. Converting between them requires a slightly tedious match, but this ensures that all five cases are in a single one-byte tag, instead of needing to pack a second tag somewhere. In addition, if I'm not mistaken, Amode is 4-byte aligned so embedding it in one of the cases of SyntheticAmode wastes three bytes in padding.

So the result is something like the following. Happily, converting from Amode to SyntheticAmode can't fail.

enum SyntheticAmode {
    ImmReg { simm32: u32, base: Reg, flags: MemFlags },
    ImmRegRegShift { simm32: u32, base: Gpr, index: Gpr, shift: u8, flags: MemFlags },
    RipRelative { target: MachLabel },
    NominalSPOffset { simm32: u32 },
    ConstantOffset(VCodeConstant),
}

impl From<Amode> for SyntheticAmode {
    fn from(src: Amode) -> SyntheticAmode {
        match src {
            Amode::ImmReg { simm32, base, flags } => SyntheticAmode::ImmReg { simm32, base, flags },
            ...
        }
    }
}

One nice thing here is that by default Rust will reorder struct fields to pack them with as little padding as possible. Also, it picks the smallest tag representation that is big enough to distinguish all the enum cases. So I think the representation it'll pick for the above SyntheticAmode is actually exactly what you did by hand. Convenient, right?

I'd hope that LLVM is clever enough to notice that the above implementation of from doesn't change the memory representation, too.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2022 at 20:04):

cfallin commented on issue #4496:

One thing @elliottt and I have discussed is that it's likely possible to get this down to 12 bytes, actually, by taking advantage of the fact that VReg indices are limited to 21 bits (because regalloc2 also plays bitpacking tricks...). Then the worst case is 2*vreg (42 bits) + simm32 (32 bits) + tag (3 bits) + flags (5 bits now, up to 8 bits) --> 82..85 bits. To do that we do need custom bitpacking, unfortunately (though if rustc supported C-style bitfields in structs... no wait, please don't throw me out the window...).

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2022 at 20:19):

jameysharp commented on issue #4496:

Can you define a type equivalent to (VReg, VReg) that's six bytes, and avoid doing magic bit-packing for anything else?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2022 at 20:23):

jameysharp edited a comment on issue #4496:

Can you define a type equivalent to (VReg, VReg) that's six bytes, and avoid doing magic bit-packing for anything else?

Edit: It'd need to be 2-byte aligned at most, so [u16; 3] under the hood, I imagine. And would it make sense to also use that type in Amode to shave four bytes off there too?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2022 at 20:41):

elliottt commented on issue #4496:

Can you define a type equivalent to (VReg, VReg) that's six bytes, and avoid doing magic bit-packing for anything else?

Edit: It'd need to be 2-byte aligned at most, so [u16; 3] under the hood, I imagine. And would it make sense to also use that type in Amode to shave four bytes off there too?

I'd love it if we could find a way to make that work! My experiment on godbolt suggests that the tag bumps the enum back up to 16 bytes: https://rust.godbolt.org/z/1E8s3rns8.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2022 at 20:42):

elliottt edited a comment on issue #4496:

Can you define a type equivalent to (VReg, VReg) that's six bytes, and avoid doing magic bit-packing for anything else?

Edit: It'd need to be 2-byte aligned at most, so [u16; 3] under the hood, I imagine. And would it make sense to also use that type in Amode to shave four bytes off there too?

I'd love it if we could find a way to make that work -- the bit shifting implementation is a little less readable :)

My experiment on godbolt suggests that the tag bumps the enum back up to 16 bytes: https://rust.godbolt.org/z/1E8s3rns8.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2022 at 21:01):

jameysharp commented on issue #4496:

I miscounted, but my new idea is to pack base, index, and also shift into 6 bytes. I believe x86 shifts are limited to something like two or three bits, right? So six bits should be plenty.

...I just really want to use a normal enum okay :laughing:

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2022 at 21:05):

cfallin commented on issue #4496:

@jameysharp if you can come up with a working Godbolt demo with a 12-byte size that'd be great; but I think the main focus right now is just to get something working and evaluate the performance gains of a smaller Inst. Then after that we can incrementally refine :-)

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2022 at 21:28):

jameysharp commented on issue #4496:

I mean, it's easy enough as long as it's safe to use only six bits for the shift: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=114c56c3b49d2465ce46456ef752f68d

So, _is_ it safe to use six or fewer bits for the shift?

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2022 at 21:35):

cfallin commented on issue #4496:

Yep, the shift amount is always one of 1, 2, 4, 8; so that representation should work. Thanks!

view this post on Zulip Wasmtime GitHub notifications bot (Jul 21 2022 at 21:36):

cfallin commented on issue #4496:

(Or rather, the scale is 1, 2, 4, 8; shift is 0, 1, 2, 3, so two bits.)

view this post on Zulip Wasmtime GitHub notifications bot (Jul 22 2022 at 19:34):

MaxGraey commented on issue #4496:

Can you define a type equivalent to (VReg, VReg) that's six bytes, and avoid doing magic bit-packing for anything else?
Edit: It'd need to be 2-byte aligned at most, so [u16; 3] under the hood, I imagine. And would it make sense to also use that type in Amode to shave four bytes off there too?

I'd love it if we could find a way to make that work -- the bit shifting implementation is a little less readable :)

My experiment on godbolt suggests that the tag bumps the enum back up to 16 bytes: https://rust.godbolt.org/z/1E8s3rns8.

But you can switch to union with packed repr for ImmRegRegShift. It will turn union layout to 12 bytes without bits magic:
https://rust.godbolt.org/z/aWW939b7b

view this post on Zulip Wasmtime GitHub notifications bot (Jul 22 2022 at 19:54):

jameysharp commented on issue #4496:

But you can switch to union with packed repr for ImmRegRegShift. It will turn union layout to 12 bytes without bits magic: https://rust.godbolt.org/z/aWW939b7b

There still has to be a tag somewhere. With that union we'd have no idea which variant was being used. We'd also have to use unsafe everywhere we wanted to read from these values, for the same reason.

Fortunately, just packing the base, index, and shift fields together, in the one case that uses all three, gets us down to 12 bytes anyway.

view this post on Zulip Wasmtime GitHub notifications bot (Jul 22 2022 at 22:03):

elliottt commented on issue #4496:

Closing this PR in favor of #4514.


Last updated: Dec 23 2024 at 12:05 UTC