wasmtime / issue #5931 x64: Add more support for more AVX... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #5931 x64: Add more support for more AVX...

Wasmtime GitHub notifications bot (Mar 04 2023 at 22:45):

github-actions[bot] commented on issue #5931:

Subscribe to Label Action

cc @cfallin, @fitzgen

<details>
This issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

cfallin: isle

fitzgen: isle

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

Wasmtime GitHub notifications bot (Mar 09 2023 at 22:38):

alexcrichton commented on issue #5931:

Oh good points! Shame on me for not actually reading all the way through on these bits...

So hypothetically if the host uses ymm registers in its own code, that might cause stalls but given that the stall requires hopping between the guest and the host it probably isn't really going to affect much?

Otherwise though locally I can't measure a difference before/after this PR, so the main motivation at this point is to copy what v8 does.

Wasmtime GitHub notifications bot (Mar 09 2023 at 23:01):

abrown commented on issue #5931:

So hypothetically if the host uses ymm registers in its own code, that might cause stalls but given that the stall requires hopping between the guest and the host it probably isn't really going to affect much?

Yeah, that's a good point. I guess we should remember that, beyond just the normal overhead of switching between guest and host, this YMM transition penalty could add to the switch overhead. Maybe it's worthwhile to think about running VZEROUPPER in the "host to guest" trampoline so that we feel more sure that guest code will be in the "Clean UpperState"? cc: @cfallin, @elliottt, @jameysharp; I guess this is a "better safe than sorry" kind of thought, but that goes along with the intent of this PR.

Otherwise though locally I can't measure a difference before/after this PR, so the main motivation at this point is to copy what v8 does.

Yeah, I wanted to say it earlier but don't want to sound cavalier: one might have to work rather hard to make the partial register dependency become a noticeable issue in a real benchmark. I'm not saying it can't be done and we shouldn't try to avoid it, just... the StackOverflow answer ("you are experiencing a penalty for "mixing" non-VEX SSE and VEX-encoded instructions") felt more alarmist than I thought was warranted.

Wasmtime GitHub notifications bot (Mar 09 2023 at 23:08):

alexcrichton commented on issue #5931:

Oh sorry I didn't mean to raise an alarms or convey any sense of urgency. I should probably more succinctly put it as "I was interested in filling out more AVX instructions, but had no technical motivation to document as the reason to do so, so I picked the first google result and pasted it here"

I'll need to read up more on VZEROUPPER as I'm not sure what it does and how it affects performance myself.

Wasmtime GitHub notifications bot (Mar 09 2023 at 23:15):

jameysharp commented on issue #5931:

The optimization manual says that vzeroupper "has zero latency" so I guess the only cost is instruction decode. Given that, adding one no-operand instruction to Wasmtime's trampolines sounds reasonable to me. (I guess it should be added for transitions in both directions between host and guest, based on the optimization manual's recommendations.)

I think I remember at least one of those trampolines does a tail-call, so it doesn't have the opportunity to do this when the callee returns, which I suppose could lead to surprising results too.

Just to check, we don't need to worry about ABI here, right? I'm assuming no x86 ABI guarantees anything about bits beyond the first 128 of vector registers across a call, or all the vector registers are caller-saved, or something.

Wasmtime GitHub notifications bot (Mar 09 2023 at 23:44):

abrown commented on issue #5931:

Just to check, we don't need to worry about ABI here, right? I'm assuming no x86 ABI guarantees anything about bits beyond the first 128 of vector registers across a call, or all the vector registers are caller-saved, or something.

Honestly, hadn't thought too much about this idea until today so I don't know, but if we did add VZEROUPPER in the "host to guest" direction, e.g., I think we would want to do so before we fill in any registers with passed v128 values.

Last updated: Dec 06 2025 at 06:05 UTC