github-actions[bot] commented on issue #5931:
Subscribe to Label Action
cc @cfallin, @fitzgen
<details>
This issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "isle"Thus the following users have been cc'd because of the following labels:
- cfallin: isle
- fitzgen: isle
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
alexcrichton commented on issue #5931:
Oh good points! Shame on me for not actually reading all the way through on these bits...
So hypothetically if the host uses ymm registers in its own code, that might cause stalls but given that the stall requires hopping between the guest and the host it probably isn't really going to affect much?
Otherwise though locally I can't measure a difference before/after this PR, so the main motivation at this point is to copy what v8 does.
abrown commented on issue #5931:
So hypothetically if the host uses ymm registers in its own code, that might cause stalls but given that the stall requires hopping between the guest and the host it probably isn't really going to affect much?
Yeah, that's a good point. I guess we should remember that, beyond just the normal overhead of switching between guest and host, this YMM transition penalty could add to the switch overhead. Maybe it's worthwhile to think about running
VZEROUPPER
in the "host to guest" trampoline so that we feel more sure that guest code will be in the "Clean UpperState"? cc: @cfallin, @elliottt, @jameysharp; I guess this is a "better safe than sorry" kind of thought, but that goes along with the intent of this PR.Otherwise though locally I can't measure a difference before/after this PR, so the main motivation at this point is to copy what v8 does.
Yeah, I wanted to say it earlier but don't want to sound cavalier: one might have to work rather hard to make the partial register dependency become a noticeable issue in a real benchmark. I'm not saying it can't be done and we shouldn't try to avoid it, just... the StackOverflow answer ("you are experiencing a penalty for "mixing" non-VEX SSE and VEX-encoded instructions") felt more alarmist than I thought was warranted.
alexcrichton commented on issue #5931:
Oh sorry I didn't mean to raise an alarms or convey any sense of urgency. I should probably more succinctly put it as "I was interested in filling out more AVX instructions, but had no technical motivation to document as the reason to do so, so I picked the first google result and pasted it here"
I'll need to read up more on VZEROUPPER as I'm not sure what it does and how it affects performance myself.
jameysharp commented on issue #5931:
The optimization manual says that
vzeroupper
"has zero latency" so I guess the only cost is instruction decode. Given that, adding one no-operand instruction to Wasmtime's trampolines sounds reasonable to me. (I guess it should be added for transitions in both directions between host and guest, based on the optimization manual's recommendations.)I think I remember at least one of those trampolines does a tail-call, so it doesn't have the opportunity to do this when the callee returns, which I suppose could lead to surprising results too.
Just to check, we don't need to worry about ABI here, right? I'm assuming no x86 ABI guarantees anything about bits beyond the first 128 of vector registers across a call, or all the vector registers are caller-saved, or something.
abrown commented on issue #5931:
Just to check, we don't need to worry about ABI here, right? I'm assuming no x86 ABI guarantees anything about bits beyond the first 128 of vector registers across a call, or all the vector registers are caller-saved, or something.
Honestly, hadn't thought too much about this idea until today so I don't know, but if we did add
VZEROUPPER
in the "host to guest" direction, e.g., I think we would want to do so before we fill in any registers with passedv128
values.
Last updated: Jan 24 2025 at 00:11 UTC