bnjbvr edited Issue #2839:
Ok I think I got everything working now! Since it was a very recent change, @cfallin can you confirm you saw 95f86be? That was necessary to fix some segfaults that were cropping up on windows since the 4-byte return values were getting stored as 8-byte results. The comment there makes me suspicious though...
That was indeed a suspicious comment. I am not sure exactly what the original intent was but I double-checked just now that this
store
helper is not being used to codegen stores; those directly produce correctly-sized instructions. I suspect this originally came from spill/reload helpers in which case "always spill/reload the full reg" is a reasonable conservative approach but even then we have the invariant that upper bits (beyond a type's width) in a register are undefined and extended when needed by an op so this should be fine I think. Thanks for calling it out!_Originally posted by @cfallin in https://github.com/bytecodealliance/wasmtime/issues/2806#issuecomment-814437743_
I guess I'm coming after the battle, but this particular change introduced a regression. See https://github.com/bytecodealliance/wasmtime/pull/2277 for context, which added an optimization for the following situation. Imagine we have the following CLIF:
v1 = iadd.i32 {...} v2 = uextend.i64 v1
Thus v1 is 32-bits, and v2 is 64-bits. During lowering, we pattern-match this situation: in theory, the
iadd.i32
is compiled to anaddl
, which clears out the upper bits, so the zero-extension is spurious, and can be redefined as a plain copy of the input (a move!). So this might be lowered into the following vcode:%v1 = addl ... movq %v1, %v2
Now, from the point of view of register allocation, both vcode's values are using the
RegClass::i64
, so the two virtual registers can be coalesced and allocated to the same register. That's all right... unless the register is spilled// say %v1 and %v2 are allocated to %r1 %r1 = addl ... movl %r1, 0x20(%rsp) // spill to a stack slot as an i32, since %v1 is an i32 // ... later movq 0x20(%rsp), %r1 // reload as an i64, since %v2 is an i64. The high 32-bits are undefined!
The line ending with
spill to a stack slot as an i32
results from the change Alex made. Before this, the value would be stored as an i64, so the high bits from the iaddl would be stored in the stack slot and correctly reloaded. The comment's wording didn't quite help in describing this situation. Also, there should have been a test case, so that part is on me.So the issue is that register allocation kind of loses the precise type, and that a store to a stack slot should store the full width of the stack slot all the time, so there are no undefined bits when storing and reloading values within a given
RegClass
(here,RegClass::I64
). Alternatively, we might remove theuextend
elimination optimization, but that would increase register pressure and slow down the decoding pipeline a bit, so I'd rather go with the first solution.
bnjbvr labeled Issue #2839:
Ok I think I got everything working now! Since it was a very recent change, @cfallin can you confirm you saw 95f86be? That was necessary to fix some segfaults that were cropping up on windows since the 4-byte return values were getting stored as 8-byte results. The comment there makes me suspicious though...
That was indeed a suspicious comment. I am not sure exactly what the original intent was but I double-checked just now that this
store
helper is not being used to codegen stores; those directly produce correctly-sized instructions. I suspect this originally came from spill/reload helpers in which case "always spill/reload the full reg" is a reasonable conservative approach but even then we have the invariant that upper bits (beyond a type's width) in a register are undefined and extended when needed by an op so this should be fine I think. Thanks for calling it out!_Originally posted by @cfallin in https://github.com/bytecodealliance/wasmtime/issues/2806#issuecomment-814437743_
I guess I'm coming after the battle, but this particular change introduced a regression. See https://github.com/bytecodealliance/wasmtime/pull/2277 for context, which added an optimization for the following situation. Imagine we have the following CLIF:
v1 = iadd.i32 {...} v2 = uextend.i64 v1
Thus v1 is 32-bits, and v2 is 64-bits. During lowering, we pattern-match this situation: in theory, the
iadd.i32
is compiled to anaddl
, which clears out the upper bits, so the zero-extension is spurious, and can be redefined as a plain copy of the input (a move!). So this might be lowered into the following vcode:%v1 = addl ... movq %v1, %v2
Now, from the point of view of register allocation, both vcode's values are using the
RegClass::i64
, so the two virtual registers can be coalesced and allocated to the same register. That's all right... unless the register is spilled// say %v1 and %v2 are allocated to %r1 %r1 = addl ... movl %r1, 0x20(%rsp) // spill to a stack slot as an i32, since %v1 is an i32 // ... later movq 0x20(%rsp), %r1 // reload as an i64, since %v2 is an i64. The high 32-bits are undefined!
The line ending with
spill to a stack slot as an i32
results from the change Alex made. Before this, the value would be stored as an i64, so the high bits from the iaddl would be stored in the stack slot and correctly reloaded. The comment's wording didn't quite help in describing this situation. Also, there should have been a test case, so that part is on me.So the issue is that register allocation kind of loses the precise type, and that a store to a stack slot should store the full width of the stack slot all the time, so there are no undefined bits when storing and reloading values within a given
RegClass
(here,RegClass::I64
). Alternatively, we might remove theuextend
elimination optimization, but that would increase register pressure and slow down the decoding pipeline a bit, so I'd rather go with the first solution.
bnjbvr labeled Issue #2839:
Ok I think I got everything working now! Since it was a very recent change, @cfallin can you confirm you saw 95f86be? That was necessary to fix some segfaults that were cropping up on windows since the 4-byte return values were getting stored as 8-byte results. The comment there makes me suspicious though...
That was indeed a suspicious comment. I am not sure exactly what the original intent was but I double-checked just now that this
store
helper is not being used to codegen stores; those directly produce correctly-sized instructions. I suspect this originally came from spill/reload helpers in which case "always spill/reload the full reg" is a reasonable conservative approach but even then we have the invariant that upper bits (beyond a type's width) in a register are undefined and extended when needed by an op so this should be fine I think. Thanks for calling it out!_Originally posted by @cfallin in https://github.com/bytecodealliance/wasmtime/issues/2806#issuecomment-814437743_
I guess I'm coming after the battle, but this particular change introduced a regression. See https://github.com/bytecodealliance/wasmtime/pull/2277 for context, which added an optimization for the following situation. Imagine we have the following CLIF:
v1 = iadd.i32 {...} v2 = uextend.i64 v1
Thus v1 is 32-bits, and v2 is 64-bits. During lowering, we pattern-match this situation: in theory, the
iadd.i32
is compiled to anaddl
, which clears out the upper bits, so the zero-extension is spurious, and can be redefined as a plain copy of the input (a move!). So this might be lowered into the following vcode:%v1 = addl ... movq %v1, %v2
Now, from the point of view of register allocation, both vcode's values are using the
RegClass::i64
, so the two virtual registers can be coalesced and allocated to the same register. That's all right... unless the register is spilled// say %v1 and %v2 are allocated to %r1 %r1 = addl ... movl %r1, 0x20(%rsp) // spill to a stack slot as an i32, since %v1 is an i32 // ... later movq 0x20(%rsp), %r1 // reload as an i64, since %v2 is an i64. The high 32-bits are undefined!
The line ending with
spill to a stack slot as an i32
results from the change Alex made. Before this, the value would be stored as an i64, so the high bits from the iaddl would be stored in the stack slot and correctly reloaded. The comment's wording didn't quite help in describing this situation. Also, there should have been a test case, so that part is on me.So the issue is that register allocation kind of loses the precise type, and that a store to a stack slot should store the full width of the stack slot all the time, so there are no undefined bits when storing and reloading values within a given
RegClass
(here,RegClass::I64
). Alternatively, we might remove theuextend
elimination optimization, but that would increase register pressure and slow down the decoding pipeline a bit, so I'd rather go with the first solution.
abrown commented on Issue #2839:
Did my refactoring to stores in 6bdef48 (from #2833) make this worse or better or unrelated?
bnjbvr commented on Issue #2839:
Unrelated, I think. Really the matter was to not have a test for the explained behavior above. For what it's worth, I've got a test case now, which was a bit hard to handwrite, but that will make sure that we won't undo this later again.
cfallin commented on Issue #2839:
Fantastic find, @bnjbvr -- thank you for tracking this down!
cfallin closed Issue #2839:
Ok I think I got everything working now! Since it was a very recent change, @cfallin can you confirm you saw 95f86be? That was necessary to fix some segfaults that were cropping up on windows since the 4-byte return values were getting stored as 8-byte results. The comment there makes me suspicious though...
That was indeed a suspicious comment. I am not sure exactly what the original intent was but I double-checked just now that this
store
helper is not being used to codegen stores; those directly produce correctly-sized instructions. I suspect this originally came from spill/reload helpers in which case "always spill/reload the full reg" is a reasonable conservative approach but even then we have the invariant that upper bits (beyond a type's width) in a register are undefined and extended when needed by an op so this should be fine I think. Thanks for calling it out!_Originally posted by @cfallin in https://github.com/bytecodealliance/wasmtime/issues/2806#issuecomment-814437743_
I guess I'm coming after the battle, but this particular change introduced a regression. See https://github.com/bytecodealliance/wasmtime/pull/2277 for context, which added an optimization for the following situation. Imagine we have the following CLIF:
v1 = iadd.i32 {...} v2 = uextend.i64 v1
Thus v1 is 32-bits, and v2 is 64-bits. During lowering, we pattern-match this situation: in theory, the
iadd.i32
is compiled to anaddl
, which clears out the upper bits, so the zero-extension is spurious, and can be redefined as a plain copy of the input (a move!). So this might be lowered into the following vcode:%v1 = addl ... movq %v1, %v2
Now, from the point of view of register allocation, both vcode's values are using the
RegClass::i64
, so the two virtual registers can be coalesced and allocated to the same register. That's all right... unless the register is spilled// say %v1 and %v2 are allocated to %r1 %r1 = addl ... movl %r1, 0x20(%rsp) // spill to a stack slot as an i32, since %v1 is an i32 // ... later movq 0x20(%rsp), %r1 // reload as an i64, since %v2 is an i64. The high 32-bits are undefined!
The line ending with
spill to a stack slot as an i32
results from the change Alex made. Before this, the value would be stored as an i64, so the high bits from the iaddl would be stored in the stack slot and correctly reloaded. The comment's wording didn't quite help in describing this situation. Also, there should have been a test case, so that part is on me.So the issue is that register allocation kind of loses the precise type, and that a store to a stack slot should store the full width of the stack slot all the time, so there are no undefined bits when storing and reloading values within a given
RegClass
(here,RegClass::I64
). Alternatively, we might remove theuextend
elimination optimization, but that would increase register pressure and slow down the decoding pipeline a bit, so I'd rather go with the first solution.
uweigand commented on Issue #2839:
Looks like this patch now introduced a new regression on s390x. The patch changes store_spillslot but not load_spillslot, which means that some values may now be stored as i64 and then loaded as i32 from the same address. This does not work on big-endian platforms.
Shouldn't load_spillslot be using the same type as store_spillslot? If this is not possible or desirable for some reason, then we need to update the offset where the value is loaded from on big-endian systems in such cases. (However, it is not immediately clear how we'd know the target endianness in load_spillsplot. Seems this would require a new ABIMachineSpec callback.)
I'd be happy to come up with a patch to fix this, just let me know which option you prefer.
bnjbvr commented on Issue #2839:
Sorry about that! I agree that yeah, since the
store_spillslot
function is target-independent and this target-dependent constraint has leaked upwards, theload_spillslot
function should explicitly behave the same way indeed. @cfallin thoughts?
cfallin commented on Issue #2839:
Hmm, yes, I think the straightforward fix is to do the same type-promotion logic in
load_spililslot()
, as you suggest.It might be worth documenting what invariants we're upholding, too, just to make this clearer for future readers: specifically, for integer registers, the lowering logic can always assume that the full register will be preserved across spills/reloads, while for floating-point and vector registers, only the used portion is preserved. The other side of this contract is that the lowering must be precise with types on floats/vecs, but is free to generate moves between integer registers holding types of different sizes.
uweigand commented on Issue #2839:
Hmm, yes, I think the straightforward fix is to do the same type-promotion logic in
load_spililslot()
, as you suggest.OK, https://github.com/bytecodealliance/wasmtime/pull/2843 implements this and fixes the regression for me.
Last updated: Nov 22 2024 at 17:03 UTC