afonso360 opened issue #5922:
:wave: Hey,
Me and @alexcrichton were discussing this via Zulip. The
stack_load
gets translated into aload.f64x2 notrap aligned v34
, and that's why we do the load sinking in the backend.Is
stack_load
required to be sufficiently aligned? Reading the docs, they don't mention anything. So it might just be a case of lets update the docs and the fuzzer.
.clif
Test Casetest run target x86_64 target aarch64 target s390x function %a() -> f64 system_v { ss0 = explicit_slot 69 block0: v0 = iconst.i64 0 stack_store.i64 v0, ss0+20 stack_store.i64 v0, ss0+28 stack_store.i64 v0, ss0+36 v29 = stack_load.f64x2 ss0+23 v30 = extractlane v29, 1 return v30 } ; run: %a() == 0.0
Steps to Reproduce
clif-util test ./the-above.clif
Expected Results
The test to pass
Actual Results
Finished dev [unoptimized + debuginfo] target(s) in 0.25s Running `/home/afonso/git/wasmtime/target/debug/clif-util test ./lmao.clif` Segmentation fault
This passes on AArch64 and S390x.
Versions and Environment
Cranelift version or commit: main
Operating system: Linux
Architecture: x64
Extra Info
<details>
<summary>Generated assembly:</summary>Disassembly of 60 bytes: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 50 subq $0x50, %rsp 8: 4c 8d 54 24 14 leaq 0x14(%rsp), %r10 d: 4d 31 db xorq %r11, %r11 10: 4d 89 1a movq %r11, (%r10) 13: 4c 8d 5c 24 1c leaq 0x1c(%rsp), %r11 18: 48 31 f6 xorq %rsi, %rsi 1b: 49 89 33 movq %rsi, (%r11) 1e: 48 8d 74 24 24 leaq 0x24(%rsp), %rsi 23: 48 31 ff xorq %rdi, %rdi 26: 48 89 3e movq %rdi, (%rsi) 29: 48 8d 7c 24 17 leaq 0x17(%rsp), %rdi 2e: 66 0f 70 07 ee pshufd $0xee, (%rdi), %xmm0 33: 48 83 c4 50 addq $0x50, %rsp 37: 48 89 ec movq %rbp, %rsp 3a: 5d popq %rbp 3b: c3 retq
</details>
afonso360 labeled issue #5922:
:wave: Hey,
Me and @alexcrichton were discussing this via Zulip. The
stack_load
gets translated into aload.f64x2 notrap aligned v34
, and that's why we do the load sinking in the backend.Is
stack_load
required to be sufficiently aligned? Reading the docs, they don't mention anything. So it might just be a case of lets update the docs and the fuzzer.
.clif
Test Casetest run target x86_64 target aarch64 target s390x function %a() -> f64 system_v { ss0 = explicit_slot 69 block0: v0 = iconst.i64 0 stack_store.i64 v0, ss0+20 stack_store.i64 v0, ss0+28 stack_store.i64 v0, ss0+36 v29 = stack_load.f64x2 ss0+23 v30 = extractlane v29, 1 return v30 } ; run: %a() == 0.0
Steps to Reproduce
clif-util test ./the-above.clif
Expected Results
The test to pass
Actual Results
Finished dev [unoptimized + debuginfo] target(s) in 0.25s Running `/home/afonso/git/wasmtime/target/debug/clif-util test ./lmao.clif` Segmentation fault
This passes on AArch64 and S390x.
Versions and Environment
Cranelift version or commit: main
Operating system: Linux
Architecture: x64
Extra Info
<details>
<summary>Generated assembly:</summary>Disassembly of 60 bytes: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 50 subq $0x50, %rsp 8: 4c 8d 54 24 14 leaq 0x14(%rsp), %r10 d: 4d 31 db xorq %r11, %r11 10: 4d 89 1a movq %r11, (%r10) 13: 4c 8d 5c 24 1c leaq 0x1c(%rsp), %r11 18: 48 31 f6 xorq %rsi, %rsi 1b: 49 89 33 movq %rsi, (%r11) 1e: 48 8d 74 24 24 leaq 0x24(%rsp), %rsi 23: 48 31 ff xorq %rdi, %rdi 26: 48 89 3e movq %rdi, (%rsi) 29: 48 8d 7c 24 17 leaq 0x17(%rsp), %rdi 2e: 66 0f 70 07 ee pshufd $0xee, (%rdi), %xmm0 33: 48 83 c4 50 addq $0x50, %rsp 37: 48 89 ec movq %rbp, %rsp 3a: 5d popq %rbp 3b: c3 retq
</details>
afonso360 labeled issue #5922:
:wave: Hey,
Me and @alexcrichton were discussing this via Zulip. The
stack_load
gets translated into aload.f64x2 notrap aligned v34
, and that's why we do the load sinking in the backend.Is
stack_load
required to be sufficiently aligned? Reading the docs, they don't mention anything. So it might just be a case of lets update the docs and the fuzzer.
.clif
Test Casetest run target x86_64 target aarch64 target s390x function %a() -> f64 system_v { ss0 = explicit_slot 69 block0: v0 = iconst.i64 0 stack_store.i64 v0, ss0+20 stack_store.i64 v0, ss0+28 stack_store.i64 v0, ss0+36 v29 = stack_load.f64x2 ss0+23 v30 = extractlane v29, 1 return v30 } ; run: %a() == 0.0
Steps to Reproduce
clif-util test ./the-above.clif
Expected Results
The test to pass
Actual Results
Finished dev [unoptimized + debuginfo] target(s) in 0.25s Running `/home/afonso/git/wasmtime/target/debug/clif-util test ./lmao.clif` Segmentation fault
This passes on AArch64 and S390x.
Versions and Environment
Cranelift version or commit: main
Operating system: Linux
Architecture: x64
Extra Info
<details>
<summary>Generated assembly:</summary>Disassembly of 60 bytes: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 50 subq $0x50, %rsp 8: 4c 8d 54 24 14 leaq 0x14(%rsp), %r10 d: 4d 31 db xorq %r11, %r11 10: 4d 89 1a movq %r11, (%r10) 13: 4c 8d 5c 24 1c leaq 0x1c(%rsp), %r11 18: 48 31 f6 xorq %rsi, %rsi 1b: 49 89 33 movq %rsi, (%r11) 1e: 48 8d 74 24 24 leaq 0x24(%rsp), %rsi 23: 48 31 ff xorq %rdi, %rdi 26: 48 89 3e movq %rdi, (%rsi) 29: 48 8d 7c 24 17 leaq 0x17(%rsp), %rdi 2e: 66 0f 70 07 ee pshufd $0xee, (%rdi), %xmm0 33: 48 83 c4 50 addq $0x50, %rsp 37: 48 89 ec movq %rbp, %rsp 3a: 5d popq %rbp 3b: c3 retq
</details>
afonso360 labeled issue #5922:
:wave: Hey,
Me and @alexcrichton were discussing this via Zulip. The
stack_load
gets translated into aload.f64x2 notrap aligned v34
, and that's why we do the load sinking in the backend.Is
stack_load
required to be sufficiently aligned? Reading the docs, they don't mention anything. So it might just be a case of lets update the docs and the fuzzer.
.clif
Test Casetest run target x86_64 target aarch64 target s390x function %a() -> f64 system_v { ss0 = explicit_slot 69 block0: v0 = iconst.i64 0 stack_store.i64 v0, ss0+20 stack_store.i64 v0, ss0+28 stack_store.i64 v0, ss0+36 v29 = stack_load.f64x2 ss0+23 v30 = extractlane v29, 1 return v30 } ; run: %a() == 0.0
Steps to Reproduce
clif-util test ./the-above.clif
Expected Results
The test to pass
Actual Results
Finished dev [unoptimized + debuginfo] target(s) in 0.25s Running `/home/afonso/git/wasmtime/target/debug/clif-util test ./lmao.clif` Segmentation fault
This passes on AArch64 and S390x.
Versions and Environment
Cranelift version or commit: main
Operating system: Linux
Architecture: x64
Extra Info
<details>
<summary>Generated assembly:</summary>Disassembly of 60 bytes: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 50 subq $0x50, %rsp 8: 4c 8d 54 24 14 leaq 0x14(%rsp), %r10 d: 4d 31 db xorq %r11, %r11 10: 4d 89 1a movq %r11, (%r10) 13: 4c 8d 5c 24 1c leaq 0x1c(%rsp), %r11 18: 48 31 f6 xorq %rsi, %rsi 1b: 49 89 33 movq %rsi, (%r11) 1e: 48 8d 74 24 24 leaq 0x24(%rsp), %rsi 23: 48 31 ff xorq %rdi, %rdi 26: 48 89 3e movq %rdi, (%rsi) 29: 48 8d 7c 24 17 leaq 0x17(%rsp), %rdi 2e: 66 0f 70 07 ee pshufd $0xee, (%rdi), %xmm0 33: 48 83 c4 50 addq $0x50, %rsp 37: 48 89 ec movq %rbp, %rsp 3a: 5d popq %rbp 3b: c3 retq
</details>
bjorn3 commented on issue #5922:
I think cg_clif can emit unaligned stack_load and stack_store instructions. I can change it if the consensus is that unaligned accesses are not allowed.
jameysharp commented on issue #5922:
As far as I can tell, Wasmtime doesn't use Cranelift's explicit stack support. So I think cg_clif gets to have a particularly large say on this question.
My personal opinion is the StackLoad and StackStore formats ought to have a MemFlags so the frontend can decide this question for itself. I think there's room in both formats without making
InstructionData
any larger.If not that, I think at least that legalization should stop generating the
aligned
flag: since these instructions allow arbitrary offsets, it makes sense to me that someone might ask for unaligned offsets. I think thenotrap
flag is okay though?
bjorn3 commented on issue #5922:
I think the notrap flag is okay though?
I agree. The stack should never be unmapped. For stack overflow there is stack probing.
If not that, I think at least that legalization should stop generating the aligned flag: since these instructions allow arbitrary offsets, it makes sense to me that someone might ask for unaligned offsets.
It will need to be possible to specify the alignment of a stack slot. Currently they are 1 aligned, except if you ensure that every stack slot is a multiple of 16, you get 16 aligned stack slots on all current backends. As such I do this in cg_clif as workaround for the lack of alignment specification support. If alignment specification is added to stack slots, Cranelift could infer if a
stack_load
orstack_store
is aligned based on the stack slot alignment and the offset within the stack slot.
afonso360 commented on issue #5922:
My personal opinion is the StackLoad and StackStore formats ought to have a MemFlags so the frontend can decide this question for itself. I think there's room in both formats without making InstructionData any larger.
This is also nice for cross endian
stack_{load,store}
s. They currently have to be manually emitted asstack_addr
+{load,store}
since we always pick the native endianness. And it removes one source of the native endianness flags.Also, can we do something like #4721 to these instructions? Instead of having them magically translated in Cranelift, just build them as regular loads/stores in the frontend. One less Instruction Format to deal with!
It will need to be possible to specify the alignment of a stack slot. Currently they are 1 aligned, except if you ensure that every stack slot is a multiple of 16, you get 16 aligned stack slots on all current backends. As such I do this in cg_clif as workaround for the lack of alignment specification support. If alignment specification is added to stack slots, Cranelift could infer if a stack_load or stack_store is aligned based on the stack slot alignment and the offset within the stack slot.
Another issue that the fuzzer found is that even with a load without an offset if the stack slot is unaligned it can fault. Since we always align stack slots to 8 bytes, that is not enough for SSE, we get a crash there.
Adding an alignment field to stack slots seems like a great idea to me! I think I might have to do that before finishing up the work on the SIMD part of the fuzzer. Otherwise I'll have to start tracking the alignment of the whole stack which I don't really want to do.
afonso360 edited a comment on issue #5922:
My personal opinion is the StackLoad and StackStore formats ought to have a MemFlags so the frontend can decide this question for itself. I think there's room in both formats without making InstructionData any larger.
This is also nice for cross endian
stack_{load,store}
s. They currently have to be manually emitted asstack_addr
+{load,store}
since we always pick the native endianness. And it removes one source of the native endianness flags.Also, can we do something like #4721 to these instructions? Instead of having them magically translated in Cranelift, just build them as regular loads/stores in the frontend. One less Instruction Format to deal with!
It will need to be possible to specify the alignment of a stack slot. Currently they are 1 aligned, except if you ensure that every stack slot is a multiple of 16, you get 16 aligned stack slots on all current backends. As such I do this in cg_clif as workaround for the lack of alignment specification support. If alignment specification is added to stack slots, Cranelift could infer if a stack_load or stack_store is aligned based on the stack slot alignment and the offset within the stack slot.
Another issue that the fuzzer found is that even with a load without an offset if the stack slot is unaligned it can fault. Since we always align stack slots to 8 bytes, that is not enough for SSE, we get a crash there.
Adding an alignment field to stack slots seems like a great idea to me! I think I might have to do that before finishing up the work on the SIMD part of the fuzzer (though that has a bunch of issues as well). Otherwise I'll have to start tracking the alignment of the whole stack which I don't really want to do.
afonso360 edited a comment on issue #5922:
My personal opinion is the StackLoad and StackStore formats ought to have a MemFlags so the frontend can decide this question for itself. I think there's room in both formats without making InstructionData any larger.
This is also nice for cross endian
stack_{load,store}
s. They currently have to be manually emitted asstack_addr
+{load,store}
since we always pick the native endianness. And it removes one source of the native endianness flags.Also, can we do something like #4721 to these instructions? Instead of having them magically translated in Cranelift, just build them as regular loads/stores in the frontend. One less Instruction Format to deal with!
It will need to be possible to specify the alignment of a stack slot. Currently they are 1 aligned, except if you ensure that every stack slot is a multiple of 16, you get 16 aligned stack slots on all current backends. As such I do this in cg_clif as workaround for the lack of alignment specification support. If alignment specification is added to stack slots, Cranelift could infer if a stack_load or stack_store is aligned based on the stack slot alignment and the offset within the stack slot.
Another issue that the fuzzer found is that even with a load without an offset if the stack slot is unaligned it can fault. Since we always align stack slots to 8 bytes, that is not enough for SSE, we get a crash there.
Adding an alignment field to stack slots seems like a great idea to me (though that has a bunch of issues as well)! I think I might have to do that before finishing up the work on the SIMD part of the fuzzer. Otherwise I'll have to start tracking the alignment of the whole stack which I don't really want to do.
jameysharp commented on issue #5922:
I like the idea of removing the
stack_load
andstack_store
instructions, and instead always pairingstack_addr
with a regularload
orstore
, for most of the same reasons as #5386. It nicely resolves the question of whether these instructions need aMemFlags
operand, and was already necessary anyway if you wanted to use the narrowing stores or widening loads.I think cranelift-fuzzgen might need to start remembering which variables hold addresses, along with metadata like whether it's a
func_addr
orstack_addr
and, in the stack case, what minimum alignment and maximum offset it has. I haven't thought through the details. I hope there's a simpler approach than I've thought of so far.I spent a moment thinking about whether
stack_addr
should be removed too before remembering that, unlike wasm heaps and tables, Cranelift really is responsible for the stack.And yeah, adding alignment specifiers to stack slots makes sense to me. As a later memory-usage optimization maybe we can sort the stack slots (both explicit and spill) by alignment to minimize padding.
afonso360 closed issue #5922:
:wave: Hey,
Me and @alexcrichton were discussing this via Zulip. The
stack_load
gets translated into aload.f64x2 notrap aligned v34
, and that's why we do the load sinking in the backend.Is
stack_load
required to be sufficiently aligned? Reading the docs, they don't mention anything. So it might just be a case of lets update the docs and the fuzzer.
.clif
Test Casetest run target x86_64 target aarch64 target s390x function %a() -> f64 system_v { ss0 = explicit_slot 69 block0: v0 = iconst.i64 0 stack_store.i64 v0, ss0+20 stack_store.i64 v0, ss0+28 stack_store.i64 v0, ss0+36 v29 = stack_load.f64x2 ss0+23 v30 = extractlane v29, 1 return v30 } ; run: %a() == 0.0
Steps to Reproduce
clif-util test ./the-above.clif
Expected Results
The test to pass
Actual Results
Finished dev [unoptimized + debuginfo] target(s) in 0.25s Running `/home/afonso/git/wasmtime/target/debug/clif-util test ./lmao.clif` Segmentation fault
This passes on AArch64 and S390x.
Versions and Environment
Cranelift version or commit: main
Operating system: Linux
Architecture: x64
Extra Info
<details>
<summary>Generated assembly:</summary>Disassembly of 60 bytes: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 83 ec 50 subq $0x50, %rsp 8: 4c 8d 54 24 14 leaq 0x14(%rsp), %r10 d: 4d 31 db xorq %r11, %r11 10: 4d 89 1a movq %r11, (%r10) 13: 4c 8d 5c 24 1c leaq 0x1c(%rsp), %r11 18: 48 31 f6 xorq %rsi, %rsi 1b: 49 89 33 movq %rsi, (%r11) 1e: 48 8d 74 24 24 leaq 0x24(%rsp), %rsi 23: 48 31 ff xorq %rdi, %rdi 26: 48 89 3e movq %rdi, (%rsi) 29: 48 8d 7c 24 17 leaq 0x17(%rsp), %rdi 2e: 66 0f 70 07 ee pshufd $0xee, (%rdi), %xmm0 33: 48 83 c4 50 addq $0x50, %rsp 37: 48 89 ec movq %rbp, %rsp 3a: 5d popq %rbp 3b: c3 retq
</details>
Last updated: Jan 24 2025 at 00:11 UTC