Stream: git-wasmtime

Topic: wasmtime / issue #5922 Cranelift: Segfault with unaligned...


view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2023 at 17:08):

afonso360 opened issue #5922:

:wave: Hey,

Me and @alexcrichton were discussing this via Zulip. The stack_load gets translated into a load.f64x2 notrap aligned v34, and that's why we do the load sinking in the backend.

Is stack_load required to be sufficiently aligned? Reading the docs, they don't mention anything. So it might just be a case of lets update the docs and the fuzzer.

.clif Test Case

test run
target x86_64
target aarch64
target s390x

function %a() -> f64 system_v {
    ss0 = explicit_slot 69

block0:
    v0 = iconst.i64 0
    stack_store.i64 v0, ss0+20
    stack_store.i64 v0, ss0+28
    stack_store.i64 v0, ss0+36

    v29 = stack_load.f64x2 ss0+23
    v30 = extractlane v29, 1
    return v30
}

; run: %a() == 0.0

Steps to Reproduce

Expected Results

The test to pass

Actual Results

    Finished dev [unoptimized + debuginfo] target(s) in 0.25s
     Running `/home/afonso/git/wasmtime/target/debug/clif-util test ./lmao.clif`
Segmentation fault

This passes on AArch64 and S390x.

Versions and Environment

Cranelift version or commit: main

Operating system: Linux

Architecture: x64

Extra Info

<details>
<summary>Generated assembly:</summary>

Disassembly of 60 bytes:
   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   48 83 ec 50             subq    $0x50, %rsp
   8:   4c 8d 54 24 14          leaq    0x14(%rsp), %r10
   d:   4d 31 db                xorq    %r11, %r11
  10:   4d 89 1a                movq    %r11, (%r10)
  13:   4c 8d 5c 24 1c          leaq    0x1c(%rsp), %r11
  18:   48 31 f6                xorq    %rsi, %rsi
  1b:   49 89 33                movq    %rsi, (%r11)
  1e:   48 8d 74 24 24          leaq    0x24(%rsp), %rsi
  23:   48 31 ff                xorq    %rdi, %rdi
  26:   48 89 3e                movq    %rdi, (%rsi)
  29:   48 8d 7c 24 17          leaq    0x17(%rsp), %rdi
  2e:   66 0f 70 07 ee          pshufd  $0xee, (%rdi), %xmm0
  33:   48 83 c4 50             addq    $0x50, %rsp
  37:   48 89 ec                movq    %rbp, %rsp
  3a:   5d                      popq    %rbp
  3b:   c3                      retq

</details>

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2023 at 17:08):

afonso360 labeled issue #5922:

:wave: Hey,

Me and @alexcrichton were discussing this via Zulip. The stack_load gets translated into a load.f64x2 notrap aligned v34, and that's why we do the load sinking in the backend.

Is stack_load required to be sufficiently aligned? Reading the docs, they don't mention anything. So it might just be a case of lets update the docs and the fuzzer.

.clif Test Case

test run
target x86_64
target aarch64
target s390x

function %a() -> f64 system_v {
    ss0 = explicit_slot 69

block0:
    v0 = iconst.i64 0
    stack_store.i64 v0, ss0+20
    stack_store.i64 v0, ss0+28
    stack_store.i64 v0, ss0+36

    v29 = stack_load.f64x2 ss0+23
    v30 = extractlane v29, 1
    return v30
}

; run: %a() == 0.0

Steps to Reproduce

Expected Results

The test to pass

Actual Results

    Finished dev [unoptimized + debuginfo] target(s) in 0.25s
     Running `/home/afonso/git/wasmtime/target/debug/clif-util test ./lmao.clif`
Segmentation fault

This passes on AArch64 and S390x.

Versions and Environment

Cranelift version or commit: main

Operating system: Linux

Architecture: x64

Extra Info

<details>
<summary>Generated assembly:</summary>

Disassembly of 60 bytes:
   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   48 83 ec 50             subq    $0x50, %rsp
   8:   4c 8d 54 24 14          leaq    0x14(%rsp), %r10
   d:   4d 31 db                xorq    %r11, %r11
  10:   4d 89 1a                movq    %r11, (%r10)
  13:   4c 8d 5c 24 1c          leaq    0x1c(%rsp), %r11
  18:   48 31 f6                xorq    %rsi, %rsi
  1b:   49 89 33                movq    %rsi, (%r11)
  1e:   48 8d 74 24 24          leaq    0x24(%rsp), %rsi
  23:   48 31 ff                xorq    %rdi, %rdi
  26:   48 89 3e                movq    %rdi, (%rsi)
  29:   48 8d 7c 24 17          leaq    0x17(%rsp), %rdi
  2e:   66 0f 70 07 ee          pshufd  $0xee, (%rdi), %xmm0
  33:   48 83 c4 50             addq    $0x50, %rsp
  37:   48 89 ec                movq    %rbp, %rsp
  3a:   5d                      popq    %rbp
  3b:   c3                      retq

</details>

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2023 at 17:08):

afonso360 labeled issue #5922:

:wave: Hey,

Me and @alexcrichton were discussing this via Zulip. The stack_load gets translated into a load.f64x2 notrap aligned v34, and that's why we do the load sinking in the backend.

Is stack_load required to be sufficiently aligned? Reading the docs, they don't mention anything. So it might just be a case of lets update the docs and the fuzzer.

.clif Test Case

test run
target x86_64
target aarch64
target s390x

function %a() -> f64 system_v {
    ss0 = explicit_slot 69

block0:
    v0 = iconst.i64 0
    stack_store.i64 v0, ss0+20
    stack_store.i64 v0, ss0+28
    stack_store.i64 v0, ss0+36

    v29 = stack_load.f64x2 ss0+23
    v30 = extractlane v29, 1
    return v30
}

; run: %a() == 0.0

Steps to Reproduce

Expected Results

The test to pass

Actual Results

    Finished dev [unoptimized + debuginfo] target(s) in 0.25s
     Running `/home/afonso/git/wasmtime/target/debug/clif-util test ./lmao.clif`
Segmentation fault

This passes on AArch64 and S390x.

Versions and Environment

Cranelift version or commit: main

Operating system: Linux

Architecture: x64

Extra Info

<details>
<summary>Generated assembly:</summary>

Disassembly of 60 bytes:
   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   48 83 ec 50             subq    $0x50, %rsp
   8:   4c 8d 54 24 14          leaq    0x14(%rsp), %r10
   d:   4d 31 db                xorq    %r11, %r11
  10:   4d 89 1a                movq    %r11, (%r10)
  13:   4c 8d 5c 24 1c          leaq    0x1c(%rsp), %r11
  18:   48 31 f6                xorq    %rsi, %rsi
  1b:   49 89 33                movq    %rsi, (%r11)
  1e:   48 8d 74 24 24          leaq    0x24(%rsp), %rsi
  23:   48 31 ff                xorq    %rdi, %rdi
  26:   48 89 3e                movq    %rdi, (%rsi)
  29:   48 8d 7c 24 17          leaq    0x17(%rsp), %rdi
  2e:   66 0f 70 07 ee          pshufd  $0xee, (%rdi), %xmm0
  33:   48 83 c4 50             addq    $0x50, %rsp
  37:   48 89 ec                movq    %rbp, %rsp
  3a:   5d                      popq    %rbp
  3b:   c3                      retq

</details>

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2023 at 17:08):

afonso360 labeled issue #5922:

:wave: Hey,

Me and @alexcrichton were discussing this via Zulip. The stack_load gets translated into a load.f64x2 notrap aligned v34, and that's why we do the load sinking in the backend.

Is stack_load required to be sufficiently aligned? Reading the docs, they don't mention anything. So it might just be a case of lets update the docs and the fuzzer.

.clif Test Case

test run
target x86_64
target aarch64
target s390x

function %a() -> f64 system_v {
    ss0 = explicit_slot 69

block0:
    v0 = iconst.i64 0
    stack_store.i64 v0, ss0+20
    stack_store.i64 v0, ss0+28
    stack_store.i64 v0, ss0+36

    v29 = stack_load.f64x2 ss0+23
    v30 = extractlane v29, 1
    return v30
}

; run: %a() == 0.0

Steps to Reproduce

Expected Results

The test to pass

Actual Results

    Finished dev [unoptimized + debuginfo] target(s) in 0.25s
     Running `/home/afonso/git/wasmtime/target/debug/clif-util test ./lmao.clif`
Segmentation fault

This passes on AArch64 and S390x.

Versions and Environment

Cranelift version or commit: main

Operating system: Linux

Architecture: x64

Extra Info

<details>
<summary>Generated assembly:</summary>

Disassembly of 60 bytes:
   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   48 83 ec 50             subq    $0x50, %rsp
   8:   4c 8d 54 24 14          leaq    0x14(%rsp), %r10
   d:   4d 31 db                xorq    %r11, %r11
  10:   4d 89 1a                movq    %r11, (%r10)
  13:   4c 8d 5c 24 1c          leaq    0x1c(%rsp), %r11
  18:   48 31 f6                xorq    %rsi, %rsi
  1b:   49 89 33                movq    %rsi, (%r11)
  1e:   48 8d 74 24 24          leaq    0x24(%rsp), %rsi
  23:   48 31 ff                xorq    %rdi, %rdi
  26:   48 89 3e                movq    %rdi, (%rsi)
  29:   48 8d 7c 24 17          leaq    0x17(%rsp), %rdi
  2e:   66 0f 70 07 ee          pshufd  $0xee, (%rdi), %xmm0
  33:   48 83 c4 50             addq    $0x50, %rsp
  37:   48 89 ec                movq    %rbp, %rsp
  3a:   5d                      popq    %rbp
  3b:   c3                      retq

</details>

view this post on Zulip Wasmtime GitHub notifications bot (Mar 03 2023 at 18:03):

bjorn3 commented on issue #5922:

I think cg_clif can emit unaligned stack_load and stack_store instructions. I can change it if the consensus is that unaligned accesses are not allowed.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 07 2023 at 00:04):

jameysharp commented on issue #5922:

As far as I can tell, Wasmtime doesn't use Cranelift's explicit stack support. So I think cg_clif gets to have a particularly large say on this question.

My personal opinion is the StackLoad and StackStore formats ought to have a MemFlags so the frontend can decide this question for itself. I think there's room in both formats without making InstructionData any larger.

If not that, I think at least that legalization should stop generating the aligned flag: since these instructions allow arbitrary offsets, it makes sense to me that someone might ask for unaligned offsets. I think the notrap flag is okay though?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 07 2023 at 07:45):

bjorn3 commented on issue #5922:

I think the notrap flag is okay though?

I agree. The stack should never be unmapped. For stack overflow there is stack probing.

If not that, I think at least that legalization should stop generating the aligned flag: since these instructions allow arbitrary offsets, it makes sense to me that someone might ask for unaligned offsets.

It will need to be possible to specify the alignment of a stack slot. Currently they are 1 aligned, except if you ensure that every stack slot is a multiple of 16, you get 16 aligned stack slots on all current backends. As such I do this in cg_clif as workaround for the lack of alignment specification support. If alignment specification is added to stack slots, Cranelift could infer if a stack_load or stack_store is aligned based on the stack slot alignment and the offset within the stack slot.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 07 2023 at 10:29):

afonso360 commented on issue #5922:

My personal opinion is the StackLoad and StackStore formats ought to have a MemFlags so the frontend can decide this question for itself. I think there's room in both formats without making InstructionData any larger.

This is also nice for cross endian stack_{load,store}s. They currently have to be manually emitted as stack_addr+{load,store} since we always pick the native endianness. And it removes one source of the native endianness flags.

Also, can we do something like #4721 to these instructions? Instead of having them magically translated in Cranelift, just build them as regular loads/stores in the frontend. One less Instruction Format to deal with!

It will need to be possible to specify the alignment of a stack slot. Currently they are 1 aligned, except if you ensure that every stack slot is a multiple of 16, you get 16 aligned stack slots on all current backends. As such I do this in cg_clif as workaround for the lack of alignment specification support. If alignment specification is added to stack slots, Cranelift could infer if a stack_load or stack_store is aligned based on the stack slot alignment and the offset within the stack slot.

Another issue that the fuzzer found is that even with a load without an offset if the stack slot is unaligned it can fault. Since we always align stack slots to 8 bytes, that is not enough for SSE, we get a crash there.

Adding an alignment field to stack slots seems like a great idea to me! I think I might have to do that before finishing up the work on the SIMD part of the fuzzer. Otherwise I'll have to start tracking the alignment of the whole stack which I don't really want to do.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 07 2023 at 11:40):

afonso360 edited a comment on issue #5922:

My personal opinion is the StackLoad and StackStore formats ought to have a MemFlags so the frontend can decide this question for itself. I think there's room in both formats without making InstructionData any larger.

This is also nice for cross endian stack_{load,store}s. They currently have to be manually emitted as stack_addr+{load,store} since we always pick the native endianness. And it removes one source of the native endianness flags.

Also, can we do something like #4721 to these instructions? Instead of having them magically translated in Cranelift, just build them as regular loads/stores in the frontend. One less Instruction Format to deal with!

It will need to be possible to specify the alignment of a stack slot. Currently they are 1 aligned, except if you ensure that every stack slot is a multiple of 16, you get 16 aligned stack slots on all current backends. As such I do this in cg_clif as workaround for the lack of alignment specification support. If alignment specification is added to stack slots, Cranelift could infer if a stack_load or stack_store is aligned based on the stack slot alignment and the offset within the stack slot.

Another issue that the fuzzer found is that even with a load without an offset if the stack slot is unaligned it can fault. Since we always align stack slots to 8 bytes, that is not enough for SSE, we get a crash there.

Adding an alignment field to stack slots seems like a great idea to me! I think I might have to do that before finishing up the work on the SIMD part of the fuzzer (though that has a bunch of issues as well). Otherwise I'll have to start tracking the alignment of the whole stack which I don't really want to do.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 07 2023 at 11:53):

afonso360 edited a comment on issue #5922:

My personal opinion is the StackLoad and StackStore formats ought to have a MemFlags so the frontend can decide this question for itself. I think there's room in both formats without making InstructionData any larger.

This is also nice for cross endian stack_{load,store}s. They currently have to be manually emitted as stack_addr+{load,store} since we always pick the native endianness. And it removes one source of the native endianness flags.

Also, can we do something like #4721 to these instructions? Instead of having them magically translated in Cranelift, just build them as regular loads/stores in the frontend. One less Instruction Format to deal with!

It will need to be possible to specify the alignment of a stack slot. Currently they are 1 aligned, except if you ensure that every stack slot is a multiple of 16, you get 16 aligned stack slots on all current backends. As such I do this in cg_clif as workaround for the lack of alignment specification support. If alignment specification is added to stack slots, Cranelift could infer if a stack_load or stack_store is aligned based on the stack slot alignment and the offset within the stack slot.

Another issue that the fuzzer found is that even with a load without an offset if the stack slot is unaligned it can fault. Since we always align stack slots to 8 bytes, that is not enough for SSE, we get a crash there.

Adding an alignment field to stack slots seems like a great idea to me (though that has a bunch of issues as well)! I think I might have to do that before finishing up the work on the SIMD part of the fuzzer. Otherwise I'll have to start tracking the alignment of the whole stack which I don't really want to do.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 07 2023 at 17:23):

jameysharp commented on issue #5922:

I like the idea of removing the stack_load and stack_store instructions, and instead always pairing stack_addr with a regular load or store, for most of the same reasons as #5386. It nicely resolves the question of whether these instructions need a MemFlags operand, and was already necessary anyway if you wanted to use the narrowing stores or widening loads.

I think cranelift-fuzzgen might need to start remembering which variables hold addresses, along with metadata like whether it's a func_addr or stack_addr and, in the stack case, what minimum alignment and maximum offset it has. I haven't thought through the details. I hope there's a simpler approach than I've thought of so far.

I spent a moment thinking about whether stack_addr should be removed too before remembering that, unlike wasm heaps and tables, Cranelift really is responsible for the stack.

And yeah, adding alignment specifiers to stack slots makes sense to me. As a later memory-usage optimization maybe we can sort the stack slots (both explicit and spill) by alignment to minimize padding.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 15 2023 at 18:39):

afonso360 closed issue #5922:

:wave: Hey,

Me and @alexcrichton were discussing this via Zulip. The stack_load gets translated into a load.f64x2 notrap aligned v34, and that's why we do the load sinking in the backend.

Is stack_load required to be sufficiently aligned? Reading the docs, they don't mention anything. So it might just be a case of lets update the docs and the fuzzer.

.clif Test Case

test run
target x86_64
target aarch64
target s390x

function %a() -> f64 system_v {
    ss0 = explicit_slot 69

block0:
    v0 = iconst.i64 0
    stack_store.i64 v0, ss0+20
    stack_store.i64 v0, ss0+28
    stack_store.i64 v0, ss0+36

    v29 = stack_load.f64x2 ss0+23
    v30 = extractlane v29, 1
    return v30
}

; run: %a() == 0.0

Steps to Reproduce

Expected Results

The test to pass

Actual Results

    Finished dev [unoptimized + debuginfo] target(s) in 0.25s
     Running `/home/afonso/git/wasmtime/target/debug/clif-util test ./lmao.clif`
Segmentation fault

This passes on AArch64 and S390x.

Versions and Environment

Cranelift version or commit: main

Operating system: Linux

Architecture: x64

Extra Info

<details>
<summary>Generated assembly:</summary>

Disassembly of 60 bytes:
   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   48 83 ec 50             subq    $0x50, %rsp
   8:   4c 8d 54 24 14          leaq    0x14(%rsp), %r10
   d:   4d 31 db                xorq    %r11, %r11
  10:   4d 89 1a                movq    %r11, (%r10)
  13:   4c 8d 5c 24 1c          leaq    0x1c(%rsp), %r11
  18:   48 31 f6                xorq    %rsi, %rsi
  1b:   49 89 33                movq    %rsi, (%r11)
  1e:   48 8d 74 24 24          leaq    0x24(%rsp), %rsi
  23:   48 31 ff                xorq    %rdi, %rdi
  26:   48 89 3e                movq    %rdi, (%rsi)
  29:   48 8d 7c 24 17          leaq    0x17(%rsp), %rdi
  2e:   66 0f 70 07 ee          pshufd  $0xee, (%rdi), %xmm0
  33:   48 83 c4 50             addq    $0x50, %rsp
  37:   48 89 ec                movq    %rbp, %rsp
  3a:   5d                      popq    %rbp
  3b:   c3                      retq

</details>


Last updated: Dec 23 2024 at 12:05 UTC