Stream: git-wasmtime

Topic: wasmtime / issue #8112 Cranelift: different memory access...


view this post on Zulip Wasmtime GitHub notifications bot (Mar 13 2024 at 04:18):

candymate opened issue #8112:

Test Case

// main.rs
use wasmtime::*;

fn main() -> Result<()> {
    let mut config = Config::default();
    config.strategy(Strategy::Cranelift);
    config.cranelift_opt_level(OptLevel::None);

    let engine1 = Engine::new(&config)?;
    let mut new_config = config.clone();
    new_config.cranelift_opt_level(OptLevel::Speed);
    let engine2 = Engine::new(&new_config)?;
    let wat = r#"
        (module
            (type (;0;) (func (param i32 v128 v128) (result f64)))
            (import "mem" "mem" (memory (;0;) 1))
            (func (;0;) (type 0) (param i32 v128 v128) (result f64)
                (local f64)
                local.get 0
                local.get 1
                local.get 2
                i8x16.extract_lane_s 11
                i16x8.shl
                v128.store16_lane align=1 1
                local.get 0
                i32.load align=1
                f64.load align=1
                f64.const 0x1.fffffffep+31 (;=4.29497e+09;)
                i32.const 1
                select)
            (export "main" (func 0)))
    "#;
    let module1 = Module::new(&engine1, wat)?;
    let module2 = Module::new(&engine2, wat)?;
    let mut store1 = Store::new(&engine1, ());
    let mut store2 = Store::new(&engine2, ());

    let memory_ty = MemoryType::new(1, None);
    let memory1 = Memory::new(&mut store1, memory_ty.clone())?;
    let memory2 = Memory::new(&mut store2, memory_ty)?;

    let instance1 = Instance::new(&mut store1, &module1, &[memory1.into()])?;
    let instance2 = Instance::new(&mut store2, &module2, &[memory2.into()])?;
    let main1 = instance1.get_func(&mut store1, "main")
        .expect("`main` was not an exported function");
    let main2 = instance2.get_func(&mut store2, "main")
        .expect("`main` was not an exported function");
    let params = vec![
        Val::I32(4097),
        Val::V128(0x10203cccdcecf807f7e7dfffefdfc.into()),
        Val::V128(0xd7a950c126c2ce62f786d3c83f7914e.into()),
    ];
    let mut results1 = vec![Val::null()];
    let mut results2 = vec![Val::null()];
    println!("Opt level None: {:?}", main1.call(
        &mut store1,
        &params,
        &mut results1
    ));
    println!("--------------");
    println!("Opt level Speed: {:?}", main2.call(
        &mut store2,
        &params,
        &mut results2
    ));

    Ok(())
}
[package]
name = "wasmtime-wrapper"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
wasmtime = "18.0.3"

Steps to Reproduce

cargo run --release

Expected Results

Both results from none and speed optimizations should show the same result (both in error or successful execution)

Actual Results

Opt level None: Err(error while executing at wasm backtrace:
    0:   0x59 - <unknown>!<wasm function 0>

Caused by:
    0: memory fault at wasm address 0x10000 in linear memory of size 0x10000
    1: wasm trap: out of bounds memory access)
--------------
Opt level Speed: Ok(())

Versions and Environment

Extra Info

view this post on Zulip Wasmtime GitHub notifications bot (Mar 13 2024 at 05:24):

cfallin commented on issue #8112:

I'm able to reproduce on main and confirm that on x86-64, we see the OOB on no-opt and successful execution on opt. On aarch64, we see successful execution in both cases.

The CLIF before opt is

function u0:0(i64 vmctx, i64, i32, i8x16, i8x16) -> f64 fast {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    gv4 = load.i64 notrap aligned readonly gv3+72
    gv5 = load.i64 notrap aligned readonly checked gv4
    sig0 = (i64 vmctx, i32 uext, i32 uext, i32 uext) -> i32 uext system_v
    sig1 = (i64 vmctx, i32 uext, i32 uext, i32 uext) -> i32 uext system_v
    sig2 = (i64 vmctx, i32 uext, i32 uext) -> i32 uext system_v
    sig3 = (i64 vmctx, i32 uext) -> i32 uext system_v
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32, v3: i8x16, v4: i8x16):
@0033                               v6 = f64const 0.0
@003b                               v7 = extractlane v4, 11
@003b                               v8 = sextend.i32 v7
@003e                               v9 = bitcast.i16x8 little v3
@003e                               v10 = ishl v9, v8
@0041                               v11 = extractlane v10, 1
@0041                               v12 = uextend.i64 v2
@0041                               v13 = global_value.i64 gv5
@0041                               v14 = iadd v13, v12
@0041                               store little heap v11, v14
@0048                               v15 = uextend.i64 v2
@0048                               v16 = global_value.i64 gv5
@0048                               v17 = iadd v16, v15
@0048                               v18 = load.i32 little heap v17
@004b                               v19 = uextend.i64 v18
@004b                               v20 = global_value.i64 gv5
@004b                               v21 = iadd v20, v19
@004b                               v22 = load.f64 little heap v21
@004e                               v23 = f64const 0x1.fffffffe00000p31
@0057                               v24 = iconst.i32 1
@0059                               v25 = select v24, v22, v23  ; v24 = 1, v23 = 0x1.fffffffe00000p31
@005a                               jump block1(v25)

                                block1(v5: f64):
@005a                               return v5
}

and after opt is

function u0:0(i64 vmctx, i64, i32, i8x16, i8x16) -> f64 fast {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    gv4 = load.i64 notrap aligned readonly gv3+72
    gv5 = load.i64 notrap aligned readonly checked gv4
    sig0 = (i64 vmctx, i32 uext, i32 uext, i32 uext) -> i32 uext system_v
    sig1 = (i64 vmctx, i32 uext, i32 uext, i32 uext) -> i32 uext system_v
    sig2 = (i64 vmctx, i32 uext, i32 uext) -> i32 uext system_v
    sig3 = (i64 vmctx, i32 uext) -> i32 uext system_v
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32, v3: i8x16, v4: i8x16):
                                    v27 -> v0
                                    v29 -> v0
                                    v31 -> v0
@003e                               v9 = bitcast.i16x8 little v3
@003b                               v7 = extractlane v4, 11
                                    v32 = ishl v9, v7
@0041                               v11 = extractlane v32, 1
@0041                               v26 = load.i64 notrap aligned readonly v0+72
@0041                               v13 = load.i64 notrap aligned readonly checked v26
@0041                               v12 = uextend.i64 v2
@0041                               v14 = iadd v13, v12
@0041                               store little heap v11, v14
@0048                               v18 = load.i32 little heap v14
@004b                               v19 = uextend.i64 v18
@004b                               v21 = iadd v13, v19
@004b                               v22 = load.f64 little heap v21
@005a                               jump block1

                                block1:
@005a                               return v22
}

The faulting access is the last load. The no-opt x86-64 machine code is

0000000000000000 <wasm[0]::function[0]>:
       0:   55                      push   %rbp
       1:   48 89 e5                mov    %rsp,%rbp
       4:   4c 8b 57 08             mov    0x8(%rdi),%r10
       8:   4d 8b 12                mov    (%r10),%r10
       b:   49 39 e2                cmp    %rsp,%r10
       e:   0f 87 67 00 00 00       ja     7b <wasm[0]::function[0]+0x7b>
      14:   c4 c3 79 14 c9 0b       vpextrb $0xb,%xmm1,%r9d
      1a:   41 0f be c9             movsbl %r9b,%ecx
      1e:   48 83 e1 0f             and    $0xf,%rcx
      22:   c5 f9 6e c9             vmovd  %ecx,%xmm1
      26:   c5 f9 f1 d1             vpsllw %xmm1,%xmm0,%xmm2
      2a:   44 8b ca                mov    %edx,%r9d
      2d:   4c 8b 57 48             mov    0x48(%rdi),%r10
      31:   4d 8b 12                mov    (%r10),%r10
      34:   c4 83 79 15 14 0a 01    vpextrw $0x1,%xmm2,(%r10,%r9,1)
      3b:   44 8b d2                mov    %edx,%r10d
      3e:   4c 8b 5f 48             mov    0x48(%rdi),%r11
      42:   4d 8b 1b                mov    (%r11),%r11
      45:   47 8b 14 13             mov    (%r11,%r10,1),%r10d
      49:   4c 8b 5f 48             mov    0x48(%rdi),%r11
      4d:   4d 8b 1b                mov    (%r11),%r11
      50:   49 b9 00 00 e0 ff ff    movabs $0x41efffffffe00000,%r9
      57:   ff ef 41
      5a:   c4 c1 f9 6e c1          vmovq  %r9,%xmm0
      5f:   be 01 00 00 00          mov    $0x1,%esi
      64:   f3 43 0f 6f 0c 13       movdqu (%r11,%r10,1),%xmm1
      6a:   85 f6                   test   %esi,%esi
      6c:   0f 84 04 00 00 00       je     76 <wasm[0]::function[0]+0x76>
      72:   f2 0f 10 c1             movsd  %xmm1,%xmm0
      76:   48 89 ec                mov    %rbp,%rsp
      79:   5d                      pop    %rbp
      7a:   c3                      retq
      7b:   0f 0b                   ud2

and note in particular that the load.f64 is translated to a movdqu. Capturing the SIGSEGV in gdb I see the offset (r10 here) as 0xfff8 confirming that a load.f64 would have succeeded while a 128-bit load indeed goes OOB. It seems that this should only be possible with this lowering rule but that explicitly guards with ty_vec128.

That's as far as I've gotten tonight; cc @abrown maybe due to weird vector lowering issue?

view this post on Zulip Wasmtime GitHub notifications bot (Mar 13 2024 at 05:32):

alexcrichton commented on issue #8112:

Jinx @cfallin! Had this written up when I saw yours come in as well, so I think this can add some extra information:


Thanks for the report! I believe that the opt-level=speed behavior is correct here and I believe that Wasmtime has a bug in its x64 backend which affects the unoptimized lowering. I'm going to place some more details in this issue but @candymate we'll take it from here, this should be an easy fix and my initial thinking is that this shouldn't impact much else in the system.

It appears that with this input:

;; foo.clif
function %a(i32, f64, i64) -> f64 {
block0(v0: i32, v1: f64, v2: i64):
  v3 = load.f64 v2
  v4 = select v0, v3, v1
  return v4
}

Cranelift generates:

$ cargo run compile ./foo.clif --target x86_64 -D
...
Disassembly of 25 bytes:
   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   f3 0f 6f 26             movdqu  (%rsi), %xmm4 ;; this instruction is a 16-byte load, not 8-byte load
   8:   85 ff                   testl   %edi, %edi
   a:   0f 84 04 00 00 00       je      0x14
  10:   f2 0f 10 c4             movsd   %xmm4, %xmm0
  14:   48 89 ec                movq    %rbp, %rsp
  17:   5d                      popq    %rbp
  18:   c3                      retq

The bug here is at offset 0x4, namely that movdqu is used to load a 64-bit floating-point value. This instruction actually loads 16-bytes instead of 8. This affects the WebAssembly you pasted because the address being loaded from is 8 away from the end of memory, so an 8-byte load is correct but a 16-byte load runs off the edge of memory, triggering a fault.

For runnable reproductions, @candymate's original test case can be turned into:

(module $mem
  (memory (export "mem") 1))

(module
  (import "mem" "mem" (memory 1))
  (func (export "main") (param i32 v128 v128) (result f64)
    (local f64)
    local.get 0             ;; 4097
    local.get 1
    local.get 2
    i8x16.extract_lane_s 11 ;; => 0x83
    i16x8.shl               ;; => param1 << 3
    v128.store16_lane align=1 1 ;; store 16-bits at 4097
    local.get 0                 ;; 4097
    i32.load align=1            ;; load previous 16-bits
    f64.load align=1            ;; load again
    f64.const 0x1.fffffffep+31 (;=4.29497e+09;)
    i32.const 1
    select))

(assert_return
  (invoke "main"
    (i32.const 4097)
    (v128.const i64x2 0x807f7e7dfffefdfc 0x10203_cccd_cecf)
    (v128.const i64x2 0x2f786d3c83f7914e 0xd7a950c126c2ce6)
  )
  (f64.const 0))

and run with:

$ wasmtime wast -O opt-level=0 foo.wast
$ wasmtime wast foo.wast

In WebAssembly this can be minimized to:

(module
  (memory 1)
  (func (export "main") (param i32 f64 i32) (result f64)
    local.get 2
    f64.load
    f64.const 0
    local.get 0
    select
    return))

(assert_return
  (invoke "main"
    (i32.const 1)
    (f64.const 1)
    (i32.const 0xfff8)
  )
  (f64.const 0))

which fails both with and without optimizations as it hits the buggy lowering rule.

I believe the bug is here-ish because XmmCmove which is used for the select has an aligned argument, and the fallback to eagerly load the bytes as they may be unaligned doesn't take size into account and loads the full xmm register.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 13 2024 at 05:39):

cfallin commented on issue #8112:

Yep I just found the same :-) About to put up a PR.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 13 2024 at 05:45):

cfallin commented on issue #8112:

In a little more detail, my diagnosis is: the XmmCmove pseudo-inst is fine as a 16-byte move; XMMs are 128 bits indeed; the issue is that the cmove_from_values helper, which is used by lowerings of select, has a case here where it creates the XmmCmove while passing the Value through to an XmmMemAligned which allows load fusion via the available implicit conversions, guarding only by is_xmm_type. is_xmm_type is true even for $F64 because floats live in XMMs, though they don't take the full width. So we have an incorrect implicit widening. My plan is to disallow load fusion in this case.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 13 2024 at 06:41):

cfallin closed issue #8112:

Test Case

// main.rs
use wasmtime::*;

fn main() -> Result<()> {
    let mut config = Config::default();
    config.strategy(Strategy::Cranelift);
    config.cranelift_opt_level(OptLevel::None);

    let engine1 = Engine::new(&config)?;
    let mut new_config = config.clone();
    new_config.cranelift_opt_level(OptLevel::Speed);
    let engine2 = Engine::new(&new_config)?;
    let wat = r#"
        (module
            (type (;0;) (func (param i32 v128 v128) (result f64)))
            (import "mem" "mem" (memory (;0;) 1))
            (func (;0;) (type 0) (param i32 v128 v128) (result f64)
                (local f64)
                local.get 0
                local.get 1
                local.get 2
                i8x16.extract_lane_s 11
                i16x8.shl
                v128.store16_lane align=1 1
                local.get 0
                i32.load align=1
                f64.load align=1
                f64.const 0x1.fffffffep+31 (;=4.29497e+09;)
                i32.const 1
                select)
            (export "main" (func 0)))
    "#;
    let module1 = Module::new(&engine1, wat)?;
    let module2 = Module::new(&engine2, wat)?;
    let mut store1 = Store::new(&engine1, ());
    let mut store2 = Store::new(&engine2, ());

    let memory_ty = MemoryType::new(1, None);
    let memory1 = Memory::new(&mut store1, memory_ty.clone())?;
    let memory2 = Memory::new(&mut store2, memory_ty)?;

    let instance1 = Instance::new(&mut store1, &module1, &[memory1.into()])?;
    let instance2 = Instance::new(&mut store2, &module2, &[memory2.into()])?;
    let main1 = instance1.get_func(&mut store1, "main")
        .expect("`main` was not an exported function");
    let main2 = instance2.get_func(&mut store2, "main")
        .expect("`main` was not an exported function");
    let params = vec![
        Val::I32(4097),
        Val::V128(0x10203cccdcecf807f7e7dfffefdfc.into()),
        Val::V128(0xd7a950c126c2ce62f786d3c83f7914e.into()),
    ];
    let mut results1 = vec![Val::null()];
    let mut results2 = vec![Val::null()];
    println!("Opt level None: {:?}", main1.call(
        &mut store1,
        &params,
        &mut results1
    ));
    println!("--------------");
    println!("Opt level Speed: {:?}", main2.call(
        &mut store2,
        &params,
        &mut results2
    ));

    Ok(())
}
[package]
name = "wasmtime-wrapper"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
wasmtime = "18.0.3"

Steps to Reproduce

cargo run --release

Expected Results

Both results from none and speed optimizations should show the same result (both in error or successful execution)

Actual Results

Opt level None: Err(error while executing at wasm backtrace:
    0:   0x59 - <unknown>!<wasm function 0>

Caused by:
    0: memory fault at wasm address 0x10000 in linear memory of size 0x10000
    1: wasm trap: out of bounds memory access)
--------------
Opt level Speed: Ok(())

Versions and Environment

Extra Info

view this post on Zulip Wasmtime GitHub notifications bot (Mar 13 2024 at 13:52):

fitzgen commented on issue #8112:

Idle thought: it is worth thinking about what kinds of fuzzing we can do help uncover these sorts of bugs. It intuitively seems like doing various operations on the last word of memory is "likely" to turn up "interesting" things.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 13 2024 at 14:27):

alexcrichton commented on issue #8112:

Also, if you don't mind me asking, @candymate how did the test case from above come about? Was it a reduction from a larger application or perhaps fuzz-generated? If fuzz-generated and you're willing we'd be interested to learn how and possibly integrate it with Wasmtime's preexisting fuzzing as well

view this post on Zulip Wasmtime GitHub notifications bot (Mar 13 2024 at 15:26):

candymate commented on issue #8112:

Also, if you don't mind me asking, @candymate how did the test case from above come about? Was it a reduction from a larger application or perhaps fuzz-generated? If fuzz-generated and you're willing we'd be interested to learn how and possibly integrate it with Wasmtime's preexisting fuzzing as well

The test case is generated from the fuzzer we are currently developing. However, I cannot disclose details about it for now. Hopefully, I can disclose the details of the fuzzer within this year, but I'm not sure about the date.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 13 2024 at 15:27):

candymate edited a comment on issue #8112:

Also, if you don't mind me asking, @candymate how did the test case from above come about? Was it a reduction from a larger application or perhaps fuzz-generated? If fuzz-generated and you're willing we'd be interested to learn how and possibly integrate it with Wasmtime's preexisting fuzzing as well

The test case is generated from the fuzzer we are currently developing. However, I cannot disclose details about it for now. Hopefully, I can disclose the details within this year, but I'm not sure about the date.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 13 2024 at 15:30):

alexcrichton commented on issue #8112:

Oh nice! Sounds good, and we'll definitely be eager to read more about it once details are available!

view this post on Zulip Wasmtime GitHub notifications bot (Mar 14 2024 at 15:54):

alexcrichton added the fuzz-bug label to Issue #8112.

view this post on Zulip Wasmtime GitHub notifications bot (Mar 14 2024 at 15:54):

alexcrichton added the cranelift:area:x64 label to Issue #8112.


Last updated: Nov 22 2024 at 17:03 UTC