weird behaviour (miscompilation?) with x64 newBE · cranelift

Basically 66 0f 57 06 xorpd (%rsi),%xmm0 gives a SIGSEGV. This is despite the fact that reading from the address in %rsi using a debugger works fine. Lldb reports the accessed location as 0 for some reason. I also noticed that the preceding instruction 66 48 0f 6e c0 movq %rax,%xmm0 doesn't cause the value of %rax (0x8000000000000000) to be loaded into %xmm0. It stays 0. If I write a different value to %xmm0 just before executing the movq, the value gets reset to 0 after executing it.

Draft: I128 support (partial) on x64. by cfallin · Pull Request #2504 · bytecodealliance/wasmtime

This PR generalizes all of the MachInst framework to reason about SSA Values as being located in multiple registers (one, two or four, currently, in an efficient packed form). This is necessary in ...

bjorn3 (Dec 14 2020 at 13:41):

bjorn3 (Dec 14 2020 at 13:43):

Also when jumping over the first load instruction, the second load instruction doesn't cause a SIGSEGV, but the third does.

Julian Seward (Dec 14 2020 at 15:00):

bjorn3 (Dec 14 2020 at 15:22):

0x7ffd802b5678, nope it looks not. There is no way to specify the alignment of a stack slot though unfortunately.

bjorn3 (Dec 14 2020 at 15:22):

Julian Seward (Dec 14 2020 at 15:25):

As a random comment, all of those SSE load-op instructions require the memory address to be 16-aligned. (I think.)

bjorn3 (Dec 14 2020 at 15:26):

bjorn3 (Dec 14 2020 at 15:50):

Replacing load.f64 with load.i64 + raw_bitcast.f64 fixed this SIGSEGV, but I now get another one.

bjorn3 (Dec 14 2020 at 15:54):

This time it seems that there is a problem in the binemit code. The vcode containsmovq 0(%rsi), %xmm0, but the disassembly contains mov (%rsi), %rax

bjorn3 (Dec 14 2020 at 15:56):

raw_bitcast is lowered as a simple move which isn't correct for GPR->FPR moves.

bjorn3 (Dec 14 2020 at 17:57):

Julian Seward (Dec 14 2020 at 18:05):

bjorn3 (Dec 14 2020 at 18:06):

bjorn3 (Dec 14 2020 at 18:08):

Inst 484:   div     %sil

bjorn3 (Dec 14 2020 at 18:08):

Julian Seward (Dec 14 2020 at 18:12):

This is with the newBE, right? If so probably the easiest thing to do is to add case(s) to the relevant emit_tests.rs and then fiddle round with emit.rs so as to make it work. It's gonna come down to passing a retain-redundant-rex-prefix flag to the low-level emit function (I forget the exact names).

bjorn3 (Dec 14 2020 at 18:13):

I am currently testing rex_flags.always_emit() when size == 1 on div instructions.

bjorn3 (Dec 14 2020 at 18:15):

Julian Seward (Dec 14 2020 at 18:28):

We should audit that stuff (and add test cases). Who knows how many more cases there are.

bjorn3 (Dec 14 2020 at 18:31):

Chris Fallin (Dec 14 2020 at 18:48):

@bjorn3 thanks for this debugging work! I agree with @Julian Seward that we should be somewhat systematic about auditing behavior of "narrow values"; we have much more confidence in 32/64-bit types because those are exercised by Wasm but there are possibly other bugs in 8/16-bit handling

Chris Fallin (Dec 14 2020 at 18:49):

I wonder if there could be a way to fuzz this -- perhaps compare against the CLIF interpreter...

bjorn3 (Dec 29 2020 at 10:36):

simple-raytracer compiled in debug mode works fine. In release mode it gives a panic however:

thread 'main' panicked at 'assertion failed: pending_bits <= 8', /home/bjorn/.cargo/registry/src/github.com-1ecc6299db9ec823/deflate-0.7.19/src/huffman_lengths.rs:129:5
stack backtrace:
   0: rust_begin_unwind
             at /home/bjorn/Documenten/cg_clif/build_sysroot/sysroot_src/library/std/src/panicking.rs:568:5
   1: core::panicking::panic_fmt
             at /home/bjorn/Documenten/cg_clif/build_sysroot/sysroot_src/library/core/src/panicking.rs:92:14
   2: core::panicking::panic
             at /home/bjorn/Documenten/cg_clif/build_sysroot/sysroot_src/library/core/src/panicking.rs:275:5
   3: deflate::huffman_lengths::stored_padding
   4: deflate::huffman_lengths::gen_huffman_lengths
   5: deflate::compress::compress_data_dynamic_n
   6: <deflate::writer::ZlibEncoder<W> as std::io::Write>::write
   7: std::io::Write::write_all
   8: png::encoder::Writer<W>::write_image_data
   9: image::png::PNGEncoder<W>::encode
  10: image::dynimage::save_buffer_impl
  11: image::dynimage::save_buffer
  12: image::buffer::ImageBuffer<P,Container>::save
  13: raytracer::scene::Scene::render
  14: main::main
  15: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

For cg_clif release mode consists of set opt_level=speed_and_size combined with an optimization in cg_clif that does basic store to load forwarding and dead store elimination.

bjorn3 (Dec 30 2020 at 12:20):

bjorn3 (Dec 30 2020 at 12:24):

Benchmark #1: ./target/release/main
  Time (mean ± σ):      7.932 s ±  0.017 s    [User: 7.925 s, System: 0.006 s]
  Range (min … max):    7.911 s …  7.958 s    10 runs

Benchmark #2: ./raytracer_cg_llvm
  Time (mean ± σ):      8.037 s ±  0.013 s    [User: 8.031 s, System: 0.004 s]
  Range (min … max):    8.016 s …  8.062 s    10 runs

Summary
  './target/release/main' ran
    1.01 ± 0.00 times faster than './raytracer_cg_llvm'

Chris Fallin (Dec 30 2020 at 18:07):

Huh, wow... 1% faster than the LLVM build (or conservatively, "approximately the same", though the confidence intervals don't overlap)... how does this compare to the old backend?

bjorn3 (Dec 30 2020 at 18:15):

bjorn3 (Dec 30 2020 at 18:29):

Fix iconst.i8 0 miscompilation with opt_level=speed_and_size by bjorn3 · Pull Request #2496 · bytecodealliance/wasmtime

This fixes the cg_clif miscompilation I wrote about at https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/miscompilation.20with.20opt_level.3Dspeed_and_size/near/219461997...

bjorn3 (Dec 30 2020 at 18:42):

Benchmark #1: ./raytracer_cg_clif_newbe_debug
  Time (mean ± σ):      8.048 s ±  0.028 s    [User: 8.042 s, System: 0.005 s]
  Range (min … max):    8.007 s …  8.113 s    10 runs

Benchmark #2: ./raytracer_cg_clif_newbe_release
  Time (mean ± σ):      7.940 s ±  0.025 s    [User: 7.933 s, System: 0.007 s]
  Range (min … max):    7.903 s …  7.984 s    10 runs

Benchmark #3: ./raytracer_cg_clif_oldbe_debug
  Time (mean ± σ):      9.331 s ±  0.043 s    [User: 9.325 s, System: 0.006 s]
  Range (min … max):    9.278 s …  9.425 s    10 runs

Benchmark #4: ./raytracer_cg_clif_oldbe_release
  Time (mean ± σ):      7.780 s ±  0.013 s    [User: 7.777 s, System: 0.002 s]
  Range (min … max):    7.756 s …  7.794 s    10 runs

Benchmark #5: ./raytracer_cg_llvm
  Time (mean ± σ):      8.056 s ±  0.021 s    [User: 8.052 s, System: 0.003 s]
  Range (min … max):    8.034 s …  8.091 s    10 runs

Summary
  './raytracer_cg_clif_oldbe_release' ran
    1.02 ± 0.00 times faster than './raytracer_cg_clif_newbe_release'
    1.03 ± 0.00 times faster than './raytracer_cg_clif_newbe_debug'
    1.04 ± 0.00 times faster than './raytracer_cg_llvm'
    1.20 ± 0.01 times faster than './raytracer_cg_clif_oldbe_debug'

Chris Fallin (Dec 30 2020 at 18:44):

I'm hoping to spend some time finding poor codegen issues in the near-ish future so hopefully we can improve the release-mode perf a bit

bjorn3 (Dec 30 2020 at 20:15):

bjorn3 (Dec 30 2020 at 20:26):

       │   000000000003c583 <core::ptr::const_ptr::<impl *const T>::guaranteed_eq>:
       │   _ZN4core3ptr9const_ptr33_$LT$impl$u20$$BP$const$u20$T$GT$13guaranteed_eq17h8751da776ec0026eE():
 32,04 │     push   %rbp
  3,32 │     mov    %rsp,%rbp
  4,42 │     cmp    %rsi,%rdi
 39,78 │     sete   %al
  6,64 │     movzbl %al,%eax
 13,82 │     pop    %rbp
       │   ← retq

       │    000000000022c5a0 <core::ptr::const_ptr::<impl *const T>::guaranteed_eq>:
       │    _ZN4core3ptr9const_ptr33_$LT$impl$u20$$BP$const$u20$T$GT$13guaranteed_eq17h8751da776ec0026eE():
 26,45 │      push   %rbp
  2,92 │      mov    %rsp,%rbp
       │      cmp    %rsi,%rdi
 31,49 │      sete   %sil
  2,47 │      movzbl %sil,%esi
       │      mov    %rsi,%rax
  0,85 │      mov    %rbp,%rsp
 34,30 │      pop    %rbp
  1,52 │    ← retq

This is worse regalloc. The regalloc regression may also be (partially) responsible for the rest of the slowdown compared to oldBE.

bjorn3 (Dec 30 2020 at 20:52):

In addition it does an unnecessary mov %rbp, %rsp even when %rsp wasn't modified in the current function at all.

Chris Fallin (Dec 31 2020 at 00:32):

@bjorn3 yeah I've seen similar; @Julian Seward had said something earlier about limiting propagation of "preferred registers" in the move-coalescing to just one step, perhaps for regalloc efficiency reasons? Would be good to reconsider that

Julian Seward (Dec 31 2020 at 08:28):

I remember mentioning something about incomplete propagation of constraints ("I prefer to be in real reg %r42" etc) in the coalescer. But that's a bug, not a design decision. Maybe I misunderstand?

Chris Fallin (Dec 31 2020 at 09:12):

@Julian Seward ah, perhaps I'm just assuming too much intentionality -- had figured there must be a reason for it :-) Agree that full propagation is correct -- hopefully the fix isn't too bad!

Stream: cranelift

Topic: weird behaviour (miscompilation?) with x64 newBE

bjorn3 (Dec 14 2020 at 13:32):

bjorn3 (Dec 14 2020 at 13:41):

bjorn3 (Dec 14 2020 at 13:43):

Julian Seward (Dec 14 2020 at 15:00):

bjorn3 (Dec 14 2020 at 15:22):

bjorn3 (Dec 14 2020 at 15:22):

Julian Seward (Dec 14 2020 at 15:25):

bjorn3 (Dec 14 2020 at 15:26):

bjorn3 (Dec 14 2020 at 15:50):

bjorn3 (Dec 14 2020 at 15:54):

bjorn3 (Dec 14 2020 at 15:56):

bjorn3 (Dec 14 2020 at 17:57):

Julian Seward (Dec 14 2020 at 18:05):

bjorn3 (Dec 14 2020 at 18:06):

bjorn3 (Dec 14 2020 at 18:08):

bjorn3 (Dec 14 2020 at 18:08):

Julian Seward (Dec 14 2020 at 18:12):

bjorn3 (Dec 14 2020 at 18:13):

bjorn3 (Dec 14 2020 at 18:15):

Julian Seward (Dec 14 2020 at 18:28):

bjorn3 (Dec 14 2020 at 18:31):

Chris Fallin (Dec 14 2020 at 18:48):

Chris Fallin (Dec 14 2020 at 18:49):

bjorn3 (Dec 29 2020 at 10:36):

bjorn3 (Dec 30 2020 at 12:20):

bjorn3 (Dec 30 2020 at 12:24):

Chris Fallin (Dec 30 2020 at 18:07):

bjorn3 (Dec 30 2020 at 18:15):

bjorn3 (Dec 30 2020 at 18:29):

bjorn3 (Dec 30 2020 at 18:42):

Chris Fallin (Dec 30 2020 at 18:44):

Chris Fallin (Dec 30 2020 at 18:44):

bjorn3 (Dec 30 2020 at 20:15):

bjorn3 (Dec 30 2020 at 20:26):

bjorn3 (Dec 30 2020 at 20:52):

Chris Fallin (Dec 31 2020 at 00:32):

Julian Seward (Dec 31 2020 at 08:28):

Chris Fallin (Dec 31 2020 at 09:12):