Stream: cranelift

Topic: weird behaviour (miscompilation?) with x64 newBE


view this post on Zulip bjorn3 (Dec 14 2020 at 13:32):

See https://github.com/bytecodealliance/wasmtime/pull/2504#issuecomment-744327408. I get some really weird behaviour:

Basically 66 0f 57 06 xorpd (%rsi),%xmm0 gives a SIGSEGV. This is despite the fact that reading from the address in %rsi using a debugger works fine. Lldb reports the accessed location as 0 for some reason. I also noticed that the preceding instruction 66 48 0f 6e c0 movq %rax,%xmm0 doesn't cause the value of %rax (0x8000000000000000) to be loaded into %xmm0. It stays 0. If I write a different value to %xmm0 just before executing the movq, the value gets reset to 0 after executing it.

This PR generalizes all of the MachInst framework to reason about SSA Values as being located in multiple registers (one, two or four, currently, in an efficient packed form). This is necessary in ...

view this post on Zulip bjorn3 (Dec 14 2020 at 13:41):

For the record running it in qemu with user mode virtualization works fine.

view this post on Zulip bjorn3 (Dec 14 2020 at 13:43):

Also when jumping over the first load instruction, the second load instruction doesn't cause a SIGSEGV, but the third does.

view this post on Zulip Julian Seward (Dec 14 2020 at 15:00):

@bjorn3 is the value in %rsi 16-aligned? I bet it's not

view this post on Zulip bjorn3 (Dec 14 2020 at 15:22):

0x7ffd802b5678, nope it looks not. There is no way to specify the alignment of a stack slot though unfortunately.

view this post on Zulip bjorn3 (Dec 14 2020 at 15:22):

I will try compiling with the aligned memflag removed.

view this post on Zulip Julian Seward (Dec 14 2020 at 15:25):

As a random comment, all of those SSE load-op instructions require the memory address to be 16-aligned. (I think.)

view this post on Zulip bjorn3 (Dec 14 2020 at 15:26):

Removing the aligned memflag doesn't have any effect on the compiled code.

view this post on Zulip bjorn3 (Dec 14 2020 at 15:50):

Replacing load.f64 with load.i64 + raw_bitcast.f64 fixed this SIGSEGV, but I now get another one.

view this post on Zulip bjorn3 (Dec 14 2020 at 15:54):

This time it seems that there is a problem in the binemit code. The vcode containsmovq 0(%rsi), %xmm0, but the disassembly contains mov (%rsi), %rax

view this post on Zulip bjorn3 (Dec 14 2020 at 15:56):

raw_bitcast is lowered as a simple move which isn't correct for GPR->FPR moves.

view this post on Zulip bjorn3 (Dec 14 2020 at 17:57):

I was able to work around the problem but now I get SIGFPE on divb %dh.

view this post on Zulip Julian Seward (Dec 14 2020 at 18:05):

That would surely be a rerun of the 0x40 problem, no?

view this post on Zulip bjorn3 (Dec 14 2020 at 18:06):

That is possible. I will take a look at the vcode of that function.

view this post on Zulip bjorn3 (Dec 14 2020 at 18:08):

Inst 484:   div     %sil

view this post on Zulip bjorn3 (Dec 14 2020 at 18:08):

I think so.

view this post on Zulip Julian Seward (Dec 14 2020 at 18:12):

This is with the newBE, right? If so probably the easiest thing to do is to add case(s) to the relevant emit_tests.rs and then fiddle round with emit.rs so as to make it work. It's gonna come down to passing a retain-redundant-rex-prefix flag to the low-level emit function (I forget the exact names).

view this post on Zulip bjorn3 (Dec 14 2020 at 18:13):

I am currently testing rex_flags.always_emit() when size == 1 on div instructions.

view this post on Zulip bjorn3 (Dec 14 2020 at 18:15):

That fixed all miscompilations.

view this post on Zulip Julian Seward (Dec 14 2020 at 18:28):

We should audit that stuff (and add test cases). Who knows how many more cases there are.

view this post on Zulip bjorn3 (Dec 14 2020 at 18:31):

I opened https://github.com/bytecodealliance/wasmtime/issues/2507 and https://github.com/bytecodealliance/wasmtime/issues/2508.

GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
For example div %sil requires a rex prefix, otherwise it becomes divb %dh.

view this post on Zulip Chris Fallin (Dec 14 2020 at 18:48):

@bjorn3 thanks for this debugging work! I agree with @Julian Seward that we should be somewhat systematic about auditing behavior of "narrow values"; we have much more confidence in 32/64-bit types because those are exercised by Wasm but there are possibly other bugs in 8/16-bit handling

view this post on Zulip Chris Fallin (Dec 14 2020 at 18:49):

I wonder if there could be a way to fuzz this -- perhaps compare against the CLIF interpreter...

view this post on Zulip bjorn3 (Dec 29 2020 at 10:36):

simple-raytracer compiled in debug mode works fine. In release mode it gives a panic however:

thread 'main' panicked at 'assertion failed: pending_bits <= 8', /home/bjorn/.cargo/registry/src/github.com-1ecc6299db9ec823/deflate-0.7.19/src/huffman_lengths.rs:129:5
stack backtrace:
   0: rust_begin_unwind
             at /home/bjorn/Documenten/cg_clif/build_sysroot/sysroot_src/library/std/src/panicking.rs:568:5
   1: core::panicking::panic_fmt
             at /home/bjorn/Documenten/cg_clif/build_sysroot/sysroot_src/library/core/src/panicking.rs:92:14
   2: core::panicking::panic
             at /home/bjorn/Documenten/cg_clif/build_sysroot/sysroot_src/library/core/src/panicking.rs:275:5
   3: deflate::huffman_lengths::stored_padding
   4: deflate::huffman_lengths::gen_huffman_lengths
   5: deflate::compress::compress_data_dynamic_n
   6: <deflate::writer::ZlibEncoder<W> as std::io::Write>::write
   7: std::io::Write::write_all
   8: png::encoder::Writer<W>::write_image_data
   9: image::png::PNGEncoder<W>::encode
  10: image::dynimage::save_buffer_impl
  11: image::dynimage::save_buffer
  12: image::buffer::ImageBuffer<P,Container>::save
  13: raytracer::scene::Scene::render
  14: main::main
  15: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

For cg_clif release mode consists of set opt_level=speed_and_size combined with an optimization in cg_clif that does basic store to load forwarding and dead store elimination.

view this post on Zulip bjorn3 (Dec 30 2020 at 12:20):

It works as of the latest commit.

view this post on Zulip bjorn3 (Dec 30 2020 at 12:24):

Benchmark #1: ./target/release/main
  Time (mean ± σ):      7.932 s ±  0.017 s    [User: 7.925 s, System: 0.006 s]
  Range (min  max):    7.911 s   7.958 s    10 runs

Benchmark #2: ./raytracer_cg_llvm
  Time (mean ± σ):      8.037 s ±  0.013 s    [User: 8.031 s, System: 0.004 s]
  Range (min  max):    8.016 s   8.062 s    10 runs

Summary
  './target/release/main' ran
    1.01 ± 0.00 times faster than './raytracer_cg_llvm'

view this post on Zulip Chris Fallin (Dec 30 2020 at 18:07):

Huh, wow... 1% faster than the LLVM build (or conservatively, "approximately the same", though the confidence intervals don't overlap)... how does this compare to the old backend?

view this post on Zulip bjorn3 (Dec 30 2020 at 18:15):

/me is recompiling cg_clif with the old backend

view this post on Zulip bjorn3 (Dec 30 2020 at 18:29):

Looks like the base of your PR doesn't yet include https://github.com/bytecodealliance/wasmtime/pull/2496. I got a SIGSEGV. I will just use main.

This fixes the cg_clif miscompilation I wrote about at https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/miscompilation.20with.20opt_level.3Dspeed_and_size/near/219461997...

view this post on Zulip bjorn3 (Dec 30 2020 at 18:42):

Benchmark #1: ./raytracer_cg_clif_newbe_debug
  Time (mean ± σ):      8.048 s ±  0.028 s    [User: 8.042 s, System: 0.005 s]
  Range (min  max):    8.007 s   8.113 s    10 runs

Benchmark #2: ./raytracer_cg_clif_newbe_release
  Time (mean ± σ):      7.940 s ±  0.025 s    [User: 7.933 s, System: 0.007 s]
  Range (min  max):    7.903 s   7.984 s    10 runs

Benchmark #3: ./raytracer_cg_clif_oldbe_debug
  Time (mean ± σ):      9.331 s ±  0.043 s    [User: 9.325 s, System: 0.006 s]
  Range (min  max):    9.278 s   9.425 s    10 runs

Benchmark #4: ./raytracer_cg_clif_oldbe_release
  Time (mean ± σ):      7.780 s ±  0.013 s    [User: 7.777 s, System: 0.002 s]
  Range (min  max):    7.756 s   7.794 s    10 runs

Benchmark #5: ./raytracer_cg_llvm
  Time (mean ± σ):      8.056 s ±  0.021 s    [User: 8.052 s, System: 0.003 s]
  Range (min  max):    8.034 s   8.091 s    10 runs

Summary
  './raytracer_cg_clif_oldbe_release' ran
    1.02 ± 0.00 times faster than './raytracer_cg_clif_newbe_release'
    1.03 ± 0.00 times faster than './raytracer_cg_clif_newbe_debug'
    1.04 ± 0.00 times faster than './raytracer_cg_llvm'
    1.20 ± 0.01 times faster than './raytracer_cg_clif_oldbe_debug'

@Chris Fallin debug mode got faster with the newBE, release mode got slower.

view this post on Zulip Chris Fallin (Dec 30 2020 at 18:44):

@bjorn3 interesting, thanks very much for this data!

view this post on Zulip Chris Fallin (Dec 30 2020 at 18:44):

I'm hoping to spend some time finding poor codegen issues in the near-ish future so hopefully we can improve the release-mode perf a bit

view this post on Zulip bjorn3 (Dec 30 2020 at 20:15):

Differential flamegraph between oldBE and newBE: oldbe_newbe_release.diff.svg

view this post on Zulip bjorn3 (Dec 30 2020 at 20:26):

One clear inefficiency I found is:

          000000000003c583 <core::ptr::const_ptr::<impl *const T>::guaranteed_eq>:
          _ZN4core3ptr9const_ptr33_$LT$impl$u20$$BP$const$u20$T$GT$13guaranteed_eq17h8751da776ec0026eE():
 32,04      push   %rbp
  3,32      mov    %rsp,%rbp
  4,42      cmp    %rsi,%rdi
 39,78      sete   %al
  6,64      movzbl %al,%eax
 13,82      pop    %rbp
           retq

becomes

           000000000022c5a0 <core::ptr::const_ptr::<impl *const T>::guaranteed_eq>:
           _ZN4core3ptr9const_ptr33_$LT$impl$u20$$BP$const$u20$T$GT$13guaranteed_eq17h8751da776ec0026eE():
 26,45       push   %rbp
  2,92       mov    %rsp,%rbp
             cmp    %rsi,%rdi
 31,49       sete   %sil
  2,47       movzbl %sil,%esi
             mov    %rsi,%rax
  0,85       mov    %rbp,%rsp
 34,30       pop    %rbp
  1,52      retq

This is worse regalloc. The regalloc regression may also be (partially) responsible for the rest of the slowdown compared to oldBE.

view this post on Zulip bjorn3 (Dec 30 2020 at 20:52):

In addition it does an unnecessary mov %rbp, %rsp even when %rsp wasn't modified in the current function at all.

view this post on Zulip Chris Fallin (Dec 31 2020 at 00:32):

@bjorn3 yeah I've seen similar; @Julian Seward had said something earlier about limiting propagation of "preferred registers" in the move-coalescing to just one step, perhaps for regalloc efficiency reasons? Would be good to reconsider that

view this post on Zulip Julian Seward (Dec 31 2020 at 08:28):

I remember mentioning something about incomplete propagation of constraints ("I prefer to be in real reg %r42" etc) in the coalescer. But that's a bug, not a design decision. Maybe I misunderstand?

view this post on Zulip Chris Fallin (Dec 31 2020 at 09:12):

@Julian Seward ah, perhaps I'm just assuming too much intentionality -- had figured there must be a reason for it :-) Agree that full propagation is correct -- hopefully the fix isn't too bad!


Last updated: Nov 22 2024 at 16:03 UTC