See https://github.com/bytecodealliance/wasmtime/pull/2504#issuecomment-744327408. I get some really weird behaviour:
Basically 66 0f 57 06 xorpd (%rsi),%xmm0
gives a SIGSEGV. This is despite the fact that reading from the address in %rsi
using a debugger works fine. Lldb reports the accessed location as 0 for some reason. I also noticed that the preceding instruction 66 48 0f 6e c0 movq %rax,%xmm0
doesn't cause the value of %rax
(0x8000000000000000
) to be loaded into %xmm0
. It stays 0. If I write a different value to %xmm0
just before executing the movq
, the value gets reset to 0 after executing it.
For the record running it in qemu with user mode virtualization works fine.
Also when jumping over the first load instruction, the second load instruction doesn't cause a SIGSEGV, but the third does.
@bjorn3 is the value in %rsi
16-aligned? I bet it's not
0x7ffd802b5678
, nope it looks not. There is no way to specify the alignment of a stack slot though unfortunately.
I will try compiling with the aligned memflag removed.
As a random comment, all of those SSE load-op instructions require the memory address to be 16-aligned. (I think.)
Removing the aligned memflag doesn't have any effect on the compiled code.
Replacing load.f64
with load.i64
+ raw_bitcast.f64
fixed this SIGSEGV, but I now get another one.
This time it seems that there is a problem in the binemit code. The vcode containsmovq 0(%rsi), %xmm0
, but the disassembly contains mov (%rsi), %rax
raw_bitcast
is lowered as a simple move which isn't correct for GPR->FPR moves.
I was able to work around the problem but now I get SIGFPE on divb %dh
.
That would surely be a rerun of the 0x40 problem, no?
That is possible. I will take a look at the vcode of that function.
Inst 484: div %sil
I think so.
This is with the newBE, right? If so probably the easiest thing to do is to add case(s) to the relevant emit_tests.rs
and then fiddle round with emit.rs
so as to make it work. It's gonna come down to passing a retain-redundant-rex-prefix flag to the low-level emit function (I forget the exact names).
I am currently testing rex_flags.always_emit()
when size == 1
on div instructions.
That fixed all miscompilations.
We should audit that stuff (and add test cases). Who knows how many more cases there are.
I opened https://github.com/bytecodealliance/wasmtime/issues/2507 and https://github.com/bytecodealliance/wasmtime/issues/2508.
@bjorn3 thanks for this debugging work! I agree with @Julian Seward that we should be somewhat systematic about auditing behavior of "narrow values"; we have much more confidence in 32/64-bit types because those are exercised by Wasm but there are possibly other bugs in 8/16-bit handling
I wonder if there could be a way to fuzz this -- perhaps compare against the CLIF interpreter...
simple-raytracer compiled in debug mode works fine. In release mode it gives a panic however:
thread 'main' panicked at 'assertion failed: pending_bits <= 8', /home/bjorn/.cargo/registry/src/github.com-1ecc6299db9ec823/deflate-0.7.19/src/huffman_lengths.rs:129:5
stack backtrace:
0: rust_begin_unwind
at /home/bjorn/Documenten/cg_clif/build_sysroot/sysroot_src/library/std/src/panicking.rs:568:5
1: core::panicking::panic_fmt
at /home/bjorn/Documenten/cg_clif/build_sysroot/sysroot_src/library/core/src/panicking.rs:92:14
2: core::panicking::panic
at /home/bjorn/Documenten/cg_clif/build_sysroot/sysroot_src/library/core/src/panicking.rs:275:5
3: deflate::huffman_lengths::stored_padding
4: deflate::huffman_lengths::gen_huffman_lengths
5: deflate::compress::compress_data_dynamic_n
6: <deflate::writer::ZlibEncoder<W> as std::io::Write>::write
7: std::io::Write::write_all
8: png::encoder::Writer<W>::write_image_data
9: image::png::PNGEncoder<W>::encode
10: image::dynimage::save_buffer_impl
11: image::dynimage::save_buffer
12: image::buffer::ImageBuffer<P,Container>::save
13: raytracer::scene::Scene::render
14: main::main
15: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
For cg_clif release mode consists of set opt_level=speed_and_size
combined with an optimization in cg_clif that does basic store to load forwarding and dead store elimination.
It works as of the latest commit.
Benchmark #1: ./target/release/main
Time (mean ± σ): 7.932 s ± 0.017 s [User: 7.925 s, System: 0.006 s]
Range (min … max): 7.911 s … 7.958 s 10 runs
Benchmark #2: ./raytracer_cg_llvm
Time (mean ± σ): 8.037 s ± 0.013 s [User: 8.031 s, System: 0.004 s]
Range (min … max): 8.016 s … 8.062 s 10 runs
Summary
'./target/release/main' ran
1.01 ± 0.00 times faster than './raytracer_cg_llvm'
Huh, wow... 1% faster than the LLVM build (or conservatively, "approximately the same", though the confidence intervals don't overlap)... how does this compare to the old backend?
/me is recompiling cg_clif with the old backend
Looks like the base of your PR doesn't yet include https://github.com/bytecodealliance/wasmtime/pull/2496. I got a SIGSEGV. I will just use main.
Benchmark #1: ./raytracer_cg_clif_newbe_debug
Time (mean ± σ): 8.048 s ± 0.028 s [User: 8.042 s, System: 0.005 s]
Range (min … max): 8.007 s … 8.113 s 10 runs
Benchmark #2: ./raytracer_cg_clif_newbe_release
Time (mean ± σ): 7.940 s ± 0.025 s [User: 7.933 s, System: 0.007 s]
Range (min … max): 7.903 s … 7.984 s 10 runs
Benchmark #3: ./raytracer_cg_clif_oldbe_debug
Time (mean ± σ): 9.331 s ± 0.043 s [User: 9.325 s, System: 0.006 s]
Range (min … max): 9.278 s … 9.425 s 10 runs
Benchmark #4: ./raytracer_cg_clif_oldbe_release
Time (mean ± σ): 7.780 s ± 0.013 s [User: 7.777 s, System: 0.002 s]
Range (min … max): 7.756 s … 7.794 s 10 runs
Benchmark #5: ./raytracer_cg_llvm
Time (mean ± σ): 8.056 s ± 0.021 s [User: 8.052 s, System: 0.003 s]
Range (min … max): 8.034 s … 8.091 s 10 runs
Summary
'./raytracer_cg_clif_oldbe_release' ran
1.02 ± 0.00 times faster than './raytracer_cg_clif_newbe_release'
1.03 ± 0.00 times faster than './raytracer_cg_clif_newbe_debug'
1.04 ± 0.00 times faster than './raytracer_cg_llvm'
1.20 ± 0.01 times faster than './raytracer_cg_clif_oldbe_debug'
@Chris Fallin debug mode got faster with the newBE, release mode got slower.
@bjorn3 interesting, thanks very much for this data!
I'm hoping to spend some time finding poor codegen issues in the near-ish future so hopefully we can improve the release-mode perf a bit
Differential flamegraph between oldBE and newBE: oldbe_newbe_release.diff.svg
One clear inefficiency I found is:
│ 000000000003c583 <core::ptr::const_ptr::<impl *const T>::guaranteed_eq>:
│ _ZN4core3ptr9const_ptr33_$LT$impl$u20$$BP$const$u20$T$GT$13guaranteed_eq17h8751da776ec0026eE():
32,04 │ push %rbp
3,32 │ mov %rsp,%rbp
4,42 │ cmp %rsi,%rdi
39,78 │ sete %al
6,64 │ movzbl %al,%eax
13,82 │ pop %rbp
│ ← retq
becomes
│ 000000000022c5a0 <core::ptr::const_ptr::<impl *const T>::guaranteed_eq>:
│ _ZN4core3ptr9const_ptr33_$LT$impl$u20$$BP$const$u20$T$GT$13guaranteed_eq17h8751da776ec0026eE():
26,45 │ push %rbp
2,92 │ mov %rsp,%rbp
│ cmp %rsi,%rdi
31,49 │ sete %sil
2,47 │ movzbl %sil,%esi
│ mov %rsi,%rax
0,85 │ mov %rbp,%rsp
34,30 │ pop %rbp
1,52 │ ← retq
This is worse regalloc. The regalloc regression may also be (partially) responsible for the rest of the slowdown compared to oldBE.
In addition it does an unnecessary mov %rbp, %rsp
even when %rsp
wasn't modified in the current function at all.
@bjorn3 yeah I've seen similar; @Julian Seward had said something earlier about limiting propagation of "preferred registers" in the move-coalescing to just one step, perhaps for regalloc efficiency reasons? Would be good to reconsider that
I remember mentioning something about incomplete propagation of constraints ("I prefer to be in real reg %r42" etc) in the coalescer. But that's a bug, not a design decision. Maybe I misunderstand?
@Julian Seward ah, perhaps I'm just assuming too much intentionality -- had figured there must be a reason for it :-) Agree that full propagation is correct -- hopefully the fix isn't too bad!
Last updated: Dec 23 2024 at 12:05 UTC