Because LICM has finally been fixed with jump tables, I tried to use it for cg_clif when enabling optimizations. It compiled fine and basic programs work, but simple-raytracer gave a SIGSEGV once and panicks the rest of the time. According to valgrind there are several reads of memory at different locations that is all freed at the same place. I will investigate further today or tomorrow.
==21057== Invalid read of size 1
==21057== at 0x8BF885: core::str::validations::run_utf8_validation (validations.rs:143)
==21057== by 0x8BF368: core::str::converts::from_utf8 (converts.rs:86)
==21057== by 0x6FCCCF: std::sys_common::os_str_bytes::Slice::to_str (os_str_bytes.rs:165)
==21057== by 0x6FB6BC: std::ffi::os_str::OsStr::to_str (os_str.rs:543)
==21057== by 0x28351F: image::dynimage::save_buffer_impl::{{closure}} (dynimage.rs:787)
==21057== by 0x33F3AC: core::option::Option<T>::and_then (option.rs:692)
==21057== by 0x2828B7: image::dynimage::save_buffer_impl (dynimage.rs:786)
==21057== by 0x1497E6: image::dynimage::save_buffer (dynimage.rs:775)
==21057== by 0x148382: image::buffer::ImageBuffer<P,Container>::save (buffer.rs:441)
==21057== by 0x147C51: raytracer::scene::Scene::render (scene.rs:42)
==21057== by 0x143FBF: main::main (main.rs:138)
==21057== by 0x14220B: core::ops::function::FnOnce::call_once (function.rs:227)
==21057== Address 0x4c10067 is 39 bytes inside a block of size 552 free'd
==21057== at 0x48369AB: free (vg_replace_malloc.c:530)
==21057== by 0x4ABB880: fclose@@GLIBC_2.2.5 (iofclose.c:77)
==21057== by 0x48AAE1C: pthread_getattr_np (pthread_getattr_np.c:161)
==21057== by 0x75D4D1: std::sys::unix::thread::guard::get_stack_start (thread.rs:288)
==21057== by 0x75D8A0: std::sys::unix::thread::guard::get_stack_start_aligned (thread.rs:305)
==21057== by 0x75DA33: std::sys::unix::thread::guard::init (thread.rs:336)
==21057== by 0x7465D0: std::rt::lang_start_internal (rt.rs:37)
==21057== by 0x145143: std::rt::lang_start (rt.rs:65)
==21057== by 0x14429C: main (in /home/bjorn/Documenten/cg_clif3/simple-raytracer/raytracer_cg_clif)
==21057== Block was alloc'd at
==21057== at 0x483577F: malloc (vg_replace_malloc.c:299)
==21057== by 0x4ABC0FA: __fopen_internal (iofopen.c:65)
==21057== by 0x48AACCA: pthread_getattr_np (pthread_getattr_np.c:81)
==21057== by 0x75D4D1: std::sys::unix::thread::guard::get_stack_start (thread.rs:288)
==21057== by 0x75D8A0: std::sys::unix::thread::guard::get_stack_start_aligned (thread.rs:305)
==21057== by 0x75DA33: std::sys::unix::thread::guard::init (thread.rs:336)
==21057== by 0x7465D0: std::rt::lang_start_internal (rt.rs:37)
==21057== by 0x145143: std::rt::lang_start (rt.rs:65)
==21057== by 0x14429C: main (in /home/bjorn/Documenten/cg_clif3/simple-raytracer/raytracer_cg_clif)
@bjorn3 One thing you could do is back out the LICM fix, but leave LICM enabled. Then there are several possible scenarios:
I too would like to know whether this fix is really OK, since I'm trying to stabilise a branch of CL to ship in Firefox.
Before the LICM fix opt_level=speed and opt_level=speed_and_size would simply panic when compiling. The LICM fix is what makes it possible to enable optimizations at all. I am pretty sure that the LICM fix didn't cause the issue.
@bjorn3 thanks for this -- eek. In addition to the questions from @Julian Seward above, it should be possible to instrument the licm pass with println's to work out where it's running; the CLIF for all functions on which LICM did any code motion would be super-helpful for us to debug
@Chris Fallin moin! But note also, this doesn't actually imply that LICM is buggy. It's just "some part of the optimising pipeline".
yep, indeed
@bjorn3 another thing to try would be to enable optimizations but explicitly remove LICM
then we avoid the panic that hid this before but hopefully expose whatever is causing the issue
I normally debug miscompilations by removing code and adding println!()
to show where the corruption happens.
I had wondered from the backtraces if this was some kind of threading problem. And in particular whether supposedly atomic loads/stores are not being labelled correctly (or some such) by @bjorn3's front end, and hence are getting moved around (GVNd, LICMd) when they shouldn't.
I use plain loads/stores protected by a global mutex. I am using an empty MemFlags
so GVN and LICM should leave it alone. In addition the mutex lock/unlock should in itself function as a fence as it is a call instruction.
The pthread_getattr_np
call is used by libstd as part of adding a stack guard page. It is done at the start of all threads before passing execution to the user. This includes the main thread.
This looks to me more like a standard use-after-free (file read tries to read FILE* after fclose), maybe from a control-flow corruption somewhere?
I'll build a full Fx with this fix in, so I can test it with really large inputs (eg Google Earth), but per @bjorn3's comments above, it seems like the fix itself isn't at fault.
@bjorn3 I assume this is on x64; is this the new or old BE?
A pointer to the block alloc'd inside libc shouldn't have been passed to user code even when it got corrupted. I think it is more likely that somewhere a pointer is created that just so happens to overlap with the malloc'd block.
x86_64 with the old backend. The new backend doesn't support 128bit ints yet.
(tangent, that's on my roadmap and I hope to fix it soon :-) )
Reduced to
fn main() {
let ext = std::path::Path::new("result.png").extension().unwrap().to_str();
assert_eq!(ext, Some("png"));
}
This doesn't SIGSEGV anymore but ext
becomes None
incorrectly.
Using println!()
the extension before the .to_str()
call shows as varying garbage bytes.
Valgrind doesn't trigger anymore though.
Does V complain even with the println!
in place?
It doesn't complain with or without println()
.
But you get different garbage bytes on different runs, or are they always the same?
Different bytes
Path::new("result.png").as_os_str().as_bytes()
doesn't show any corruption. Path::new("result.png").extension().unwrap().as_bytes()
does.
@bjorn3 as a side thought, can you work around the lack of 128 bit int support in the newBE by just translating it all into 64 bit arithmetic at your level?
Just on the basis that then you'd be using the pipeline under active development.
That is hard. cg_clif expects that all primitive rust types correspond to cranelift types. I do translate some 128bit operations to 64bit operations, but I really need one rust scalar to correspond to one Value
.
I reduced it to
use std::ffi::OsStr;
use std::os::unix::ffi::OsStrExt;
fn os_str_as_u8_slice(s: &OsStr) -> &[u8] {
unsafe { &*(s as *const OsStr as *const [u8]) }
}
unsafe fn u8_slice_as_os_str(s: &[u8]) -> &OsStr {
// SAFETY: see the comment of `os_str_as_u8_slice`
unsafe { &*(s as *const [u8] as *const OsStr) }
}
fn split_file_at_dot(file: &OsStr) -> (Option<&OsStr>, Option<&OsStr>) {
if os_str_as_u8_slice(file) == b".." {
return (Some(file), None);
}
// The unsafety here stems from converting between &OsStr and &[u8]
// and back. This is safe to do because (1) we only look at ASCII
// contents of the encoding and (2) new &OsStr values are produced
// only from ASCII-bounded slices of existing &OsStr values.
let mut iter = os_str_as_u8_slice(file).rsplitn(2, |b| *b == b'.');
let after = iter.next();
let before = iter.next();
if before == Some(b"") {
(Some(file), None)
} else {
unsafe { (before.map(|s| u8_slice_as_os_str(s)), after.map(|s| u8_slice_as_os_str(s))) }
}
}
fn main() {
let ext = Some(OsStr::new("result.png")).map(split_file_at_dot).and_then(|(_, after)| after).unwrap();
assert_eq!(ext.as_bytes(), b"png");
}
It is very sensitive to changes now.
Removed all unsafe code and the OsStr
dependency:
fn main() {
let ext = Some((Some(b"" as &[u8]), Some(b"png" as &[u8]))).and_then(|(_, after)| after).unwrap();
assert_eq!(ext, b"png");
}
No unwrap
:
fn main() {
let ext = Some((Some(b"" as &[u8]), Some(b"png" as &[u8]))).and_then(|(_, after)| after);
assert_eq!(ext, Some(b"png" as &[u8]));
}
I can't find any relevant changes in Map::and_then
between unoptimized and optimized. Only icmp
+ brnz
-> ifcmp
+ brif
, folding some pointer arithmetic into memory operations, rex prefix removal and some regalloc changes.
I noticed that on multiple occasions GVN resulted in worse regalloc as it merges multiple stack_addr
instructions which need to be spilled. Still searching for a relevant change.
I found the optimization pass that causes the problem: shrink_instructions
.
I was trying to diff a normalized disassembly of the correct and miscompiled binaries. I found the following:
-push %rdi
+(bad)
+stc
And a little bit later:
-dec %ebx
+callq *%rsi
Are you sure you're disassembling as if it was 64-bit, not 32-bit? I ask because IIRC, dec %ebx
in 32 bit code can be interpreted as a REX prefix in 64-bit code.
I use objdump -d
.
Excerpt from _ZN4core6option15Option$LT$T$GT$8and_then17hde19356b2b9e6feeE
:
182c0: 44 8b bc 24 3c 00 00 mov 0x3c(%rsp),%r15d
182c7: 00
182c8: 41 0f b6 c7 movzbl %r15b,%eax
182cc: 85 c0 test %eax,%eax
182ce: 74 ce je 1829e <_ZN4core6option15Option$LT$T$GT$8and_then17hde19356b2b9e6feeE+0xd8>
182d0: eb cc jmp 1829e <_ZN4core6option15Option$LT$T$GT$8and_then17hde19356b2b9e6feeE+0xd8>
182d2: 39 ff cmp %edi,%edi
182d4: ff (bad)
182d5: ff 5c ff ff lcall *-0x1(%rdi,%rdi,8)
182d9: ff .byte 0xff
Err, that is just padding I think.
@Julian Seward xor %dh,%dh
and xor %sil,%sil
zero different registers, right? (the x86 register overlapping is a bit confusing)
Well, those are different both because they reference different underlying regs (rdx
vs rsi
) and because they reference different bit indices (15:8
vs 7:0
)
I think it's the case that if you add an 0x40 prefix to the first then you get the second.
That is to say .. this is one of those few places where a REX prefix of 0x40 isn't redundant.
Ok, thanks. It is [RexOp1u_id_z#30,%rsi] v53 = iconst.i8 0
before shrinking and [Op1u_id_z#30,%rsi] v53 = iconst.i8 0
after. Are there any other exceptions for this recipe where the rex prefix needs to be preserved?
Pass. I know close to zero about the oldBE.
If anyone can tell you, it might be @Andrew Brown.
To phrase it differently. Are there any cases other cases where xor
needs the rex
prefix?
This isn't specifically about xor
. It relates to any basic integer insn that involves an 8-bit register.
I guess I could try to not allow the rex-less version for all 8bit integer operations.
Opened https://github.com/bytecodealliance/wasmtime/pull/2496 with a fix.
Nice find -- thanks for debugging this!
Last updated: Dec 23 2024 at 12:05 UTC