Mrmaxmeier opened issue #4000:
Hey,
I'm seeing crashes during
finalize_definitions
calls related to x86\_64call
relocations:thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: TryFromIntError(())', cranelift/jit/src/compiled_blob.rs:55:80
Cranelift emits 32-bit relocations for calls on x86\_64, and thus can "only" address in the relative ±2GB range. Code memory is allocated with the normal system allocator, which might place different allocations in distant parts of the address space.
I'm seeing irregular crashes in a heavily multithreaded program, but the problem can be reproduced with this abridgedjit-minimal.rs
example:use cranelift::prelude::*; use cranelift_codegen::settings; use cranelift_jit::{JITBuilder, JITModule}; use cranelift_module::{default_libcall_names, Linkage, Module}; fn main() { let isa_builder = cranelift_native::builder().unwrap(); let isa = isa_builder .finish(settings::Flags::new(settings::builder())) .unwrap(); let mut m = JITModule::new(JITBuilder::with_isa(isa, default_libcall_names())); let mut ctx = m.make_context(); let mut func_ctx = FunctionBuilderContext::new(); let func_a = m .declare_function("a", Linkage::Local, &m.make_signature()) .unwrap(); let func_b = m .declare_function("b", Linkage::Local, &m.make_signature()) .unwrap(); // Define a dummy function `func_a` ctx.func.name = ExternalName::user(0, func_a.as_u32()); { let mut bcx: FunctionBuilder = FunctionBuilder::new(&mut ctx.func, &mut func_ctx); let block = bcx.create_block(); bcx.switch_to_block(block); bcx.ins().return_(&[]); bcx.seal_all_blocks(); bcx.finalize(); } m.define_function(func_a, &mut ctx).unwrap(); m.clear_context(&mut ctx); // Allocate a bunch (~4GB) to stretch address space let mut allocations: Vec<Vec<u8>> = Vec::new(); for _ in 0..999999 { allocations.push(Vec::with_capacity(4096)); } // Define `func_b` in a new allocation and reference `func_a` ctx.func.name = ExternalName::user(0, func_b.as_u32()); { let mut bcx: FunctionBuilder = FunctionBuilder::new(&mut ctx.func, &mut func_ctx); let block = bcx.create_block(); bcx.switch_to_block(block); let local_func = m.declare_func_in_func(func_a, &mut bcx.func); // Emit a call with a relocation for func_a bcx.ins().call(local_func, &[]); // Make sure that this function's body is larger than page_size and will require a new allocation. for _ in 0..1024 { bcx.ins().call(local_func, &[]); } bcx.ins().return_(&[]); bcx.seal_all_blocks(); bcx.finalize(); } m.define_function(func_b, &mut ctx).unwrap(); m.clear_context(&mut ctx); // Perform linking m.finalize_definitions(); }
It might be possible to trigger this from small-ish WebAssembly modules with glibc's mmap threshold that places >128kb allocations outside of the heap, though I haven't had any luck reproducing that because glibc's dynamic threshold scaling raises this limit before code is emitted.
Possible approaches:
Determine the total size of the finalized code page before allocating; allocate one large chunk. It seems like an implementation for this should be doable, though I'm not sure if this is by design. (This would be incompatible with features like hot function replacement.)
Don't allocate on the heap. Cranelift's
selinux-fix
features uses mmap allocations. The underlying issue still persists, though as mmap allocations are separate from the heap, they're mostly sequential and would need >2GB of generated machine code to cause problems.(Change relocation style? There's no 64-bit relative jump in x86\_64 and blowing up code size for this seems like a bad idea.)
Aarch64 runs into a related issue with 26-bit relative jumps: https://github.com/bytecodealliance/wasmtime/issues/3277
I'm not sure veneers are applicable for x86\_64, but they seems like an interesting and more general approach to relative jump range limits.
Mrmaxmeier labeled issue #4000:
Hey,
I'm seeing crashes during
finalize_definitions
calls related to x86\_64call
relocations:thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: TryFromIntError(())', cranelift/jit/src/compiled_blob.rs:55:80
Cranelift emits 32-bit relocations for calls on x86\_64, and thus can "only" address in the relative ±2GB range. Code memory is allocated with the normal system allocator, which might place different allocations in distant parts of the address space.
I'm seeing irregular crashes in a heavily multithreaded program, but the problem can be reproduced with this abridgedjit-minimal.rs
example:use cranelift::prelude::*; use cranelift_codegen::settings; use cranelift_jit::{JITBuilder, JITModule}; use cranelift_module::{default_libcall_names, Linkage, Module}; fn main() { let isa_builder = cranelift_native::builder().unwrap(); let isa = isa_builder .finish(settings::Flags::new(settings::builder())) .unwrap(); let mut m = JITModule::new(JITBuilder::with_isa(isa, default_libcall_names())); let mut ctx = m.make_context(); let mut func_ctx = FunctionBuilderContext::new(); let func_a = m .declare_function("a", Linkage::Local, &m.make_signature()) .unwrap(); let func_b = m .declare_function("b", Linkage::Local, &m.make_signature()) .unwrap(); // Define a dummy function `func_a` ctx.func.name = ExternalName::user(0, func_a.as_u32()); { let mut bcx: FunctionBuilder = FunctionBuilder::new(&mut ctx.func, &mut func_ctx); let block = bcx.create_block(); bcx.switch_to_block(block); bcx.ins().return_(&[]); bcx.seal_all_blocks(); bcx.finalize(); } m.define_function(func_a, &mut ctx).unwrap(); m.clear_context(&mut ctx); // Allocate a bunch (~4GB) to stretch address space let mut allocations: Vec<Vec<u8>> = Vec::new(); for _ in 0..999999 { allocations.push(Vec::with_capacity(4096)); } // Define `func_b` in a new allocation and reference `func_a` ctx.func.name = ExternalName::user(0, func_b.as_u32()); { let mut bcx: FunctionBuilder = FunctionBuilder::new(&mut ctx.func, &mut func_ctx); let block = bcx.create_block(); bcx.switch_to_block(block); let local_func = m.declare_func_in_func(func_a, &mut bcx.func); // Emit a call with a relocation for func_a bcx.ins().call(local_func, &[]); // Make sure that this function's body is larger than page_size and will require a new allocation. for _ in 0..1024 { bcx.ins().call(local_func, &[]); } bcx.ins().return_(&[]); bcx.seal_all_blocks(); bcx.finalize(); } m.define_function(func_b, &mut ctx).unwrap(); m.clear_context(&mut ctx); // Perform linking m.finalize_definitions(); }
It might be possible to trigger this from small-ish WebAssembly modules with glibc's mmap threshold that places >128kb allocations outside of the heap, though I haven't had any luck reproducing that because glibc's dynamic threshold scaling raises this limit before code is emitted.
Possible approaches:
Determine the total size of the finalized code page before allocating; allocate one large chunk. It seems like an implementation for this should be doable, though I'm not sure if this is by design. (This would be incompatible with features like hot function replacement.)
Don't allocate on the heap. Cranelift's
selinux-fix
features uses mmap allocations. The underlying issue still persists, though as mmap allocations are separate from the heap, they're mostly sequential and would need >2GB of generated machine code to cause problems.(Change relocation style? There's no 64-bit relative jump in x86\_64 and blowing up code size for this seems like a bad idea.)
Aarch64 runs into a related issue with 26-bit relative jumps: https://github.com/bytecodealliance/wasmtime/issues/3277
I'm not sure veneers are applicable for x86\_64, but they seems like an interesting and more general approach to relative jump range limits.
Mrmaxmeier labeled issue #4000:
Hey,
I'm seeing crashes during
finalize_definitions
calls related to x86\_64call
relocations:thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: TryFromIntError(())', cranelift/jit/src/compiled_blob.rs:55:80
Cranelift emits 32-bit relocations for calls on x86\_64, and thus can "only" address in the relative ±2GB range. Code memory is allocated with the normal system allocator, which might place different allocations in distant parts of the address space.
I'm seeing irregular crashes in a heavily multithreaded program, but the problem can be reproduced with this abridgedjit-minimal.rs
example:use cranelift::prelude::*; use cranelift_codegen::settings; use cranelift_jit::{JITBuilder, JITModule}; use cranelift_module::{default_libcall_names, Linkage, Module}; fn main() { let isa_builder = cranelift_native::builder().unwrap(); let isa = isa_builder .finish(settings::Flags::new(settings::builder())) .unwrap(); let mut m = JITModule::new(JITBuilder::with_isa(isa, default_libcall_names())); let mut ctx = m.make_context(); let mut func_ctx = FunctionBuilderContext::new(); let func_a = m .declare_function("a", Linkage::Local, &m.make_signature()) .unwrap(); let func_b = m .declare_function("b", Linkage::Local, &m.make_signature()) .unwrap(); // Define a dummy function `func_a` ctx.func.name = ExternalName::user(0, func_a.as_u32()); { let mut bcx: FunctionBuilder = FunctionBuilder::new(&mut ctx.func, &mut func_ctx); let block = bcx.create_block(); bcx.switch_to_block(block); bcx.ins().return_(&[]); bcx.seal_all_blocks(); bcx.finalize(); } m.define_function(func_a, &mut ctx).unwrap(); m.clear_context(&mut ctx); // Allocate a bunch (~4GB) to stretch address space let mut allocations: Vec<Vec<u8>> = Vec::new(); for _ in 0..999999 { allocations.push(Vec::with_capacity(4096)); } // Define `func_b` in a new allocation and reference `func_a` ctx.func.name = ExternalName::user(0, func_b.as_u32()); { let mut bcx: FunctionBuilder = FunctionBuilder::new(&mut ctx.func, &mut func_ctx); let block = bcx.create_block(); bcx.switch_to_block(block); let local_func = m.declare_func_in_func(func_a, &mut bcx.func); // Emit a call with a relocation for func_a bcx.ins().call(local_func, &[]); // Make sure that this function's body is larger than page_size and will require a new allocation. for _ in 0..1024 { bcx.ins().call(local_func, &[]); } bcx.ins().return_(&[]); bcx.seal_all_blocks(); bcx.finalize(); } m.define_function(func_b, &mut ctx).unwrap(); m.clear_context(&mut ctx); // Perform linking m.finalize_definitions(); }
It might be possible to trigger this from small-ish WebAssembly modules with glibc's mmap threshold that places >128kb allocations outside of the heap, though I haven't had any luck reproducing that because glibc's dynamic threshold scaling raises this limit before code is emitted.
Possible approaches:
Determine the total size of the finalized code page before allocating; allocate one large chunk. It seems like an implementation for this should be doable, though I'm not sure if this is by design. (This would be incompatible with features like hot function replacement.)
Don't allocate on the heap. Cranelift's
selinux-fix
features uses mmap allocations. The underlying issue still persists, though as mmap allocations are separate from the heap, they're mostly sequential and would need >2GB of generated machine code to cause problems.(Change relocation style? There's no 64-bit relative jump in x86\_64 and blowing up code size for this seems like a bad idea.)
Aarch64 runs into a related issue with 26-bit relative jumps: https://github.com/bytecodealliance/wasmtime/issues/3277
I'm not sure veneers are applicable for x86\_64, but they seems like an interesting and more general approach to relative jump range limits.
bjorn3 commented on issue #4000:
Don't allocate on the heap. Cranelift's selinux-fix features uses mmap allocations. The underlying issue still persists, though as mmap allocations are separate from the heap, they're mostly sequential and would need >2GB of generated machine code to cause problems.
I think this is the best fix. Possibly in combination with reserving the full 2GB as PROT_NONE. Allowing the GOT to be split between each such 2GB chunk should also allow more code to be used when PIC is enabled.
cfallin commented on issue #4000:
This is covered I think by the
colocated
flag on external function definitions: the intent is to denote that a function is in the same module (hence can use near calls) or elsewhere (hence needs an absolute 64-bit relocation). This flag in theExtFuncData
controls which kind of call is generated. It looks like this may not be surfaced in theJITModule
API; we'd be happy to take a PR to fix that if so!
bjorn3 commented on issue #4000:
That is not the problem here. The problem is that a function and the GOT or PLT it accesses may end up more than 2GB from each other due to memory fragmentation. All calls already go through the GOT and PLT anyway so for as long as those are within 2GB it doesn't matter where the function is, independent of the
colocated
flag.
Last updated: Dec 23 2024 at 12:05 UTC