Stream: git-wasmtime

Topic: wasmtime / issue #4000 Cranelift: JIT relocations depend ...


view this post on Zulip Wasmtime GitHub notifications bot (Apr 06 2022 at 13:44):

Mrmaxmeier opened issue #4000:

Hey,

I'm seeing crashes during finalize_definitions calls related to x86\_64 call relocations:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: TryFromIntError(())', cranelift/jit/src/compiled_blob.rs:55:80

Cranelift emits 32-bit relocations for calls on x86\_64, and thus can "only" address in the relative ±2GB range. Code memory is allocated with the normal system allocator, which might place different allocations in distant parts of the address space.
I'm seeing irregular crashes in a heavily multithreaded program, but the problem can be reproduced with this abridged jit-minimal.rs example:

use cranelift::prelude::*;
use cranelift_codegen::settings;
use cranelift_jit::{JITBuilder, JITModule};
use cranelift_module::{default_libcall_names, Linkage, Module};

fn main() {
    let isa_builder = cranelift_native::builder().unwrap();
    let isa = isa_builder
        .finish(settings::Flags::new(settings::builder()))
        .unwrap();

    let mut m = JITModule::new(JITBuilder::with_isa(isa, default_libcall_names()));

    let mut ctx = m.make_context();
    let mut func_ctx = FunctionBuilderContext::new();

    let func_a = m
        .declare_function("a", Linkage::Local, &m.make_signature())
        .unwrap();
    let func_b = m
        .declare_function("b", Linkage::Local, &m.make_signature())
        .unwrap();

    // Define a dummy function `func_a`
    ctx.func.name = ExternalName::user(0, func_a.as_u32());
    {
        let mut bcx: FunctionBuilder = FunctionBuilder::new(&mut ctx.func, &mut func_ctx);
        let block = bcx.create_block();

        bcx.switch_to_block(block);
        bcx.ins().return_(&[]);
        bcx.seal_all_blocks();
        bcx.finalize();
    }
    m.define_function(func_a, &mut ctx).unwrap();
    m.clear_context(&mut ctx);

    // Allocate a bunch (~4GB) to stretch address space
    let mut allocations: Vec<Vec<u8>> = Vec::new();
    for _ in 0..999999 {
        allocations.push(Vec::with_capacity(4096));
    }

    // Define `func_b` in a new allocation and reference `func_a`
    ctx.func.name = ExternalName::user(0, func_b.as_u32());
    {
        let mut bcx: FunctionBuilder = FunctionBuilder::new(&mut ctx.func, &mut func_ctx);
        let block = bcx.create_block();

        bcx.switch_to_block(block);
        let local_func = m.declare_func_in_func(func_a, &mut bcx.func);
        // Emit a call with a relocation for func_a
        bcx.ins().call(local_func, &[]);
        // Make sure that this function's body is larger than page_size and will require a new allocation.
        for _ in 0..1024 {
            bcx.ins().call(local_func, &[]);
        }
        bcx.ins().return_(&[]);

        bcx.seal_all_blocks();
        bcx.finalize();
    }

    m.define_function(func_b, &mut ctx).unwrap();
    m.clear_context(&mut ctx);

    // Perform linking
    m.finalize_definitions();
}

It might be possible to trigger this from small-ish WebAssembly modules with glibc's mmap threshold that places >128kb allocations outside of the heap, though I haven't had any luck reproducing that because glibc's dynamic threshold scaling raises this limit before code is emitted.

Possible approaches:

Aarch64 runs into a related issue with 26-bit relative jumps: https://github.com/bytecodealliance/wasmtime/issues/3277
I'm not sure veneers are applicable for x86\_64, but they seems like an interesting and more general approach to relative jump range limits.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 06 2022 at 13:44):

Mrmaxmeier labeled issue #4000:

Hey,

I'm seeing crashes during finalize_definitions calls related to x86\_64 call relocations:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: TryFromIntError(())', cranelift/jit/src/compiled_blob.rs:55:80

Cranelift emits 32-bit relocations for calls on x86\_64, and thus can "only" address in the relative ±2GB range. Code memory is allocated with the normal system allocator, which might place different allocations in distant parts of the address space.
I'm seeing irregular crashes in a heavily multithreaded program, but the problem can be reproduced with this abridged jit-minimal.rs example:

use cranelift::prelude::*;
use cranelift_codegen::settings;
use cranelift_jit::{JITBuilder, JITModule};
use cranelift_module::{default_libcall_names, Linkage, Module};

fn main() {
    let isa_builder = cranelift_native::builder().unwrap();
    let isa = isa_builder
        .finish(settings::Flags::new(settings::builder()))
        .unwrap();

    let mut m = JITModule::new(JITBuilder::with_isa(isa, default_libcall_names()));

    let mut ctx = m.make_context();
    let mut func_ctx = FunctionBuilderContext::new();

    let func_a = m
        .declare_function("a", Linkage::Local, &m.make_signature())
        .unwrap();
    let func_b = m
        .declare_function("b", Linkage::Local, &m.make_signature())
        .unwrap();

    // Define a dummy function `func_a`
    ctx.func.name = ExternalName::user(0, func_a.as_u32());
    {
        let mut bcx: FunctionBuilder = FunctionBuilder::new(&mut ctx.func, &mut func_ctx);
        let block = bcx.create_block();

        bcx.switch_to_block(block);
        bcx.ins().return_(&[]);
        bcx.seal_all_blocks();
        bcx.finalize();
    }
    m.define_function(func_a, &mut ctx).unwrap();
    m.clear_context(&mut ctx);

    // Allocate a bunch (~4GB) to stretch address space
    let mut allocations: Vec<Vec<u8>> = Vec::new();
    for _ in 0..999999 {
        allocations.push(Vec::with_capacity(4096));
    }

    // Define `func_b` in a new allocation and reference `func_a`
    ctx.func.name = ExternalName::user(0, func_b.as_u32());
    {
        let mut bcx: FunctionBuilder = FunctionBuilder::new(&mut ctx.func, &mut func_ctx);
        let block = bcx.create_block();

        bcx.switch_to_block(block);
        let local_func = m.declare_func_in_func(func_a, &mut bcx.func);
        // Emit a call with a relocation for func_a
        bcx.ins().call(local_func, &[]);
        // Make sure that this function's body is larger than page_size and will require a new allocation.
        for _ in 0..1024 {
            bcx.ins().call(local_func, &[]);
        }
        bcx.ins().return_(&[]);

        bcx.seal_all_blocks();
        bcx.finalize();
    }

    m.define_function(func_b, &mut ctx).unwrap();
    m.clear_context(&mut ctx);

    // Perform linking
    m.finalize_definitions();
}

It might be possible to trigger this from small-ish WebAssembly modules with glibc's mmap threshold that places >128kb allocations outside of the heap, though I haven't had any luck reproducing that because glibc's dynamic threshold scaling raises this limit before code is emitted.

Possible approaches:

Aarch64 runs into a related issue with 26-bit relative jumps: https://github.com/bytecodealliance/wasmtime/issues/3277
I'm not sure veneers are applicable for x86\_64, but they seems like an interesting and more general approach to relative jump range limits.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 06 2022 at 13:44):

Mrmaxmeier labeled issue #4000:

Hey,

I'm seeing crashes during finalize_definitions calls related to x86\_64 call relocations:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: TryFromIntError(())', cranelift/jit/src/compiled_blob.rs:55:80

Cranelift emits 32-bit relocations for calls on x86\_64, and thus can "only" address in the relative ±2GB range. Code memory is allocated with the normal system allocator, which might place different allocations in distant parts of the address space.
I'm seeing irregular crashes in a heavily multithreaded program, but the problem can be reproduced with this abridged jit-minimal.rs example:

use cranelift::prelude::*;
use cranelift_codegen::settings;
use cranelift_jit::{JITBuilder, JITModule};
use cranelift_module::{default_libcall_names, Linkage, Module};

fn main() {
    let isa_builder = cranelift_native::builder().unwrap();
    let isa = isa_builder
        .finish(settings::Flags::new(settings::builder()))
        .unwrap();

    let mut m = JITModule::new(JITBuilder::with_isa(isa, default_libcall_names()));

    let mut ctx = m.make_context();
    let mut func_ctx = FunctionBuilderContext::new();

    let func_a = m
        .declare_function("a", Linkage::Local, &m.make_signature())
        .unwrap();
    let func_b = m
        .declare_function("b", Linkage::Local, &m.make_signature())
        .unwrap();

    // Define a dummy function `func_a`
    ctx.func.name = ExternalName::user(0, func_a.as_u32());
    {
        let mut bcx: FunctionBuilder = FunctionBuilder::new(&mut ctx.func, &mut func_ctx);
        let block = bcx.create_block();

        bcx.switch_to_block(block);
        bcx.ins().return_(&[]);
        bcx.seal_all_blocks();
        bcx.finalize();
    }
    m.define_function(func_a, &mut ctx).unwrap();
    m.clear_context(&mut ctx);

    // Allocate a bunch (~4GB) to stretch address space
    let mut allocations: Vec<Vec<u8>> = Vec::new();
    for _ in 0..999999 {
        allocations.push(Vec::with_capacity(4096));
    }

    // Define `func_b` in a new allocation and reference `func_a`
    ctx.func.name = ExternalName::user(0, func_b.as_u32());
    {
        let mut bcx: FunctionBuilder = FunctionBuilder::new(&mut ctx.func, &mut func_ctx);
        let block = bcx.create_block();

        bcx.switch_to_block(block);
        let local_func = m.declare_func_in_func(func_a, &mut bcx.func);
        // Emit a call with a relocation for func_a
        bcx.ins().call(local_func, &[]);
        // Make sure that this function's body is larger than page_size and will require a new allocation.
        for _ in 0..1024 {
            bcx.ins().call(local_func, &[]);
        }
        bcx.ins().return_(&[]);

        bcx.seal_all_blocks();
        bcx.finalize();
    }

    m.define_function(func_b, &mut ctx).unwrap();
    m.clear_context(&mut ctx);

    // Perform linking
    m.finalize_definitions();
}

It might be possible to trigger this from small-ish WebAssembly modules with glibc's mmap threshold that places >128kb allocations outside of the heap, though I haven't had any luck reproducing that because glibc's dynamic threshold scaling raises this limit before code is emitted.

Possible approaches:

Aarch64 runs into a related issue with 26-bit relative jumps: https://github.com/bytecodealliance/wasmtime/issues/3277
I'm not sure veneers are applicable for x86\_64, but they seems like an interesting and more general approach to relative jump range limits.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 06 2022 at 13:50):

bjorn3 commented on issue #4000:

Don't allocate on the heap. Cranelift's selinux-fix features uses mmap allocations. The underlying issue still persists, though as mmap allocations are separate from the heap, they're mostly sequential and would need >2GB of generated machine code to cause problems.

I think this is the best fix. Possibly in combination with reserving the full 2GB as PROT_NONE. Allowing the GOT to be split between each such 2GB chunk should also allow more code to be used when PIC is enabled.

view this post on Zulip Wasmtime GitHub notifications bot (Apr 06 2022 at 17:01):

cfallin commented on issue #4000:

This is covered I think by the colocated flag on external function definitions: the intent is to denote that a function is in the same module (hence can use near calls) or elsewhere (hence needs an absolute 64-bit relocation). This flag in the ExtFuncData controls which kind of call is generated. It looks like this may not be surfaced in the JITModule API; we'd be happy to take a PR to fix that if so!

view this post on Zulip Wasmtime GitHub notifications bot (Apr 06 2022 at 17:44):

bjorn3 commented on issue #4000:

That is not the problem here. The problem is that a function and the GOT or PLT it accesses may end up more than 2GB from each other due to memory fragmentation. All calls already go through the GOT and PLT anyway so for as long as those are within 2GB it doesn't matter where the function is, independent of the colocated flag.

view this post on Zulip Wasmtime GitHub notifications bot (Jan 21 2026 at 15:40):

zackw commented on issue #4000:

Right now, the machine code emitted for CallKnown on x86 is simply E8 xx xx yy yy, i.e. near call with 32-bit signed displacement. If you emitted an 8-byte long NOP after each CALL,

    E8 00 00 00 00            call <placeholder>
    0f 1f 84 00 00 00 00 00   nop.8

the relocation phase would have enough room to patch in a sequence like

    49 bb xx xx yy yy zz zz ww ww        mov r11, 0xwwwwzzzzzyyyyxxxx
    41 ff d3                             call r11

when the displacement doesn't fit in i32. This does require a scratch register, but r11 should always be available, since we're making a call and it's a call-clobbered register that isn't an outgoing argument register.

Similar tactics should be applicable for other supported ISAs.

(Bonus points for squishing out the long NOP when the space isn't needed, but that's significantly harder since it changes the address of everything afterward.)

view this post on Zulip Wasmtime GitHub notifications bot (Feb 14 2026 at 16:22):

zackw commented on issue #4000:

I don't have time to work up a full test case, but I believe this outline describes a program which will hit the troublesome limit at least 10% of the time:

jemalloc creates a 64MB address space reservation for each additional thread, but (it appears) does _not_ make the same large reservation for the main thread. 64MB * 32 = 2GB. So the overall effect is to make it very likely that the memory allocated by the main thread to hold the JIT-compiled code will be more than 2GB away from the main executable and thus a simple CALL instruction cannot reach it.


Last updated: Feb 24 2026 at 05:28 UTC