pulley vs wasmi vs cranelift, more detail test · general

use bevy_app::{App, Startup, Update};
use bevy_ecs::prelude::{Commands, Component, Query, Res, Resource};
use std::cell::UnsafeCell;
use std::mem::MaybeUninit;

#[derive(Component)]
pub struct Test<const N: usize> {
    pub data: [f32; N],
}

impl<const N: usize> Test<N> {
    pub fn new() -> Self {
        Self { data: [0.0; N] }
    }
}

fn setup(mut commands: Commands, count: Res<EntityCount>) {
    for _ in 0..count.0 {
        commands.spawn((
            Test::<1>::new(),
            Test::<2>::new(),
            Test::<3>::new(),
            Test::<4>::new(),
            Test::<5>::new(),
            Test::<6>::new(),
            Test::<7>::new(),
            Test::<8>::new(),
            Test::<9>::new(),
            Test::<10>::new(),
        ));
    }
}

pub struct Engine {
    app: UnsafeCell<MaybeUninit<App>>,
}

unsafe impl Sync for Engine {}

impl Engine {
    pub const fn new() -> Self {
        Self {
            app: UnsafeCell::new(MaybeUninit::uninit()),
        }
    }

    pub fn app(&self) -> &App {
        unsafe { { &*self.app.get() }.assume_init_ref() }
    }

    pub fn app_mut(&self) -> &mut App {
        unsafe { { &mut *self.app.get() }.assume_init_mut() }
    }

    pub fn init(&self) {
        unsafe { &mut *self.app.get() }.write(App::new());
    }
}

fn tick(mut query: Query<(&Test<3>, &Test<5>, &mut Test<8>)>) {
    for (a, b, mut c) in query.iter_mut() {
        for i in 0..3 {
            c.data[i] = a.data[i];
        }
        for i in 0..5 {
            c.data[i + 3] = b.data[i];
        }
    }
}

#[derive(Resource)]
pub struct EntityCount(pub usize);

static ENGINE: Engine = Engine::new();

#[unsafe(no_mangle)]
pub extern "system" fn init(count: u32) {
    ENGINE.init();
    let bevy = ENGINE.app_mut();
    bevy.insert_resource(EntityCount(count as _));
    bevy.add_systems(Startup, setup);
    bevy.add_systems(Update, tick);
}

#[unsafe(no_mangle)]
pub extern "system" fn update() {
    let bevy = ENGINE.app_mut();
    bevy.update();
}

The FPS is great, actually. I wonder why cranelift and pulley cost so much memory.

type	memory	fps	wasm memory
native	9600	23400	-
cranelift	94372	23100	11392
pulley	47116	1010	-
wasmi	19896	1110	-

bjorn3 (Apr 11 2025 at 11:29):

Hoping White (Apr 11 2025 at 11:33):

kilobytes. And count=1-10000, does not effect the cranelift memory too much, it is 80000 basically and up

Robin Freyler (Apr 11 2025 at 13:03):

[profile.release]
lto = "fat"
codegen-units = 1

hoping (Apr 11 2025 at 13:11):

Ralph (Apr 11 2025 at 13:27):

wsl environments have wildly varying constraints; if you ran the same thing somewhere else (bare metal ubuntu) I'd be interested......

Ralph (Apr 11 2025 at 13:27):

wsl works fine! but you need to be very skeptical of telemetry that comes out of it as it's really a dev platform optimized for client interruptions and so on.

Ralph (Apr 11 2025 at 13:28):

I don't THINK that it should have any effect here, but I am frequently surprised by subtle differences

Robin Freyler (Apr 11 2025 at 13:48):

@hoping please benchmark again with the proper optimization settings by copying the above toml code into your Cargo.toml file of the benchmarking source code. Otherwise, Wasmi is not used at its full potential. This may also benefit Pulley execution.

hoping (Apr 12 2025 at 01:41):

hoping (Apr 14 2025 at 04:39):

@Robin Freyler @Ralph
I tested it again, adding the cargo.toml parameters on the bare metal (They were older machines though)
centos7.1 Intel(R) Xeon(R) CPU E5-2420 0 @ 1.90GHz

type	total memory(kb)	wasm memory(kb)	fps
native	8168	-	8433
cranelift	92040	11392	7791
pulley	41676	11392	211
wasmi	16344	11392	382

type	total memory(kb)	wasm memory(kb)	fps
native	8148	-	13374
cranelift	104468	11392	12360
pulley	40768	11392	424
wasmi	16296	11392	1045

The memory did not change too much on all the machines.
I am trying to use wasm as a hot fix method for mobile phones, so memory is my primary concern. I wonder why cranelift cost so much memory, and is there any way to reduce it? Is it going to increase as the complexity of the code goes up?

hoping (Apr 14 2025 at 05:57):

Deepseek tells me to set parallel_compilation to false, reducing the memory from 98 to 48. Are there any other configurations like this to use?

bjorn3 (Apr 14 2025 at 07:58):

I suspect you are either measuring peak memory usage or you are measuring memory usage from the OS side rather than the actually allocated memory amount. Setting parallel_compilation to false will only reduce memory usage when compiling the wasm module. Once compilation is done, all this memory is freed either way. It is possible however that the memory allocator choses to retain part or all of this memory to more quickly serve future allocation requests. If you are trying to measure memory usage you should use a heap profiler like valgrind's dhat or massif or for example bytehound. The dhat crate is also an option.

hoping (Apr 14 2025 at 08:48):

@bjorn3 The memory is RES field from top command, so it's not the peak memory.

So I think if the memory is reserved by the allocator, when I increase the count, the wasm may consume more memory, so the reserved memory should be used, and the memory gap should decrease.

count	parallel_compilation	total memory	wasm memory	memory gap
10000	No	48516	11392	37124
100000	No	115044	86592	28452
10000	Yes	102056	11392	90664
100000	Yes	163712	86592	77120
1000000	Yes	759792	685120	74672

Notification Bot (Apr 14 2025 at 12:55):

Ralph (Apr 14 2025 at 13:42):

Chris Fallin (Apr 14 2025 at 20:00):

This is more an issue of methodology I think: measuring resident memory as viewed by the OS is not an accurate way to get the exact size of the heap. If Cranelift leaks memory, that would be an issue we'd need to resolve, but it's written in fully safe Rust and frees all of its data structures after compilation, so that shouldn't be the case. @hoping I don't think the memory reservation works the way you think it does: there are likely per-thread free pools and other things going on in a modern high-performance parallel allocator that make this hard to reason about.

If you run under Valgrind you can get per-byte leak tracking; if you actually see a memory leak while running that way we'd be interested to hear about it.

Chris Fallin (Apr 14 2025 at 20:01):

Also, please don't rely on AI tools to give you advice about our project -- we're happy to answer questions ourselves, and our experience is that AI tools (which are at their heart statistical randomness machines) sometimes hallucinate or give bad advice.

Chris Fallin (Apr 14 2025 at 20:02):

(For example, turning off parallel compilation is probably a bad idea in a user-facing app because it will also significantly increase startup latency when you have a new uncached module)

hoping (Apr 15 2025 at 01:37):

I think the total memory taken from RES is not accurate, so the last two rows show total memory < wasm memory.

count	total memory	wasm memory
10000	17072	11392
100000	83580	86592
1000000	678504	685120

hoping (Apr 15 2025 at 01:44):

I agree. I don't think it's fair to call it a memory leak actually. The parallel compilation uses rayon, and the global thread pool won't be cleaned when the compilation is done. (I don't think they provide an API to do this). So the correct way to use wasmtime when memory is the priority is to load from a cached module.

fitzgen (he/him) (Apr 15 2025 at 16:49):

an even better approach for keeping memory (and latency) overheads low would be to pre-compile the .wasms into .cwasms and avoid compiling on-device or in the data plane at all.

this is relevant for both pulley and cranelift-native (and winch too, but you don't seem interested in winch, afaict)

fitzgen (he/him) (Apr 15 2025 at 16:53):

with ^ you can also disable the compiler from your embedding's wasmtime build, meaning that code size shrinks a ton and you don't have memory overhead associated with the compiler's executable code either

Stream: general

Topic: pulley vs wasmi vs cranelift, more detail test

Hoping White (Apr 11 2025 at 11:09):

bjorn3 (Apr 11 2025 at 11:29):

Hoping White (Apr 11 2025 at 11:33):

Robin Freyler (Apr 11 2025 at 13:03):

hoping (Apr 11 2025 at 13:11):

Ralph (Apr 11 2025 at 13:27):

Ralph (Apr 11 2025 at 13:27):

Ralph (Apr 11 2025 at 13:28):

Ralph (Apr 11 2025 at 13:28):

Robin Freyler (Apr 11 2025 at 13:48):

hoping (Apr 12 2025 at 01:41):

hoping (Apr 14 2025 at 04:39):

hoping (Apr 14 2025 at 05:57):

bjorn3 (Apr 14 2025 at 07:58):

hoping (Apr 14 2025 at 08:48):

Notification Bot (Apr 14 2025 at 12:55):

Ralph (Apr 14 2025 at 13:42):

Ralph (Apr 14 2025 at 13:42):

Ralph (Apr 14 2025 at 13:42):

Chris Fallin (Apr 14 2025 at 20:00):

Chris Fallin (Apr 14 2025 at 20:01):

Chris Fallin (Apr 14 2025 at 20:02):

hoping (Apr 15 2025 at 01:37):

hoping (Apr 15 2025 at 01:44):

fitzgen (he/him) (Apr 15 2025 at 16:49):

fitzgen (he/him) (Apr 15 2025 at 16:53):

fitzgen (he/him) (Apr 15 2025 at 16:54):

hoping (Apr 17 2025 at 02:09):