Stream: general

Topic: pulley vs wasmi vs cranelift, more detail test


view this post on Zulip Hoping White (Apr 11 2025 at 11:09):

I did a more complex test with pulley, wasmi, and cranelift.

use bevy_app::{App, Startup, Update};
use bevy_ecs::prelude::{Commands, Component, Query, Res, Resource};
use std::cell::UnsafeCell;
use std::mem::MaybeUninit;

#[derive(Component)]
pub struct Test<const N: usize> {
    pub data: [f32; N],
}

impl<const N: usize> Test<N> {
    pub fn new() -> Self {
        Self { data: [0.0; N] }
    }
}

fn setup(mut commands: Commands, count: Res<EntityCount>) {
    for _ in 0..count.0 {
        commands.spawn((
            Test::<1>::new(),
            Test::<2>::new(),
            Test::<3>::new(),
            Test::<4>::new(),
            Test::<5>::new(),
            Test::<6>::new(),
            Test::<7>::new(),
            Test::<8>::new(),
            Test::<9>::new(),
            Test::<10>::new(),
        ));
    }
}

pub struct Engine {
    app: UnsafeCell<MaybeUninit<App>>,
}

unsafe impl Sync for Engine {}

impl Engine {
    pub const fn new() -> Self {
        Self {
            app: UnsafeCell::new(MaybeUninit::uninit()),
        }
    }

    pub fn app(&self) -> &App {
        unsafe { { &*self.app.get() }.assume_init_ref() }
    }

    pub fn app_mut(&self) -> &mut App {
        unsafe { { &mut *self.app.get() }.assume_init_mut() }
    }

    pub fn init(&self) {
        unsafe { &mut *self.app.get() }.write(App::new());
    }
}

fn tick(mut query: Query<(&Test<3>, &Test<5>, &mut Test<8>)>) {
    for (a, b, mut c) in query.iter_mut() {
        for i in 0..3 {
            c.data[i] = a.data[i];
        }
        for i in 0..5 {
            c.data[i + 3] = b.data[i];
        }
    }
}

#[derive(Resource)]
pub struct EntityCount(pub usize);

static ENGINE: Engine = Engine::new();

#[unsafe(no_mangle)]
pub extern "system" fn init(count: u32) {
    ENGINE.init();
    let bevy = ENGINE.app_mut();
    bevy.insert_resource(EntityCount(count as _));
    bevy.add_systems(Startup, setup);
    bevy.add_systems(Update, tick);
}

#[unsafe(no_mangle)]
pub extern "system" fn update() {
    let bevy = ENGINE.app_mut();
    bevy.update();
}

When count =10000, the result is as follows

type memory fps wasm memory
native 9600 23400 -
cranelift 94372 23100 11392
pulley 47116 1010 -
wasmi 19896 1110 -

The FPS is great, actually. I wonder why cranelift and pulley cost so much memory.

view this post on Zulip bjorn3 (Apr 11 2025 at 11:29):

What is the unit of the memory column? bytes? kilobytes?

view this post on Zulip Hoping White (Apr 11 2025 at 11:33):

kilobytes. And count=1-10000, does not effect the cranelift memory too much, it is 80000 basically and up

view this post on Zulip Robin Freyler (Apr 11 2025 at 13:03):

Some questions:

  1. Did you use proper Cargo optimization settings required by Wasmi? This can affect Wasmi performance by up to 50%:
[profile.release]
lto = "fat"
codegen-units = 1
  1. What platform are you running this on (CPU + OS)?
  2. What Wasmi and Pulley/Wasmtime versions have you used?

view this post on Zulip hoping (Apr 11 2025 at 13:11):

1 no
2 wsl+ubuntu20.04+i9
3 wasmtime 31.0 wasmi 0.44

view this post on Zulip Ralph (Apr 11 2025 at 13:27):

wsl environments have wildly varying constraints; if you ran the same thing somewhere else (bare metal ubuntu) I'd be interested......

view this post on Zulip Ralph (Apr 11 2025 at 13:27):

wsl works fine! but you need to be very skeptical of telemetry that comes out of it as it's really a dev platform optimized for client interruptions and so on.

view this post on Zulip Ralph (Apr 11 2025 at 13:28):

that said, if you find it is related to wsl, I'd be very interested anyway

view this post on Zulip Ralph (Apr 11 2025 at 13:28):

I don't THINK that it should have any effect here, but I am frequently surprised by subtle differences

view this post on Zulip Robin Freyler (Apr 11 2025 at 13:48):

@hoping please benchmark again with the proper optimization settings by copying the above toml code into your Cargo.toml file of the benchmarking source code. Otherwise, Wasmi is not used at its full potential. This may also benefit Pulley execution.

view this post on Zulip hoping (Apr 12 2025 at 01:41):

I will do the benchmark on the bare metal, but my main concern is the memory

view this post on Zulip hoping (Apr 14 2025 at 04:39):

@Robin Freyler @Ralph
I tested it again, adding the cargo.toml parameters on the bare metal (They were older machines though)
centos7.1 Intel(R) Xeon(R) CPU E5-2420 0 @ 1.90GHz

type total memory(kb) wasm memory(kb) fps
native 8168 - 8433
cranelift 92040 11392 7791
pulley 41676 11392 211
wasmi 16344 11392 382

centos7.9 Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz

type total memory(kb) wasm memory(kb) fps
native 8148 - 13374
cranelift 104468 11392 12360
pulley 40768 11392 424
wasmi 16296 11392 1045

The conclusions are

The memory did not change too much on all the machines.
I am trying to use wasm as a hot fix method for mobile phones, so memory is my primary concern. I wonder why cranelift cost so much memory, and is there any way to reduce it? Is it going to increase as the complexity of the code goes up?

view this post on Zulip hoping (Apr 14 2025 at 05:57):

Deepseek tells me to set parallel_compilation to false, reducing the memory from 98 to 48. Are there any other configurations like this to use?

view this post on Zulip bjorn3 (Apr 14 2025 at 07:58):

hoping said:

Deepseek tells me to set parallel_compilation to false, reducing the memory from 98 to 48. Are there any other configurations like this to use?

I suspect you are either measuring peak memory usage or you are measuring memory usage from the OS side rather than the actually allocated memory amount. Setting parallel_compilation to false will only reduce memory usage when compiling the wasm module. Once compilation is done, all this memory is freed either way. It is possible however that the memory allocator choses to retain part or all of this memory to more quickly serve future allocation requests. If you are trying to measure memory usage you should use a heap profiler like valgrind's dhat or massif or for example bytehound. The dhat crate is also an option.

A memory profiler for Linux. Contribute to koute/bytehound development by creating an account on GitHub.

view this post on Zulip hoping (Apr 14 2025 at 08:48):

@bjorn3 The memory is RES field from top command, so it's not the peak memory.

I tried valgrind and bytehound, all crashed on my system. I don't know why.

So I think if the memory is reserved by the allocator, when I increase the count, the wasm may consume more memory, so the reserved memory should be used, and the memory gap should decrease.

count parallel_compilation total memory wasm memory memory gap
10000 No 48516 11392 37124
100000 No 115044 86592 28452
10000 Yes 102056 11392 90664
100000 Yes 163712 86592 77120
1000000 Yes 759792 685120 74672

I don't think the result supports the allocator reservation assumption.

view this post on Zulip Notification Bot (Apr 14 2025 at 12:55):

hoping has marked this topic as unresolved.

view this post on Zulip Ralph (Apr 14 2025 at 13:42):

hoping said:

I think I got it. Code compilation did not release the memory. If I use Moudle::deserialize instead of Moudle::from_file, the memory is just fine as lower as wasmi.

someone ensure that puppy is documented!!!

view this post on Zulip Ralph (Apr 14 2025 at 13:42):

do you have the results run after this change?

view this post on Zulip Ralph (Apr 14 2025 at 13:42):

love to see the table again.....

view this post on Zulip Chris Fallin (Apr 14 2025 at 20:00):

This is more an issue of methodology I think: measuring resident memory as viewed by the OS is not an accurate way to get the exact size of the heap. If Cranelift leaks memory, that would be an issue we'd need to resolve, but it's written in fully safe Rust and frees all of its data structures after compilation, so that shouldn't be the case. @hoping I don't think the memory reservation works the way you think it does: there are likely per-thread free pools and other things going on in a modern high-performance parallel allocator that make this hard to reason about.

If you run under Valgrind you can get per-byte leak tracking; if you actually see a memory leak while running that way we'd be interested to hear about it.

view this post on Zulip Chris Fallin (Apr 14 2025 at 20:01):

Deepseek tells me to set parallel_compilation to false, reducing the memory from 98 to 48. Are there any other configurations like this to use?

Also, please don't rely on AI tools to give you advice about our project -- we're happy to answer questions ourselves, and our experience is that AI tools (which are at their heart statistical randomness machines) sometimes hallucinate or give bad advice.

view this post on Zulip Chris Fallin (Apr 14 2025 at 20:02):

(For example, turning off parallel compilation is probably a bad idea in a user-facing app because it will also significantly increase startup latency when you have a new uncached module)

view this post on Zulip hoping (Apr 15 2025 at 01:37):

Ralph said:

do you have the results run after this change?

count total memory wasm memory
10000 17072 11392
100000 83580 86592
1000000 678504 685120

I think the total memory taken from RES is not accurate, so the last two rows show total memory < wasm memory.

view this post on Zulip hoping (Apr 15 2025 at 01:44):

Chris Fallin said:

This is more an issue of methodology I think: measuring resident memory as viewed by the OS is not an accurate way to get the exact size of the heap. If Cranelift leaks memory, that would be an issue we'd need to resolve, but it's written in fully safe Rust and frees all of its data structures after compilation, so that shouldn't be the case. hoping I don't think the memory reservation works the way you think it does: there are likely per-thread free pools and other things going on in a modern high-performance parallel allocator that make this hard to reason about.

If you run under Valgrind you can get per-byte leak tracking; if you actually see a memory leak while running that way we'd be interested to hear about it.

I agree. I don't think it's fair to call it a memory leak actually. The parallel compilation uses rayon, and the global thread pool won't be cleaned when the compilation is done. (I don't think they provide an API to do this). So the correct way to use wasmtime when memory is the priority is to load from a cached module.

view this post on Zulip fitzgen (he/him) (Apr 15 2025 at 16:49):

an even better approach for keeping memory (and latency) overheads low would be to pre-compile the .wasms into .cwasms and avoid compiling on-device or in the data plane at all.

this is relevant for both pulley and cranelift-native (and winch too, but you don't seem interested in winch, afaict)

resources:

view this post on Zulip fitzgen (he/him) (Apr 15 2025 at 16:53):

with ^ you can also disable the compiler from your embedding's wasmtime build, meaning that code size shrinks a ton and you don't have memory overhead associated with the compiler's executable code either

view this post on Zulip fitzgen (he/him) (Apr 15 2025 at 16:54):

see https://docs.wasmtime.dev/examples-minimal.html for more details about building a minimal, runtime-only wasmtime build that doesn't include the compiler

view this post on Zulip hoping (Apr 17 2025 at 02:09):

@fitzgen (he/him) Thanks for the advice. This is what I need.


Last updated: Dec 06 2025 at 05:03 UTC