I did a more complex test with pulley, wasmi, and cranelift.
use bevy_app::{App, Startup, Update};
use bevy_ecs::prelude::{Commands, Component, Query, Res, Resource};
use std::cell::UnsafeCell;
use std::mem::MaybeUninit;
#[derive(Component)]
pub struct Test<const N: usize> {
pub data: [f32; N],
}
impl<const N: usize> Test<N> {
pub fn new() -> Self {
Self { data: [0.0; N] }
}
}
fn setup(mut commands: Commands, count: Res<EntityCount>) {
for _ in 0..count.0 {
commands.spawn((
Test::<1>::new(),
Test::<2>::new(),
Test::<3>::new(),
Test::<4>::new(),
Test::<5>::new(),
Test::<6>::new(),
Test::<7>::new(),
Test::<8>::new(),
Test::<9>::new(),
Test::<10>::new(),
));
}
}
pub struct Engine {
app: UnsafeCell<MaybeUninit<App>>,
}
unsafe impl Sync for Engine {}
impl Engine {
pub const fn new() -> Self {
Self {
app: UnsafeCell::new(MaybeUninit::uninit()),
}
}
pub fn app(&self) -> &App {
unsafe { { &*self.app.get() }.assume_init_ref() }
}
pub fn app_mut(&self) -> &mut App {
unsafe { { &mut *self.app.get() }.assume_init_mut() }
}
pub fn init(&self) {
unsafe { &mut *self.app.get() }.write(App::new());
}
}
fn tick(mut query: Query<(&Test<3>, &Test<5>, &mut Test<8>)>) {
for (a, b, mut c) in query.iter_mut() {
for i in 0..3 {
c.data[i] = a.data[i];
}
for i in 0..5 {
c.data[i + 3] = b.data[i];
}
}
}
#[derive(Resource)]
pub struct EntityCount(pub usize);
static ENGINE: Engine = Engine::new();
#[unsafe(no_mangle)]
pub extern "system" fn init(count: u32) {
ENGINE.init();
let bevy = ENGINE.app_mut();
bevy.insert_resource(EntityCount(count as _));
bevy.add_systems(Startup, setup);
bevy.add_systems(Update, tick);
}
#[unsafe(no_mangle)]
pub extern "system" fn update() {
let bevy = ENGINE.app_mut();
bevy.update();
}
When count =10000, the result is as follows
| type | memory | fps | wasm memory |
|---|---|---|---|
| native | 9600 | 23400 | - |
| cranelift | 94372 | 23100 | 11392 |
| pulley | 47116 | 1010 | - |
| wasmi | 19896 | 1110 | - |
The FPS is great, actually. I wonder why cranelift and pulley cost so much memory.
What is the unit of the memory column? bytes? kilobytes?
kilobytes. And count=1-10000, does not effect the cranelift memory too much, it is 80000 basically and up
Some questions:
[profile.release]
lto = "fat"
codegen-units = 1
1 no
2 wsl+ubuntu20.04+i9
3 wasmtime 31.0 wasmi 0.44
wsl environments have wildly varying constraints; if you ran the same thing somewhere else (bare metal ubuntu) I'd be interested......
wsl works fine! but you need to be very skeptical of telemetry that comes out of it as it's really a dev platform optimized for client interruptions and so on.
that said, if you find it is related to wsl, I'd be very interested anyway
I don't THINK that it should have any effect here, but I am frequently surprised by subtle differences
@hoping please benchmark again with the proper optimization settings by copying the above toml code into your Cargo.toml file of the benchmarking source code. Otherwise, Wasmi is not used at its full potential. This may also benefit Pulley execution.
I will do the benchmark on the bare metal, but my main concern is the memory
@Robin Freyler @Ralph
I tested it again, adding the cargo.toml parameters on the bare metal (They were older machines though)
centos7.1 Intel(R) Xeon(R) CPU E5-2420 0 @ 1.90GHz
| type | total memory(kb) | wasm memory(kb) | fps |
|---|---|---|---|
| native | 8168 | - | 8433 |
| cranelift | 92040 | 11392 | 7791 |
| pulley | 41676 | 11392 | 211 |
| wasmi | 16344 | 11392 | 382 |
centos7.9 Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
| type | total memory(kb) | wasm memory(kb) | fps |
|---|---|---|---|
| native | 8148 | - | 13374 |
| cranelift | 104468 | 11392 | 12360 |
| pulley | 40768 | 11392 | 424 |
| wasmi | 16296 | 11392 | 1045 |
The conclusions are
The memory did not change too much on all the machines.
I am trying to use wasm as a hot fix method for mobile phones, so memory is my primary concern. I wonder why cranelift cost so much memory, and is there any way to reduce it? Is it going to increase as the complexity of the code goes up?
Deepseek tells me to set parallel_compilation to false, reducing the memory from 98 to 48. Are there any other configurations like this to use?
hoping said:
Deepseek tells me to set parallel_compilation to false, reducing the memory from 98 to 48. Are there any other configurations like this to use?
I suspect you are either measuring peak memory usage or you are measuring memory usage from the OS side rather than the actually allocated memory amount. Setting parallel_compilation to false will only reduce memory usage when compiling the wasm module. Once compilation is done, all this memory is freed either way. It is possible however that the memory allocator choses to retain part or all of this memory to more quickly serve future allocation requests. If you are trying to measure memory usage you should use a heap profiler like valgrind's dhat or massif or for example bytehound. The dhat crate is also an option.
@bjorn3 The memory is RES field from top command, so it's not the peak memory.
I tried valgrind and bytehound, all crashed on my system. I don't know why.
So I think if the memory is reserved by the allocator, when I increase the count, the wasm may consume more memory, so the reserved memory should be used, and the memory gap should decrease.
| count | parallel_compilation | total memory | wasm memory | memory gap |
|---|---|---|---|---|
| 10000 | No | 48516 | 11392 | 37124 |
| 100000 | No | 115044 | 86592 | 28452 |
| 10000 | Yes | 102056 | 11392 | 90664 |
| 100000 | Yes | 163712 | 86592 | 77120 |
| 1000000 | Yes | 759792 | 685120 | 74672 |
I don't think the result supports the allocator reservation assumption.
hoping has marked this topic as unresolved.
hoping said:
I think I got it. Code compilation did not release the memory. If I use Moudle::deserialize instead of Moudle::from_file, the memory is just fine as lower as wasmi.
someone ensure that puppy is documented!!!
do you have the results run after this change?
love to see the table again.....
This is more an issue of methodology I think: measuring resident memory as viewed by the OS is not an accurate way to get the exact size of the heap. If Cranelift leaks memory, that would be an issue we'd need to resolve, but it's written in fully safe Rust and frees all of its data structures after compilation, so that shouldn't be the case. @hoping I don't think the memory reservation works the way you think it does: there are likely per-thread free pools and other things going on in a modern high-performance parallel allocator that make this hard to reason about.
If you run under Valgrind you can get per-byte leak tracking; if you actually see a memory leak while running that way we'd be interested to hear about it.
Deepseek tells me to set parallel_compilation to false, reducing the memory from 98 to 48. Are there any other configurations like this to use?
Also, please don't rely on AI tools to give you advice about our project -- we're happy to answer questions ourselves, and our experience is that AI tools (which are at their heart statistical randomness machines) sometimes hallucinate or give bad advice.
(For example, turning off parallel compilation is probably a bad idea in a user-facing app because it will also significantly increase startup latency when you have a new uncached module)
Ralph said:
do you have the results run after this change?
| count | total memory | wasm memory |
|---|---|---|
| 10000 | 17072 | 11392 |
| 100000 | 83580 | 86592 |
| 1000000 | 678504 | 685120 |
I think the total memory taken from RES is not accurate, so the last two rows show total memory < wasm memory.
Chris Fallin said:
This is more an issue of methodology I think: measuring resident memory as viewed by the OS is not an accurate way to get the exact size of the heap. If Cranelift leaks memory, that would be an issue we'd need to resolve, but it's written in fully safe Rust and frees all of its data structures after compilation, so that shouldn't be the case. hoping I don't think the memory reservation works the way you think it does: there are likely per-thread free pools and other things going on in a modern high-performance parallel allocator that make this hard to reason about.
If you run under Valgrind you can get per-byte leak tracking; if you actually see a memory leak while running that way we'd be interested to hear about it.
I agree. I don't think it's fair to call it a memory leak actually. The parallel compilation uses rayon, and the global thread pool won't be cleaned when the compilation is done. (I don't think they provide an API to do this). So the correct way to use wasmtime when memory is the priority is to load from a cached module.
an even better approach for keeping memory (and latency) overheads low would be to pre-compile the .wasms into .cwasms and avoid compiling on-device or in the data plane at all.
this is relevant for both pulley and cranelift-native (and winch too, but you don't seem interested in winch, afaict)
resources:
with ^ you can also disable the compiler from your embedding's wasmtime build, meaning that code size shrinks a ton and you don't have memory overhead associated with the compiler's executable code either
see https://docs.wasmtime.dev/examples-minimal.html for more details about building a minimal, runtime-only wasmtime build that doesn't include the compiler
@fitzgen (he/him) Thanks for the advice. This is what I need.
Last updated: Dec 06 2025 at 05:03 UTC