wasmtime / issue #10239 Memory leak w/ parallel component... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #10239 Memory leak w/ parallel component...

Wasmtime GitHub notifications bot (Feb 18 2025 at 01:05):

jadamcrain added the bug label to Issue #10239.

Wasmtime GitHub notifications bot (Feb 18 2025 at 01:05):

jadamcrain opened issue #10239:

We're embedding wasmtime to execute plugins within our application. These plugins are defined in WIT. The application instantiates multiple instances of the plugin, and drives each instance on its own Tokio task. We started seeing slow memory growth in production. This was surprising because our application is carefully designed to have very flat memory usage. It has a fixed number of Tokio tasks that only communicate with each-other using bounded queues.

When only 1 plugin executes, no memory is leaked, and the application has a flat steady-state memory usage:

![Image](https://github.com/user-attachments/assets/8e77919d-e7ca-4b5e-baa0-05a76d7dfdc1)

When more than 1 plugin executes, the application will leak memory in WASM => HOST callbacks, e.g. 2 instances on 2 tasks was actually pretty steady for a period of time and then started leaking:

![Image](https://github.com/user-attachments/assets/7e55cf71-aef3-4a8d-8277-7a50ceb68ea1)

Zoomed in view of last chart:

![Image](https://github.com/user-attachments/assets/746616c7-e2f8-4272-b588-5b6465e4c724)

It will sometimes start leaking right away, and other times take a while to start. Once it starts leaking, it always continues to leak. At 4 instances (4 plugins on 4 parallel tasks) the leaks blow up immediately:

![Image](https://github.com/user-attachments/assets/37f6e98a-95a4-4767-b6be-5257dbfe1d6b)

The higher the level of parallelism, the faster the leak... It feels like there's something shared between instances here that isn't thread-safe.

The leaked allocations are realloc that occurs when "lifting" a list type from WASM => host in a host callback.

![Image](https://github.com/user-attachments/assets/4f224ad2-097e-4bdf-8611-2b8fdacaee50)

Test Case

Not easy to replicate outside of our application ATM.

Steps to Reproduce

I'd like to upload the ZST traces for heaptrack, but github is blocking even a zip of them? They're large, 40-60 MB.

Versions and Environment

Wasmtime version or commit: 29.0.1

Operating system: Linux

Architecture: x86_64

Wasmtime GitHub notifications bot (Feb 18 2025 at 01:08):

jadamcrain edited issue #10239:

We're embedding wasmtime to execute plugins within our application. These plugins are defined in WIT. The application instantiates multiple instances of the plugin, and drives each instance on its own Tokio task. We started seeing slow memory growth in production. This was surprising because our application is carefully designed to have very flat memory usage. It has a fixed number of Tokio tasks that only communicate with each-other using bounded queues.

When only 1 plugin executes, no memory is leaked, and the application has a flat steady-state memory usage:

![Image](https://github.com/user-attachments/assets/8e77919d-e7ca-4b5e-baa0-05a76d7dfdc1)

When more than 1 plugin executes, the application will leak memory in WASM => HOST callbacks, e.g. 2 instances on 2 tasks was actually pretty steady for a period of time and then started leaking:

![Image](https://github.com/user-attachments/assets/7e55cf71-aef3-4a8d-8277-7a50ceb68ea1)

Zoomed in view of last chart:

![Image](https://github.com/user-attachments/assets/746616c7-e2f8-4272-b588-5b6465e4c724)

It will sometimes start leaking right away, and other times take a while to start like the trace above. Once it starts leaking, it always continue to leak. At 4 instances (4 plugins on 4 parallel tasks) the leaks blow up immediately:

![Image](https://github.com/user-attachments/assets/37f6e98a-95a4-4767-b6be-5257dbfe1d6b)

The higher the level of parallelism, the faster the leak... It feels like there's something shared between instances here that isn't thread-safe.

The leaked allocations are realloc that occurs when "lifting" a list type from WASM => host in a host callback.

![Image](https://github.com/user-attachments/assets/4f224ad2-097e-4bdf-8611-2b8fdacaee50)

Test Case

Not easy to replicate outside of our application ATM.

Steps to Reproduce

I'd like to upload the ZST traces for heaptrack, but github is blocking even a zip of them? They're large, 40-60 MB.

Versions and Environment

Wasmtime version or commit: 29.0.1

Operating system: Linux

Architecture: x86_64

Wasmtime GitHub notifications bot (Feb 18 2025 at 17:41):

alexcrichton commented on issue #10239:

Thanks for the report! I'm going to ask some questions about the shape of your embedding to help get some more information and hopefully assist in debugging as well. It's understandable if you can't share the whole application, but it may take some more back-and-forth in the absence of a reproduction.

Are you using the pooling allocator? Or the default OnDemand allocation strategy?

Or, more generally, are you able to share a snippet/gist of your creation of wasmtime::Config? Understanding the configuration settings may be helpful in determining what possible leak scenarios there are.

Is there a legend/key for the colors of the stripes in the graphs above?

Would you be able to share the signature of the WIT function that looks like it's leaking? Or are you able to pin down which host function is triggering the leak?

Can you talk more about the lifecycle of a plugin? Is it instantiated for a long time? Or only a short period of time before it's thrown away?

Can you speak more as to what/how statistics are being gathered here? Is it instrumentation of malloc/free with LD_PRELOAD? Or something lower level perhaps?

Wasmtime GitHub notifications bot (Feb 18 2025 at 17:57):

jadamcrain commented on issue #10239:

Hi @alexcrichton. Before I make anyone guess in the dark here on our proprietary application, I'm try to make this leak occur in a minimal application that mimics our embedding that I can just shove in a public repo. Fingers crossed.

Some initial responses below while I work on a full host/guest I can hand you in the background:

Are you using the pooling allocator? Or the default OnDemand allocation strategy?

We've not explicitly selected any allocator, so I assume it's the default.

Or, more generally, are you able to share a snippet/gist of your creation of wasmtime::Config? Understanding the configuration settings may be helpful in determining what possible leak scenarios there are.

We're using all of the default settings.

Is there a legend/key for the colors of the stripes in the graphs above?

Yes, there is. I'll get you this in a bit if I fail to give you a reproducible example.

Would you be able to share the signature of the WIT function that looks like it's leaking? Or are you able to pin down which host function is triggering the leak?

It's a pretty simple host function:
// publish a set of samples
publish: func(samples: list<sample>);
A sample is nothing special... it doesn't contain any dynamically allocated types and should all just be laid out on the stack.

Can you talk more about the lifecycle of a plugin? Is it instantiated for a long time? Or only a short period of time before it's thrown away?

I lives forever... as long as the application. We actually use the plugin to create a single Resource type during initialization. We then periodically call a single method on the guest resource, which can call back to host functions like "publish" above.

Can you speak more as to what/how statistics are being gathered here? Is it instrumentation of malloc/free with LD_PRELOAD? Or something lower level perhaps?

My understanding is that Heaptrack uses LD_PRELOAD to insert it's own .SO between the application and the allocator. We're not using jemalloc here, just the default Rust global allocator.

Wasmtime GitHub notifications bot (Feb 18 2025 at 17:57):

jadamcrain edited a comment on issue #10239:

Hi @alexcrichton. Before I make anyone guess in the dark here on our proprietary application, I'm try to make this leak occur in a minimal application that mimics our embedding that I can just shove in a public repo. Fingers crossed.

Some initial responses below while I work on a full host/guest I can hand you in the background:

Are you using the pooling allocator? Or the default OnDemand allocation strategy?

We've not explicitly selected any allocator, so I assume it's the default.

Or, more generally, are you able to share a snippet/gist of your creation of wasmtime::Config? Understanding the configuration settings may be helpful in determining what possible leak scenarios there are.

We're using all of the default settings.

Is there a legend/key for the colors of the stripes in the graphs above?

Yes, there is. I'll get you this in a bit if I fail to give you a reproducible example.

Would you be able to share the signature of the WIT function that looks like it's leaking? Or are you able to pin down which host function is triggering the leak?

It's a pretty simple host function:
// publish a set of samples
publish: func(samples: list<sample>);
A sample is nothing special... it doesn't contain any dynamically allocated types and should all just be laid out on the stack.

Can you talk more about the lifecycle of a plugin? Is it instantiated for a long time? Or only a short period of time before it's thrown away?

It lives forever... as long as the application. We actually use the plugin to create a single Resource type during initialization. We then periodically call a single method on the guest resource, which can call back to host functions like "publish" above.

Can you speak more as to what/how statistics are being gathered here? Is it instrumentation of malloc/free with LD_PRELOAD? Or something lower level perhaps?

My understanding is that Heaptrack uses LD_PRELOAD to insert it's own .SO between the application and the allocator. We're not using jemalloc here, just the default Rust global allocator.

Wasmtime GitHub notifications bot (Feb 18 2025 at 18:01):

jadamcrain edited a comment on issue #10239:

Hi @alexcrichton. Before I make anyone guess in the dark here on our proprietary application, I'm trying to make this leak occur in a minimal application that mimics our embedding that I can just shove in a public repo. Fingers crossed.

Some initial responses below while I work on a full host/guest I can hand you in the background:

Are you using the pooling allocator? Or the default OnDemand allocation strategy?

We've not explicitly selected any allocator, so I assume it's the default.

Or, more generally, are you able to share a snippet/gist of your creation of wasmtime::Config? Understanding the configuration settings may be helpful in determining what possible leak scenarios there are.

We're using all of the default settings.

Is there a legend/key for the colors of the stripes in the graphs above?

Yes, there is. I'll get you this in a bit if I fail to give you a reproducible example.

Would you be able to share the signature of the WIT function that looks like it's leaking? Or are you able to pin down which host function is triggering the leak?

It's a pretty simple host function:
// publish a set of samples
publish: func(samples: list<sample>);
A sample is nothing special... it doesn't contain any dynamically allocated types and should all just be laid out on the stack.

Can you talk more about the lifecycle of a plugin? Is it instantiated for a long time? Or only a short period of time before it's thrown away?

It lives forever... as long as the application. We actually use the plugin to create a single Resource type during initialization. We then periodically call a single method on the guest resource, which can call back to host functions like "publish" above.

Can you speak more as to what/how statistics are being gathered here? Is it instrumentation of malloc/free with LD_PRELOAD? Or something lower level perhaps?

My understanding is that Heaptrack uses LD_PRELOAD to insert it's own .SO between the application and the allocator. We're not using jemalloc here, just the default Rust global allocator.

Wasmtime GitHub notifications bot (Feb 19 2025 at 00:13):

alexcrichton commented on issue #10239:

Ok thanks for the info!

Everything seems pretty reasonable to me and from what I can tell from the screenshots it looks like the Vec<Sample> that's allocated on the host is what's leaking. I've double-checked the various bits and pieces I could in Wasmtime and I can't find anything awry though. In the final screenshot though you've expanded a chain of 18.1MB leaked bytes, but just above that (highlighted in the screenshot) is a leak of 35.7MB. Does the trace there look similar?

I also assume you're using wasmtime::component::bindgen!-generated bindings for this API? If so you should get the Vec<Sample> and that should naturally get deallocated when it falls out of scope in Rust. Basically I'm as stumped as you are :) (I'll keep digging once you've got more info though)

Wasmtime GitHub notifications bot (Feb 19 2025 at 15:16):

jadamcrain commented on issue #10239:

Yes, I'm using wasmtime::component::bindgen!. I agree that this doesn't make any sense. I just tried this using valgrind's massif and I'm getting a flat trace there even with high parallelism... this kinda makes me think that this might be a bug in heaptrace rather than a leak application.

I'm going to try a couple more heap profiling tools to be certain that's the case like jemalloc.

Heaptrack did allow me to find a leak in our code (me being stupid and growing and endless HashMap), but then I kept going with, but it might just be wrong here.

Wasmtime GitHub notifications bot (Feb 19 2025 at 15:17):

jadamcrain edited a comment on issue #10239:

Yes, I'm using wasmtime::component::bindgen!. I agree that this doesn't make any sense. I just tried this using valgrind's massif and I'm getting a flat trace there even with high parallelism... this kinda makes me think that this might be a bug in heaptrack rather than a leak application.

Massif is really slow though compared to heaptrack, so to rule out some kind of heisenbug, I'm going to try a couple more heap profiling tools to be certain that's the case like jemalloc.

Heaptrack did allow me to find a leak in our code (me being stupid and growing and endless HashMap), but then I kept going with, but it might just be wrong here.

Wasmtime GitHub notifications bot (Feb 19 2025 at 15:58):

jadamcrain edited a comment on issue #10239:

Yes, I'm using wasmtime::component::bindgen!. I agree that this doesn't make any sense. I just tried this using valgrind's massif and I'm getting a flat trace there even with high parallelism... this kinda makes me think that this might be a bug in heaptrack rather than a leak in the application.

Massif is really slow though compared to heaptrack, so to rule out some kind of heisenbug, I'm going to try a couple more heap profiling tools to be certain that's the case like jemalloc.

Heaptrack did allow me to find a leak in our code (me being stupid and growing and endless HashMap), but then I kept going with, but it might just be wrong here.

Wasmtime GitHub notifications bot (Feb 19 2025 at 15:58):

jadamcrain edited a comment on issue #10239:

Yes, I'm using wasmtime::component::bindgen!. I agree that this doesn't make any sense. I just tried this using valgrind's massif and I'm getting a flat trace there even with high parallelism... this kinda makes me think that this might be a bug in heaptrack rather than a leak in the application.

Massif is really slow though compared to heaptrack, so to rule out some kind of heisenbug, I'm going to try a couple more heap profiling tools to be certain that's the case like jemalloc.

Heaptrack did allow me to find a leak in our code (me being stupid and growing and endless HashMap), but then I kept going with it assuming it was reporting correct results, but it might just be wrong here.

Wasmtime GitHub notifications bot (Feb 21 2025 at 18:29):

jadamcrain closed issue #10239:

We're embedding wasmtime to execute plugins within our application. These plugins are defined in WIT. The application instantiates multiple instances of the plugin, and drives each instance on its own Tokio task. We started seeing slow memory growth in production. This was surprising because our application is carefully designed to have very flat memory usage. It has a fixed number of Tokio tasks that only communicate with each-other using bounded queues.

When only 1 plugin executes, no memory is leaked, and the application has a flat steady-state memory usage:

![Image](https://github.com/user-attachments/assets/8e77919d-e7ca-4b5e-baa0-05a76d7dfdc1)

When more than 1 plugin executes, the application will leak memory in WASM => HOST callbacks, e.g. 2 instances on 2 tasks was actually pretty steady for a period of time and then started leaking:

![Image](https://github.com/user-attachments/assets/7e55cf71-aef3-4a8d-8277-7a50ceb68ea1)

Zoomed in view of last chart:

![Image](https://github.com/user-attachments/assets/746616c7-e2f8-4272-b588-5b6465e4c724)

It will sometimes start leaking right away, and other times take a while to start like the trace above. Once it starts leaking, it always continue to leak. At 4 instances (4 plugins on 4 parallel tasks) the leaks blow up immediately:

![Image](https://github.com/user-attachments/assets/37f6e98a-95a4-4767-b6be-5257dbfe1d6b)

The higher the level of parallelism, the faster the leak... It feels like there's something shared between instances here that isn't thread-safe.

The leaked allocations are realloc that occurs when "lifting" a list type from WASM => host in a host callback.

![Image](https://github.com/user-attachments/assets/4f224ad2-097e-4bdf-8611-2b8fdacaee50)

Test Case

Not easy to replicate outside of our application ATM.

Steps to Reproduce

I'd like to upload the ZST traces for heaptrack, but github is blocking even a zip of them? They're large, 40-60 MB.

Versions and Environment

Wasmtime version or commit: 29.0.1

Operating system: Linux

Architecture: x86_64

Wasmtime GitHub notifications bot (Feb 21 2025 at 18:29):

jadamcrain commented on issue #10239:

I've used Valgrind's massif and jemalloc. There are no heap leaks. Heaptrack appears to just have a bug under some unknown set of conditions that leads to those nonsensical profiles and leaked stack traces. It was a red herring that adding wasmtime to the mix triggered the bug. Who knows why... depth of the stack traces, anonymous stack frames, I have no idea why but the other heap profiling tools had no issue.

The reason I first thought there was a leak in production was the we were running the application as a systemd service and the memory "usage" reported by systemctl status apparently includes cached file data by Linux! In the same redeployment, I added both the WASM plugin stuff + some historical logging directly to files... the growing memory usage was just Linux caching this written file data and accounting for it when reporting the memory usage. If you look at the RSS memory usage using ps/top/etc you actually see that that part of the usage is stable.

So, a bug in heaptrack combined w/ me being a systemd noob resulted in a wild goose chase =).

![Image](https://github.com/user-attachments/assets/7ee37c49-4118-4751-99d4-e54be81a021f)

Wasmtime GitHub notifications bot (Feb 21 2025 at 18:53):

alexcrichton commented on issue #10239:

Oh wow, that's wild! Regardless thanks for investigating and tracking that down!

Last updated: Apr 17 2025 at 12:05 UTC