cfallin opened issue #3830:
In #3815, we saw a case where a fuzzer came up with a Wasm heap consisting of (i) a byte at address 0, and (ii) a byte at address 1 GiB. With memfd enabled, this produces a
.cwasm
with a data segment, ready to mmap, that takes a full 1GiB.We probably don't want that, so in #3819 we added some heuristics based on "density" of the data segments: if less than half the bytes are set, then the image is sparse and we don't do memfd.
Unfortunately this creates a really subtle performance cliff: if the user's toolchain happens to produce a Wasm with more than 50% of its bytes in its heap zero, it will slow down the whole instantiation significantly (often pushing from microsecond to millisecond range). This is unfortunate both because it's a cliff and because it affects all memory contents, not just those that exceed e.g. some bound. (In other words it's a "non-graduated" penalty: adding just one more thing penalizes all of the things, instead of just the new thing.)
I was able to get close to the 50% limit with a "real" Wasm toolchain (specifically, one based on Wizer and SpiderMonkey that produces wasm binaries for services).
An alternative proposed in #3815 was a simpler static limit: all data segments below some bound go into a memfd image, and those above some bound go into a list of segments to eagerly initialize on instantiation. This was rejected because it is also a cliff, but IMHO a static limit is easier to understand than a ratio, and can often also be aligned to other limits in the system (e.g. if a platform already bounds the maximum module or heap size, then it can set the memfd-image limit to that size and get a guarantee that it will never see the slow case).
In addition, we could perhaps generate the memfd image for all data segments that are within a limit, then only keep a sparse list of those that are out of bound. This bounds our image size to O(kMaxDenseSize) + O(|wasm|), and importantly, gives a "graduated" cost: if the module grows to slightly exceed the limit someday, then the new bytes cost more, but the whole thing doesn't get 100x slower to instantiate.
cc @alexcrichton
cfallin commented on issue #3830:
Here's a proof-of-concept of the issue: I produced this Wasm module by Wizening a SpiderMonkey-based JS project that includes a markdown renderer and does some pre-rendering at the toplevel (which runs at wizening time), and simulates some more init-time alloc/free by allocating a large array. (I played with the sizes until it got just past the threshold, for full disclosure, which is why this is "suspiciously close", but the point I was trying to convince myself of was that this is indeed possible to hit.) Source is here.
I observe the following stats:
[crates/environ/src/module.rs:372] memory_init_size = 11993088 [crates/environ/src/module.rs:372] data_size = 5963776
which is just under the 50%-dense threshold, so this module would fail to use memfd with the new heuristics.
alexcrichton commented on issue #3830:
cc https://github.com/bytecodealliance/wasmtime/pull/3831 to have the link here
I'm all for a fancier mmap scheme where we either have mmap+leftovers or something a bit fancier like up to N mmap images plus leftovers. I'm not sure off the top of my head how we'd determine what's appropriate for what module but I'm all for making this a more intelligent decision in Wasmtime.
alexcrichton labeled issue #3830:
In #3815, we saw a case where a fuzzer came up with a Wasm heap consisting of (i) a byte at address 0, and (ii) a byte at address 1 GiB. With memfd enabled, this produces a
.cwasm
with a data segment, ready to mmap, that takes a full 1GiB.We probably don't want that, so in #3819 we added some heuristics based on "density" of the data segments: if less than half the bytes are set, then the image is sparse and we don't do memfd.
Unfortunately this creates a really subtle performance cliff: if the user's toolchain happens to produce a Wasm with more than 50% of its bytes in its heap zero, it will slow down the whole instantiation significantly (often pushing from microsecond to millisecond range). This is unfortunate both because it's a cliff and because it affects all memory contents, not just those that exceed e.g. some bound. (In other words it's a "non-graduated" penalty: adding just one more thing penalizes all of the things, instead of just the new thing.)
I was able to get close to the 50% limit with a "real" Wasm toolchain (specifically, one based on Wizer and SpiderMonkey that produces wasm binaries for services).
An alternative proposed in #3815 was a simpler static limit: all data segments below some bound go into a memfd image, and those above some bound go into a list of segments to eagerly initialize on instantiation. This was rejected because it is also a cliff, but IMHO a static limit is easier to understand than a ratio, and can often also be aligned to other limits in the system (e.g. if a platform already bounds the maximum module or heap size, then it can set the memfd-image limit to that size and get a guarantee that it will never see the slow case).
In addition, we could perhaps generate the memfd image for all data segments that are within a limit, then only keep a sparse list of those that are out of bound. This bounds our image size to O(kMaxDenseSize) + O(|wasm|), and importantly, gives a "graduated" cost: if the module grows to slightly exceed the limit someday, then the new bytes cost more, but the whole thing doesn't get 100x slower to instantiate.
cc @alexcrichton
alexcrichton closed issue #3830:
In #3815, we saw a case where a fuzzer came up with a Wasm heap consisting of (i) a byte at address 0, and (ii) a byte at address 1 GiB. With memfd enabled, this produces a
.cwasm
with a data segment, ready to mmap, that takes a full 1GiB.We probably don't want that, so in #3819 we added some heuristics based on "density" of the data segments: if less than half the bytes are set, then the image is sparse and we don't do memfd.
Unfortunately this creates a really subtle performance cliff: if the user's toolchain happens to produce a Wasm with more than 50% of its bytes in its heap zero, it will slow down the whole instantiation significantly (often pushing from microsecond to millisecond range). This is unfortunate both because it's a cliff and because it affects all memory contents, not just those that exceed e.g. some bound. (In other words it's a "non-graduated" penalty: adding just one more thing penalizes all of the things, instead of just the new thing.)
I was able to get close to the 50% limit with a "real" Wasm toolchain (specifically, one based on Wizer and SpiderMonkey that produces wasm binaries for services).
An alternative proposed in #3815 was a simpler static limit: all data segments below some bound go into a memfd image, and those above some bound go into a list of segments to eagerly initialize on instantiation. This was rejected because it is also a cliff, but IMHO a static limit is easier to understand than a ratio, and can often also be aligned to other limits in the system (e.g. if a platform already bounds the maximum module or heap size, then it can set the memfd-image limit to that size and get a guarantee that it will never see the slow case).
In addition, we could perhaps generate the memfd image for all data segments that are within a limit, then only keep a sparse list of those that are out of bound. This bounds our image size to O(kMaxDenseSize) + O(|wasm|), and importantly, gives a "graduated" cost: if the module grows to slightly exceed the limit someday, then the new bytes cost more, but the whole thing doesn't get 100x slower to instantiate.
cc @alexcrichton
alexcrichton commented on issue #3830:
I think this was more-or-less done at the time and appears to have served us well in the meantime, so I'm going to close.
Last updated: Jan 24 2025 at 00:11 UTC