wasmtime / issue #5879 Use boxed slices instead of Vec wh... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #5879 Use boxed slices instead of Vec wh...

Wasmtime GitHub notifications bot (Feb 24 2023 at 19:22):

jameysharp opened issue #5879:

Feature

When we store a Vec<T> inside another data structure, but never change its length, we can instead store a Box<[T]>.

Benefit

The size of Vec in memory is three machine words, but a boxed slice is only two machine words long. This doesn't really matter if the array is only stored on the stack. But if it's part of another data structure, switching to boxed slices can reduce the amount of heap memory we allocate. That may improve cache locality and have other performance benefits.

Implementation

I did a quick experiment along these lines several months ago in commit 68d2ac2a1471a096d021be7f2cb2734843648ab8. That commit has merge conflicts with main now, but it illustrates the basics of how to make this change for one type, Vec<ArgPair>. I suspect there are many other uses of Vec in Cranelift that could have equally simple changes. There may be some in Wasmtime as well, though I'm personally focused on cases in Cranelift.

Pull requests addressing this issue should ideally include some measurements showing what effect the change has, because it's theoretically possible that this could make performance worse. I think it's more likely that there won't be much visible effect at all, but these changes shouldn't make the implementation any more complicated and could potentially improve performance. We won't know unless we try it and measure the results.

There are two kinds of measurements that would be great to do, with Wasmtime built in --release mode.

Run valgrind --tool=dhat target/release/wasmtime compile on some WebAssembly module, both with the version of main you started from and with your patched version. I like to use our pulldown-cmark benchmark program for this because it's not too small but not too big, so even with the overhead of DHAT's instrumentation it doesn't take too long.

Run our Sightglass benchmark suite to compare the speed of compiling some benchmark programs with and without your changes. See the instructions in that project's README. For testing this issue, the --stop-after compilation option is useful; these changes should have no effect on the generated code or execution time, so you don't need to measure those.

Alternatives

It's fine to keep using Vec.

There might be cases where either SmallVec, or our ListPool type from cranelift-entity, are more appropriate than a boxed slice. SmallVec is an especially good choice if most of the time the array would fit in about two machine words, but it occasionally needs to be bigger. ListPool may be a good choice if we have a lot of the same type of array allocated at the same time, and the array elements implement EntityRef, which generally means they're 32-bit indices into an array somewhere else.

Wasmtime GitHub notifications bot (Feb 24 2023 at 19:22):

jameysharp labeled issue #5879:

Feature

When we store a Vec<T> inside another data structure, but never change its length, we can instead store a Box<[T]>.

Benefit

The size of Vec in memory is three machine words, but a boxed slice is only two machine words long. This doesn't really matter if the array is only stored on the stack. But if it's part of another data structure, switching to boxed slices can reduce the amount of heap memory we allocate. That may improve cache locality and have other performance benefits.

Implementation

I did a quick experiment along these lines several months ago in commit 68d2ac2a1471a096d021be7f2cb2734843648ab8. That commit has merge conflicts with main now, but it illustrates the basics of how to make this change for one type, Vec<ArgPair>. I suspect there are many other uses of Vec in Cranelift that could have equally simple changes. There may be some in Wasmtime as well, though I'm personally focused on cases in Cranelift.

Pull requests addressing this issue should ideally include some measurements showing what effect the change has, because it's theoretically possible that this could make performance worse. I think it's more likely that there won't be much visible effect at all, but these changes shouldn't make the implementation any more complicated and could potentially improve performance. We won't know unless we try it and measure the results.

There are two kinds of measurements that would be great to do, with Wasmtime built in --release mode.

Run valgrind --tool=dhat target/release/wasmtime compile on some WebAssembly module, both with the version of main you started from and with your patched version. I like to use our pulldown-cmark benchmark program for this because it's not too small but not too big, so even with the overhead of DHAT's instrumentation it doesn't take too long.

Run our Sightglass benchmark suite to compare the speed of compiling some benchmark programs with and without your changes. See the instructions in that project's README. For testing this issue, the --stop-after compilation option is useful; these changes should have no effect on the generated code or execution time, so you don't need to measure those.

Alternatives

It's fine to keep using Vec.

There might be cases where either SmallVec, or our ListPool type from cranelift-entity, are more appropriate than a boxed slice. SmallVec is an especially good choice if most of the time the array would fit in about two machine words, but it occasionally needs to be bigger. ListPool may be a good choice if we have a lot of the same type of array allocated at the same time, and the array elements implement EntityRef, which generally means they're 32-bit indices into an array somewhere else.

Wasmtime GitHub notifications bot (Feb 24 2023 at 19:27):

bjorn3 commented on issue #5879:

Note that converting from Vec to boxed slice may reallocate to shrink capacity, which may dwarf the benefit.

Wasmtime GitHub notifications bot (Feb 24 2023 at 19:46):

jameysharp commented on issue #5879:

Yeah, that's one reason why measuring the results is important.

Whether the reallocation occurs depends on how they're constructed. I believe that using Iterator::collect to produce a boxed slice won't reallocate as long as the iterator gives exact results from size_hint, which I think is fairly common in Cranelift. If someone picks up this issue I'll look at specific cases and help figure out whether boxed slices are likely to be a good fit on a case-by-case basis.

Also, I don't know how expensive it is to reallocate to a smaller layout with Rust's default allocator. My mental model of general-purpose heap allocators is that realloc to a smaller size doesn't copy the memory, just updates metadata in O(1) time. But that's not based on looking at any modern allocators and could be completely wrong. Do you know off-hand?

Wasmtime GitHub notifications bot (Feb 24 2023 at 19:58):

bjorn3 commented on issue #5879:

Most allocators divide the heap in different size classes and use mmap directly for very large allocations. If resizing brings you to a smaller size class I wouldn't be surprised if it is necessary to copy the data to a chunk in this new size class. If it doesn't go to a smaller size class, it won't save any memory (except for the capacity field missing) at all as allocation sizes are rounded to the nearest size class. For mmap backed allocations shrinking will perform an expensive munmap syscall. When collecting an iterator directly into a boxed slice where the size hint is correct, collecting it into a vec instead doesn't save any space apart from the capacity field.

Wasmtime GitHub notifications bot (Feb 24 2023 at 19:59):

bjorn3 edited a comment on issue #5879:

Most allocators divide the heap in different size classes and use mmap directly for very large allocations. If resizing brings you to a smaller size class I wouldn't be surprised if it is necessary to copy the data to a chunk in this new size class. (At least one allocator I looked at didn't save any size information in the allocation header, but instead derives it from the size class of the chunk in which the allocation is stored) If it doesn't go to a smaller size class, it won't save any memory (except for the capacity field missing) at all as allocation sizes are rounded to the nearest size class. For mmap backed allocations shrinking will perform an expensive munmap syscall. When collecting an iterator directly into a boxed slice where the size hint is correct, collecting it into a vec instead doesn't save any space apart from the capacity field.

Wasmtime GitHub notifications bot (Aug 21 2023 at 22:16):

OLUWAMUYIWA commented on issue #5879:

Hi @jameysharp, I'd like to pick this up. First time here, but I'm up to it.

Last updated: Nov 22 2024 at 16:03 UTC