alexcrichton requested wasmtime-compiler-reviewers for a review on PR #10404.
alexcrichton requested abrown for a review on PR #10404.
alexcrichton opened PR #10404 from alexcrichton:stack-maps-out-of-line
to bytecodealliance:main
:
This commit moves the storage of stack maps from being embedded within serde-encoded information to instead being stored in a separate ELF section in the final executable. The motivation for this is to make this more easily debuggable with a
wasmtime objdump
command in the future but this additionally should have the nice side effect of making non-stack-maps modules have smaller encoded information (no need to encode an empty list) and additionally make stack-maps-using-modules faster to decode (no serde decoding, it's already "decoded").This implements a scheme similar to the address map section where there's a "builder" for the section and then a separate half to decode the section. The same basic encoding, a bit map, is used. This is likely going to make accessing stack maps slightly slower, but if that's an issue we can tweak the representation and align things and/or use
usize
or such.<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
alexcrichton requested pchickey for a review on PR #10404.
alexcrichton requested wasmtime-core-reviewers for a review on PR #10404.
github-actions[bot] commented on PR #10404:
Subscribe to Label Action
cc @saulecabrera
<details>
This issue or pull request has been labeled: "cranelift", "wasmtime:api", "winch"Thus the following users have been cc'd because of the following labels:
- saulecabrera: winch
To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.
Learn more.
</details>
pchickey submitted PR review.
abrown submitted PR review:
Makes sense to me! I just noted some doubts I had while reading this for those who are more familiar with stack maps and their usage.
abrown created PR review comment:
Are we sure we want to re-export
U32Bytes
in the public API? I suspect the answer is "yes" but just wanted to note this.
abrown created PR review comment:
/// * Stack map data as 4-byte little endian integers.
Without looking below to see what this should be... is this correct?
alexcrichton updated PR #10404.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Whoops, yes!
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Yeah it's fine here in that
wasmtime-environ
is a "private crate" where we don't consider it part of Wasmtime's public API, just an internal implementation detail. We can probably do more to document that at the crate itself though.
fitzgen submitted PR review:
Overall, I really like this approach. Have opinions on some of the details tho.
fitzgen created PR review comment:
The idea here is that you want to use something of fixed size, rather than
usize
, for the bitsets serialized into the stack map section?It shouldn't actually matter, since we can only ever run
.cwasm
s that have the same sizeusize
as the host (whether native or pulley) right? Given that, is it worth the complications that become necessary here to support non-usize
scalars inside a compound bit set?
fitzgen created PR review comment:
Ah so the
* Stack map data
above is just a bag of bytes of the rest of the section, and when you get an offset into that bag of bytes, it is guaranteed to point at one of these? Could you clarify these binary format docs that the trailing data is a bunch of variably-sized data and that the initial PC and offset fields are essentially an index for searching through it without decoding every single stack map?
fitzgen created PR review comment:
This is kind of a strange API for this data structure to support. I'm not sure what it really is supposed to "mean". It feels like, if you want this API, you should actually just be working with a
Vec<ScalarBitSet<T>>
, because the underlying scalars used by this data structure really are just an implementation and data-representation detail and don't have any meaning on which values are vs are not in the set itself.
fitzgen created PR review comment:
- Maybe avoid pushing to
section
ifoffsets
is empty?- The name
offsets
was pretty confusing to me, since I was initially thinking of instruction offsets (i.e. theoffset
(necode_offset
) variable) and not frame offsets when reading this. Can we either keep the namecode_offset
foroffset
and/or renameoffsets
to{frame,stack}_offsets
?
fitzgen created PR review comment:
Rename this
BITS_PER_SCALAR
?
fitzgen created PR review comment:
Something to consider for the future: if we frequently have multiple sequential entries for different PCs but which have the same stack slots, eg
... 0x1dc: offset of [8, 12] 0x124: offset of [8, 12, 24] copy 1 0x142: offset of [8, 12, 24] copy 2 0x15a: offset of [8, 12, 24] copy 3 0x166: offset of [8] ...
then it may make sense for each entry in the index to store non-overlapping PC ranges, rather than exact PCs, and we could effectively dedupe the index entries and the stack map data. That is, the previous example would become
... 0x1dc..0x1dd: offset of [8, 12] 0x124..0x15b: offset of [8, 12, 24] (only copy) 0x166..0x167: offset of [8] ...
The downsides are that
- We would need to change Cranelift to actually emit empty stack maps for safepoints without any live GC refs, otherwise if we have
(pc=0x1234, [8]); (pc=0x1238, []); (pc=0x123b, [8])
and we don't see that middle entry in this builder, then we risk using[8]
as our stack map at pc0x1238
, which is extending a dead gc ref's lifetime at best and is giving the collector uninitialized data at worst.- Relatedly, we lose our ability to catch bugs where the return address PC we are tracing isn't an exact match for a stack map entry.
These are actually pretty scary, so maybe we don't want to do this, even if it would let us make these binary search indices much smaller.
All that said, we can actually already dedupe the stack map _data_ if we want to, and have multiple index entries point to the same stack map data (even if they aren't contiguous!) with the encoding scheme already in use in this PR. We just need to hash cons and cache stack-map-data to encoded offset in this builder. This doesn't have any of the downsides from above. Seems like it would be a pure win.
fitzgen created PR review comment:
In addition to what Andrew commented, how many integers?
N
doesn't seem right, it seems like we should logically haveN
compound bit sets, but each one can be variably sized. So something isn't clear here.
fitzgen created PR review comment:
Suggest adding "because our GC ref's stack slots are always 4-byte aligned" somewhere in this sentence.
fitzgen created PR review comment:
So back to
iter_words
: the usage here makes me think we should instead just expose a method likeimpl CompoundBitSet { pub fn as_raw(&self) -> &[u32] { ... } }
method or something, have it always operate on
u32
, and maybe (I haven't looked at the updated usage yet) add something like the following to allow zero-copy usage:impl CompoundBitSet { pub fn ref_from_raw<F, T>(raw: &[u32], f: F) -> T where F: for<'a> FnOnce(&'a CompoundBitSet) -> T { // unsafely create a CompoundBitSet wrapping the raw data, // pass a ref to the bitset into the closure, // mem::forget the bitset to avoid Vec dtors } }
rather than making it generic over the internal storage and doing a whole
iter_words
thing.
fitzgen created PR review comment:
Or alternatively, should we just fix the compound bit set's internal scalar bit set to
u64
oru32
scalars, instead ofusize
?
fitzgen created PR review comment:
The
CompoundBitSet::ref_from_raw
thing would help tidy up some of this stuff too, so that it doesn't have to have a copy of the bitset logic inline here.
fitzgen created PR review comment:
Since we have
range.end
handy, it probably makes sense to debug assert thatrange.start <= range.start + code_offset < range.end
fitzgen submitted PR review.
fitzgen created PR review comment:
Could avoid the branded lifetime (so that the thing could be stored in the usage site's iterator) by doing a
std::cell::Ref
-esque thing instead:pub struct BorrowedCompoundBitSet<'a> { bitset: core::mem::ManuallyDrop<CompoundBitSet>, _borrow: core::marker::PhantomData<&'a [u32]>, } impl core::ops::Deref for BorrowedCompoundBitSet<'_> { type Target = CompoundBitSet; // ... }
fitzgen edited PR review comment.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
This is an attempt to multiplex the (IMO correct) default behavior of pointer-sized-by-default with the cross-compilation behavior of "always use fixed-size things as it's easier". I don't want to use
usize
in the compiled ELF as that makes cross-compilation and debugging cross-compiled artifacts a bit harder.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
I'm not sure I understand your confusion, but I'm happy to rename this and/or document it better. IMO
CompoundBitSet
is by definition aVec<ScalarBitSet<T>>
and the point of it is to do all the vector indexing for you so you don't have to think about it. In that sense I wouldn't want to work withVec<ScalarBitSet<T>>
raw and it also makes sense to me to expose an iterator or slice accessor as that's the definition of the representation.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
For (1) agreed and that happens within
StackMapSection
building. For (2) I'll add some assertions, good point!
alexcrichton submitted PR review.
alexcrichton created PR review comment:
I like your latter idea, having a "pool of sets" and when we insert a stack map we shove it in a hash map and dedupe based on that. Agreed it's a pure win with no real degree of complexity overhead. Are you thinking that should be done here? (I'd sort of prefer to defer that to analyze a "real world module" with a lot of stack maps)
alexcrichton submitted PR review.
alexcrichton created PR review comment:
I realize I can probably use
ScalarBitSet::iter
to simplify a lot of this. I'd prefer to avoid too wonky of an abstraction to useCompoundBitSet
literally somehow, but do you think that's an acceptable complexity tradeoff?
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Er, sorry.
To answer your original thoughts on this comment, I'd sort of defer to my response above. My literal response just now above this one is actually a response to your thought below
fitzgen submitted PR review.
fitzgen created PR review comment:
Deduping the stack map data can definitely be a separate PR
alexcrichton updated PR #10404.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
I've tried to document this better with some more art/boxes, but let me know if I can improve anything
fitzgen submitted PR review.
fitzgen created PR review comment:
How about this:
- we call this method
iter_scalars
- the iterator yields
ScalarBitSet<T>
items instead ofT
directly
alexcrichton updated PR #10404.
alexcrichton commented on PR #10404:
@fitzgen did you want to review this over again after the recent changes? Or have more thoughts?
fitzgen submitted PR review:
Thanks
alexcrichton updated PR #10404.
alexcrichton has enabled auto merge for PR #10404.
alexcrichton merged PR #10404.
Last updated: Apr 16 2025 at 22:03 UTC