wasmtime / PR #11768 Cranelift: add debug tag infrastruct... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / PR #11768 Cranelift: add debug tag infrastruct...

Wasmtime GitHub notifications bot (Oct 01 2025 at 03:36):

cfallin opened PR #11768 from cfallin:cranelift-debug-tags to bytecodealliance:main:

This PR adds debug tags, a kind of metadata that can attach to CLIF instructions and be lowered to VCode instructions and as metadata on the produced compiled code. It also adds opaque descriptor blobs carried with stackslots. Together, these two features allow decorating IR with first-class debug instrumentation that is properly preserved by the compiler, including across optimizations and inlining. (Wasmtime's use of these features will come in followup PRs.)

The key idea of a "debug tag" is to allow the Cranelift embedder to express whatever information it needs to, in a format that is opaque to Cranelift itself, except for the parts that need translation during lowering. In particular, the DebugTag::StackSlot variant gets translated to a physical offset into the stackframe in the compiled metadata output. So, for example, the embedder can emit a tag referring to a stackslot, and another describing an offset in that stackslot.

The debug tags exist as a sequence on any given instruction; the meaning of the sequence is known only to the embedder, except that during inlining, the tags for the inlining call instruction are prepended to the tags of inlined instructions. In this way, a canonical use-case of tags as describing original source-language frames can preserve the source-language view even when multiple functions are inlined into one.

The descriptor on a stackslot may look a little odd at first, but its purpose is to allow serializing some description of stackslot-contained runtime user-program data, in a way that is firmly attached to the stackslot. In particular, in the face of inlining, this descriptor is copied into the inlining (parent) function from the inlined function when the stackslot entity is copied; no other metadata outside Cranelift needs to track the identity of stackslots and know about that motion. This fits nicely with the ability of tags to refer to stackslots; together, the embedder can annotate instructions as having certain state in stackslots, and describe the format of that state per stackslot.

This infrastructure is tested with some compile-tests now; testing of the interpretation of the metadata output will come with end-to-end debug instrumentation tests in a followup PR.

Wasmtime GitHub notifications bot (Oct 01 2025 at 03:36):

cfallin requested alexcrichton for a review on PR #11768.

Wasmtime GitHub notifications bot (Oct 01 2025 at 03:36):

cfallin requested wasmtime-compiler-reviewers for a review on PR #11768.

Wasmtime GitHub notifications bot (Oct 01 2025 at 03:38):

cfallin updated PR #11768.

Wasmtime GitHub notifications bot (Oct 01 2025 at 03:41):

cfallin updated PR #11768.

Wasmtime GitHub notifications bot (Oct 01 2025 at 03:47):

cfallin updated PR #11768.

Wasmtime GitHub notifications bot (Oct 01 2025 at 04:05):

cfallin updated PR #11768.

Wasmtime GitHub notifications bot (Oct 01 2025 at 04:17):

cfallin updated PR #11768.

Wasmtime GitHub notifications bot (Oct 01 2025 at 09:20):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (Oct 01 2025 at 09:20):

bjorn3 created PR review comment:

Why are you removing this? If a value is in a fixed stack slot, then using this map would be more efficient than using debug tags that you need to attach to every instruction for native debugging.

Wasmtime GitHub notifications bot (Oct 01 2025 at 15:28):

alexcrichton submitted PR review:

I have some assorted thoughts below, but I'd second @bjorn3's question a bit too. Overall I'm at least not personally quite following why we need to plumb so much through Cranelift vs reading the result of Cranelift's compilation. I mentioned below that it felt like we should be using a newtype-u32 rather than Vec<DebugTag> in places, but additionally if we're already plumbing around StackSlot-to-offset as the result of compilation that seems like it's a big part of what we want already? The only other major missing piece would be Inst-to-offset-in-output which looks related to MachDebugTagPos logic here for example.

I may not be the best reviewer for this though perhaps. Or perhaps I should read the next PR to see how this is all used.

Wasmtime GitHub notifications bot (Oct 01 2025 at 15:28):

alexcrichton created PR review comment:

Should this have an else that does self.insts.remove(&to)?

Wasmtime GitHub notifications bot (Oct 01 2025 at 15:28):

alexcrichton created PR review comment:

How come this is a Vec<u8> vs a newtype-u32? (or other fixed-width integer) I would naively expect the latter where the embedder would be responsible for mapping the integer back to substantial data (which may or may not make sense to represent as Vec<u8>)

Wasmtime GitHub notifications bot (Oct 01 2025 at 15:28):

alexcrichton created PR review comment:

At a high level after skimming this PR I'm surprised in that I expected more handling of infrastructure or something internally. This is currently a map of Inst to Vec<DebugTag>, but that appears to be mostly it. I'm surprised by that where I was expecting some sort of integration with egraphs or something like that. Because otherwise wouldn't egraphs basically destroy the usefulness of the debug tags where instructions are remapped/moved/etc? I realize for debugging in Wasmtime we'd disable egraphs, but if the handling is so light in Cranelift I'm also not certain why this can't be an embedder-only thing vs needing to be baked into Cranelift.

Wasmtime GitHub notifications bot (Oct 01 2025 at 15:28):

alexcrichton created PR review comment:

I'm a bit surprised by this in the sense that most of Cranelift is centered around costly clones/etc of vectors. I would otherwise expect some sort of step up-front to turn Vec<DebugTag> into a DebugList(u32) descriptor and then that would be passed around here. Is it worth doing that sort of compile-time optimization at this point? Or are these infrequent/short enough to not really matter?

Wasmtime GitHub notifications bot (Oct 01 2025 at 15:28):

alexcrichton created PR review comment:

I realize that this is mostly repeating what's already here, but I'm getting a bit more worried over time that a MachBuffer is kilobytes-large and we sort of keep naively adding more SmallVecs which may result in a lot of wasted empty space. At what point is it better to use a Vec<T> vs SmallVec<[T; N]>?

Wasmtime GitHub notifications bot (Oct 01 2025 at 15:28):

alexcrichton created PR review comment:

Is it worth having any sort of interning here where tags is deduplicated with previous lists? I'm not far enough in my understanding yet to know whether it's going to be common to have the same set of debug tags frequently throughout a function

Wasmtime GitHub notifications bot (Oct 01 2025 at 16:49):

alexcrichton requested fitzgen for a review on PR #11768.

Wasmtime GitHub notifications bot (Oct 01 2025 at 16:49):

alexcrichton requested alexcrichton for a review on PR #11768.

Wasmtime GitHub notifications bot (Oct 01 2025 at 17:41):

fitzgen submitted PR review:

Thanks Chris!

Also from the discussion in the cranelift meeting today:

we should re-add the sequence_point CLIF instruction

we should assert that the only instructions with debug tags are sequence_point intstructions

Wasmtime GitHub notifications bot (Oct 01 2025 at 17:41):

fitzgen created PR review comment:

Not every stack slot needs a descriptor, so it would be nice if this was a secondary map, and then all the existing callers for creating new stack slots wouldn't need to change.

Even if a descriptor becomes a u32/entity, I think this would still be nice to put in a side table.

Wasmtime GitHub notifications bot (Oct 01 2025 at 17:41):

fitzgen created PR review comment:

Basically would be this side table, in fact.

Wasmtime GitHub notifications bot (Oct 01 2025 at 17:41):

fitzgen created PR review comment:

But they could be? We are using indices and flat storage anyways, so why not give those indices a newtype as an entity and do
struct DebugTag(u32);
entity_impl!(DebugTag);

enum DebugTagData {
    User(u32),
    StackSlot(StackSlot),
}

pub struct DebugTags {
    tags: PrimaryMap<DebugTag, DebugTagData>,
    pool: ListPool<DebugTag>,
    insts: SecondaryMap<Inst, EntityList<DebugTag>>,
}
?

I guess this adds an indirection, but in return means that you recycle memory when creating new lists (prepending an existing list with new tags means that the existing list is "leaked" in the current data structure) and you have size classes and all that.

Not clear to me which side of the trade off is ultimately a better choice, but at minimum this comment about not being able to use EntityList is clearly a bit distracting :)

Wasmtime GitHub notifications bot (Oct 01 2025 at 21:07):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 01 2025 at 21:07):

cfallin created PR review comment:

This information isn't actually removed, just moved here as something accessible from the MachBuffer for slightly neater plumbing.

Wasmtime GitHub notifications bot (Oct 01 2025 at 21:13):

cfallin commented on PR #11768:

I have some assorted thoughts below, but I'd second @bjorn3's question a bit too. Overall I'm at least not personally quite following why we need to plumb so much through Cranelift vs reading the result of Cranelift's compilation. I mentioned below that it felt like we should be using a newtype-u32 rather than Vec<DebugTag> in places, but additionally if we're already plumbing around StackSlot-to-offset as the result of compilation that seems like it's a big part of what we want already? The only other major missing piece would be Inst-to-offset-in-output which looks related to MachDebugTagPos logic here for example.

To recap a bit what we discussed in today's Cranelift meeting for any other readers (or posterity):

The design here is more or less necessitated by supporting preservation across inlining (and we fully preserve debugging correctness across other optimizations too, so handling inlining as well doesn't seem out of place; moreover, excluding it seemed to me to be an unnecessary limit). In particular, inlining concatenates tags from the inlining call instruction to those of the inlined instructions (see this doc comment); this is what means we need a list of tags, rather than a single tag, and the need for these tags to stand alone (rather than be an indirect index into some other data) is forced by inlining as well.

The correctness across optimizations is ensured by placing tags on side-effectful instructions, which don't get removed. In the sibling PR for Wasmtime (#11769) I add them on all trapping instructions, and on callsites; an earlier iteration of this work put them on sequence-point instructions that acted as explicit immovable markers instead. I explained the appeal of that latter design for conceptual clarity in the meeting and we agreed to go back to sequence-point instructions (calls will also need tags, so the inlining mechanism works).

Wasmtime GitHub notifications bot (Oct 01 2025 at 21:16):

cfallin commented on PR #11768:

Also:

Attaching tags to instruction indices and then recovering information at the output is error-prone for a number of reasons, but inlining again is what forces us to have tags as a first-class object that flow through the IR rather than live alongside it.

Cranelift needs to handle the translation of stackslot references from the IR level to the frame-layout level in the compilation metadata, so this part of the mechanism also needs to be first-class in the IR, and can't be indirected to the outside (wasmtime-cranelift or other embedder).

Wasmtime GitHub notifications bot (Oct 01 2025 at 21:18):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 01 2025 at 21:18):

cfallin created PR review comment:

Addressed in top-level comment but the tl;dr is: (i) things that are put in the IR here need to be first-class either because inlining pulls them between functions, and needs to operate semantically on them (concatenate list of virtual frames), or because Cranelift needs to translate/lower them somehow (stackslot refs); (ii) we do run and support full optimizations, and this works because tags are attached to side-effecting insts.

Wasmtime GitHub notifications bot (Oct 01 2025 at 21:20):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 01 2025 at 21:20):

cfallin created PR review comment:

Unfortunately the tag lists for each point will be unique (they include the Wasm PC, for example), so deduplication won't really do much here. Separating the intern-into-list-pool step from the attachment isn't the worst idea, and would allow sharing in theory, but e.g. the wasmtime-cranelift embedder won't do that (each list is unique).

Wasmtime GitHub notifications bot (Oct 01 2025 at 21:21):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 01 2025 at 21:21):

cfallin created PR review comment:

More that they are unique at each instance, so there isn't much value to sharing. That said, if we revert to my earlier sequence-point design, there is no need to attach tags to every instruction inserted by a cursor as this does here; rather they will be explicitly attached on one particular instruction; so we can get rid of this field and the clones from it.

Wasmtime GitHub notifications bot (Oct 02 2025 at 00:32):

cfallin updated PR #11768.

Wasmtime GitHub notifications bot (Oct 02 2025 at 00:32):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 02 2025 at 00:32):

cfallin created PR review comment:

Yes, wasn't needed before but is now (with stricter checking of tag presence on non-sequence-point insts) -- fixed!

Wasmtime GitHub notifications bot (Oct 02 2025 at 00:33):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 02 2025 at 00:33):

cfallin created PR review comment:

Removed this now, since tags can only be on sequence points and need not be spread across all inserted insts.

Wasmtime GitHub notifications bot (Oct 02 2025 at 00:42):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 02 2025 at 00:42):

cfallin created PR review comment:

I poked at this a bit but I think it's probably not the right answer -- indirection would be a win if it led to substantial savings by sharing one copy of a large object, but in this case, DebugTagData is going to be 64 bits, so we already have a 50% overhead storing a separate u32 to index each one in the entity list, nevermind the pointer into that list, and the dependent load / extra hop to get there.

FWIW, while we do abandon the one tag list for a call instruction when doing inlining (three tags in practice in Wasmtime), that's the only case where a freelist inside the pool would help -- every other tag list is constructed once in its final form.

I'll go ahead and resolve this but others are free to poke at it further of course :-)

Wasmtime GitHub notifications bot (Oct 02 2025 at 00:43):

cfallin commented on PR #11768:

I've re-added sequence points in the latest push, and I'll remove the built-in stackslot descriptor blobs (in favor of a u64 that in Wasmtime's case can hold a FuncKey) next.

Wasmtime GitHub notifications bot (Oct 02 2025 at 00:46):

cfallin updated PR #11768.

Wasmtime GitHub notifications bot (Oct 02 2025 at 00:46):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 02 2025 at 00:46):

cfallin created PR review comment:

That's a great question; way back in 2020 I found that smallvecs helped avoid realloc/copy traffic for the code buffer itself for small functions, but we've carried that pattern through to all the tags, some of which may be much more sparse. I'll switch these back to Vecs.

Wasmtime GitHub notifications bot (Oct 02 2025 at 00:47):

cfallin edited PR review comment.

Wasmtime GitHub notifications bot (Oct 02 2025 at 04:00):

cfallin updated PR #11768.

Wasmtime GitHub notifications bot (Oct 02 2025 at 04:02):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 02 2025 at 04:02):

cfallin created PR review comment:

I ended up keeping this as a field in StackSlotData but

making it an Option<StackSlotKey> and

keeping StackSlotData::new()'s signature unchanged, while adding new_with_key

This feels a little nicer/more functional than having to set a side-table IMHO, and is less error-prone (the key is still part of the definition, not something that optimizations have to remember to move as well).

Wasmtime GitHub notifications bot (Oct 02 2025 at 04:04):

cfallin created PR review comment:

Very good point -- after discussions in the Cranelift meeting today it seems a u64 is sufficient (concretely, Wasmtime needs the FuncKey associated with the stackslot so it can go look up the frame shape descriptor later); so I've defined a newtype StackSlotKey for this.

Wasmtime GitHub notifications bot (Oct 02 2025 at 04:04):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 02 2025 at 04:04):

cfallin commented on PR #11768:

All comments are addressed now, I think -- thanks!

Wasmtime GitHub notifications bot (Oct 02 2025 at 05:41):

cfallin commented on PR #11768:

Alright, actually, now that I'm working through rebasing the Wasmtime half on this, I'm not convinced that removing the "blob of bytes descriptor" from the stackslot is actually the right move:

It's horribly complex to keep the side table and plumb it to the right places on the wasmtime-cranelift side, in particular around inlining. Inlining "just works" when the StackSlotData is self-describing. Now we need to carry the side-table on the wasmtime-cranelift side, let cranelift-codegen query callees and decide to inline via the Inline trait, but insert a side-effect somehow enacted by that trait's implementation to move descriptors behind the scenes back on the wasmtime-cranelift side. That seems very error-prone in comparison. (Note that because we implement a per-function compiler, there is no place for a single module-global side-table to exist and be updated by the parallel compilation.)

It's not any more efficient (we still are copying data from FuncEnvironment into the side-table, then between functions' side-tables when inlining). IIRC this was the original concern -- "too Vec-heavy" in some sense; but the descriptors still exist, just at arm's length and in a remote place that has to remain synchronized.

My wip branch is here for anyone curious but I didn't finish wiring up the part that actually queries/uses the descriptors; stopped when I realized what I would have to do in the InliningCompiler.

I pretty strongly feel that the way it worked before was simpler, more self-contained, and more maintainable; this approach is pushed in a bad direction by high-level feedback that may not have foreseen all the implications. Also, handling inlining is becoming the crux of the complexity, but only because of the review feedback (a sort of self-fulfilling prophecy) -- the original works fine.

@alexcrichton @fitzgen let me know what you think! I'll revert the last commit if you're OK with the above.

Wasmtime GitHub notifications bot (Oct 02 2025 at 05:48):

cfallin commented on PR #11768:

(To add a little more reasoning by analogy: everything else that goes into metadata in the compiled artifact comes directly from Cranelift, e.g. exception tables, source locations, unwind info, and the like; even the name_map, which is carried separately in CompiledFunction, comes directly from the CLIF, and the CLIF is updated by inlining and any other transforms. Following the precedent here, we should keep everything that describes what we'll emit in the IR itself; doing otherwise is constructing a parallel universe of info on the Wasmtime side that has to be maintained and updated carefully whenever we do interprocedural opts. The spirit of our RFC is to work with the compiler, not against it, and let it manage and translate our debugging instrumentation + metadata like any other part of the program.)

Wasmtime GitHub notifications bot (Oct 02 2025 at 14:57):

alexcrichton commented on PR #11768:

At a high level I'm not trying to make life more difficult here, but rather I'm trying to maintain a separation between Wasmtime and Cranelift to avoid having the "wasm mode" in Cranelift with a bunch of custom features just for Wasmtime. For example you say:

simpler, more self-contained, and more maintainable

but following this to the extreme means that we throw away the ability to embed Cranelift anywhere else and fuse it to Wasmtime specifically. That will objectively simplify Cranelift and make it more self-contained for Wasmtime's use case for example. Despite this though I don't think it makes sense to do, so this is part of a gray area of sorts where I thinks it's worth exploring solutions that avoid tacking on Wasmtime-specific features into Cranelift.

The reason I feel this is Wasmtime-specific is how this relates to other Cranelift features feels different. For example UserExternalName is heavily used by Wasmtime and is just an opaque 64-bit integer effectively from Cranelift's point of view. It would objectively be simpler if we didn't have to deal with this and just moved FuncKey directly into Cranelift, but I view it as valuable that we don't do that. Having a tight API boundary is also nice for future compatibility/extensibility in that it makes working on Cranelift, outside the context of Wasmtime, easier.

This is my high-level motivation at least which I wanted to explain. I'm not intentionally trying to make this harder to work with but basically trying to maintain a separation between Wasmtime and Cranelift. This is of course, as with everything, a balance to figure out what works and what doesn't. I don't want to over-rotate on "they must be separate" but I also don't want to over-rotate on "this works let's just ship it".

For this problem specifically, I could very well be missing details, but I was under the impression that the idea was to tack a 64-bit integer/value/etc onto a Descriptor. That'd be the serialization of FuncKey which means, at the end of compilation, we know what origin function a stack slot corresponds to. What I was then thinking would happen is that CompiledFunction in Cranelift would retain metadata describing the stack slot (locals, etc). During the linking phase we'd need to build the metadata section in the ELF, and we'd use Cranelift's MachBuffer output to find what stack slots a function has. We'd then use this FuncKey to look up in the map of all functions to go to the metadata that the function wants, and then the section would be emitted.

It's horribly complex to keep the side table and plumb it to the right places on the wasmtime-cranelift side

Along the lines of my initial comment here, I want to reiterate that I'm not trying to make things horribly complex. If they are then they are and we should go a different route.

For now though I'm not sure myself at least why inlining would come into play given that there's a stack-slot-per-function that's significant so the descriptor is the FuncKey for that function. I'm also not sure why carrying around the side table would be difficult if the per-function metadata were placed into a CompiledFunction and then looked up during the final linking phase.

It's not any more efficient

In some sense we've long since given up on this level of efficiency. There's a ton of inefficiencies in the "don't clone this" or "don't move this" category related to Cranelift compilation and it's something we've never really focused on. I don't mean to claim "if we remove this Vec we're retain our lightning-fast compilation" but moreso "if we can I think it'd be good to avoid taking this debt on". For example in the old iteration it looked like whenever a Callee is made it would clone the descriptor field of stack slots. That naively seems like it would clone per-call within a function. Now I have no idea how big the metadata is and it's always solvable by also slapping a few Arcs around.

Basically efficiency isn't a huge point for me, it's just something I noticed. The higher-level point to me is that it feels like a wart in Cranelift's design where everywhere else embedder-related things are integers but here it happens to be a Vec<u8> because that happens to be what we want for this exact use case in Wasmtime

Wasmtime GitHub notifications bot (Oct 02 2025 at 15:30):

cfallin commented on PR #11768:

So, a few things:

I don't think I'm arguing to make Cranelift Wasmtime-specific at all. I agree that's a bad end-goal; I chose Vec<u8> for the stack-slot descriptor specifically because it was an opaque representation.

I do think this case is a bit different. Your mention of the UserExternalName example is illustrative: CLIF itself contains a list of all external names referenced; this list is legible to and kept updated by the inliner. That's a good example of "IR contains all relevant info", and we're assured that we'll get it right. In contrast, here, externalizing the information about the frame slot causes some really subtle and surprising possible footguns. To dig into your sketch a bit:

For this problem specifically, I could very well be missing details, but I was under the impression that the idea was to tack a 64-bit integer/value/etc onto a Descriptor. That'd be the serialization of FuncKey which means, at the end of compilation, we know what origin function a stack slot corresponds to. What I was then thinking would happen is that CompiledFunction in Cranelift would retain metadata describing the stack slot (locals, etc). During the linking phase we'd need to build the metadata section in the ELF, and we'd use Cranelift's MachBuffer output to find what stack slots a function has. We'd then use this FuncKey to look up in the map of all functions to go to the metadata that the function wants, and then the section would be emitted.

That's fine until we get to the point of looking up the metadata. Note that the metadata (the stackslot format) is an artifact of the compilation of the original function; it's not computed separately or just part of the source or whatnot. So we need a merge step. The simple answer here is to let each function emit only its own original source-level-function metadata as it's appended to the text section.

But there are at least two problems with this:

What happens if we do function-level DCE? (We can't; or we need a special "we still need your metadata as a donation to the cause, but you don't get to go in .text" mode.)

We actually need another pre-pass to create the mapping from FuncKeys to frame-table descriptor indices. (In today's setup all descriptors are local to the function so we just lazily create them as needed during translation of progpoints.)

And this refactors the whole trait setup in Compiler -- right now it's explicitly one pass with an "append to text section" method.

This is part of what I mean by new complexity -- the need to keep things in sync, and rely on the invariant that someone else will later emit that metadata, and the need to compute the mapping that the layer of indirection introduces, makes me a bit squeamish. In contrast, IR that is fully self-describing gives us solutions to all of the above for free.

Wasmtime GitHub notifications bot (Oct 02 2025 at 15:32):

cfallin commented on PR #11768:

For example in the old iteration it looked like whenever a Callee is made it would clone the descriptor field of stack slots. That naively seems like it would clone per-call within a function.

I don't think that's true -- descriptors were copied as part of the compilation pipeline (same as serialized data would be in a hypothetical refactor), but callsites only need to create tags to refer to them.

Wasmtime GitHub notifications bot (Oct 02 2025 at 16:07):

bjorn3 commented on PR #11768:

We actually need another pre-pass to create the mapping from FuncKeys to frame-table descriptor indices.

This could be derived from the FuncKey the same way ExternalName is, can't it?

Wasmtime GitHub notifications bot (Oct 02 2025 at 16:25):

cfallin commented on PR #11768:

We actually need another pre-pass to create the mapping from FuncKeys to frame-table descriptor indices.

This could be derived from the FuncKey the same way ExternalName is, can't it?

Unfortunately not: ExternalName is built by directly embedding the FuncKey (here), which is a bitpacked u64 not necessarily in a contiguous sequence; whereas we need indices into a descriptor table in the frame-table section that are contiguous.

Wasmtime GitHub notifications bot (Oct 02 2025 at 20:00):

cfallin commented on PR #11768:

I've pushed this a little further still and the next impedance mismatch is that frame slots can be at different FP offsets when inlined; so one can build a global table from FuncKey to descriptor, but part of the descriptor varies depending on the use-site, so one needs another layer of descriptors to reference those descriptors and add an FP offset.

Lest this be accounted under "complexity of inlining", I'll note again that everything just works when the IR is self-describing; trying to spread the concept of a descriptor across two layers, when the thing it is attached to is a builtin IR entity that is freely manipulated by the compiler, is the source of the problems...

Wasmtime GitHub notifications bot (Oct 02 2025 at 20:04):

alexcrichton commented on PR #11768:

Well at this point it seems like the way forward is Vec<u8> as a descriptor entry in stack slots. That being said I still feel like it's an unresolved point to me. It's not the end of the world to paper over and just see how it works out, so I don't feel the need to block anything. That being said I'll still put my thoughts down:

... chose Vec<u8> for the stack-slot descriptor specifically because it was an opaque representation.

... Your mention of the UserExternalName example is illustrative: ...

Personally I don't know how to grapple with these two things. On one hand it's ok for external symbols to be two 32-bit integers when they're more naturally represented in many other places as strings. This then works well enough during inlining that there's no reason to change this. For stack slots, though, you're saying it makes more sense to use a bag-of-bytes descriptor despite symbols, for example, intentionally (I assume?) not using that. I also don't understand why a bag-of-bytes or integer-on-stack-slot is any different w.r.t. inlining, both strategies simply copy whatever's there to somewhere else. The only difference is the interpretation later on and then it's either the bytes are there or a lookup table is needed.

Now I agree that looking up the metadata by key is slightly more tricky. I don't understand why it's so tricky it's not worth pursuing, though. The merge step already exists in the form of linking everything together and we already have to have mapping logic related to handling relocations. This is certainly a tricky step, but I don't understand why building a map at the beginning would make it any more meaningfully trickier.

This is part of what I mean by new complexity -- the need to keep things in sync, and rely on the invariant that someone else will later emit that metadata, and the need to compute the mapping that the layer of indirection introduces,

I don't fully grasp the increase in complexity here. My assumption is that a descriptor for a stack slot is more-or-less a function of a FuncKey. Given that there's not really anything to keep in sync other than something needs to produce a map of FuncKey to descriptor and each function compilation would produce its descriptor as part of the CompiledFunction. While DCE seems like something to keep in mind that doesn't seem like an insurmountable amount of complexity (e.g. the function is passed in with a flag saying "don't put this in the text section").

Overall I agree it's different than what's already here today, but I again don't fully understand the rationale about the drastic increase in complexity.

In contrast, IR that is fully self-describing gives us solutions to all of the above for free.

...

For example in the old iteration it looked like whenever a Callee is made it would clone the descriptor field of stack slots.

I don't think that's true ...

I meant this line from the previous version of this PR.

Wasmtime GitHub notifications bot (Oct 02 2025 at 22:46):

cfallin commented on PR #11768:

Alright, I managed to cobble a solution together (update after rebase on the Wasmtime side here). Apologies for the pushback -- it's a subjective design thing and I'm still not super-happy with the coupling, versus fully self-describing IR, but at least this is possible. It did require some small changes to the signatures in the compiler traits and you can see the extra pass over functions but at least it's enabled only when the tunable is set.

(I think that by redesigning the serialization format a bit to separate out the slot-offset-from-FP from the offset-within-slot, I can share descriptors between inlined copies of functions rather than emitting them separately for each function that uses them; I'll poke at that on the Wasmtime side.)

Wasmtime GitHub notifications bot (Oct 03 2025 at 17:03):

alexcrichton commented on PR #11768:

This all looks reasonable to me, thanks! A skim over #11769 all looks pretty reasonable to me too. I'll defer to @fitzgen for final approval on this though

Wasmtime GitHub notifications bot (Oct 03 2025 at 20:17):

bjorn3 submitted PR review.

Wasmtime GitHub notifications bot (Oct 03 2025 at 20:17):

bjorn3 created PR review comment:

Maybe use PackedOption?

Wasmtime GitHub notifications bot (Oct 06 2025 at 19:07):

fitzgen commented on PR #11768:

I'll defer to @fitzgen for final approval on this though

I looked over the usage in the PR stacked on top of this and things looked pretty reasonable, didn't seem like there needed to be a lot of bookkeeping of extra info in Wasmtime that otherwise would have been inside of Cranelift, so LGTM :+1:

Wasmtime GitHub notifications bot (Oct 06 2025 at 19:16):

fitzgen submitted PR review.

Wasmtime GitHub notifications bot (Oct 06 2025 at 19:37):

cfallin submitted PR review.

Wasmtime GitHub notifications bot (Oct 06 2025 at 19:37):

cfallin created PR review comment:

I don't necessarily want to take any bites out of the domain of the arbitrary u64 -- it's already used by Wasmtime to carry a FuncKey which has its own bit-level encoding (rather than a serial index). Hopefully though stackslots are sparse enough (vs. e.g. instructions or blocks) that this is OK...

Wasmtime GitHub notifications bot (Oct 06 2025 at 20:01):

cfallin merged PR #11768.

Last updated: Feb 24 2026 at 05:28 UTC