wasmtime / issue #2459 Implement wasmtime GC tracing with... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #2459 Implement wasmtime GC tracing with...

Wasmtime GitHub notifications bot (Jul 28 2022 at 22:57):

@fitzgen now that #4431 is merged (:tada: ) is this issue subsumed as well? I.e. I wonder if GC tracing works with unwinding info omitted?

Wasmtime GitHub notifications bot (Jul 28 2022 at 22:58):

cfallin commented on issue #2459:

(the other part of this issue is I guess discussing ways to avoid the need for stackmaps entirely, but the toplevel problem description is basically just "make do without libunwind")

Wasmtime GitHub notifications bot (Aug 01 2022 at 17:32):

fitzgen commented on issue #2459:

Yes, this issue should be resolved, but I'll leave it open until I can verify whether we can disable dynamically registering unwind info without breaking perf or what have you.

Wasmtime GitHub notifications bot (Aug 04 2022 at 16:34):

fitzgen commented on issue #2459:

I'm actually going to close this as complete, and handle the unwind info and all that in https://github.com/bytecodealliance/wasmtime/issues/4554

Wasmtime GitHub notifications bot (Aug 04 2022 at 16:34):

fitzgen closed issue #2459:

In wasmtime, GC tracing of reftype pointers currently works by using libunwind to iterate over stack frames, fetching a stackmap for each relevant PC and finding the stack slots with live pointers.

This works perfectly fine, but is potentially slower than we would like, because libunwind relies on DWARF info to understand stack frames. It is also more complex -- it relies on DWARF generation and interpretation to be correct -- which increases risk a little because GC-tracing bugs can lead to various security issues.

In contrast, many other high-performance JITs use explicit data structures of some sort on the stack so that tracing stack roots boils down to walking a linked list of some sort. For example, SpiderMonkey has a strict JIT-frame discipline allowing fast iteration (different from the system ABI), and V8 indirects object references through InstanceHandles that link themselves into a list on the thread context.

We should look into designing a mechanism that maintains a stack of frames reachable from the vmctx and walkable without any metadata (aside from the stackmaps). Two options that come to mind are:

Option 1: Shadow Stack of (SP, Stackmap) Tuples

Maintain a shadow stack (with top and limit pointers in vmctx) of (stackmap, SP) tuples. On function entry, allocate a tuple. At every safepoint, ensure that the stack pointer and stackmap for that safepoint are up-to-date in the tuple. Walking the stack for GC roots then simply requires (i) looping over these tuples, and (ii) tracing references at offsets indicated by the stackmap.

Some advantages of this scheme are:

It is a relatively small delta from today's implementation. It requires inserting code at prologue and epilogue/returns to alloc/dealloc the tuple, and at every safepoint to store the stackmap pointer and SP value.

Some disadvantages of this scheme are:

The stackmap pointer would have to be indirected somehow; we cannot bake the raw pointer value into the code if the code is cached on disk.

It will slightly increase memory traffic at every safepoint, as in addition to the spills of all references inserted by the register allocator, we have to store stackmap and SP pointers.

Option 2: Shadow Stack of References

Maintain a shadow stack that actually stores spilled references. On function entry, bounds-check that there is enough shadow-stack space for the maximal live-set of reftyped values at any safepoint in the function (this is statically known). At any safepoint, push all live reftyped values to the shadow stack; after the safepoint, restore them (if we implement a moving GC that may edit pointers) or bulk-pop them by bumping the top pointer.

Some advantages of this scheme are:

It is independent of reftypes support in Cranelift and regalloc.rs; in other words, it is very simple and easy to verify. While we are pretty confident in the reftypes implementation at least in the backtracking allocator at this point, less complexity is always good; and it lowers the bar for adopting other register allocators in the future, if we choose to do that.

Tracing will be as fast as possible; we literally provide the GC with a &[PointerT] (slice of contiguous live pointers). This is even better than walking a potentially sparse stackmap looking for set bits.

It has no additional memory traffic relative to the status quo (what we do today) unless a reftyped value was already spilled; we are simply replacing regalloc spills with stores to our shadow stack.

A disadvantage of this scheme is:

It adds a little memory traffic when a reftyped value was already spilled at a safepoint: it will be loaded from the spillslot and then pushed onto the shadow stack. Note that this does not need to be explicitly handled (shadow-stack code is inserted before regalloc, so regalloc will just Do The Right Thing and reload the spilled value), but it is suboptimal.

I tentatively prefer Option 2, but I can see both options as viable. Thoughts?

cc @fitzgen

Last updated: Apr 17 2025 at 07:03 UTC