Stream: git-wasmtime

Topic: wasmtime / Issue #1832 externref: implement stack map-bas...


view this post on Zulip Wasmtime GitHub notifications bot (Jun 05 2020 at 23:37):

github-actions[bot] commented on Issue #1832:

Subscribe to Label Action

cc @peterhuene

<details>
This issue or pull request has been labeled: "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

view this post on Zulip Wasmtime GitHub notifications bot (Jun 06 2020 at 12:27):

tschneidereit commented on Issue #1832:

Very exciting to see this! \o/

One question: does this have any implications for deterministic behavior? ISTM that as long as we don't have weak references anywhere in the system it probably doesn't, right?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 08 2020 at 17:07):

fitzgen commented on Issue #1832:

One question: does this have any implications for deterministic behavior? ISTM that as long as we don't have weak references anywhere in the system it probably doesn't, right?

This has deterministic (albeit perhaps surprising if you don't know the implementation) behavior. We only GC when either the embedder calls Store::gc or when the VMExternRefActivationsTable is full (which happens after passing in N externrefs to Wasm).

FWIW. VMExternRef doesn't currently support weak references. Shouldn't be too hard to add if we find we need it. It wouldn't change the determinism of GC though.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 08 2020 at 18:16):

fitzgen commented on Issue #1832:

Very nice! Before I dig too closely into the internal details I wanted to get a grasp of how this is all organized and such. The main thoughts I have is that this is adding a new registry and a lot of separate points where we pass around registries and such. I'm hoping we can perhaps unify all these with existing registries and such we have? Internally it feels better if we don't have an arc and rwlock per-thing that we need to keep track of.

I originally did add this to the existing global frame info, but after our discussions about not wanting implicit global context, I moved it out to its own registry.

One other thing I'm remembering now which I think would be good to happen here, I think this should include an implementation of WasmTy for ExternRef. We'll want to enable closures with Func that take ExternRef as arguments and such, and it'd be cool to see what the monomorphize logic looks like for functions that take a number of ExternRef or produce them.

Yeah, I have this on my TODO list for a follow up PR. Felt like this was big enough as is.

Finally I wanted to write some thoughts about the GC aspect. It feels a bit weird to me that we're using reference counting but still have to have explicit GC points. To make sure I understand, GC only happens right now automatically when you call into a function, right? Or are there other auto-inserted points that GC happens? At a minimum I think we need to make sure that all long-running code is eventually GC'd, so I think both entry and exit needs GC'ing (calling into a host and returning back to wasm may already do the GC, I likely missed it!).

I wanted to dig a bit more into the decision though to use deferred reference counting rather than explicit. Is it possible to somewhat easily get performance number comparisons? For example do we have a handle on what we predict the overhead will be? I'd imagine there are possible optimizations where local.get/local.set don't do reference counting but calling a function does.

Regarding when GC happens see my reply to Till: https://github.com/bytecodealliance/wasmtime/pull/1832#issuecomment-640756824

For long-running Wasm, in the absence of explicit Store::gc calls, GC will happen as the VMExternRefActivationsTable fills up. I don't think it makes sense to GC on entry/exit from every single host<-->wasm function boundary, because it will be too expensive for simple getter/setter host functions that you expect to generally be fast. It might makes sense for the embedder to call Store::gc after the "main" call into Wasm (e.g. after servicing an HTTP request, but not after each wasm -> host call to read an HTTP header; however in this case there should be zero Wasm frames on the stack, so we could provide an unsafe method to just clear the VMExternRefActivationsTable rather than even bothering to walk the stack and do the GC).

Regarding the decision to go with deferred reference counting, let's look at the alternative: we would have to do explicit reference count increments and decrements for references inside Wasm frames, rather than deferring them. First, to get this working at all, this would require extending cranelift-wasm to add ref count increments every time we pushed a reference onto the spec-defined Wasm execution stack (e.g. every local.get), and ref count decrements every time it is popped off the Wasm execution stack (e.g. every local.set). But doing this many reference counting operations is the thing kills reference counting's throughput. So to alleviate that we could either do deferred reference counting, or we could add a static analysis to cranelift-wasm that sums the ref count increments and decrements in a basic block and does only the final effect once (hand waving a bit, would need to think about this a bit more). Then, if coalescing ref count operations within a basic block isn't enough, this static analysis could perhaps further be generalized across basic blocks (really hand waving now).

Note that we can't only do reference counting at function call boundaries because we need to handle the case where a Wasm function takes a reference and drops it. We need to decrement the reference count at the drop site, which is what instrumenting the spec stack pops does.

So that is a bunch of infrastructural work on Cranelift to get something that will perform worse but at least works, followed by more work to start moving towards fewer ref count operations (but never getting as few as deferred reference counting: zero, albeit with occasional stack walking pauses).

On the other hand, we already have stack maps produced by Cranelift, which are the necessary bits required for implementing deferred reference counting. We don't need to build any Cranelift infrastructure and Wasmtime's integration, just the latter. However, yes, this does come with occasional stack-walking pauses and reliance on libunwind or some other way of walking the stack (but also we already require it for unwinding host functions after traps).

So that was basically the calculus: an easier path to getting a better implementation (or at least an easier path to getting a good implementation, if we are comparing against the hypothetical best version of the ref counting coalescing static analysis).

Unfortunately, I don't have benchmarks or performance numbers. It would be hard to get this information without implementing both approaches. And it is hard to know how many reference counting operations we could coalesce with the hypothetical static analysis. But I don't really have any doubt that naively doing all the increments and decrements for on-stack references would be quite slow: this is Well Known(tm) in the memory management world.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 09 2020 at 11:08):

tschneidereit commented on Issue #1832:

This has deterministic (albeit perhaps surprising if you don't know the implementation) behavior.

Great, thank you for the explanation!

One additional question: will it be possible to use the same infrastructure for stack tracing once we start implementing one of the GC proposals? Seems like that should be the case, but I might well be missing something.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 09 2020 at 16:45):

fitzgen commented on Issue #1832:

One additional question: will it be possible to use the same infrastructure for stack tracing once we start implementing one of the GC proposals? Seems like that should be the case, but I might well be missing something.

Yep, doing that should be a lot easier after this lands.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 14:45):

alexcrichton commented on Issue #1832:

Sorry I haven't had a chance to read and fully digest your response yet @fitzgen (will do later today), but I wanted to comment here with a thought before I forget it. In https://github.com/bytecodealliance/wasmtime/issues/1845 it was found that we can't actually backtrace through other host JIT code (e.g. the CLR) on all platforms. I think that this GC implementation requires getting a full backtrace at all times, right? If, for example, a host call in the CLR triggered a GC it would get a smaller view of the world than actually exists and could cause a use-after-free?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 16:41):

alexcrichton commented on Issue #1832:

Ok now to respond in full! One thing I'm still a bit murky and/or uncomfortable on is when GC is expected to happen. I understand that it automatically happens when tables fill up or if you explicitly call it, but my question was sort of largely do we expect wasmtime to grow some other time it automatically calls it? For example if an application never calls Store::gc is it guaranteed that, regardless of what wasm runs, memory won't be leaked?

Another question is how we'd document this. For embedders using anyref, what would be the recommended way to call this? I think "call it after the 'main call'" is pretty sound advice, but that hingest on the previous question of it should never be required to call Store::gc to prevent leaking unbounded amounts of memory.

For alternative strategies of reference counting, to clarify I don't think that we should be reverting this and switching to explicit reference counting. I'm mostly probing because the rationale against reference counting feels a bit hand-wavy to me and given the cost of relying on stack unwinding and stack maps I'd just want to make sure we're set.

To play devil's advocate a bit, I'm not convinced that function-boundary reference counting is impossible. I agree local.{set,get} should not do reference counting at all. What I'm imagining is that there's a set of "live anyref" in a function frame that cranelift keeps track of. Instructions like call and {global,table}.{get,set} will increment the reference count for received anyref values and passed away anyref values. When the function returns cranelift would then inject a decrement of the reference count for all active anyref values in the frame.

I agree that this may need some intrinsic work in Cranelift, but I don't think that this would require optimizations about coalescing reference counts. The main idea would be that local.{get,set} are the lion's share of reference count activities so by deferring that we'd get the lion's share of the benefit.

Thinking more on this though unwinds are a really important thing to account for here too. I believe the stack walking strategy perfectly handles unwinds (since you'll just gc later and realize that rooted things aren't on the stack any more), but anything with explicit code generation will still need to do something on unwinding. And that 'something' is arguably just as hard to get right as stack maps themselves.

Again though to be clear I'm trying to get it straight in my head why we're using deferred reference counting rather than explicit reference counting. I don't mean to seem like I'm challenging the conclusion, it's mostly that I just want to feel confident that we don't hand-wave too much by accident. For me, though, the nail in the coffin is that in a world of explicit reference counting unwinding needs to be handled somehow, and the solution seems like it'd be very similar if not just stack maps in one form or another.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 16:55):

bjorn3 commented on Issue #1832:

Again though to be clear I'm trying to get it straight in my head why we're using deferred reference counting rather than explicit reference counting.

Deferred reference counting is more performant.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 17:43):

fitzgen commented on Issue #1832:

Deferred reference counting is more performant.

Being this short and this absolute is not helpful. It isn't a 100% cut and dry issue, as the lengthy discussion weighing its trade offs shows. Please engage with nuance in the future. Thanks.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 17:46):

bjorn3 commented on Issue #1832:

Ok, sorry.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 19:15):

alexcrichton commented on Issue #1832:

How does SpiderMonkey unwind frames within a single JitActivation? Do they have a custom unwinder which can sort of start halfway down the stack?

Otherwise another possible alternative is we could perhaps have a flag in Store of "am I in wasm?" and some sort of "blessed" frame on the stack, where if we don't find that frame then we know the stack trace is incomplete and skip GC entirely?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 20:06):

fitzgen commented on Issue #1832:

How does SpiderMonkey unwind frames within a single JitActivation? Do they have a custom unwinder which can sort of start halfway down the stack?

Yes, they have their own unwinder that understands only their own JIT'd frames.

Otherwise another possible alternative is we could perhaps have a flag in Store of "am I in wasm?" and some sort of "blessed" frame on the stack, where if we don't find that frame then we know the stack trace is incomplete and skip GC entirely?

This would avoid the unsoundness, but wouldn't let us clean up the garbage, so it isn't super attractive...

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 20:14):

peterhuene commented on Issue #1832:

For .net, presumably msft's stack walker knows how to walk CLR frames. Perhaps we could just use their stack walker on windows? It isn't clear to me how to deal with .net on linux/macos if they refuse to integrate with the standard unwinding conventions of the platform...

.NET Core's walker isn't readily available to use outside of the CLR and would be burdensome to liberate. Additionally, this actually isn't a problem on Windows because .NET Core uses the Windows unwind information format to internally represent its JIT code and hence complete walks with the system unwinder is possible (but only on Windows). That said, there might be other JIT-based runtimes out there that don't register their code with system unwinders, so a general solution would probably be warranted.

I think your other proposed solution makes the most sense, similar to SpiderMonkey, to record enough context at wasm-to-host and host-to-wasm transition points to enable a wasm-specific walk within Wasmtime itself rather than relying on a system unwinder.

This would also solve the trap backtrace issue more generally such that we can then guarantee correct wasm traces regardless of encountering a frame that has no system unwind information registered.

It might also be useful to enhance our trap traces so we can clearly indicate to users where such transitions occur. For instance, that would be useful to show that a trap came from the host rather than user wasm code (e.g. show a top frame of [host frames omitted] rather than the wasm frame that called into the host).

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 20:20):

peterhuene edited a comment on Issue #1832:

For .net, presumably msft's stack walker knows how to walk CLR frames. Perhaps we could just use their stack walker on windows? It isn't clear to me how to deal with .net on linux/macos if they refuse to integrate with the standard unwinding conventions of the platform...

.NET Core's walker isn't readily available to use outside of the CLR and would be burdensome to liberate. Additionally, this actually isn't a problem on Windows because .NET Core uses the Windows unwind information format to internally represent its JIT code and hence complete walks with the system unwinder is possible (but only on Windows). That said, there might be other JIT-based runtimes out there that don't register their code with system unwinders, so a general solution would probably be warranted.

I think your other proposed solution makes the most sense, similar to SpiderMonkey, to record enough context at wasm-to-host and host-to-wasm transition points to enable a wasm-specific walk within Wasmtime itself rather than relying on a system unwinder for a complete trace (we could still rely on a system unwinder for unwinding the wasm frames as we support libunwind and Windows).

This would also solve the trap backtrace issue more generally such that we can then guarantee correct wasm traces regardless of encountering a frame that has no system unwind information registered.

It might also be useful to enhance our trap traces so we can clearly indicate to users where such transitions occur. For instance, that would be useful to show that a trap came from the host rather than user wasm code (e.g. show a top frame of [host frames omitted] rather than the wasm frame that called into the host).

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 20:20):

peterhuene edited a comment on Issue #1832:

For .net, presumably msft's stack walker knows how to walk CLR frames. Perhaps we could just use their stack walker on windows? It isn't clear to me how to deal with .net on linux/macos if they refuse to integrate with the standard unwinding conventions of the platform...

.NET Core's walker isn't readily available to use outside of the CLR and would be burdensome to liberate. Additionally, this actually isn't a problem on Windows because .NET Core uses the Windows unwind information format to internally represent its JIT code and hence complete walks with the system unwinder is possible (but only on Windows). That said, there might be other JIT-based runtimes out there that don't register their code with system unwinders, so a general solution would probably be warranted.

I think your other proposed solution makes the most sense, similar to SpiderMonkey, to record enough context at wasm-to-host and host-to-wasm transition points to enable a wasm-specific walk within Wasmtime itself rather than relying on a system unwinder for a complete trace (we could still rely on a system unwinder for walking the wasm frames as we support libunwind and Windows).

This would also solve the trap backtrace issue more generally such that we can then guarantee correct wasm traces regardless of encountering a frame that has no system unwind information registered.

It might also be useful to enhance our trap traces so we can clearly indicate to users where such transitions occur. For instance, that would be useful to show that a trap came from the host rather than user wasm code (e.g. show a top frame of [host frames omitted] rather than the wasm frame that called into the host).

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 20:23):

peterhuene edited a comment on Issue #1832:

For .net, presumably msft's stack walker knows how to walk CLR frames. Perhaps we could just use their stack walker on windows? It isn't clear to me how to deal with .net on linux/macos if they refuse to integrate with the standard unwinding conventions of the platform...

.NET Core's walker isn't readily available to use outside of the CLR and would be burdensome to liberate. Additionally, this actually isn't a problem on Windows because .NET Core uses the Windows unwind information format to internally represent its JIT code and hence complete walks with the system unwinder is possible (but only on Windows). That said, there might be other JIT-based runtimes out there that don't register their code with system unwinders, so a general solution would probably be warranted.

I think your other proposed solution makes the most sense, similar to SpiderMonkey, to record enough context at wasm-to-host and host-to-wasm transition points to enable a wasm-specific walk within Wasmtime itself rather than relying on a system unwinder for a complete trace (we could still rely on a system unwinder for walking the wasm frames as we support libunwind and Windows; on Windows this is possible with RtlVritualUnwind, but I'm ignorant of the libunwind equivalent).

This would also solve the trap backtrace issue more generally such that we can then guarantee correct wasm traces regardless of encountering a frame that has no system unwind information registered.

It might also be useful to enhance our trap traces so we can clearly indicate to users where such transitions occur. For instance, that would be useful to show that a trap came from the host rather than user wasm code (e.g. show a top frame of [host frames omitted] rather than the wasm frame that called into the host).

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 20:24):

peterhuene edited a comment on Issue #1832:

For .net, presumably msft's stack walker knows how to walk CLR frames. Perhaps we could just use their stack walker on windows? It isn't clear to me how to deal with .net on linux/macos if they refuse to integrate with the standard unwinding conventions of the platform...

.NET Core's walker isn't readily available to use outside of the CLR and would be burdensome to liberate. Additionally, this actually isn't a problem on Windows because .NET Core uses the Windows unwind information format to internally represent its JIT code and hence complete walks with the system unwinder is possible (but only on Windows). That said, there might be other JIT-based runtimes out there that don't register their code with system unwinders, so a general solution would probably be warranted.

I think your other proposed solution makes the most sense, similar to SpiderMonkey, to record enough context at wasm-to-host and host-to-wasm transition points to enable a wasm-specific walk within Wasmtime itself rather than relying on a system unwinder for a complete trace (we could still rely on a system unwinder for walking the wasm frames as we support libunwind and Windows; on Windows this is possible with RtlVirtualUnwind, but I'm ignorant of the libunwind equivalent).

This would also solve the trap backtrace issue more generally such that we can then guarantee correct wasm traces regardless of encountering a frame that has no system unwind information registered.

It might also be useful to enhance our trap traces so we can clearly indicate to users where such transitions occur. For instance, that would be useful to show that a trap came from the host rather than user wasm code (e.g. show a top frame of [host frames omitted] rather than the wasm frame that called into the host).

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 22:01):

fitzgen commented on Issue #1832:

Ok now to respond in full! One thing I'm still a bit murky and/or uncomfortable on is when GC is expected to happen. I understand that it automatically happens when tables fill up or if you explicitly call it, but my question was sort of largely do we expect wasmtime to grow some other time it automatically calls it? For example if an application _never_ calls Store::gc is it guaranteed that, regardless of what wasm runs, memory won't be leaked?

Another question is how we'd document this. For embedders using anyref, what would be the recommended way to call this? I think "call it after the 'main call'" is pretty sound advice, but that hingest on the previous question of it should never be required to call Store::gc to prevent leaking unbounded amounts of memory.

We could add a timer-based GC, but this does make the timing of when destructors are called non-deterministic.

The only situation where there could be long-running wasm without any GC is if the Wasm goes into a long-running loop that doesn't use any references. If it did use references (i.e. put them in tables/globals) then the VMExternRefActivationsTable would eventually fill up and trigger a GC.

I don't think recommending a Store::gc call on exit of the "main" wasm call is too onerous for embedders (especially given that they may not want to do it that way if there isn't a single "main" call, and instead let it come naturally) but we could also provide an RAII class or somehow do it by default by counting number of active calls into wasm there are at any given time for the thread.

To play devil's advocate a bit, I'm not convinced that function-boundary reference counting is impossible. I agree local.{set,get} should not do reference counting at all. What I'm imagining is that there's a set of "live anyref" in a function frame that cranelift keeps track of. Instructions like call and {global,table}.{get,set} will increment the reference count for received anyref values and passed away anyref values. When the function returns cranelift would then inject a decrement of the reference count for all active anyref values in the frame.

I agree that this may need some intrinsic work in Cranelift, but I don't think that this would require optimizations about coalescing reference counts. The main idea would be that local.{get,set} are the lion's share of reference count activities so by deferring that we'd get the lion's share of the benefit.

I see more where you are going with this now.

Cranelift already does this to some degree when generating stack maps for safepoints: it asks the register allocator which values are still live after this point, filters for just the references, and generates a stack map for them based on their locations.

This is really late in the compilation pipeline though: it requires cooperation with the register allocator. I don't think we can do exactly that same approach for ref counting operations, because it is probably too late to insert new instructions (let alone whole new blocks for checking if the refcount reached zero inline, and only calling out to a VM function if so).

I think we could do something similar in cranelift-wasm, before the IR enters Cranelift's pipeline properly. (Once it goes into Cranelift, it is essentially "too late" to insert these hooks, as Cranelift doesn't ever call back out into user code at that point, and introducing it would be a bunch of work and also not necessarily wanted.) Still feel a bit hand wavy about this, and the more I think about it, the more similar it feels to the final form of the hypothetical static analysis.

but anything with explicit code generation will still need to do _something_ on unwinding. And that 'something' is arguably just as hard to get right as stack maps themselves.

I'm not sure what you're talking about here. What explicit code generation? What code generation do we do at all that isn't handled by Cranelift?

For me, though, the nail in the coffin is that in a world of explicit reference counting unwinding needs to be handled somehow, and the solution seems like it'd be very similar if not just stack maps in one form or another.

Yes, if we use non-deferred reference counting for on-stack references, traps will have to decrement reference counts as they unwind through Wasm frames. I hadn't thought about this either. This does end up looking very similar to stack maps, but also with personality routines throne into the mix. The difference is that failure to unwind stacks properly leads to leaks in this case, rather than unsafety.


Ultimately, I'm not 100% sure whether it makes more sense to keep investing in this stack maps-based approach to deferred reference counting, or to start fresh and try non-deferred reference counting.

I had been leaning towards exploring non-deferred reference counting, but I hadn't thought of the need to decrement reference counts when unwinding through wasm. When considering the effort required to do that properly, I'm less sure now.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 22:10):

fitzgen commented on Issue #1832:

Oh one other thing I wanted to mention: if we do keep going with this PR's approach and introduce a JitActivation analog, it isn't clear to me yet whether we can keep using libunwind (via the backtrace crate) for unwinding through a single JitActivation's wasm frames, or if we would need to implement our own stack walker that only handles the kinds of frames generated by Cranelift.

In order to keep using libunwind, we would need a way to initialize its unwind context manually ourselves, and not just with the current youngest stack frame's register values. We would also need to plumb that through the backtrace crate. I am not sure whether libunwind supports this or not. I'm going to look into it.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 22:14):

fitzgen edited a comment on Issue #1832:

Ok now to respond in full! One thing I'm still a bit murky and/or uncomfortable on is when GC is expected to happen. I understand that it automatically happens when tables fill up or if you explicitly call it, but my question was sort of largely do we expect wasmtime to grow some other time it automatically calls it? For example if an application _never_ calls Store::gc is it guaranteed that, regardless of what wasm runs, memory won't be leaked?

Another question is how we'd document this. For embedders using anyref, what would be the recommended way to call this? I think "call it after the 'main call'" is pretty sound advice, but that hingest on the previous question of it should never be required to call Store::gc to prevent leaking unbounded amounts of memory.

We could add a timer-based GC, but this does make the timing of when destructors are called non-deterministic.

The only situation where there could be long-running wasm without any GC is if the Wasm goes into a long-running loop that doesn't use any references. If it did use references (i.e. put them in tables/globals) then the VMExternRefActivationsTable would eventually fill up and trigger a GC.

I don't think recommending a Store::gc call on exit of the "main" wasm call is too onerous for embedders (especially given that they may not want to do it that way if there isn't a single "main" call, and instead let it come naturally) but we could also provide an RAII class or somehow do it by default by counting number of active calls into wasm there are at any given time for the thread.

To play devil's advocate a bit, I'm not convinced that function-boundary reference counting is impossible. I agree local.{set,get} should not do reference counting at all. What I'm imagining is that there's a set of "live anyref" in a function frame that cranelift keeps track of. Instructions like call and {global,table}.{get,set} will increment the reference count for received anyref values and passed away anyref values. When the function returns cranelift would then inject a decrement of the reference count for all active anyref values in the frame.

I agree that this may need some intrinsic work in Cranelift, but I don't think that this would require optimizations about coalescing reference counts. The main idea would be that local.{get,set} are the lion's share of reference count activities so by deferring that we'd get the lion's share of the benefit.

I see more where you are going with this now.

Cranelift already does this to some degree when generating stack maps for safepoints: it asks the register allocator which values are still live after this point, filters for just the references, and generates a stack map for them based on their locations.

This is really late in the compilation pipeline though: it requires cooperation with the register allocator. I don't think we can do exactly that same approach for ref counting operations, because it is probably too late to insert new instructions (let alone whole new blocks for checking if the refcount reached zero inline, and only calling out to a VM function if so).

I think we could do something similar in cranelift-wasm, before the IR enters Cranelift's pipeline properly. (Once it goes into Cranelift, it is essentially "too late" to insert these hooks, as Cranelift doesn't ever call back out into user code at that point, and introducing it would be a bunch of work and also not necessarily wanted.) Still feel a bit hand wavy about this, and the more I think about it, the more similar it feels to the final form of the hypothetical static analysis.

but anything with explicit code generation will still need to do _something_ on unwinding. And that 'something' is arguably just as hard to get right as stack maps themselves.

I'm not sure what you're talking about here. What explicit code generation? What code generation do we do at all that isn't handled by Cranelift?

For me, though, the nail in the coffin is that in a world of explicit reference counting unwinding needs to be handled somehow, and the solution seems like it'd be very similar if not just stack maps in one form or another.

Yes, if we use non-deferred reference counting for on-stack references, traps will have to decrement reference counts as they unwind through Wasm frames. I hadn't thought about this either. This does end up looking very similar to stack maps, but also with personality routines thrown into the mix. The difference is that failure to unwind stacks properly leads to leaks in this case, rather than unsafety.


Ultimately, I'm not 100% sure whether it makes more sense to keep investing in this stack maps-based approach to deferred reference counting, or to start fresh and try non-deferred reference counting.

I had been leaning towards exploring non-deferred reference counting, but I hadn't thought of the need to decrement reference counts when unwinding through wasm. When considering the effort required to do that properly, I'm less sure now.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 22:35):

peterhuene commented on Issue #1832:

I assume we can capture the context at the wasm-to-host transition and use unw_init_local to initialize the unwind cursor at that context. We could then step the cursor until we encounter a non-wasm frame.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 22:41):

peterhuene edited a comment on Issue #1832:

I assume we can capture the context at the wasm-to-host transition and use unw_init_local to initialize the unwind cursor at that context. We could then step the cursor until we encounter a non-wasm frame.

If there are no wasm-to-host transitions (i.e. the top frame is wasm or a Wasmtime function such as the signal or GC handler), then we can assume the walk from the current context perhaps? There shouldn't be any foreign unregistered frames that would prevent the walk in that case.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 22:43):

peterhuene edited a comment on Issue #1832:

I assume we can capture the context at the wasm-to-host transition and use unw_init_local to initialize the unwind cursor at that context. We could then step the cursor until we encounter a non-wasm frame.

If there are no wasm-to-host transitions (i.e. there's only Wasm frames on top or a Wasmtime function such as the signal or GC handler), then we can assume the walk from the current context perhaps? There shouldn't be any foreign unregistered frames that would prevent the walk in that case.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 22:43):

peterhuene edited a comment on Issue #1832:

I assume we can capture the context at the wasm-to-host transition and use unw_init_local to initialize the unwind cursor at that context. We could then step the cursor until we encounter a non-wasm frame.

If there are no wasm-to-host transitions (i.e. there's only Wasm frames on top or a Wasmtime function such as the signal or GC handler), then we can assume the walk from the current context, skipping any initial non-wasm frames, perhaps? There shouldn't be any foreign unregistered frames that would prevent the walk in that case.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 22:45):

peterhuene edited a comment on Issue #1832:

I assume we can capture the context at the wasm-to-host transition and use unw_init_local to initialize the unwind cursor at that context. We could then step the cursor until we encounter a non-wasm frame.

If the wasm-to-host transitions stack is empty (i.e. there's only Wasm frames on top or a Wasmtime function such as the signal or GC handler), then we can assume the walk from the current context, skipping any initial non-wasm frames, perhaps? There shouldn't be any foreign unregistered frames that would prevent the walk in that case.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 23:04):

fitzgen commented on Issue #1832:

I assume we can capture the context at the wasm-to-host transition and use unw_init_local to initialize the unwind cursor at that context. We could then step the cursor until we encounter a non-wasm frame.

Yes, we could do this. Adding an out-of-line function call (unw_getcontext) to every wasm->host transition isn't great though. (On arm, it is actually a macro of some inline assembly, but since Rust doesn't have inline assembly, we would probably wrap that in a C function that we call anways...)

AFAICT, there is no blessed way of re-initializing an existing unw_context_t from just a PC and SP, which is all that should really be required for stack walking in principle. On both x86-64 and arm, unw_context_t is just a define of a ucontext_t, so maybe we could just poke the ucontext_t's register values directly...

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 23:06):

peterhuene commented on Issue #1832:

Agreed, if we go with such a design, I would want to limit the context capturing only to host functions not defined by Wasmtime (and perhaps only via the C API, where there is a chance of another JIT runtime being used). Perhaps it could even be something embedders opt-in to.

The bottom line is that we shouldn't have to pay the cost for capturing context when calling into Wasmtime's WASI implementation especially, as we are guaranteed to be able to unwind through those frames for both libunwind and Windows.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 23:08):

peterhuene edited a comment on Issue #1832:

Agreed, if we go with such a design, I would want to limit the context capturing only to host functions not defined by Wasmtime (and perhaps only via the C API, where there is a chance of another JIT runtime being used). Perhaps it could even be something embedders opt-in to when defining a function.

The bottom line is that we shouldn't have to pay the cost for capturing context when calling into Wasmtime's WASI implementation especially, as we are guaranteed to be able to unwind through those frames for both libunwind and Windows.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 23:11):

peterhuene commented on Issue #1832:

All that said, I'm comfortable with landing a stack-walking-based GC before we fix the issue that currently only affects .NET hosts.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 10 2020 at 23:16):

peterhuene edited a comment on Issue #1832:

All that said, I'm comfortable with landing a stack-walking-based GC before we fix the issue that currently only affects .NET hosts (the .NET API doesn't support reference types yet anyway).

view this post on Zulip Wasmtime GitHub notifications bot (Jun 11 2020 at 07:03):

bjorn3 commented on Issue #1832:

Maybe there is some way for the .NET implementation to insert a frame whose unwind info doesn't do normal unwinding, but sets the sp to the value before entering the .NET JITtes code, such that all .NET JITted code is skipped. Or has a special personality function that invokes the .NET unwinder and then sets the register state as the .NET unwinder gives back when unwinding past the JITted code.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 11 2020 at 07:03):

bjorn3 edited a comment on Issue #1832:

Maybe there is some way for the .NET Wasmtime implementation to insert a frame whose unwind info doesn't do normal unwinding, but sets the sp to the value before entering the .NET JITtes code, such that all .NET JITted code is skipped. Or has a special personality function that invokes the .NET unwinder and then sets the register state as the .NET unwinder gives back when unwinding past the JITted code.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 11 2020 at 09:13):

tschneidereit commented on Issue #1832:

All that said, I'm comfortable with landing a stack-walking-based GC before we fix the issue that currently only affects .NET hosts (the .NET API doesn't support reference types yet anyway).

While I think that might be fine to do, I'm actually quite glad we have this embedding and thus caught this issue.

What's more, even if we find a workaround specifically for .NET, I think it might make sense to take a closer look at whether we can then have sufficient confidence in our ability to make this approach work in all cases that might become relevant. I.e., would landing this begin painting ourselves into a corner that's increasingly hard to get out of at some later time when we realize we need to?

view this post on Zulip Wasmtime GitHub notifications bot (Jun 11 2020 at 14:28):

alexcrichton commented on Issue #1832:

@fitzgen

I'm not sure what you're talking about here. What explicit code generation? What code generation do we do at all that isn't handled by Cranelift?

Oh by "explicit code generation" I mean "the thing that isn't deferred reference counting".


So overall I feel like my thinking above is still somewhat attractive. That scheme is where on a call to Store::gc it doesn't actually do anything if we're missing some blessed marker on the stack (or some other mechanism to detect that our stack trace is likely incomplete).

The downside of that is that runtimes like .NET will GC less, but that seems like it's not really that much of an issue given that our "GC" here is very small. The GC'd aspect is you gave a value to wasm and then it became unreachable during the execution of wasm. If we don't eagerly clean those up it doesn't seem like it's the end of the world, especially because we'll want to eventually fix this.

I think that explicit reference counting is a defunkt strategy given our unwinding strategy. I don't really see how implementing unwinding correctly with explicit reference counting is going to be any less hazardous or any less work than this already was. If that's taken as an assumption then the final appearance of this feature will be exactly this PR except with more changes to the gc function and codegen (to manage JitActivation entries and such). @fitzgen brings up a good point though that the backtrace crate will likely no longer be what we use.

Personally I see a few directions to go from here:

  1. Land this PR, work afterwards to fix .NET
  2. Don't land this PR, fix .NET, then land this PR. (e.g. JitActivation or similar)
  3. Switch this PR to explicit reference counting.

I don't think (3) is feasible and I think 1/2 are pretty close. I think it might be good to sketch out in more detail how JitActivation could work though.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 11 2020 at 16:02):

tschneidereit commented on Issue #1832:

Personally I see a few directions to go from here

I find this convincing, and support (1), given that @peterhuene also voiced support for it.

view this post on Zulip Wasmtime GitHub notifications bot (Jun 11 2020 at 16:34):

fitzgen commented on Issue #1832:

Ok so we talked about this in today's wasmtime meeting. The conclusion we came to was to implement a canary stack frame in this PR for now, so we can detect situations where libunwind cannot walk the full stack, and then we can skip GC rather than get potential unsoundness. But long term, we want to do the precise JitActivation approach (it would allow other additional benefits, such as getting wasm stack traces for .net traps, which aren't currently available).

Thanks everyone for the discussion and design input!

view this post on Zulip Wasmtime GitHub notifications bot (Jun 11 2020 at 21:51):

fitzgen commented on Issue #1832:

@alexcrichton ok, I've implemented the stack canary and rebased, so this should be ready for another round of review.

We will probably want a new release of backtrace before we merge this (or at least we'll need one before we publish again after merging this).

view this post on Zulip Wasmtime GitHub notifications bot (Jun 11 2020 at 21:54):

alexcrichton commented on Issue #1832:

Wow that was fast!

I'll take a look tomorrow!

view this post on Zulip Wasmtime GitHub notifications bot (Jun 11 2020 at 22:33):

github-actions[bot] commented on Issue #1832:

Subscribe to Label Action

cc @bnjbvr

<details>
This issue or pull request has been labeled: "cranelift"

Thus the following users have been cc'd because of the following labels:

To subscribe or unsubscribe from this label, edit the <code>.github/subscribe-to-label.json</code> configuration file.

Learn more.
</details>

view this post on Zulip Wasmtime GitHub notifications bot (Jun 12 2020 at 23:12):

fitzgen commented on Issue #1832:

Apart from the inline comments the only other thing I'd say is that I think it'd be good to look into not having two registrations per module (GlobalFrameInfoRegistration and StackMapRegistration) and trying to lump that all into one (also deduplicating the BTreeMap lookups and such)

Ah, I just remembered why it has to be this way: the GlobalFrameInfo is in the wasmtime crate, but we need the stack map stuff inside the wasmtime-runtime crate, so it can't be added as part of GlobalFrameInfo. The split between wasmtime and wasmtime-runtime crates is a bit annoying...

view this post on Zulip Wasmtime GitHub notifications bot (Jun 15 2020 at 23:10):

fitzgen commented on Issue #1832:

I had to cfg(target_arch = "x86_64") the reference types-related tests because Cranelift doesn't support safepoints/stack maps for any other targets (notably aarch64).

view this post on Zulip Wasmtime GitHub notifications bot (Jun 16 2020 at 00:48):

fitzgen commented on Issue #1832:

Confused why the aarch64 CI job is running the reference types tests despite them being ignored if !cfg!(target_arch = "x86_64"). Feel like I must be missing some simple typo or something like that, but I just don't see it...

view this post on Zulip Wasmtime GitHub notifications bot (Jun 16 2020 at 00:51):

fitzgen commented on Issue #1832:

Gah, its because the build.rs is being compiled for x64.


Last updated: Oct 23 2024 at 20:03 UTC