xtuc opened issue #5732:
Feature
When the Wasm instance traps, it's sometimes difficult to understand what happened. Post-mortem debugging using coredumps (which is extensively used in native environment) would be helpful for investigating and fixing crashes.
Wasm coredump is especially useful for serverless environment where production binaries are stripped and/or have access to limited logging.
Implementation
Implement Wasm coredumps as specified by https://github.com/WebAssembly/tool-conventions/blob/main/Coredump.md.
Note that the spec is early and subject to changes. Feedback very welcome!cc @fitzgen
bjorn3 commented on issue #5732:
Reading the linear memory after a crash is already possible. As for getting the locals and stack values, this is much more complicated. Wasmtime uses the Cranelift optimizing compiler, which can eliminate locals and stack values entirely and leaves those that remain at whichever location it likes. It did be necessary to somehow prevent optimizing locals away, at least for points where a trap could happen. There is debugger support for getting the location of locals and stack values which aren't optimized away to generate debuginfo, but I'm not sure if it is 100% accurate. By the way https://github.com/bytecodealliance/wasmtime/issues/5537 is somewhat relevant to this.
xtuc commented on issue #5732:
I don't think Wasm coredump should prevent optimizations, given that ideally it's enabled by default.
It's not uncommon to see coredump in native environment with missing values because they were optimized away. They are usually not very helpful for debugging.
bjorn3 commented on issue #5732:
The wasm coredump format doesn't seem to allow omitting values that are optimized away, but if it is allowed, then it should be possible to implement without too much changes to Cranelift. I think it would need some changes to the unwind table generation code to store the location of callee saved registers, but that will need to be done anyway for handling exceptions. After that I guess it would be a matter of telling Cranelift to generate debuginfo and then during a crash unwind the stack and record all preserved locals and stack values for every frame from Wasmtime.
xtuc commented on issue #5732:
The wasm coredump format doesn't seem to allow omitting values that are optimized away
Correct, at the moment it doesn't. I'm going to add it, thanks for your input!
jameysharp commented on issue #5732:
This is an area I haven't dug into much, but doesn't Cranelift's support for GC already support tracking the information we need for this? I think we would need to mark potentially-trapping instructions as "safe points" and then request stack maps from Cranelift. And my impression was that calls are already considered safe points. But this is all conjecture based on a CVE that I was peripherally paying attention to last year, so I could have it all wrong.
fitzgen commented on issue #5732:
This is an area I haven't dug into much, but doesn't Cranelift's support for GC already support tracking the information we need for this? I think we would need to mark potentially-trapping instructions as "safe points" and then request stack maps from Cranelift. And my impression was that calls are already considered safe points. But this is all conjecture based on a CVE that I was peripherally paying attention to last year, so I could have it all wrong.
Stack maps only track reference values (
r32
/r64
), and only say which stack slots have live references in them. They do not supply any kind of info to help tie that back to wasm locals or even clif SSA variables.I don't think we would want to use stack maps for this stuff.
cfallin commented on issue #5732:
On the flip-side, if you're proposing altering the generated code to assist debugging observability @jameysharp, there is a large design space that we haven't really explored. A relatively simple change would be to define a pseudoinstruction that takes all locals as inputs, with "any" constraints to regalloc (stack slot or register), and insert these wherever a crash could happen. This "state snapshot" instruction would then guarantee observability of all values, at the cost of hindering optimization.
This goes somewhat against the "don't alter what you're observing" principle that is common in debug infrastructure but I'll note that we do already have some hacks to keep important values alive (in this case, the vmctx, which makes all other wasm state reachable) for the whole function body.
There's also the "recovery instruction" approach, used in IonMonkey at least: whenever a value is optimized out, generate a side-sequence of instructions that can recompute it. That's a much larger compiler-infrastructure undertaking but in principle we could do it, if perfect debug observability were a goal.
xtuc commented on issue #5732:
https://github.com/WebAssembly/tool-conventions/issues/198 has been closed. The coredump format now allows to make local/stack values as missing.
xtuc edited a comment on issue #5732:
https://github.com/WebAssembly/tool-conventions/issues/198 has been closed. The coredump format now allows to mark local/stack values as missing.
xtuc commented on issue #5732:
I made a change to add initial/basic coredump generation: https://github.com/bytecodealliance/wasmtime/pull/5868. Could you please have a look and let me know if this is the right direction?
It usesWasmBacktrace
for information about frames.
xtuc commented on issue #5732:
Basic coredump generation has been merged (thanks!).
Now, to have the complete debugger experience, we need to collect the following information:
- Wasm locals of each stack frames
- Snapshot the Wasm linear memory (sounds relatively easy, it's not clear to me where the coredump code should live though).
From @cfallin :
A relatively simple change would be to define a pseudoinstruction that takes all locals as inputs, with "any" constraints to regalloc (stack slot or register), and insert these wherever a crash could happen. This "state snapshot" instruction would then guarantee observability of all values, at the cost of hindering optimization.
This "state snapshot" instruction could be translated from Wasm's
(unreachable)
instruction. I'm curious about the performance impact. Since the coredump feature is behind a flag, would it make sense to experiment with that approach?
Is my understanding correct that it won't help identifiying the values?
xtuc edited a comment on issue #5732:
Basic coredump generation has been merged (thanks!).
Now, to have the complete debugger experience, we need to collect the following information:
- Wasm locals of each stack frames
- Snapshot the Wasm linear memory (sounds relatively easy, it's not clear to me where the coredump code should live though).
RyanTorok commented on issue #5732:
Is there a chance we could revive this thread? I'm working on cloud
infrastructure research, and being able to take a stack snapshot in wasmtime
would allow us to get some sophisticated cold-start optimizations for
Function-as-a-Service (FaaS) functions.There has been a plethora of academic papers published about using execution
snapshots to speed up the cold-start (startup) time in FaaS, especially when
heavyweight VMs are involved. Starting up a Module in wasmtime tends to be
faster than VMs by 2-3 orders of magnitude, but recent papers have also explored
how to snapshot the state of the function after some initialization runs, which
has a lot in common with what Wizer does.I am trying to extend this idea with a construction called _Nondeterministic
Generators_, which will allow FaaS functions to be snaphotted at any point in the
execution. Generators rely on the observation that functions whose execution
that has not performed any invocation-specific computation (i.e. anything using
the function arguments or any nondeterministic functions imported from the host)
can be unconditionally snapshotted and used to fast-forward future invocations
of the same function.In addition, we can create conditional snapshots that let application developers
optimize for common patterns, such as functions that want to check that their
arguments are valid before they perform their expensive initialization, which
traditional "init function"-based cold-start speedup techniques cannot optimize
without breaking the function semantics if the invocation-specific invariant is
violated (e.g. our argument validation fails).I was looking into Wizer quite a bit and the design decisions it makes, and I
was hoping to get some insight about the requirements Wizer lists on its docs.rs
page, under "Caveats":
- The initialization function may not call any imported functions. Doing so will
trigger a trap and wizer will exit.Is this just a lint against the produced module being potentially non-portable
(the snapshot would rely on the outcome of a particular host's implementation of
the imported function), or is there a more fundamental reason this is not
possible? I imagine my generator design having the potential to snapshot any
time just before a generator is polled (polling calls an import function, so the
host can record the outcome of the generator function), which would necessitate
snapshotting after code that has already called into the host at least once if we
have multiple generators.
- The Wasm module may not import globals, tables, or memories.
I don't anticipate the application code running on my system to need any of
these, but I'd like some clarification about why this applies to the entire
module and not just the init function, like for host functions.
- Reference types are not supported yet. This is tricky because it would allow
the Wasm module to mutate tables, and we would need to be able to snapshot the
new table state, but funcrefs and externrefs don’t have identity and aren’t
comparable in the Wasm spec, which makes snapshotting difficult.This makes sense. Application code in my system should not need to use these.
More fundamentally, the major roadblock to my design working with WebAssembly
modules is wasmtime's current inability to snapshot the WebAssembly _stack_. Since
my design allows the execution to snapshot at any point, not just after some
initialization function runs (as Wizer supports), my design would require all
the application's local state to be moved to a Memory before we snapshot, which
would slow down function execution and be a very awkward paradigm to program in.My main question is (and I apologize for taking a page to get there), is what
roadblocks would need to be overcome in order to make stack snapshots possible
in wasmtime? Since it will be relevant below, I should point out that the
requirements for my use case are actually a bit looser than Wizer's in two ways:
I don't necessarily care that a snapshot is actually in the form of a new
WebAssembly Module that can be instantiated and run on its own. I just want my
host to be able to store _something_ that lets it fast-forward a module to the
point where a snapshot occurred, possibly by instantiating the original module
and overwriting the globals, memory, and stack. Likewise, I'm not concerned
about portability of the snapshot. We can assume that the snapshot will be
loaded on the same Engine (and therefore the same version of wasmtime) it was
produced on.I don't require the host to necessarily do all the snapshotting work on its
own. If we can invoke a callback that allows the application, through a
library they link to, to, say, copy the stack to a Memory object so it can be
snapshotted, that should suffice.I had the intuition that the application library could just run some WebAssembly
code that copies the locals on the stack into a Memory object, but I was
concerned about how wasmtime would behave when we restored such a stack. Unlike
the core-dumping use case, I'm less concerned about the actual contents of the
stack in relation to cranelift's dead-code elimination (DCE); however, I am
concerned that if during the run that produced the snapshot, cranelift decides
by DCE to eliminate an unnecessary value from the stack, is it possible that
when we restore that stack in a new instantiation of the module that skips to
the snapshot, cranelift won't perform the same optimization and it will try to
pop a value off the stack that isn't there? If I had one reason for writing this
comment, it's that I would really appreciate some clarification on how this
compilation process works and what guarantees are in place, and how that might
affect our endeavor to produce restorable stack snapshots.Thanks everyone for reading. You all do great work, and I'd love to contribute
going forward.
bjorn3 commented on issue #5732:
Something that may work is if you reuse the exact same compiled machine code then you could take a snapshot of the part of the native stack that contains the wasm frames and restore it later. You did have to fixup pointers (which probably requires emitting extra metadata and maybe some changes to avoid keeping pointers alive across function calls) and making sure that no native frames are on the stack as those can't safely be snapshotted. By keeping the same compiled machine code you know that the stack layout is identical. Wasmtime already allows emitting compiled wasm modules (.cwasm extension) and loading them again. You did only need to implement the stack snapshotting and pointer fixups. This still not exactly trivial, but likely much easier than perfectly reconstructing the wasm vm state.
The initialization function may not call any imported functions. Doing so will
trigger a trap and wizer will exit.I would guess this is a combination of there being no way to hook up any imported functions from the host to wizer and this limitation ensuring that there is no native state that wizer can't snapshot. But I'm not a contributor to it, so it is nothing but a guess.
cfallin commented on issue #5732:
@RyanTorok there are a lot of interesting ideas in your comment (I have to admit that I skimmed it in parts; I'd encourage a "tl;dr" of points for comments this long!). A few thoughts:
The fundamental issue that would have to be solved to snapshot and restore an active stack is relocation of mappings in the address space. In principle one could copy an image of the whole stack and all data segments, current PC and all registers, map them into a new execution at a later time and restart as if nothing changed... except that the heap and the stack will be at different locations than before.
In order to make that work, one has to prevent "host" addresses from escaping, or else precisely track where the escape to, or some combination. An example of the latter is the frame-pointer chain: one has addresses that point to the stack on the stack itself, but that's OK because one can precisely traverse the linked list and rewrite saved FPs if the stack moves. Likewise for return addresses. An example of the former is handling Wasm heap accesses. If we somehow ensure that only Wasm-level addresses (offsets to the heap) are "live" at snapshot points, and the only live address is the
vmctx
, except ephemerally when addresses for all other accessed memory are derived from it, then that could work. But that requires some compiler support, I think.Restoring a native-level snapshot after optimizing the code a different way is a complete non-starter, I think. (I believe this is what you're referring to when speaking of Cranelift DCE working differently in a different run.) Many incidental details of the compiled code can change if the input changes: the layout of blocks, the registers and stackslots that the register allocator assigns for particular values, existence of some values in the function causing optimization of different values to go differently, etc.
Another option that I think you refer to is a Wasm-level snapshot. This is interesting, but requires mapping Wasm-level state to machine state precisely at possible snapshot points. We have a little bit of plumbing for that kind of thing with our debug support, but it's incomplete. The other side of the coin -- restoring the snapshot -- then requires "multi-entry functions" (something like "on-stack replacement" when a JIT tiers up) to entier into the middle of the IR with known values.
So I think some form of this is possible but it's a deep research project and requires a bunch of intimate knowledge of the compiler and runtime. We likely don't have the resources to help you design this in detail, but I'm personally curious to see what you come up with...
cfallin edited a comment on issue #5732:
@RyanTorok there are a lot of interesting ideas in your comment (I have to admit that I skimmed it in parts; I'd encourage a "tl;dr" of points for comments this long!). A few thoughts:
The fundamental issue that would have to be solved to snapshot and restore an active stack is relocation of mappings in the address space. In principle one could copy an image of the whole stack and all data segments, current PC and all registers, map them into a new execution at a later time and restart as if nothing changed... except that the heap and the stack will be at different locations than before.
In order to make that work, one has to prevent "host" addresses from escaping, or else precisely track where they escape to, or some combination. An example of the latter is the frame-pointer chain: one has addresses that point to the stack on the stack itself, but that's OK because one can precisely traverse the linked list and rewrite saved FPs if the stack moves. Likewise for return addresses. An example of the former is handling Wasm heap accesses. If we somehow ensure that only Wasm-level addresses (offsets to the heap) are "live" at snapshot points, and the only live address is the
vmctx
, except ephemerally when addresses for all other accessed memory are derived from it, then that could work. But that requires some compiler support, I think.Restoring a native-level snapshot after optimizing the code a different way is a complete non-starter, I think. (I believe this is what you're referring to when speaking of Cranelift DCE working differently in a different run.) Many incidental details of the compiled code can change if the input changes: the layout of blocks, the registers and stackslots that the register allocator assigns for particular values, existence of some values in the function causing optimization of different values to go differently, etc.
Another option that I think you refer to is a Wasm-level snapshot. This is interesting, but requires mapping Wasm-level state to machine state precisely at possible snapshot points. We have a little bit of plumbing for that kind of thing with our debug support, but it's incomplete. The other side of the coin -- restoring the snapshot -- then requires "multi-entry functions" (something like "on-stack replacement" when a JIT tiers up) to entier into the middle of the IR with known values.
So I think some form of this is possible but it's a deep research project and requires a bunch of intimate knowledge of the compiler and runtime. We likely don't have the resources to help you design this in detail, but I'm personally curious to see what you come up with...
cfallin edited a comment on issue #5732:
@RyanTorok there are a lot of interesting ideas in your comment (I have to admit that I skimmed it in parts; I'd encourage a "tl;dr" of points for comments this long!). A few thoughts:
The fundamental issue that would have to be solved to snapshot and restore an active stack is relocation of mappings in the address space. In principle one could copy an image of the whole stack and all data segments, current PC and all registers, map them into a new execution at a later time and restart as if nothing changed... except that the heap and the stack will be at different locations than before.
In order to make that work, one has to prevent "host" addresses from escaping, or else precisely track where they escape to, or some combination. An example of the latter is the frame-pointer chain: one has addresses that point to the stack on the stack itself, but that's OK because one can precisely traverse the linked list and rewrite saved FPs if the stack moves. Likewise for return addresses. An example of the former is handling Wasm heap accesses. If we somehow ensure that only Wasm-level addresses (offsets to the heap) are "live" at snapshot points, and the only live address is the
vmctx
, except ephemerally when addresses for all other accessed memory are derived from it, then that could work. But that requires some compiler support, I think.Restoring a native-level snapshot after optimizing the code a different way is a complete non-starter, I think. (I believe this is what you're referring to when speaking of Cranelift DCE working differently in a different run.) Many incidental details of the compiled code can change if the input changes: the layout of blocks, the registers and stackslots that the register allocator assigns for particular values, existence of some values in the function causing optimization of different values to go differently, etc.
Another option that I think you refer to is a Wasm-level snapshot. This is interesting, but requires mapping Wasm-level state to machine state precisely at possible snapshot points. We have a little bit of plumbing for that kind of thing with our debug support, but it's incomplete. The other side of the coin -- restoring the snapshot -- then requires "multi-entry functions" (something like "on-stack replacement" when a JIT tiers up) to enter into the middle of the IR with known values.
So I think some form of this is possible but it's a deep research project and requires a bunch of intimate knowledge of the compiler and runtime. We likely don't have the resources to help you design this in detail, but I'm personally curious to see what you come up with...
fitzgen commented on issue #5732:
@RyanTorok,
The Wasm stack doesn't really exist anymore by the time Cranelift is done emitting machine code (it is erased very early in the pipeline, basically the first thing to go). Instead you would need to capture the actual native stack. This has issues that @bjorn3 mentioned around native frames in between Wasm frames, but even if it is just Wasm there will be pointers on the stack to things
malloc
ed by the host, namely the vm context and associated data structures. Each new process will have new ASLR and newmalloc
allocations and new FaaS requests/invocations will have new stores (and their associated vm contexts). These structures will ultimately end up in different addresses in memory. So either (a) restoring a snapshot will require having a list of places to go and update pointers not dissimilar to relocs or a moving GC, or (b) take extreme care codegen only emit indirect references to these structures (somehow? need an actual handle to be the "root" at some point or else a host call or something). Option (a) is a ton of work for Wasmtime/Cranelift to keep track of these things and option (b) is also a ton of work but also makes Wasm execution speed much slower. In both cases, if we get anything wrong (miss a stack slot or register that has a native pointer when saving a snapshot or accidentally emit a direct pointer reference rather than an indirection) then we have security vulnerabilities. Supporting all this would be a large refactoring of much of Wasmtime and Cranelift, and I'm pessimistic that it would ever happen. This is the kind of thing that you ideally need to build in from the very start, and Wasmtime and Cranelift have not been built with this in mind.Backing up a bit: this topic would be better discussed in a dedicated issue or on zulip, since this issue is specifically about implementing the proposed standard Wasm coredump format, which won't help with this feature since it is strictly about the Wasm-level. I suggest filing a new issue or starting a thread on zulip if you have further questions.
RyanTorok commented on issue #5732:
Thank you to everyone for the quick responses and insightful comments!
TL;DR: Issues with ASLR and the level of introspection into the runtime that would be required make stack snapshots pretty much a non-starter, and in fact they alerted me to limitations in the existing work on cold-starts I wasn't aware of.
Based on @fitzgen 's comments about ASLR, I took another look back at the existing literature on cold-starts, and it turns out that the traditional method of snapshotting the entire state of the VM or language runtime is not compatible with ASLR _at all_, and for the exact reason @fitzgen pointed out.
A summary of the problem is that language runtimes (e.g. JVM, Python, Node.js, wasmtime, ...) inherently need to compile code using native addresses, thereby making the VM state not portable to different addresses. Traditionally, the way to deal with this portability issue would be to introduce another level of indirection (i.e. position-independent addresses), but @fitzgen, @cfallin, and @bjorn3 all pointed out that any such scheme would require very deep introspection into the language runtime to convert the indirect addresses to direct addresses, which would be an enormous endeavor to the point you'd be better of redesigning the entire runtime to support this indirection. Otherwise, you're really walking a tightrope on both performance and security (mess up the indirection once, and the tenant can read memory their program doesn't own).
The existing literature on cold-starts essentially punts on this issue; it requires all memory owned by the VM or runtime to be loaded at the same address every time. While I don't see any major reasons wasmtime couldn't support this from an implementation standpoint, I don't recommend this as a direction for multiple reasons:
- Disabling ASLR is potentially bad for security. While I'm not aware of any features of language runtimes that fundamentally depend on ASLR to ensure security, disabiling it would make any memory bugs much easier for the tenant, because the attacker could just hard-code addresses in their code, or, short of that, memorize them from a previous run using the same snapshot.
- Security aside, in the cloud space, requiring code to always occupy the same address ranges every time would add unwanted contention to multi-tenant systems (i.e. cloud infrastructure). If two functions each had even a single (native) memory page that required the same fixed address, host could not run both functions in parallel. One possible mitigation to this would be to spawn multiple processes, so the functions would not compete for the same virtual addresses, but not only does this introduce overhead of interprocess communication (IPC), in wasmtime's case, this would force us to choose between reverting back to OS-based lazy loading of pages (with mmap), rather than preallocating pages using userfaultfd or becoming a _serious_ memory hog by preallocating a userspace page cache for all N processes, neither of which would be worth the performance wins of more flexible snapshots.
To summarize (in research paper speak), there are several open problems that have to be addressed with language runtimes in general, not just wasmtime, in order for generalized snapshots to be a practical solution for the cloud. I'm going to continue looking into how we might provide a subset of this feature set via library abstractions that work with the designs of existing language runtimes.
Thanks for all your help everyone!
RyanTorok commented on issue #5732:
As an aside, I think this question from my original comment:
is it possible that when we restore that stack in a new instantiation of the module that skips to the snapshot, cranelift won't perform the same optimization and it will try to pop a value off the stack that isn't there?
was a simple misunderstanding by me about the mechanics of cranelift. Clearly everything has to be compiled in order to run, it's just a matter of when that happens (AOT or JIT). My last project was in browser security, and in JavaScript engines we actually have to worry about code running at multiple optimization levels, and my confusion stemmed from there. This doesn't change anything about the issues with ASLR or introspection, however.
RyanTorok edited a comment on issue #5732:
Thank you to everyone for the quick responses and insightful comments!
TL;DR: Issues with ASLR and the level of introspection into the runtime that would be required make stack snapshots pretty much a non-starter, and in fact they alerted me to limitations in the existing work on cold-starts I wasn't aware of.
Based on @fitzgen 's comments about ASLR, I took another look back at the existing literature on cold-starts, and it turns out that the traditional method of snapshotting the entire state of the VM or language runtime is not compatible with ASLR _at all_, and for the exact reason @fitzgen pointed out.
A summary of the problem is that language runtimes (e.g. JVM, Python, Node.js, wasmtime, ...) inherently need to compile code using native addresses, thereby making the VM state not portable to different addresses. Traditionally, the way to deal with this portability issue would be to introduce another level of indirection (i.e. position-independent addresses), but @fitzgen, @cfallin, and @bjorn3 all pointed out that any such scheme would require very deep introspection into the language runtime to convert the indirect addresses to direct addresses, which would be an enormous endeavor to the point you'd be better of redesigning the entire runtime to support this indirection. Otherwise, you're really walking a tightrope on both performance and security (mess up the indirection once, and the tenant can read memory their program doesn't own).
The existing literature on cold-starts essentially punts on this issue; it requires all memory owned by the VM or runtime to be loaded at the same address every time. While I don't see any major reasons wasmtime couldn't support this from an implementation standpoint, I don't recommend this as a direction for multiple reasons:
- Disabling ASLR is potentially bad for security. While I'm not aware of any features of language runtimes that fundamentally depend on ASLR to ensure security, disabiling it would make any memory bugs much easier for the tenant to exploit, because the attacker could just hard-code addresses in their code, or, short of that, memorize them from a previous run using the same snapshot.
- Security aside, in the cloud space, requiring code to always occupy the same address ranges every time would add unwanted contention to multi-tenant systems (i.e. cloud infrastructure). If two functions each had even a single (native) memory page that required the same fixed address, host could not run both functions in parallel. One possible mitigation to this would be to spawn multiple processes, so the functions would not compete for the same virtual addresses, but not only does this introduce overhead of interprocess communication (IPC), in wasmtime's case, this would force us to choose between reverting back to OS-based lazy loading of pages (with mmap), rather than preallocating pages using userfaultfd, or becoming a _serious_ memory hog by preallocating a userspace page cache for all N processes, neither of which would be worth the performance wins of more flexible snapshots.
To summarize (in research paper speak), there are several open problems that have to be addressed with language runtimes in general, not just wasmtime, in order for generalized snapshots to be a practical solution for the cloud. I'm going to continue looking into how we might provide a subset of this feature set via library abstractions that work with the designs of existing language runtimes.
Thanks for all your help everyone!
RyanTorok edited a comment on issue #5732:
Thank you to everyone for the quick responses and insightful comments!
TL;DR: Issues with ASLR and the level of introspection into the runtime that would be required make stack snapshots pretty much a non-starter, and in fact they alerted me to limitations in the existing work on cold-starts I wasn't aware of.
Based on @fitzgen 's comments about ASLR, I took another look back at the existing literature on cold-starts, and it turns out that the traditional method of snapshotting the entire state of the VM or language runtime is not compatible with ASLR _at all_, and for the exact reason @fitzgen pointed out.
A summary of the problem is that language runtimes (e.g. JVM, Python, Node.js, wasmtime, ...) inherently need to compile code using native addresses, thereby making the VM state not portable to different addresses. Traditionally, the way to deal with this portability issue would be to introduce another level of indirection (i.e. position-independent addresses), but @fitzgen, @cfallin, and @bjorn3 all pointed out that any such scheme would require very deep introspection into the language runtime to convert the indirect addresses to direct addresses, which would be an enormous endeavor to the point you'd be better of redesigning the entire runtime to support this indirection. Otherwise, you're really walking a tightrope on both performance and security (mess up the indirection once, and the tenant can read memory their program doesn't own).
The existing literature on cold-starts essentially punts on this issue; it requires all memory owned by the VM or runtime to be loaded at the same address every time. While I don't see any major reasons wasmtime couldn't support this from an implementation standpoint, I don't recommend this as a direction for multiple reasons:
- Disabling ASLR is potentially bad for security. While I'm not aware of any features of language runtimes that fundamentally depend on ASLR to ensure security, disabiling it would make any memory bugs much easier for the tenant to exploit, because the attacker could just hard-code addresses in their code, or, short of that, memorize them from a previous run using the same snapshot.
- Security aside, in the cloud space, requiring code to always occupy the same address ranges every time would add unwanted contention to multi-tenant systems (i.e. cloud infrastructure). If two functions each had even a single (native) memory page that required the same fixed address, the host could not run both functions in parallel. One possible mitigation to this would be to spawn multiple processes, so the functions would not compete for the same virtual addresses, but not only does this introduce overhead of interprocess communication (IPC), in wasmtime's case, this would force us to choose between reverting back to OS-based lazy loading of pages (with mmap), rather than preallocating pages using userfaultfd, or becoming a _serious_ memory hog by preallocating a userspace page cache for all N processes, neither of which would be worth the performance wins of more flexible snapshots.
To summarize (in research paper speak), there are several open problems that have to be addressed with language runtimes in general, not just wasmtime, in order for generalized snapshots to be a practical solution for the cloud. I'm going to continue looking into how we might provide a subset of this feature set via library abstractions that work with the designs of existing language runtimes.
Thanks for all your help everyone!
whitequark commented on issue #5732:
What tools can I use to inspect the coredumps?
fitzgen commented on issue #5732:
@whitequark unfortunately there isn't much off-the-shelf at the moment.
There was https://github.com/xtuc/wasm-coredump/tree/main/bin/wasmgdb but as far as I know it only works with an old version of the format.
There are plans to build support for inspecting them via the debug adapter protocol in Wasmtime itself, as a stepping stone towards fuller debugging capabilities. See https://github.com/bytecodealliance/rfcs/pull/34 for more details. Unfortunately, that doesn't exist yet.
In the meantime, Wasm's core dumps are just wasm modules themselves, so you can use any tool that you might inspect a wasm module with to get at the information inside a core dump, e.g.
wasm-tools print
orwasm-objdump
.I know this isn't a great answer. I wish I had a better one. But we are planning on getting there!
whitequark commented on issue #5732:
Thanks! I'll keep it in mind--I have to use
wasm-objdump
a lot already so, cursed as it is, this does fit into my workflow...
xtuc commented on issue #5732:
There was https://github.com/xtuc/wasm-coredump/tree/main/bin/wasmgdb but as far as I know it only works with an old version of the format.
Sorry about that. I'm planning to update wasmgdb to the latest spec but haven't had the time yet.
Last updated: Dec 23 2024 at 13:07 UTC