jameysharp opened PR #6282 from jameysharp:guest-profiler
to bytecodealliance:main
:
This isn't ready for detailed review but the basic functionality seems to work so I'm putting it up in case anyone wants to play with it.
Unlike the existing profiling options, this works on all platforms and does not rely on any external profiling tools like perf or VTune. On the other hand, it can only profile time spent in the WebAssembly guest, not in Wasmtime itself or other host code. Also it can't measure time as precisely as platform-native tools can.
<!--
Please make sure you include the following information:
If this work has been discussed elsewhere, please include a link to that
conversation. If it was discussed in an issue, just mention "issue #...".Explain why this change is needed. If the details are in an issue already,
this can be brief.Our development process is documented in the Wasmtime book:
https://docs.wasmtime.dev/contributing-development-process.htmlPlease ensure all communication follows the code of conduct:
https://github.com/bytecodealliance/wasmtime/blob/main/CODE_OF_CONDUCT.md
-->
jameysharp updated PR #6282.
jameysharp has marked PR #6282 as ready for review.
jameysharp requested fitzgen for a review on PR #6282.
jameysharp requested wasmtime-core-reviewers for a review on PR #6282.
jameysharp requested wasmtime-default-reviewers for a review on PR #6282.
bjorn3 created PR review comment:
You can use
self.start.elapsed()
.
alexcrichton submitted PR review.
alexcrichton submitted PR review.
alexcrichton created PR review comment:
Could this be moved into a CLI argument as well?
alexcrichton created PR review comment:
I think that the error handling here may want to be a bit more nuanced perhaps. For example if
invoke_func
failed and this finalization also fails, the original failure will be masked by the writing of the profile.This concern of mine is additionally shared with the writing of a core dump after a trap, which IIRC will print out an error but won't return the core dump writing error further up the stack. Could the logic there and here be unified perhaps into a sort of "perform a fallible operation post-wasm-execution"?
alexcrichton created PR review comment:
The interning and various allocations here make me wonder: would it be better to defer this until after execution? Should this instead buffer up a
Vec<WasmBacktrace>
which is then all serialized out at the end of execution?Or perhaps amortizing the cost by collecting N backtraces and then flushing them all at once perhaps.
alexcrichton created PR review comment:
Are there any flags or configuration we could perhaps pass to indicate which module a frame comes from? Or perhaps which instance a frame comes from? We don't currently have a means of learning that from a wasm frame but it's something we could theoretically add.
jameysharp updated PR #6282.
jameysharp submitted PR review.
jameysharp submitted PR review.
jameysharp created PR review comment:
Nice, I didn't know about
Instant::elapsed
. Thanks!
jameysharp created PR review comment:
I've now spent a bit of time looking at the coredump path and I don't see a nice way to unify this with that. So instead I've changed
setup_epoch_handler
so the closure it returns doesn't have a return value, forcing it to report and suppress errors internally.
jameysharp created PR review comment:
Oh yeah, I intended to do that and forgot. Thanks!
jameysharp created PR review comment:
The Firefox profile format supports a lot of metadata that I'm not currently filling in. Much of it isn't currently exposed by the
fxprof-processed-profile
crate, but there's definitely more available in the current crate that we could be using.One way we could express module/instance tags today is by creating a
CategoryHandle
for each one, rather than using theOTHER
category.There are other things I'd like to add too: notably, thread-local CPU time consumed between samples, and a call-hook to add markers to the profile for exit from and re-entry into wasm.
I'd prefer not to add any of these features in this PR though. So I plan to add them as notes in comments while I'm writing documentation for this profiler, and I'll mention module/instance tags too.
jameysharp created PR review comment:
I want to move the
GuestProfiler
type into thewasmtime
crate primarily because I want it to be usable by embedders, but an additional reason is because then it's reasonable for it to usewasmtime-runtime
's low-level stack-walking API directly. I think there's a lot of work that can be avoided that way.I don't think batching up
WasmBacktrace
objects is going to help much, because just collecting one of those involves looking up which function each PC comes from, along with information like the function name. At that point I'd guess three hash-table lookups per function for interning aren't going to make much difference.I'd like to merge this with the current approach, of eagerly consuming the high-level
WasmBacktrace
, as-is for now. I don't think the public API ofGuestProfiler
will need to change to support more efficient collection of backtraces, so I think we can iterate on performance improvements afterward without impacting any users.I definitely want to improve this over time though. I think eventually we're going to want API changes in
fxprof-processed-profile
for the highest performance sampling we can do in-process.
jameysharp updated PR #6282.
jameysharp updated PR #6282.
jameysharp updated PR #6282.
alexcrichton submitted PR review:
I'm ok landing this personally, but I think we should be clear in the documentation of the type that it's not really at a state of implementation quality to be a high-fidelity general purpose guest profiler. For example some drawbacks I can think of are:
- Requires epochs as the driver for stack collection, meaning the granularity of samples is function entries and loop headers. Additionally this means that you're not necessarily measuring the "raw" performance since it's code-based instrumentation.
- Stacks are always collected from wasm meaning that any time spent in the host is not accounted for. This means that samples can be lost while in the host and additionally whether or not the host blocked or did compute I think is unclear. (although perhaps this is fixable)
I suspect that a large number of use cases are totally ok with these drawbacks, but at the same time I think it's important to be clear that we're not looking to get into the business of a "real" profiler at this time with off-thread stack sampling, host integration, etc.
Additionally would you also be up for updating the documentation on
docs/*.md
? Currently we haveexamples-profiling.md
which delegates toperf
andvtune
and I think it'd be good to add an entry for this strategy of profiling. Ideally there could be at least a small paragraph of words explaining why one might choose one profiling strategy over another (e.g.perf
captures everything but is Linux-specific and not always easy to interpret, while this is cross-platform but not as "high fidelity" asperf
)
alexcrichton submitted PR review:
I'm ok landing this personally, but I think we should be clear in the documentation of the type that it's not really at a state of implementation quality to be a high-fidelity general purpose guest profiler. For example some drawbacks I can think of are:
- Requires epochs as the driver for stack collection, meaning the granularity of samples is function entries and loop headers. Additionally this means that you're not necessarily measuring the "raw" performance since it's code-based instrumentation.
- Stacks are always collected from wasm meaning that any time spent in the host is not accounted for. This means that samples can be lost while in the host and additionally whether or not the host blocked or did compute I think is unclear. (although perhaps this is fixable)
I suspect that a large number of use cases are totally ok with these drawbacks, but at the same time I think it's important to be clear that we're not looking to get into the business of a "real" profiler at this time with off-thread stack sampling, host integration, etc.
Additionally would you also be up for updating the documentation on
docs/*.md
? Currently we haveexamples-profiling.md
which delegates toperf
andvtune
and I think it'd be good to add an entry for this strategy of profiling. Ideally there could be at least a small paragraph of words explaining why one might choose one profiling strategy over another (e.g.perf
captures everything but is Linux-specific and not always easy to interpret, while this is cross-platform but not as "high fidelity" asperf
)
alexcrichton created PR review comment:
I think this'd be a good place to have a doc-block example if you're up for it about how to configure a
Store
and such.
alexcrichton created PR review comment:
This was actually a question I thought of, what happens if you block in a hostcall for a second?
I think you'd get a stack sample, much later, in probably the wrong location (e.g. the next function call after the host import or the loop that's calling the host import). I'm not sure how those really-far-apart stack samples are interpreted by the firefox profiler, but perhaps this CPU time measurement would help?
alexcrichton created PR review comment:
API question: instead of requiring an
Arc
andMutex
here, which technically aren't necessary since there aren't any threads, could this instead be stored in theT
ofStore<T>
? As-is theGuestProfiler::sample
method would not work since it would be stored inT
but additionally takes a store as a parameter, but I think that could be fixed by taking a&WasmBacktrace
as an argument since that's all the store is used for internally.
alexcrichton created PR review comment:
Could this be expanded with more information about what to do with the output? For example explaining that it's a JSON blob to upload to https://profiler.firefox.com I think would be good.
alexcrichton created PR review comment:
Is it worth perhaps providing a CLI flag to auto-upload the profile and printing a URL that can be clicked on?
fitzgen submitted PR review:
Really excited for this!
When profiling is enabled in the CLI, it probably makes sense to write some output after running the Wasm that repeats where the profile file was written to and how to upload it to
https://profiler.firefox.com/
fitzgen submitted PR review:
Really excited for this!
When profiling is enabled in the CLI, it probably makes sense to write some output after running the Wasm that repeats where the profile file was written to and how to upload it to
https://profiler.firefox.com/
fitzgen created PR review comment:
Should this method take
self
? Would it be valid to add more samples after callingfinish
?
fitzgen created PR review comment:
It could be good to have a full example here, if we are just going to expose building blocks rather than a single enable-profiling-and-get-dump-profiles-to-this-file kind of API.
fitzgen created PR review comment:
Doesn't matter a ton in this case, but one pattern I've used to make APIs like this slightly harder to get wrong is to return an
impl Drop
:struct FinishEpochHandler<F: FnOnce()>(Option<F>); impl<F: FnOnce()> Drop for FinishEpochHandler<F> { fn drop(&mut self) { if let Some(f) = self.0.take() { f(); } } } return FinishEpochHandler(Some(move || { // what you have now... }));
This way, callers don't need to actually remember to call the resulting function, at most they just need to add a scope.
But I guess if you do Alex's suggestion of moving the profiler into
Store
, then this result function would require a store argument, and the RAII/impl Drop
approach won't really work anymore...
jameysharp created PR review comment:
"Wait, I could just keep it in the Store," I thought, as I fell asleep last night; then wake up to find this comment. :laughing:
I don't want to make
WasmBacktrace
part of the public API here because I want to re-implement this in terms of the lower-levelwasmtime-runtime
stack-walking API. That doesn't directly need access to the store at all, but I do need some information from the store at some point to figure out how native PCs map to individual modules and to functions within those modules.It might work though to pass in the store when constructing a
GuestProfiler
, collect just the necessary information about all the modules and symbols from it, and then not need it insample
.Otherwise, the only alternative I've thought of is to define
fn sample(store: impl AsContextMut, get_self: impl FnOnce(&mut Store) -> &mut Self)
. Not the nicest interface but I think it should work, right?
jameysharp created PR review comment:
I've only been using the "local file" interface, so I didn't realize there was a way to upload a profile and share a link. I guess the way to do that is documented at https://github.com/firefox-devtools/profiler/blob/main/docs-developer/loading-in-profiles.md.
I don't see any HTTP client usage anywhere in Wasmtime except in wasi-http right now. So that seems like a relatively large change that I'd prefer to think about separately from the basic profiling implementation. It would definitely be nice for usability, though.
jameysharp created PR review comment:
If you call
finish
again it'll just update the thread end-time and write out the whole profile again, including any new samples. So it's probably not useful to add more samples after callingfinish
, but it should be harmless.That said, I only did it this way because there's still an extra reference to the
GuestProfiler
in the epoch callback after the instance finishes, and I didn't want to have to clone the profile. But if I can make Alex's suggestion work, of keeping theGuestProfiler
in the store instead of in anArc
, then that won't be an issue and I could make the change you're suggesting too.
alexcrichton created PR review comment:
Ah ok yeah if it's not easy to reuse whatever is in wasi-http I'm happy to leave this as-is
alexcrichton created PR review comment:
How about:
impl<T> Store<T> { pub fn sample(&mut self, get_profiler: impl FnOnce(&mut T) -> &mut GuestProfiler) { ... } }
with similar impls for
StoreContextMut
?which is more-or-less the same thing, but just as a method instead of a function on
GuestProfiler
now that I reread your comment
jameysharp updated PR #6282.
jameysharp updated PR #6282.
jameysharp updated PR #6282.
alexcrichton submitted PR review:
Sounds like a good plan to me :+1:
jameysharp merged PR #6282.
Last updated: Jan 24 2025 at 00:11 UTC