wasmtime / Issue #2681 newBE: value_labels_ranges is very... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / Issue #2681 newBE: value_labels_ranges is very...

Wasmtime GitHub notifications bot (Feb 23 2021 at 21:05):

It literally took more time than the actual compilation on one profile.

![image](https://user-images.githubusercontent.com/17426603/108908024-257fd680-7623-11eb-92ff-405eb4f6fe27.png)

Wasmtime GitHub notifications bot (Feb 23 2021 at 21:24):

cfallin commented on Issue #2681:

It would be interesting to know more about this workload: why was the label-location dataflow analysis particularly slow in this case? Was there a higher than usual density of labels? Many basic blocks?

Considering alternative approaches (against the baseline "regalloc tells us everything directly" approach which regalloc.rs does not currently support):

The current approach, adding label ghost-instructions and doing a post-hoc analysis, is general but potentially involves slow analysis. It imposes no additional overhead on generated code, however (i.e. it can describe any generated code). This might be called a "passive" debug-info design.

There is an alternative where we explicitly reserve a stackslot for every label and store to this slot whenever the labeled value changes. Then the DWARF info is simple and constant (value label N always lives at BP-offset) but the runtime overhead is potentially considerable. However maybe this isn't so bad if we're already executing unoptimized debug-mode code. Let's call this an "active" debug-info design: rather than describing locations, we set a location and generate code to move data we care about observing.

I actually kind of favor the latter, all other things being equal. As mentioned here (and in the same spirit as the "simpler GC without stackmaps" proposal), I'd like to bias toward better factorization of complexity and less reliance on complex analyses and maintenance of metadata; the "post-hoc analysis" is partway there (the core compiler pipeline only sees blackbox value-label instructions) but this would be further so. Thoughts?

Wasmtime GitHub notifications bot (Feb 23 2021 at 21:25):

cfallin edited a comment on Issue #2681:

It would be interesting to know more about this workload: why was the label-location dataflow analysis particularly slow in this case? Was there a higher than usual density of labels? Many basic blocks?

Considering alternative approaches (against the baseline "regalloc tells us everything directly" approach which regalloc.rs does not currently support):

The current approach, adding label ghost-instructions and doing a post-hoc analysis, is general but potentially involves slow analysis. It imposes no additional overhead on generated code, however (i.e. it can describe any generated code). This might be called a "passive" debug-info design.

There is an alternative where we explicitly reserve a stackslot for every label and store to this slot whenever the labeled value changes. Then the DWARF info is simple and constant (value label N always lives at BP-offset) but the runtime overhead is potentially considerable. However maybe this isn't so bad if we're already executing unoptimized debug-mode code. Let's call this an "active" debug-info design: rather than describing locations, we set a location and generate code to move data we care about observing.

I actually kind of favor the latter, all other things being equal. As mentioned here (and in the same spirit as the "simpler GC without stackmaps" proposal #2459), I'd like to bias toward better factorization of complexity and less reliance on complex analyses and maintenance of metadata; the "post-hoc analysis" is partway there (the core compiler pipeline only sees blackbox value-label instructions) but this would be further so. Thoughts?

Wasmtime GitHub notifications bot (Feb 23 2021 at 21:36):

bjorn3 commented on Issue #2681:

This particular workload is compiling simple-raytracer with all of of its dependencies.

I would prefer not allocting a stackslot for every value for three reasons: I don't think it is acceptible to regress the already poor debugmode performance of rust even more. I don't want the choice to generate debuginfo or jot to influence the generated code. Gcc also doesn't let it influence the generated code. This has the advantage that enabling debuginfo doesn't change the behaviour of a program in case of UB or miscompilations, thus making them easier to debug. Finally value debuginfo may be useful for on stack replacement in case of a tiered JIT. Regressing performance in this case is unacceptable.

Wasmtime GitHub notifications bot (Feb 23 2021 at 21:43):

cfallin commented on Issue #2681:

Yeah, perhaps not, though I'd be interested to measure how large the regression would be.

The "right" answer here, I think, is to rely on regalloc.rs to provide us location info per vreg per program-point. Unfortunately without that we're forced to do an analysis of some sort to recover the info.

It's possible the analysis data structures could be improved: I notice that a lot of time is spent in cloning HashMaps for example; perhaps a delta-based scheme with some structure sharing could be designed. If someone has the time to experiment with this (I unfortunately don't at the moment, first priority is completeness for the switchover) I'd be happy to discuss or review...

Wasmtime GitHub notifications bot (Feb 23 2021 at 22:39):

froydnj commented on Issue #2681:

FWIW, GCC does (or at least used to) go with the second approach suggested above of giving everything a fixed stack slot.

Last updated: Apr 18 2025 at 14:03 UTC