wasmtime / issue #3612 cranelift: Add ability to auto-upd... · git-wasmtime

Stream: git-wasmtime

Topic: wasmtime / issue #3612 cranelift: Add ability to auto-upd...

Wasmtime GitHub notifications bot (Dec 16 2021 at 22:46):

I updated a few simple tests here as an example, but I figured it'd be good to get some review/feedback on this before getting to eager about anything else. AFAIK there's been some desire for this but not discussion around the design, so I don't want to presume too much here.

Wasmtime GitHub notifications bot (Dec 17 2021 at 03:53):

cfallin commented on issue #3612:

Huh, wow, this is certainly a wild idea! Unit tests that update themselves to pass (with permission)!

I do actually think this is a really cool solution to the "golden outputs are super-tedious to update" problem. The golden tests were never meant (or at least, I never meant them) to get to the volume and specificity that they have, where they trip over all sorts of incidental cross-cutting changes, like regalloc perturbations. They were mostly meant to be a stopgap; we really should pay down the techdebt soon. That said, a few reservations:

The golden-output tests were kind of a stopgap, and I actually think not relying more on runtests was a mistake. IMHO (now, with hindsight), a golden-output test is useful to ensure some properties, such as that a lowering is or isn't using a certain instruction, a certain optimization is or isn't happening, or e.g. the ABI code is using a certain stackframe layout. But the vast majority of the tests are actually asserting a particular output as an indirect way of asserting correct execution, because that particular output in theory executes correctly. (Even that isn't always verified though.)

So my "fix the fundamental problem" solution would be to delete most golden-output tests once we ensure we have coverage with runtests.

While it's a huge pain, the process of updating golden outputs is at least in principle supposed to be a human check -- yes, the output changed, does the change make sense? While one can look at deltas before doing an automated thing, in practice having a "just mash the big button to make it pass" option is the kind of thing that will encourage us to get sloppy about the golden test updates at some point.

As a human check goes, though, tediously reformatting the output line-by-line into filecheck format and removing block markers and such is probably more than is actually needed; the ideal "manual update" step is (IMHO) just copy/pasting or piping a blob from one place to another. Maybe we can do something with the filecheck infra to improve this; not sure.

The parsing here is a bit brittle, which worries me; one more thing to maintain if our output format changes. (the "load-bearing Debug format" problem, basically.)

I'm open to more discussion here of course; curious what others think as well. I do agree that our time is not well-spent now so we have to do something; maybe it's just time to move to runtests :-)

Wasmtime GitHub notifications bot (Dec 17 2021 at 03:54):

cfallin edited a comment on issue #3612:

Huh, wow, this is certainly a wild idea! Unit tests that update themselves to pass (with permission)!

I do actually think this is a really cool solution to the "golden outputs are super-tedious to update" problem. The golden tests were never meant (or at least, I never meant them) to get to the volume and specificity that they have, where they trip over all sorts of incidental cross-cutting changes, like regalloc perturbations. They were mostly meant to be a stopgap; we really should pay down the techdebt soon. That said, a few reservations:

The golden-output tests were kind of a stopgap, and I actually think not relying more on runtests was a mistake. IMHO (now, with hindsight), a golden-output test is useful to ensure some properties, such as that a lowering is or isn't using a certain instruction, a certain optimization is or isn't happening, or e.g. the ABI code is using a certain stackframe layout. But the vast majority of the tests are actually asserting a particular output as an indirect way of asserting correct execution, because that particular output in theory executes correctly. (Even that isn't always verified though.)

So my "fix the fundamental problem" solution would be to delete most golden-output tests once we ensure we have coverage with runtests.

While it's a huge pain, the process of updating golden outputs is at least in principle supposed to be a human check -- yes, the output changed, does the change make sense? While one can look at deltas before doing an automated thing, in practice having a "just mash the big button to make it pass" option is the kind of thing that will encourage us to get sloppy about the golden test updates at some point.

As a human check goes, though, tediously reformatting the output line-by-line into filecheck format and removing block markers and such is probably more than is actually needed; the ideal "manual update" step is (IMHO) just copy/pasting or piping a blob from one place to another. Maybe we can do something with the filecheck infra to improve this; not sure.

The parsing here is a bit brittle, which worries me; one more thing to maintain if our output format changes. (the "load-bearing Debug format" problem, basically.)

I'm open to more discussion here of course; curious what others think as well. I do agree that our time is not well-spent now so we have to do something; maybe it's just time to move to runtests :-)

Wasmtime GitHub notifications bot (Dec 17 2021 at 03:54):

cfallin edited a comment on issue #3612:

Huh, wow, this is certainly a wild idea! Unit tests that update themselves to pass (with permission)!

I do actually think this is a really cool solution to the "golden outputs are super-tedious to update" problem. The golden tests were never meant (or at least, I never meant them) to get to the volume and specificity that they have, where they trip over all sorts of incidental cross-cutting changes, like regalloc perturbations. They were mostly meant to be a stopgap; we really should pay down the techdebt soon. That said, a few reservations:

The golden-output tests were kind of a stopgap, and I actually think not relying more on runtests was a mistake. IMHO (now, with hindsight), a golden-output test is useful to ensure some properties, such as that a lowering is or isn't using a certain instruction, a certain optimization is or isn't happening, or e.g. the ABI code is using a certain stackframe layout. But the vast majority of the tests are actually asserting a particular output as an indirect way of asserting correct execution, because that particular output in theory executes correctly. (Even that isn't always verified though.)

So my "fix the fundamental problem" solution would be to delete most golden-output tests once we ensure we have coverage with runtests.

While it's a huge pain, the process of updating golden outputs is at least in principle supposed to be a human check -- yes, the output changed, does the change make sense? While one can look at deltas before doing an automated thing, in practice having a "just mash the big button to make it pass" option is the kind of thing that will encourage us to get sloppy about the golden test updates at some point.

As a human check goes, though, tediously reformatting the output line-by-line into filecheck format and removing block markers and such is probably more than is actually needed; the ideal "manual update" step is (IMHO) just copy/pasting or piping a blob from one place to another. Maybe we can do something with the filecheck infra to improve this; not sure.

The parsing here is a bit brittle, which worries me; one more thing to maintain if our output format changes. (the "load-bearing Debug format" problem, basically.)

I'm open to more discussion here of course; curious what others think as well. I do agree that our time is not well-spent now so we have to do something; maybe it's just time to move to runtests :-)

Wasmtime GitHub notifications bot (Dec 17 2021 at 03:56):

cfallin edited a comment on issue #3612:

Huh, wow, this is certainly a wild idea! Unit tests that update themselves to pass (with permission)!

I do actually think this is a really cool solution to the "golden outputs are super-tedious to update" problem. The golden tests were never meant (or at least, I never meant them) to get to the volume and specificity that they have, where they trip over all sorts of incidental cross-cutting changes, like regalloc perturbations. They were mostly meant to be a stopgap; we really should pay down the techdebt soon. That said, a few reservations:

The golden-output tests were kind of a stopgap, and I actually think not relying more on runtests was a mistake. IMHO (now, with hindsight), a golden-output test is useful to ensure some properties, such as that a lowering is or isn't using a certain instruction, a certain optimization is or isn't happening, or e.g. the ABI code is using a certain stackframe layout. But the vast majority of the tests are actually asserting a particular output as an indirect way of asserting correct execution, because that particular output in theory executes correctly. (Even that isn't always verified though.)

So my "fix the fundamental problem" solution would be to delete most golden-output tests once we ensure we have coverage with runtests.

While it's a huge pain, the process of updating golden outputs is at least in principle supposed to be a human check -- yes, the output changed, does the change make sense? While one can look at deltas before doing an automated thing, in practice having a "just mash the big button to make it pass" option is the kind of thing that will encourage us to get sloppy about the golden test updates at some point.

As a human check goes, though, tediously reformatting the output line-by-line into filecheck format and removing block markers and such is probably more than is actually needed; the ideal "manual update" step is (IMHO) just copy/pasting or piping a blob from one place to another. Maybe we can do something with the filecheck infra to improve this; not sure.

The parsing here is a bit brittle, which worries me; one more thing to maintain if our output format changes. (the "load-bearing Debug format" problem, basically. EDIT: of course it's load-bearing today because it's captured as golden; but it's even more load-bearing if we write parsing code to it!)

I'm open to more discussion here of course; curious what others think as well. I do agree that our time is not well-spent now so we have to do something; maybe it's just time to move to runtests :-)

Wasmtime GitHub notifications bot (Dec 17 2021 at 15:59):

alexcrichton commented on issue #3612:

FWIW I've used this technique to what feels like good success in many other projects, and I know rust-lang/rust has been using this format of tests quite successfully for quite some time for compiler error messages.

I do agree that run tests are better than compile tests, but I also don't want to ignore what's already there today. I think both styles of tests have advantages and disadvantages. For example if we're testing that a particular special case in ISLE is used then run won't be too useful and compile is better, although all compile tests should arguably be coupled with run tests as well. In the meantime though we've got tons of compile tests that are already tedious to update and it will presumably be just as tedious to migrate them all to run tests as well.

Otherwise the point about getting sloppy about updates is probably subjective. Subjectively I have never seen this be a problem in other projects I've worked on. While theoretically possible this is sort of what code review is for and if a code reviewer is ignoring code then I'm not sure that there's much we can do about that.

I think I'd also disagree that this is brittle. This is testing a 100% match which means there's little room for deviation. There's some arguably weird heuristics applied to the Debug output to make the assertions prettier in the test files, but if the Debug output ever changes then all tests will instantly fail. It's pretty easy to either update the output-massaging code or to update all tests en-masse when the output changes in this regard. If we expect rapid changes to Debug to happen over time which would require tedious maintenance of "map the program output to the test expectation" then we should invest in a more proper "map the program output to a string" implementation that doesn't use Debug. Basically I've found that it's little consequence to be "sloppy" with creating the textual what's-expected-from-the-test because if you mess up tests fail and if you succeed then you can see the diff of what the output actually is.

Wasmtime GitHub notifications bot (Dec 17 2021 at 18:08):

froydnj commented on issue #3612:

Lurking in: I don't know the history here, but has something like LLVM's FileCheck tool been considered? (https://llvm.org/docs/CommandGuide/FileCheck.html) It's probably not reasonable to use FileCheck directly, but it's not that complicated of a tool to write yourself if you need to (and FileCheck itself doesn't need any clif-specific knowledge).

Assuming you write patterns well, checks should be resistant to things like register allocation changes. It also makes explicit what the test is actually testing (e.g. these particular instructions are generated in this order): it's not always easy to remember/tell exactly what a test is actually testing when reviewing auto updates. Any changes to the checks themselves do have to be manual, but they're also much more easily reviewable by virtue of being small and targeted.

Wasmtime GitHub notifications bot (Dec 17 2021 at 18:25):

cfallin commented on issue #3612:

@alexcrichton , it's interesting (and good to know) that this sort of thing is used successfully elsewhere, thanks!

Specific thoughts below; my last reservation (specifics of how the parsing works) was the most concrete and I think there's a reasonable way to address it.

Otherwise the point about getting sloppy about updates is probably subjective. Subjectively I have never seen this be a problem in other projects I've worked on. While theoretically possible this is sort of what code review is for and if a code reviewer is ignoring code then I'm not sure that there's much we can do about that.

Yeah, that's fair!

I think I'd also disagree that this is brittle. This is testing a 100% match which means there's little room for deviation. There's some arguably weird heuristics applied to the Debug output to make the assertions prettier in the test files, but if the Debug output ever changes then all tests will instantly fail. It's pretty easy to either update the output-massaging code or to update all tests en-masse when the output changes in this regard. If we expect rapid changes to Debug to happen over time which would require tedious maintenance of "map the program output to the test expectation" then we should invest in a more proper "map the program output to a string" implementation that doesn't use Debug. Basically I've found that it's little consequence to be "sloppy" with creating the textual what's-expected-from-the-test because if you mess up tests fail and if you succeed then you can see the diff of what the output actually is.

I think that much of my reservation actually came just from the detailed parsing that's happening e.g. here. That feels like unnecessary coupling. I actually agree that a stable canonical "string representation of VCode" with less output meant for human readability is the better answer here: just output like
block0:
  add ra, rb, rc
  cmp rd, re
  jne block1, block2
block1:
  ...
would lead to clearer diffs I think. If we implement this as a method on VCode and then use this unconditionally for filetests rather than only in a special "precise mode", then I think it's a great idea!

Lurking in: I don't know the history here, but has something like LLVM's FileCheck tool been considered? [ ... ] It also makes explicit what the test is actually testing (e.g. these particular instructions are generated in this order)

The filetest infra was written by @sunfishcode with direct inspiration from LLVM's filecheck, IIRC (crate). It does actually have more features than we give it credit for / use regularly currently -- full regexes, and ability to flexibly match lines in certain sequences while skipping others. (The last bit is used to skip over extraneous VCode debug output but the rest, mostly not, at least in tests I've written.)

It's a good point that a lot of this pain goes away if we write more targeted tests: if we want to test that instruction X is emitted after Y, or that lowering an operator results in specifically instruction Z, we can test that without asserting on the whole function body. That's a different kind of test than we have today but it's what I'd prefer we migrate to, for the compile tests that can't just become runtests.

So I think this is what I'm converging on, at least:

We can land something like this, maybe just without the post-fixup-on-VCode-text stuff but rather with a clean alternate output mode always used for compile tests

We can work over time to move to more specific and fewer compile tests, just for those tests that are actually trying to test a property of the output beyond "it works"; this can be a gradual effort and doesn't need to block things here

Thoughts?

Wasmtime GitHub notifications bot (Dec 17 2021 at 19:43):

cfallin edited a comment on issue #3612:

@alexcrichton , it's interesting (and good to know) that this sort of thing is used successfully elsewhere, thanks!

Specific thoughts below; my last reservation (specifics of how the parsing works) was the most concrete and I think there's a reasonable way to address it.

Otherwise the point about getting sloppy about updates is probably subjective. Subjectively I have never seen this be a problem in other projects I've worked on. While theoretically possible this is sort of what code review is for and if a code reviewer is ignoring code then I'm not sure that there's much we can do about that.

Yeah, that's fair!

I think I'd also disagree that this is brittle. This is testing a 100% match which means there's little room for deviation. There's some arguably weird heuristics applied to the Debug output to make the assertions prettier in the test files, but if the Debug output ever changes then all tests will instantly fail. It's pretty easy to either update the output-massaging code or to update all tests en-masse when the output changes in this regard. If we expect rapid changes to Debug to happen over time which would require tedious maintenance of "map the program output to the test expectation" then we should invest in a more proper "map the program output to a string" implementation that doesn't use Debug. Basically I've found that it's little consequence to be "sloppy" with creating the textual what's-expected-from-the-test because if you mess up tests fail and if you succeed then you can see the diff of what the output actually is.

I think that much of my reservation actually came just from the detailed parsing that's happening e.g. here. That feels like unnecessary coupling. I actually agree that a stable canonical "string representation of VCode" with less output meant for human readability is the better answer here: just output like
block0:
  add ra, rb, rc
  cmp rd, re
  jne block1, block2
block1:
  ...
would lead to clearer diffs I think. If we implement this as a method on VCode and then use this unconditionally for filetests rather than only in a special "precise mode", then I think it's a great idea!

Lurking in: I don't know the history here, but has something like LLVM's FileCheck tool been considered? [ ... ] It also makes explicit what the test is actually testing (e.g. these particular instructions are generated in this order)

The filetest infra was written by @sunfishcode (EDIT: I'm told actually largely Jakob Stoklund, the original Cranelift author, so it goes way back) with direct inspiration from LLVM's filecheck, IIRC (crate). It does actually have more features than we give it credit for / use regularly currently -- full regexes, and ability to flexibly match lines in certain sequences while skipping others. (The last bit is used to skip over extraneous VCode debug output but the rest, mostly not, at least in tests I've written.)

It's a good point that a lot of this pain goes away if we write more targeted tests: if we want to test that instruction X is emitted after Y, or that lowering an operator results in specifically instruction Z, we can test that without asserting on the whole function body. That's a different kind of test than we have today but it's what I'd prefer we migrate to, for the compile tests that can't just become runtests.

So I think this is what I'm converging on, at least:

We can land something like this, maybe just without the post-fixup-on-VCode-text stuff but rather with a clean alternate output mode always used for compile tests

We can work over time to move to more specific and fewer compile tests, just for those tests that are actually trying to test a property of the output beyond "it works"; this can be a gradual effort and doesn't need to block things here

Thoughts?

Wasmtime GitHub notifications bot (Dec 18 2021 at 00:09):

abrown commented on issue #3612:

No real opinion here except that in most of those compile tests I would like to retain the "meaning" of the current tests--whether that means that they need to use filecheck better to remove, e.g., %xmm4 or whether the tests change by blessing them is not too important to me.

Wasmtime GitHub notifications bot (Dec 18 2021 at 00:10):

abrown edited a comment on issue #3612:

No real opinion here except that in most of those compile tests I would like to retain the "meaning" of the current tests--whether that means that they need to use filecheck better to remove, e.g., %xmm4 or whether the tests change by blessing them is not too important to me. I am thinking specifically of test cases where we check that a certain instruction sequence is emitted.

Wasmtime GitHub notifications bot (Dec 20 2021 at 16:26):

alexcrichton commented on issue #3612:

I'm happy to assert on any output myself, I only did the current custom filtering to make it be a smaller diff. @cfallin would you prefer that the output isn't massaged at all and instead it's asserted raw? I think it's basically this output which is generated as part of the set_disasm(true) on the compilation context. There's not a ton of infrastructure for configuring that output but I can hack stuff in if you'd prefer to assert on a cleaner output too.

Wasmtime GitHub notifications bot (Dec 20 2021 at 22:21):

cfallin commented on issue #3612:

@alexcrichton maybe as a transitional step we can just land something that records the whole VCode pretty-print output and asserts on it; that's the effect we were going for anyway, and "precise mode" might as well assert complete equality. When we eventually have more tests that look for just certain features then we can go back to regexes, partial line matches, etc., I think.

I don't mind a big diff that updates all the tests as long as it's just a mechanical update (maybe in a separate commit just so we can see clean tests before/after)!

We can think about ways to clean up the pretty-print in a separate issue; the PrettyPrint trait is part of regalloc.rs which makes it a bit intertwined so maybe I can just do this as part of the eventual regalloc2 overhaul. Anyway, on further thought, it's definitely out-of-scope here. Thanks!

Wasmtime GitHub notifications bot (Dec 21 2021 at 15:49):

alexcrichton commented on issue #3612:

Sure thing, updated with that! I also added some more logic necessary that I forgot from earlier where if multiple edits to one file are needed they'll need to be coordinated.

Wasmtime GitHub notifications bot (Dec 22 2021 at 15:59):

alexcrichton commented on issue #3612:

I realize now though it's probably best to make sure others are on-board with this, so I'll bring this up in the next cranelift meeting before merging.

Last updated: Apr 17 2025 at 17:03 UTC