alexcrichton commented on issue #3612:
I updated a few simple tests here as an example, but I figured it'd be good to get some review/feedback on this before getting to eager about anything else. AFAIK there's been some desire for this but not discussion around the design, so I don't want to presume too much here.
cfallin commented on issue #3612:
Huh, wow, this is certainly a wild idea! Unit tests that update themselves to pass (with permission)!
I do actually think this is a really cool solution to the "golden outputs are super-tedious to update" problem. The golden tests were never meant (or at least, I never meant them) to get to the volume and specificity that they have, where they trip over all sorts of incidental cross-cutting changes, like regalloc perturbations. They were mostly meant to be a stopgap; we really should pay down the techdebt soon. That said, a few reservations:
- The golden-output tests were kind of a stopgap, and I actually think not relying more on runtests was a mistake. IMHO (now, with hindsight), a golden-output test is useful to ensure some properties, such as that a lowering is or isn't using a certain instruction, a certain optimization is or isn't happening, or e.g. the ABI code is using a certain stackframe layout. But the vast majority of the tests are actually asserting a particular output as an indirect way of asserting correct execution, because that particular output in theory executes correctly. (Even that isn't always verified though.)
So my "fix the fundamental problem" solution would be to delete most golden-output tests once we ensure we have coverage with runtests.
- While it's a huge pain, the process of updating golden outputs is at least in principle supposed to be a human check -- yes, the output changed, does the change make sense? While one can look at deltas before doing an automated thing, in practice having a "just mash the big button to make it pass" option is the kind of thing that will encourage us to get sloppy about the golden test updates at some point.
As a human check goes, though, tediously reformatting the output line-by-line into filecheck format and removing block markers and such is probably more than is actually needed; the ideal "manual update" step is (IMHO) just copy/pasting or piping a blob from one place to another. Maybe we can do something with the filecheck infra to improve this; not sure.
- The parsing here is a bit brittle, which worries me; one more thing to maintain if our output format changes. (the "load-bearing
Debug
format" problem, basically.)I'm open to more discussion here of course; curious what others think as well. I do agree that our time is not well-spent now so we have to do something; maybe it's just time to move to runtests :-)
cfallin edited a comment on issue #3612:
Huh, wow, this is certainly a wild idea! Unit tests that update themselves to pass (with permission)!
I do actually think this is a really cool solution to the "golden outputs are super-tedious to update" problem. The golden tests were never meant (or at least, I never meant them) to get to the volume and specificity that they have, where they trip over all sorts of incidental cross-cutting changes, like regalloc perturbations. They were mostly meant to be a stopgap; we really should pay down the techdebt soon. That said, a few reservations:
- The golden-output tests were kind of a stopgap, and I actually think not relying more on runtests was a mistake. IMHO (now, with hindsight), a golden-output test is useful to ensure some properties, such as that a lowering is or isn't using a certain instruction, a certain optimization is or isn't happening, or e.g. the ABI code is using a certain stackframe layout. But the vast majority of the tests are actually asserting a particular output as an indirect way of asserting correct execution, because that particular output in theory executes correctly. (Even that isn't always verified though.)
So my "fix the fundamental problem" solution would be to delete most golden-output tests once we ensure we have coverage with runtests.
While it's a huge pain, the process of updating golden outputs is at least in principle supposed to be a human check -- yes, the output changed, does the change make sense? While one can look at deltas before doing an automated thing, in practice having a "just mash the big button to make it pass" option is the kind of thing that will encourage us to get sloppy about the golden test updates at some point.
As a human check goes, though, tediously reformatting the output line-by-line into filecheck format and removing block markers and such is probably more than is actually needed; the ideal "manual update" step is (IMHO) just copy/pasting or piping a blob from one place to another. Maybe we can do something with the filecheck infra to improve this; not sure.
The parsing here is a bit brittle, which worries me; one more thing to maintain if our output format changes. (the "load-bearing
Debug
format" problem, basically.)I'm open to more discussion here of course; curious what others think as well. I do agree that our time is not well-spent now so we have to do something; maybe it's just time to move to runtests :-)
cfallin edited a comment on issue #3612:
Huh, wow, this is certainly a wild idea! Unit tests that update themselves to pass (with permission)!
I do actually think this is a really cool solution to the "golden outputs are super-tedious to update" problem. The golden tests were never meant (or at least, I never meant them) to get to the volume and specificity that they have, where they trip over all sorts of incidental cross-cutting changes, like regalloc perturbations. They were mostly meant to be a stopgap; we really should pay down the techdebt soon. That said, a few reservations:
The golden-output tests were kind of a stopgap, and I actually think not relying more on runtests was a mistake. IMHO (now, with hindsight), a golden-output test is useful to ensure some properties, such as that a lowering is or isn't using a certain instruction, a certain optimization is or isn't happening, or e.g. the ABI code is using a certain stackframe layout. But the vast majority of the tests are actually asserting a particular output as an indirect way of asserting correct execution, because that particular output in theory executes correctly. (Even that isn't always verified though.)
So my "fix the fundamental problem" solution would be to delete most golden-output tests once we ensure we have coverage with runtests.
While it's a huge pain, the process of updating golden outputs is at least in principle supposed to be a human check -- yes, the output changed, does the change make sense? While one can look at deltas before doing an automated thing, in practice having a "just mash the big button to make it pass" option is the kind of thing that will encourage us to get sloppy about the golden test updates at some point.
As a human check goes, though, tediously reformatting the output line-by-line into filecheck format and removing block markers and such is probably more than is actually needed; the ideal "manual update" step is (IMHO) just copy/pasting or piping a blob from one place to another. Maybe we can do something with the filecheck infra to improve this; not sure.
The parsing here is a bit brittle, which worries me; one more thing to maintain if our output format changes. (the "load-bearing
Debug
format" problem, basically.)I'm open to more discussion here of course; curious what others think as well. I do agree that our time is not well-spent now so we have to do something; maybe it's just time to move to runtests :-)
cfallin edited a comment on issue #3612:
Huh, wow, this is certainly a wild idea! Unit tests that update themselves to pass (with permission)!
I do actually think this is a really cool solution to the "golden outputs are super-tedious to update" problem. The golden tests were never meant (or at least, I never meant them) to get to the volume and specificity that they have, where they trip over all sorts of incidental cross-cutting changes, like regalloc perturbations. They were mostly meant to be a stopgap; we really should pay down the techdebt soon. That said, a few reservations:
The golden-output tests were kind of a stopgap, and I actually think not relying more on runtests was a mistake. IMHO (now, with hindsight), a golden-output test is useful to ensure some properties, such as that a lowering is or isn't using a certain instruction, a certain optimization is or isn't happening, or e.g. the ABI code is using a certain stackframe layout. But the vast majority of the tests are actually asserting a particular output as an indirect way of asserting correct execution, because that particular output in theory executes correctly. (Even that isn't always verified though.)
So my "fix the fundamental problem" solution would be to delete most golden-output tests once we ensure we have coverage with runtests.
While it's a huge pain, the process of updating golden outputs is at least in principle supposed to be a human check -- yes, the output changed, does the change make sense? While one can look at deltas before doing an automated thing, in practice having a "just mash the big button to make it pass" option is the kind of thing that will encourage us to get sloppy about the golden test updates at some point.
As a human check goes, though, tediously reformatting the output line-by-line into filecheck format and removing block markers and such is probably more than is actually needed; the ideal "manual update" step is (IMHO) just copy/pasting or piping a blob from one place to another. Maybe we can do something with the filecheck infra to improve this; not sure.
The parsing here is a bit brittle, which worries me; one more thing to maintain if our output format changes. (the "load-bearing
Debug
format" problem, basically. EDIT: of course it's load-bearing today because it's captured as golden; but it's even more load-bearing if we write parsing code to it!)I'm open to more discussion here of course; curious what others think as well. I do agree that our time is not well-spent now so we have to do something; maybe it's just time to move to runtests :-)
alexcrichton commented on issue #3612:
FWIW I've used this technique to what feels like good success in many other projects, and I know rust-lang/rust has been using this format of tests quite successfully for quite some time for compiler error messages.
I do agree that
run
tests are better thancompile
tests, but I also don't want to ignore what's already there today. I think both styles of tests have advantages and disadvantages. For example if we're testing that a particular special case in ISLE is used thenrun
won't be too useful andcompile
is better, although allcompile
tests should arguably be coupled withrun
tests as well. In the meantime though we've got tons ofcompile
tests that are already tedious to update and it will presumably be just as tedious to migrate them all torun
tests as well.Otherwise the point about getting sloppy about updates is probably subjective. Subjectively I have never seen this be a problem in other projects I've worked on. While theoretically possible this is sort of what code review is for and if a code reviewer is ignoring code then I'm not sure that there's much we can do about that.
I think I'd also disagree that this is brittle. This is testing a 100% match which means there's little room for deviation. There's some arguably weird heuristics applied to the
Debug
output to make the assertions prettier in the test files, but if theDebug
output ever changes then all tests will instantly fail. It's pretty easy to either update the output-massaging code or to update all tests en-masse when the output changes in this regard. If we expect rapid changes toDebug
to happen over time which would require tedious maintenance of "map the program output to the test expectation" then we should invest in a more proper "map the program output to a string" implementation that doesn't useDebug
. Basically I've found that it's little consequence to be "sloppy" with creating the textual what's-expected-from-the-test because if you mess up tests fail and if you succeed then you can see the diff of what the output actually is.
froydnj commented on issue #3612:
Lurking in: I don't know the history here, but has something like LLVM's
FileCheck
tool been considered? (https://llvm.org/docs/CommandGuide/FileCheck.html) It's probably not reasonable to useFileCheck
directly, but it's not that complicated of a tool to write yourself if you need to (andFileCheck
itself doesn't need any clif-specific knowledge).Assuming you write patterns well, checks should be resistant to things like register allocation changes. It also makes explicit what the test is actually testing (e.g. these particular instructions are generated in this order): it's not always easy to remember/tell exactly what a test is actually testing when reviewing auto updates. Any changes to the checks themselves do have to be manual, but they're also much more easily reviewable by virtue of being small and targeted.
cfallin commented on issue #3612:
@alexcrichton , it's interesting (and good to know) that this sort of thing is used successfully elsewhere, thanks!
Specific thoughts below; my last reservation (specifics of how the parsing works) was the most concrete and I think there's a reasonable way to address it.
Otherwise the point about getting sloppy about updates is probably subjective. Subjectively I have never seen this be a problem in other projects I've worked on. While theoretically possible this is sort of what code review is for and if a code reviewer is ignoring code then I'm not sure that there's much we can do about that.
Yeah, that's fair!
I think I'd also disagree that this is brittle. This is testing a 100% match which means there's little room for deviation. There's some arguably weird heuristics applied to the
Debug
output to make the assertions prettier in the test files, but if theDebug
output ever changes then all tests will instantly fail. It's pretty easy to either update the output-massaging code or to update all tests en-masse when the output changes in this regard. If we expect rapid changes toDebug
to happen over time which would require tedious maintenance of "map the program output to the test expectation" then we should invest in a more proper "map the program output to a string" implementation that doesn't useDebug
. Basically I've found that it's little consequence to be "sloppy" with creating the textual what's-expected-from-the-test because if you mess up tests fail and if you succeed then you can see the diff of what the output actually is.I think that much of my reservation actually came just from the detailed parsing that's happening e.g. here. That feels like unnecessary coupling. I actually agree that a stable canonical "string representation of VCode" with less output meant for human readability is the better answer here: just output like
block0: add ra, rb, rc cmp rd, re jne block1, block2 block1: ...
would lead to clearer diffs I think. If we implement this as a method on
VCode
and then use this unconditionally for filetests rather than only in a special "precise mode", then I think it's a great idea!Lurking in: I don't know the history here, but has something like LLVM's FileCheck tool been considered? [ ... ] It also makes explicit what the test is actually testing (e.g. these particular instructions are generated in this order)
The
filetest
infra was written by @sunfishcode with direct inspiration from LLVM's filecheck, IIRC (crate). It does actually have more features than we give it credit for / use regularly currently -- full regexes, and ability to flexibly match lines in certain sequences while skipping others. (The last bit is used to skip over extraneous VCode debug output but the rest, mostly not, at least in tests I've written.)It's a good point that a lot of this pain goes away if we write more targeted tests: if we want to test that instruction X is emitted after Y, or that lowering an operator results in specifically instruction Z, we can test that without asserting on the whole function body. That's a different kind of test than we have today but it's what I'd prefer we migrate to, for the compile tests that can't just become runtests.
So I think this is what I'm converging on, at least:
- We can land something like this, maybe just without the post-fixup-on-VCode-text stuff but rather with a clean alternate output mode always used for compile tests
- We can work over time to move to more specific and fewer compile tests, just for those tests that are actually trying to test a property of the output beyond "it works"; this can be a gradual effort and doesn't need to block things here
Thoughts?
cfallin edited a comment on issue #3612:
@alexcrichton , it's interesting (and good to know) that this sort of thing is used successfully elsewhere, thanks!
Specific thoughts below; my last reservation (specifics of how the parsing works) was the most concrete and I think there's a reasonable way to address it.
Otherwise the point about getting sloppy about updates is probably subjective. Subjectively I have never seen this be a problem in other projects I've worked on. While theoretically possible this is sort of what code review is for and if a code reviewer is ignoring code then I'm not sure that there's much we can do about that.
Yeah, that's fair!
I think I'd also disagree that this is brittle. This is testing a 100% match which means there's little room for deviation. There's some arguably weird heuristics applied to the
Debug
output to make the assertions prettier in the test files, but if theDebug
output ever changes then all tests will instantly fail. It's pretty easy to either update the output-massaging code or to update all tests en-masse when the output changes in this regard. If we expect rapid changes toDebug
to happen over time which would require tedious maintenance of "map the program output to the test expectation" then we should invest in a more proper "map the program output to a string" implementation that doesn't useDebug
. Basically I've found that it's little consequence to be "sloppy" with creating the textual what's-expected-from-the-test because if you mess up tests fail and if you succeed then you can see the diff of what the output actually is.I think that much of my reservation actually came just from the detailed parsing that's happening e.g. here. That feels like unnecessary coupling. I actually agree that a stable canonical "string representation of VCode" with less output meant for human readability is the better answer here: just output like
block0: add ra, rb, rc cmp rd, re jne block1, block2 block1: ...
would lead to clearer diffs I think. If we implement this as a method on
VCode
and then use this unconditionally for filetests rather than only in a special "precise mode", then I think it's a great idea!Lurking in: I don't know the history here, but has something like LLVM's FileCheck tool been considered? [ ... ] It also makes explicit what the test is actually testing (e.g. these particular instructions are generated in this order)
The
filetest
infra was written by @sunfishcode (EDIT: I'm told actually largely Jakob Stoklund, the original Cranelift author, so it goes way back) with direct inspiration from LLVM's filecheck, IIRC (crate). It does actually have more features than we give it credit for / use regularly currently -- full regexes, and ability to flexibly match lines in certain sequences while skipping others. (The last bit is used to skip over extraneous VCode debug output but the rest, mostly not, at least in tests I've written.)It's a good point that a lot of this pain goes away if we write more targeted tests: if we want to test that instruction X is emitted after Y, or that lowering an operator results in specifically instruction Z, we can test that without asserting on the whole function body. That's a different kind of test than we have today but it's what I'd prefer we migrate to, for the compile tests that can't just become runtests.
So I think this is what I'm converging on, at least:
- We can land something like this, maybe just without the post-fixup-on-VCode-text stuff but rather with a clean alternate output mode always used for compile tests
- We can work over time to move to more specific and fewer compile tests, just for those tests that are actually trying to test a property of the output beyond "it works"; this can be a gradual effort and doesn't need to block things here
Thoughts?
abrown commented on issue #3612:
No real opinion here except that in most of those compile tests I would like to retain the "meaning" of the current tests--whether that means that they need to use
filecheck
better to remove, e.g.,%xmm4
or whether the tests change by blessing them is not too important to me.
abrown edited a comment on issue #3612:
No real opinion here except that in most of those compile tests I would like to retain the "meaning" of the current tests--whether that means that they need to use
filecheck
better to remove, e.g.,%xmm4
or whether the tests change by blessing them is not too important to me. I am thinking specifically of test cases where we check that a certain instruction sequence is emitted.
alexcrichton commented on issue #3612:
I'm happy to assert on any output myself, I only did the current custom filtering to make it be a smaller diff. @cfallin would you prefer that the output isn't massaged at all and instead it's asserted raw? I think it's basically this output which is generated as part of the
set_disasm(true)
on the compilation context. There's not a ton of infrastructure for configuring that output but I can hack stuff in if you'd prefer to assert on a cleaner output too.
cfallin commented on issue #3612:
@alexcrichton maybe as a transitional step we can just land something that records the whole VCode pretty-print output and asserts on it; that's the effect we were going for anyway, and "precise mode" might as well assert complete equality. When we eventually have more tests that look for just certain features then we can go back to regexes, partial line matches, etc., I think.
I don't mind a big diff that updates all the tests as long as it's just a mechanical update (maybe in a separate commit just so we can see clean tests before/after)!
We can think about ways to clean up the pretty-print in a separate issue; the
PrettyPrint
trait is part of regalloc.rs which makes it a bit intertwined so maybe I can just do this as part of the eventual regalloc2 overhaul. Anyway, on further thought, it's definitely out-of-scope here. Thanks!
alexcrichton commented on issue #3612:
Sure thing, updated with that! I also added some more logic necessary that I forgot from earlier where if multiple edits to one file are needed they'll need to be coordinated.
alexcrichton commented on issue #3612:
I realize now though it's probably best to make sure others are on-board with this, so I'll bring this up in the next cranelift meeting before merging.
Last updated: Jan 24 2025 at 00:11 UTC