-
Notifications
You must be signed in to change notification settings - Fork 34
feat(runner): add support for running and repairing tests #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
af95494
to
877d195
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks great, but a couple of comments/discussions
runner/configuration/constants.ts
Outdated
* Number of times we'll try to ask LLM to repair a test failure, | ||
* providing the test output and the code that causes the problem. | ||
*/ | ||
export const DEFAULT_MAX_TEST_REPAIR_ATTEMPTS = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting. Do we actually want to repair test failures? or would it be better to repair if the test code can't be built?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I do think it's useful. It's pretty similar to build, where there is something verifiably wrong. I would argue that allowing a repair on a test failure is just as relevant, if not more, than an Axe failure (which is also a test and we do allow repairs for Axe failures).
It might also be relatively difficult to discern build vs test failure since I think both would return non-zero error codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently stopped repairs by default for Axe. Re being useful. Isn't there a risk it would rewrite test assertions to just pass? Asking a bit of questions to make sure we think about it/align.
Overall, agreed. Sounds good to me. Especially if the tests aren't LLM generated itself, presumably (could be prompted to generate I think)
cc. @crisbeto do you have any thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind having it, but IMO they should be opt-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a risk it would rewrite test assertions to just pass? Asking a bit of questions to make sure we think about it/align.
Yes, indeed that is something I am somewhat concerned about as well. Sometimes, though, it would be appropriate to edit the tests themselves when the original prompt was to "add tests for X component" or something (which we don't have coverage for but I think we should look into at some point)
I don't mind having it, but IMO they should be opt-in.
SGTM. I have bundled this in to the same rerun as Axe testing. Since both fall into a test bucket, I figured it should be okay to have "test reruns" cover both a11y and the custom testCommand
. Since you can omit either of these individually (axe can be skipped with --skip-axe-testing
) I think this should be fine. WDYT?
5419d02
to
4332b39
Compare
ae8a24d
to
3bcaca6
Compare
This commit introduces the ability to run tests against the generated code as part of the evaluation process. A new optional `testCommand` can be in the environment configuration. If provided, this command will be executed after a successful build. If the tests fail, the tool will attempt to repair the code using the LLM, similar to how build failures are handled. The number of repair attempts is configurable. The report has been updated to display the test results for each run, including whether the tests passed, failed, or passed after repair. The summary view also includes aggregated statistics about the test results.
This commit introduces the ability to run tests against the generated code as part of the evaluation process.
A new optional
testCommand
can be in the environment configuration. If provided, this command will be executed after a successful build.If the tests fail, the tool will attempt to repair the code using the LLM, similar to how build failures are handled. The number of repair attempts is configurable.
The report has been updated to display the test results for each run, including whether the tests passed, failed, or passed after repair. The summary view also includes aggregated statistics about the test results.