Skip to content

Conversation

atscott
Copy link
Contributor

@atscott atscott commented Sep 29, 2025

This commit introduces the ability to run tests against the generated code as part of the evaluation process.

A new optional testCommand can be in the environment configuration. If provided, this command will be executed after a successful build.

If the tests fail, the tool will attempt to repair the code using the LLM, similar to how build failures are handled. The number of repair attempts is configurable.

The report has been updated to display the test results for each run, including whether the tests passed, failed, or passed after repair. The summary view also includes aggregated statistics about the test results.

@atscott atscott force-pushed the main branch 10 times, most recently from af95494 to 877d195 Compare September 30, 2025 20:20
@atscott
Copy link
Contributor Author

atscott commented Sep 30, 2025

Screenshots of an environment config that defines a testCommand with testCommand: 'ng test --browsers ChromeHeadless --watch=false',

Screenshot 2025-09-30 at 1 53 24 PM image Screenshot 2025-09-30 at 1 53 38 PM

Copy link
Member

@devversion devversion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great, but a couple of comments/discussions

* Number of times we'll try to ask LLM to repair a test failure,
* providing the test output and the code that causes the problem.
*/
export const DEFAULT_MAX_TEST_REPAIR_ATTEMPTS = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. Do we actually want to repair test failures? or would it be better to repair if the test code can't be built?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I do think it's useful. It's pretty similar to build, where there is something verifiably wrong. I would argue that allowing a repair on a test failure is just as relevant, if not more, than an Axe failure (which is also a test and we do allow repairs for Axe failures).

It might also be relatively difficult to discern build vs test failure since I think both would return non-zero error codes.

Copy link
Member

@devversion devversion Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently stopped repairs by default for Axe. Re being useful. Isn't there a risk it would rewrite test assertions to just pass? Asking a bit of questions to make sure we think about it/align.

Overall, agreed. Sounds good to me. Especially if the tests aren't LLM generated itself, presumably (could be prompted to generate I think)

cc. @crisbeto do you have any thoughts here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind having it, but IMO they should be opt-in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a risk it would rewrite test assertions to just pass? Asking a bit of questions to make sure we think about it/align.

Yes, indeed that is something I am somewhat concerned about as well. Sometimes, though, it would be appropriate to edit the tests themselves when the original prompt was to "add tests for X component" or something (which we don't have coverage for but I think we should look into at some point)

I don't mind having it, but IMO they should be opt-in.

SGTM. I have bundled this in to the same rerun as Axe testing. Since both fall into a test bucket, I figured it should be okay to have "test reruns" cover both a11y and the custom testCommand. Since you can omit either of these individually (axe can be skipped with --skip-axe-testing) I think this should be fine. WDYT?

@atscott atscott force-pushed the main branch 9 times, most recently from 5419d02 to 4332b39 Compare October 1, 2025 20:38
@atscott atscott changed the title feat(runner): add support for running and repairing tests feat(runner): add support for running tests Oct 1, 2025
@atscott atscott marked this pull request as draft October 2, 2025 00:09
@atscott atscott changed the title feat(runner): add support for running tests feat(runner): add support for running and repairing tests Oct 2, 2025
@atscott atscott force-pushed the main branch 3 times, most recently from ae8a24d to 3bcaca6 Compare October 2, 2025 23:11
@atscott atscott marked this pull request as ready for review October 2, 2025 23:11
This commit introduces the ability to run tests against the generated code as part of the evaluation process.

A new optional `testCommand` can be in the environment configuration. If provided, this command will be executed after a successful build.

If the tests fail, the tool will attempt to repair the code using the LLM, similar to how build failures are handled. The number of repair attempts is configurable.

The report has been updated to display the test results for each run, including whether the tests passed, failed, or passed after repair. The summary view also includes aggregated statistics about the test results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants