feat(runner): add support for running and repairing tests #62

atscott · 2025-09-29T20:00:56Z

This commit introduces the ability to run tests against the generated code as part of the evaluation process.

A new optional testCommand can be in the environment configuration. If provided, this command will be executed after a successful build.

If the tests fail, the tool will attempt to repair the code using the LLM, similar to how build failures are handled. The number of repair attempts is configurable.

The report has been updated to display the test results for each run, including whether the tests passed, failed, or passed after repair. The summary view also includes aggregated statistics about the test results.

atscott · 2025-09-30T20:57:16Z

Screenshots of an environment config that defines a testCommand with testCommand: 'ng test --browsers ChromeHeadless --watch=false',

devversion

Overall this looks great, but a couple of comments/discussions

docs/environment-reference.md

devversion · 2025-10-01T10:04:33Z

runner/configuration/constants.ts

+ * Number of times we'll try to ask LLM to repair a test failure,
+ * providing the test output and the code that causes the problem.
+ */
+export const DEFAULT_MAX_TEST_REPAIR_ATTEMPTS = 1;


This is interesting. Do we actually want to repair test failures? or would it be better to repair if the test code can't be built?

Personally, I do think it's useful. It's pretty similar to build, where there is something verifiably wrong. I would argue that allowing a repair on a test failure is just as relevant, if not more, than an Axe failure (which is also a test and we do allow repairs for Axe failures).

It might also be relatively difficult to discern build vs test failure since I think both would return non-zero error codes.

We recently stopped repairs by default for Axe. Re being useful. Isn't there a risk it would rewrite test assertions to just pass? Asking a bit of questions to make sure we think about it/align.

Overall, agreed. Sounds good to me. Especially if the tests aren't LLM generated itself, presumably (could be prompted to generate I think)

cc. @crisbeto do you have any thoughts here?

I don't mind having it, but IMO they should be opt-in.

Isn't there a risk it would rewrite test assertions to just pass? Asking a bit of questions to make sure we think about it/align.

Yes, indeed that is something I am somewhat concerned about as well. Sometimes, though, it would be appropriate to edit the tests themselves when the original prompt was to "add tests for X component" or something (which we don't have coverage for but I think we should look into at some point)

I don't mind having it, but IMO they should be opt-in.

SGTM. I have bundled this in to the same rerun as Axe testing. Since both fall into a test bucket, I figured it should be okay to have "test reruns" cover both a11y and the custom testCommand. Since you can omit either of these individually (axe can be skipped with --skip-axe-testing) I think this should be fine. WDYT?

runner/orchestration/codegen.ts

runner/orchestration/test-repair.ts

runner/orchestration/test-worker.ts

runner/workers/test/worker.ts

runner/orchestration/gateways/local_gateway.ts

runner/orchestration/test-repair.ts

runner/workers/test/worker.ts

runner/orchestration/gateways/local_gateway.ts

runner/ratings/built-in-ratings/successful-tests-rating.ts

runner/shared-interfaces.ts

runner/orchestration/build-serve-loop.ts

runner/ratings/built-in-ratings/successful-tests-rating.ts

This commit introduces the ability to run tests against the generated code as part of the evaluation process. A new optional `testCommand` can be in the environment configuration. If provided, this command will be executed after a successful build. If the tests fail, the tool will attempt to repair the code using the LLM, similar to how build failures are handled. The number of repair attempts is configurable. The report has been updated to display the test results for each run, including whether the tests passed, failed, or passed after repair. The summary view also includes aggregated statistics about the test results.

atscott force-pushed the main branch 10 times, most recently from af95494 to 877d195 Compare September 30, 2025 20:20

atscott force-pushed the main branch from 877d195 to a19a09c Compare September 30, 2025 20:58

atscott marked this pull request as ready for review September 30, 2025 20:58

atscott requested review from crisbeto, devversion and AndrewKushnir as code owners September 30, 2025 20:58

devversion reviewed Oct 1, 2025

View reviewed changes

crisbeto reviewed Oct 1, 2025

View reviewed changes

runner/orchestration/gateways/local_gateway.ts Outdated Show resolved Hide resolved

runner/orchestration/test-repair.ts Show resolved Hide resolved

runner/workers/test/worker.ts Outdated Show resolved Hide resolved

atscott force-pushed the main branch 9 times, most recently from 5419d02 to 4332b39 Compare October 1, 2025 20:38

atscott changed the title ~~feat(runner): add support for running and repairing tests~~ feat(runner): add support for running tests Oct 1, 2025

atscott force-pushed the main branch from 4332b39 to af64562 Compare October 2, 2025 00:08

atscott marked this pull request as draft October 2, 2025 00:09

devversion reviewed Oct 2, 2025

View reviewed changes

atscott changed the title ~~feat(runner): add support for running tests~~ feat(runner): add support for running and repairing tests Oct 2, 2025

atscott force-pushed the main branch 3 times, most recently from ae8a24d to 3bcaca6 Compare October 2, 2025 23:11

atscott marked this pull request as ready for review October 2, 2025 23:11

atscott force-pushed the main branch from 3bcaca6 to 616f19c Compare October 2, 2025 23:12

crisbeto approved these changes Oct 3, 2025

View reviewed changes

runner/ratings/built-in-ratings/successful-tests-rating.ts Outdated Show resolved Hide resolved

atscott force-pushed the main branch from 616f19c to d1c3eaa Compare October 3, 2025 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(runner): add support for running and repairing tests #62

feat(runner): add support for running and repairing tests #62

atscott commented Sep 29, 2025 •

edited

Loading

Uh oh!

atscott commented Sep 30, 2025

Uh oh!

devversion left a comment

Uh oh!

Uh oh!

devversion Oct 1, 2025

Uh oh!

atscott Oct 1, 2025

Uh oh!

devversion Oct 2, 2025 •

edited

Loading

Uh oh!

crisbeto Oct 2, 2025

Uh oh!

atscott Oct 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat(runner): add support for running and repairing tests #62

Are you sure you want to change the base?

feat(runner): add support for running and repairing tests #62

Conversation

atscott commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atscott commented Sep 30, 2025

Uh oh!

devversion left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

devversion Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

atscott Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

devversion Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crisbeto Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

atscott Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

atscott commented Sep 29, 2025 •

edited

Loading

devversion Oct 2, 2025 •

edited

Loading