[Proposal] AI-Powered Flaky Test Fixer — Automated Diagnosis and Fix 

## Summary
Proposing an AI-powered tool to proactively diagnose and fix flaky tests in OpenSearch. The tool is an [MCP server](https://github.com/prudhvigodithi/flaky-test-agent/) that integrates with AI coding agents (Kiro CLI, Claude CLI) to automate the full lifecycle: from parsing GitHub issues and Jenkins failures to reproducing locally, applying fixes, and creating PRs for maintainer review.

## Motivation

![Flaky Tests](https://img.shields.io/github/issues-search/opensearch-project/OpenSearch?query=is:issue%20state:open%20label:%3Etest-failure&label=Flaky%20Tests)

We currently have **200+ open flaky test issues** with the [`>test-failure`](https://github.com/opensearch-project/OpenSearch/issues?q=is%3Aissue%20state%3Aopen%20label%3A%3Etest-failure) label. These are auto-filed by `opensearch-ci-bot` via AUTOCUT when Gradle Check detects flaky tests in post-merge actions or timer-triggered runs on the main branch.

In past we together has done great work on establishing mechanisms to prevent the flaky tests https://github.com/opensearch-project/OpenSearch/issues/17974, but the backlog of existing flaky issues continues to grow. 

This proposal is an attempt to **proactively fix** the existing backlog by automating the tedious parts of flaky test diagnosis and resolution.

## Tool

**Repository:** https://github.com/prudhvigodithi/flaky-test-agent/

This is a tool I've been working on and experimenting with to explore how AI agents can help with flaky test resolution. It's an [MCP server](https://modelcontextprotocol.io/) that provides specialized tools to any AI coding agent. The agent (LLM) handles the reasoning — analyzing root causes and generating fixes — while the MCP server handles the mechanical operations.

**Important:** The tool creates PRs that are reviewed by maintainers and collaborators like any other contribution. It does not merge anything automatically — all fixes go through the standard review process.

### Sample PR

Here is a sample PR created by the tool for a flaky [ClusterStatsIT](https://github.com/opensearch-project/OpenSearch/issues/21358) test: https://github.com/prudhvigodithi/OpenSearch/pull/31

### Architecture

```
GitHub Issue (AUTOCUT)
 → parse_issue: Extract Jenkins build URLs, test class, test methods

Jenkins Build (authenticated)
 → fetch_jenkins_failure: Stack traces, REPRODUCE WITH commands, test seeds

Local OpenSearch Checkout
 → read_local_file / patch_local_file: Read and modify test source
 → run_reproduce_command: Run gradle reproduce command with the exact seed

GitHub API
 → create_pr_from_local: Commit (with DCO sign-off), push, and open PR for review
```

### Workflow

1. **Parse** the GitHub issue to extract Jenkins build details
2. **Fetch** stack traces, reproduce commands, and test seeds from Jenkins
3. **Read** the test source code from a local OpenSearch checkout
4. **Analyze** the root cause (done by the LLM, not the tool)
5. **Reproduce** the failure locally using the exact seed from Jenkins
6. **Apply** the fix via search-and-replace
7. **Verify** the fix passes locally with the same seed
8. **Create** a PR with DCO sign-off for maintainer review


### Available Tools

| Tool | Description |
|------|-------------|
| `fetch_flaky_issues` | List open flaky-test issues from GitHub |
| `parse_issue` | Extract Jenkins URLs, test names, commits from an issue |
| `fetch_jenkins_failure` | Get stack traces, reproduce commands, seeds from Jenkins |
| `read_local_file` | Read source files from local OpenSearch checkout |
| `patch_local_file` | Apply fix via search-and-replace on local files |
| `run_reproduce_command` | Run gradle reproduce command locally |
| `search_repo_code` | Search OpenSearch repo for related code |
| `create_pr_from_local` | Commit (GPG sign + DCO), push, open PR for review |


## Compatibility

The MCP server works with any MCP-compatible AI agent:
- **Kiro CLI** — via steering files for automatic workflow guidance
- **Claude CLI** — via [Agent Skills](https://agentskills.io) (`SKILL.md`)


The goal isn't to replace human reviewers — it's to work together to reduce the flaky test load so the community can focus on building features instead of chasing intermittent failures.

## Limitations

This tool can make mistakes — the LLM may misidentify root causes or generate incorrect fixes. That's expected and by design. Every PR it creates goes through the standard review process by maintainers and collaborators. The value is in automating the tedious parts (parsing Jenkins logs, finding seeds, reproducing locally, creating the PR) so that reviewers can focus on validating the fix rather than doing the diagnosis from scratch.

Even when the fix isn't perfect, the PR provides a starting point with the root cause analysis, stack traces, and reproduce commands already gathered — saving significant time for whoever picks up the issue.

## Next Steps

- [ ] Gather community feedback on this approach
- [ ] Batch mode: iterate through multiple flaky test issues automatically
- [ ] Explore CI integration for auto-triggering on new AUTOCUT issues
- [ ] Track fix success rate across different flaky test patterns

## How Can We Leverage This?

The tool works today as a manual workflow — point it at an issue and it produces a PR. But there are several ways we could scale this to reduce the 200+ open flaky test backlog. I'm open to thoughts on approaches like:

- **Nightly batch runs**: Automatically iterate through new AUTOCUT issues each night, create fix PRs, and tag maintainers for review
- **CI integration**: Trigger the agent when a new `>test-failure` issue is filed, so a potential fix PR appears alongside the issue within hours
- **Issue input mode**: Feed a list of priority flaky test issues and let the agent work through them sequentially
- **Triage assistant**: Even when it can't fix the issue, have it add a comment with the root cause analysis, reproduce command, and relevant source code to accelerate manual debugging
- **Pattern learning**: Track which flaky patterns the tool fixes successfully and use that to prioritize which issues to attempt.






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] AI-Powered Flaky Test Fixer — Automated Diagnosis and Fix #509

Summary

Motivation

Tool

Sample PR

Architecture

Workflow

Available Tools

Compatibility

Limitations

Next Steps

How Can We Leverage This?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tool	Description
`fetch_flaky_issues`	List open flaky-test issues from GitHub
`parse_issue`	Extract Jenkins URLs, test names, commits from an issue
`fetch_jenkins_failure`	Get stack traces, reproduce commands, seeds from Jenkins
`read_local_file`	Read source files from local OpenSearch checkout
`patch_local_file`	Apply fix via search-and-replace on local files
`run_reproduce_command`	Run gradle reproduce command locally
`search_repo_code`	Search OpenSearch repo for related code
`create_pr_from_local`	Commit (GPG sign + DCO), push, open PR for review

[Proposal] AI-Powered Flaky Test Fixer — Automated Diagnosis and Fix #509

Description

Summary

Motivation

Tool

Sample PR

Architecture

Workflow

Available Tools

Compatibility

Limitations

Next Steps

How Can We Leverage This?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions