Summary
Proposing an AI-powered tool to proactively diagnose and fix flaky tests in OpenSearch. The tool is an MCP server that integrates with AI coding agents (Kiro CLI, Claude CLI) to automate the full lifecycle: from parsing GitHub issues and Jenkins failures to reproducing locally, applying fixes, and creating PRs for maintainer review.
Motivation

We currently have 200+ open flaky test issues with the >test-failure label. These are auto-filed by opensearch-ci-bot via AUTOCUT when Gradle Check detects flaky tests in post-merge actions or timer-triggered runs on the main branch.
In past we together has done great work on establishing mechanisms to prevent the flaky tests opensearch-project/OpenSearch#17974, but the backlog of existing flaky issues continues to grow.
This proposal is an attempt to proactively fix the existing backlog by automating the tedious parts of flaky test diagnosis and resolution.
Tool
Repository: https://github.com/prudhvigodithi/flaky-test-agent/
This is a tool I've been working on and experimenting with to explore how AI agents can help with flaky test resolution. It's an MCP server that provides specialized tools to any AI coding agent. The agent (LLM) handles the reasoning — analyzing root causes and generating fixes — while the MCP server handles the mechanical operations.
Important: The tool creates PRs that are reviewed by maintainers and collaborators like any other contribution. It does not merge anything automatically — all fixes go through the standard review process.
Sample PR
Here is a sample PR created by the tool for a flaky ClusterStatsIT test: prudhvigodithi/OpenSearch#31
Architecture
GitHub Issue (AUTOCUT)
→ parse_issue: Extract Jenkins build URLs, test class, test methods
Jenkins Build (authenticated)
→ fetch_jenkins_failure: Stack traces, REPRODUCE WITH commands, test seeds
Local OpenSearch Checkout
→ read_local_file / patch_local_file: Read and modify test source
→ run_reproduce_command: Run gradle reproduce command with the exact seed
GitHub API
→ create_pr_from_local: Commit (with DCO sign-off), push, and open PR for review
Workflow
- Parse the GitHub issue to extract Jenkins build details
- Fetch stack traces, reproduce commands, and test seeds from Jenkins
- Read the test source code from a local OpenSearch checkout
- Analyze the root cause (done by the LLM, not the tool)
- Reproduce the failure locally using the exact seed from Jenkins
- Apply the fix via search-and-replace
- Verify the fix passes locally with the same seed
- Create a PR with DCO sign-off for maintainer review
Available Tools
| Tool |
Description |
fetch_flaky_issues |
List open flaky-test issues from GitHub |
parse_issue |
Extract Jenkins URLs, test names, commits from an issue |
fetch_jenkins_failure |
Get stack traces, reproduce commands, seeds from Jenkins |
read_local_file |
Read source files from local OpenSearch checkout |
patch_local_file |
Apply fix via search-and-replace on local files |
run_reproduce_command |
Run gradle reproduce command locally |
search_repo_code |
Search OpenSearch repo for related code |
create_pr_from_local |
Commit (GPG sign + DCO), push, open PR for review |
Compatibility
The MCP server works with any MCP-compatible AI agent:
- Kiro CLI — via steering files for automatic workflow guidance
- Claude CLI — via Agent Skills (
SKILL.md)
The goal isn't to replace human reviewers — it's to work together to reduce the flaky test load so the community can focus on building features instead of chasing intermittent failures.
Limitations
This tool can make mistakes — the LLM may misidentify root causes or generate incorrect fixes. That's expected and by design. Every PR it creates goes through the standard review process by maintainers and collaborators. The value is in automating the tedious parts (parsing Jenkins logs, finding seeds, reproducing locally, creating the PR) so that reviewers can focus on validating the fix rather than doing the diagnosis from scratch.
Even when the fix isn't perfect, the PR provides a starting point with the root cause analysis, stack traces, and reproduce commands already gathered — saving significant time for whoever picks up the issue.
Next Steps
How Can We Leverage This?
The tool works today as a manual workflow — point it at an issue and it produces a PR. But there are several ways we could scale this to reduce the 200+ open flaky test backlog. I'm open to thoughts on approaches like:
- Nightly batch runs: Automatically iterate through new AUTOCUT issues each night, create fix PRs, and tag maintainers for review
- CI integration: Trigger the agent when a new
>test-failure issue is filed, so a potential fix PR appears alongside the issue within hours
- Issue input mode: Feed a list of priority flaky test issues and let the agent work through them sequentially
- Triage assistant: Even when it can't fix the issue, have it add a comment with the root cause analysis, reproduce command, and relevant source code to accelerate manual debugging
- Pattern learning: Track which flaky patterns the tool fixes successfully and use that to prioritize which issues to attempt.
Summary
Proposing an AI-powered tool to proactively diagnose and fix flaky tests in OpenSearch. The tool is an MCP server that integrates with AI coding agents (Kiro CLI, Claude CLI) to automate the full lifecycle: from parsing GitHub issues and Jenkins failures to reproducing locally, applying fixes, and creating PRs for maintainer review.
Motivation
We currently have 200+ open flaky test issues with the
>test-failurelabel. These are auto-filed byopensearch-ci-botvia AUTOCUT when Gradle Check detects flaky tests in post-merge actions or timer-triggered runs on the main branch.In past we together has done great work on establishing mechanisms to prevent the flaky tests opensearch-project/OpenSearch#17974, but the backlog of existing flaky issues continues to grow.
This proposal is an attempt to proactively fix the existing backlog by automating the tedious parts of flaky test diagnosis and resolution.
Tool
Repository: https://github.com/prudhvigodithi/flaky-test-agent/
This is a tool I've been working on and experimenting with to explore how AI agents can help with flaky test resolution. It's an MCP server that provides specialized tools to any AI coding agent. The agent (LLM) handles the reasoning — analyzing root causes and generating fixes — while the MCP server handles the mechanical operations.
Important: The tool creates PRs that are reviewed by maintainers and collaborators like any other contribution. It does not merge anything automatically — all fixes go through the standard review process.
Sample PR
Here is a sample PR created by the tool for a flaky ClusterStatsIT test: prudhvigodithi/OpenSearch#31
Architecture
Workflow
Available Tools
fetch_flaky_issuesparse_issuefetch_jenkins_failureread_local_filepatch_local_filerun_reproduce_commandsearch_repo_codecreate_pr_from_localCompatibility
The MCP server works with any MCP-compatible AI agent:
SKILL.md)The goal isn't to replace human reviewers — it's to work together to reduce the flaky test load so the community can focus on building features instead of chasing intermittent failures.
Limitations
This tool can make mistakes — the LLM may misidentify root causes or generate incorrect fixes. That's expected and by design. Every PR it creates goes through the standard review process by maintainers and collaborators. The value is in automating the tedious parts (parsing Jenkins logs, finding seeds, reproducing locally, creating the PR) so that reviewers can focus on validating the fix rather than doing the diagnosis from scratch.
Even when the fix isn't perfect, the PR provides a starting point with the root cause analysis, stack traces, and reproduce commands already gathered — saving significant time for whoever picks up the issue.
Next Steps
How Can We Leverage This?
The tool works today as a manual workflow — point it at an issue and it produces a PR. But there are several ways we could scale this to reduce the 200+ open flaky test backlog. I'm open to thoughts on approaches like:
>test-failureissue is filed, so a potential fix PR appears alongside the issue within hours