Skip to content

[Proposal] AI-Powered Flaky Test Fixer — Automated Diagnosis and Fix  #509

@prudhvigodithi

Description

@prudhvigodithi

Summary

Proposing an AI-powered tool to proactively diagnose and fix flaky tests in OpenSearch. The tool is an MCP server that integrates with AI coding agents (Kiro CLI, Claude CLI) to automate the full lifecycle: from parsing GitHub issues and Jenkins failures to reproducing locally, applying fixes, and creating PRs for maintainer review.

Motivation

Flaky Tests

We currently have 200+ open flaky test issues with the >test-failure label. These are auto-filed by opensearch-ci-bot via AUTOCUT when Gradle Check detects flaky tests in post-merge actions or timer-triggered runs on the main branch.

In past we together has done great work on establishing mechanisms to prevent the flaky tests opensearch-project/OpenSearch#17974, but the backlog of existing flaky issues continues to grow.

This proposal is an attempt to proactively fix the existing backlog by automating the tedious parts of flaky test diagnosis and resolution.

Tool

Repository: https://github.com/prudhvigodithi/flaky-test-agent/

This is a tool I've been working on and experimenting with to explore how AI agents can help with flaky test resolution. It's an MCP server that provides specialized tools to any AI coding agent. The agent (LLM) handles the reasoning — analyzing root causes and generating fixes — while the MCP server handles the mechanical operations.

Important: The tool creates PRs that are reviewed by maintainers and collaborators like any other contribution. It does not merge anything automatically — all fixes go through the standard review process.

Sample PR

Here is a sample PR created by the tool for a flaky ClusterStatsIT test: prudhvigodithi/OpenSearch#31

Architecture

GitHub Issue (AUTOCUT)
 → parse_issue: Extract Jenkins build URLs, test class, test methods

Jenkins Build (authenticated)
 → fetch_jenkins_failure: Stack traces, REPRODUCE WITH commands, test seeds

Local OpenSearch Checkout
 → read_local_file / patch_local_file: Read and modify test source
 → run_reproduce_command: Run gradle reproduce command with the exact seed

GitHub API
 → create_pr_from_local: Commit (with DCO sign-off), push, and open PR for review

Workflow

  1. Parse the GitHub issue to extract Jenkins build details
  2. Fetch stack traces, reproduce commands, and test seeds from Jenkins
  3. Read the test source code from a local OpenSearch checkout
  4. Analyze the root cause (done by the LLM, not the tool)
  5. Reproduce the failure locally using the exact seed from Jenkins
  6. Apply the fix via search-and-replace
  7. Verify the fix passes locally with the same seed
  8. Create a PR with DCO sign-off for maintainer review

Available Tools

Tool Description
fetch_flaky_issues List open flaky-test issues from GitHub
parse_issue Extract Jenkins URLs, test names, commits from an issue
fetch_jenkins_failure Get stack traces, reproduce commands, seeds from Jenkins
read_local_file Read source files from local OpenSearch checkout
patch_local_file Apply fix via search-and-replace on local files
run_reproduce_command Run gradle reproduce command locally
search_repo_code Search OpenSearch repo for related code
create_pr_from_local Commit (GPG sign + DCO), push, open PR for review

Compatibility

The MCP server works with any MCP-compatible AI agent:

  • Kiro CLI — via steering files for automatic workflow guidance
  • Claude CLI — via Agent Skills (SKILL.md)

The goal isn't to replace human reviewers — it's to work together to reduce the flaky test load so the community can focus on building features instead of chasing intermittent failures.

Limitations

This tool can make mistakes — the LLM may misidentify root causes or generate incorrect fixes. That's expected and by design. Every PR it creates goes through the standard review process by maintainers and collaborators. The value is in automating the tedious parts (parsing Jenkins logs, finding seeds, reproducing locally, creating the PR) so that reviewers can focus on validating the fix rather than doing the diagnosis from scratch.

Even when the fix isn't perfect, the PR provides a starting point with the root cause analysis, stack traces, and reproduce commands already gathered — saving significant time for whoever picks up the issue.

Next Steps

  • Gather community feedback on this approach
  • Batch mode: iterate through multiple flaky test issues automatically
  • Explore CI integration for auto-triggering on new AUTOCUT issues
  • Track fix success rate across different flaky test patterns

How Can We Leverage This?

The tool works today as a manual workflow — point it at an issue and it produces a PR. But there are several ways we could scale this to reduce the 200+ open flaky test backlog. I'm open to thoughts on approaches like:

  • Nightly batch runs: Automatically iterate through new AUTOCUT issues each night, create fix PRs, and tag maintainers for review
  • CI integration: Trigger the agent when a new >test-failure issue is filed, so a potential fix PR appears alongside the issue within hours
  • Issue input mode: Feed a list of priority flaky test issues and let the agent work through them sequentially
  • Triage assistant: Even when it can't fix the issue, have it add a comment with the root cause analysis, reproduce command, and relevant source code to accelerate manual debugging
  • Pattern learning: Track which flaky patterns the tool fixes successfully and use that to prioritize which issues to attempt.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions