AI-powered test failure analysis and automated remediation for GitHub Actions
Meet Artemis, your CI watchdog
Claude Code Watchdog automatically analyzes test failures in your CI/CD pipeline, providing intelligent insights and automated fixes. Instead of getting overwhelmed by flaky test notifications, get actionable analysis that helps you focus on real issues.
Key capabilities:
- Intelligent Analysis: AI-powered test failure analysis with pattern recognition
- Failure Classification: Distinguishes chronic issues from flaky tests based on failure rates
- Automated Issues: Creates detailed GitHub issues with context and actionable recommendations
- Self-Healing: Implements fixes for common problems automatically
- Smart Notifications: Provides severity-based alerts to reduce noise
Add this step to your workflow after your tests:
- name: Test failure analysis
if: failure()
uses: cardscan-ai/claude-code-watchdog@v0.2
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
test_results_path: 'test-results/**/*.xml' # Adjust to your test output locationWhen tests fail, the action will:
- Analyze test outputs and failure patterns
- Determine severity based on failure frequency
- Create or update GitHub issues with detailed analysis
- Optionally implement automated fixes via pull requests
- Pattern Recognition: Distinguishes between chronic failures (80%+ fail rate) vs isolated incidents
- Root Cause Detection: Correlates failures with recent commits and changes
- Test Output Parsing: Understands JUnit XML, JSON reports, and log files
- Historical Context: Analyzes the last 20 workflow runs for trends
- No Duplicates: Updates existing issues instead of creating spam
- Consistent Naming:
Watchdog [Workflow Name]: Descriptionfor easy filtering - Rich Context: Includes failure patterns, recent commits, and actionable recommendations
- Smart Labels: Automatically tags with severity and failure type
- Safe Fixes: Only implements changes it's confident about
- Common Patterns: Fixes timeouts, flaky selectors, deprecated APIs
- PR Creation: Creates branches and PRs with clear descriptions
- Test Verification: Can re-run tests to verify fixes work
| Pattern | Failure Rate | Artemis Response |
|---|---|---|
| π΄ Chronic | 80%+ | Upgrades severity, immediate attention |
| π‘ Frequent | 50-79% | Creates high-priority issues |
| π Intermittent | 20-49% | Standard monitoring and analysis |
| π’ Isolated | <20% | May downgrade severity, likely flaky |
Perfect for most CI workflows:
name: CI with Watchdog
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
permissions:
contents: write # For creating fix PRs
issues: write # For creating issues
pull-requests: write # For creating PRs
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '18'
- name: Install and test
run: |
npm ci
npm test
continue-on-error: true # Let Artemis analyze failures
- name: Artemis failure analysis
if: failure()
id: watchdog
uses: cardscan-ai/claude-code-watchdog@v0.2
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
test_results_path: 'test-results/**/*.xml'
- name: Notify team on critical failures
if: failure() && steps.watchdog.outputs.severity == 'critical'
uses: 8398a7/action-slack@v3
with:
status: failure
channel: '#critical-alerts'
title: 'π¨ Critical Test Failure'
message: |
Severity: ${{ steps.watchdog.outputs.severity }}
Action: ${{ steps.watchdog.outputs.action_taken }}
Issue: #${{ steps.watchdog.outputs.issue_number }}
mention: 'channel'
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}Perfect for health checks and integration tests:
name: API Health Check
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
jobs:
health-check:
runs-on: ubuntu-latest
permissions:
contents: read
issues: write
steps:
- uses: actions/checkout@v4
- name: Run API tests
run: |
# Your API tests (Postman, curl, etc.)
newman run api-tests.json --reporters json --reporter-json-export results.json
continue-on-error: true
- name: Artemis analysis
if: failure()
uses: cardscan-ai/claude-code-watchdog@v0.2
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
test_results_path: 'results.json' # Newman output file
create_fixes: 'false' # Just analysis for API tests
severity_threshold: 'low' # Monitor everythingMaximum automation - Artemis tries to fix and verify:
- name: Full auto-healing
if: failure()
uses: cardscan-ai/claude-code-watchdog@v0.2
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
test_results_path: 'test-results/**/*.xml'
create_fixes: 'true' # Try to implement fixes
rerun_tests: 'true' # Verify fixes work
severity_threshold: 'low' # Handle all failures
- name: Auto-merge if fixed
if: steps.watchdog.outputs.tests_passing == 'true'
run: gh pr merge ${{ steps.watchdog.outputs.pr_number }} --squash| Input | Description | Default |
|---|---|---|
anthropic_api_key |
Anthropic API key for Claude | Required |
test_results_path |
Path or glob pattern to test result files (e.g., "test-results/**/.xml", "cypress/reports/.json") | Required |
severity_threshold |
Minimum severity to process (ignore/low/medium/high/critical) | medium |
create_issues |
Create GitHub issues for failures | true |
create_fixes |
Attempt to implement fixes automatically | true |
rerun_tests |
Re-run tests to verify fixes work | false |
debug_mode |
Upload debugging artifacts and detailed logs | false |
safe_mode |
Skip potentially risky external content (GitHub issues, PRs, commit messages) | false |
| Output | Description |
|---|---|
severity |
Failure severity (ignore/low/medium/high/critical) |
action_taken |
What Artemis did (issue_created/issue_updated/pr_created/etc.) |
issue_number |
GitHub issue number if created/updated |
pr_number |
PR number if fixes were created |
tests_passing |
true if re-run tests passed after fixes |
Use the severity output to control notifications:
- name: Critical failure alerts
if: failure() && steps.watchdog.outputs.severity == 'critical'
uses: 8398a7/action-slack@v3
with:
status: failure
channel: '#critical-alerts'
title: 'π¨ Critical Test Failure'
message: |
Severity: ${{ steps.watchdog.outputs.severity }}
Action: ${{ steps.watchdog.outputs.action_taken }}
Issue: #${{ steps.watchdog.outputs.issue_number }}
mention: 'channel'
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
- name: Auto-fix success notifications
if: steps.watchdog.outputs.tests_passing == 'true'
uses: 8398a7/action-slack@v3
with:
status: success
title: 'β
Tests Auto-Fixed'
message: 'Watchdog automatically resolved test failures'
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}permissions:
contents: read
issues: writepermissions:
contents: write # Create branches and commits
issues: write # Create/update issues
pull-requests: write # Create PRs with fixesThe action gracefully falls back to analysis-only mode if permissions aren't available.
- Sign up at console.anthropic.com
- IMPORTANT: Set up spending limits and budget alerts for your account
- Create an API key with appropriate usage limits
- Add it to your repository secrets as
ANTHROPIC_API_KEY
Go to your repository β Settings β Secrets and variables β Actions:
- Name:
ANTHROPIC_API_KEY - Value: Your Anthropic API key (starts with
sk-ant-)
Add the watchdog step to your existing test workflows (see examples above).
Add the required permissions to your workflow (see permissions section).
You're all set!
Before calling Claude, the action automatically gathers:
- Repository permissions - What actions can be taken
- Existing issues/PRs - Avoid duplicates and update existing items
- Workflow run history - Calculate failure rates and patterns
- Recent commits - Identify potential causes
- Test output files - Find JUnit XML, JSON reports, logs
Claude then:
- Parses test outputs intelligently across multiple formats
- Correlates failures with recent changes and patterns
- Determines severity based on failure rate and impact
- Makes decisions about issues, fixes, and notifications
- Implements fixes safely when confident
- Verifies fixes by re-running tests if requested
Based on the analysis:
- Updates existing issues instead of creating duplicates
- Creates PRs with fixes for automatable problems
- Provides detailed context for human investigation
- Sets appropriate severity for intelligent notifications
- Scheduled health checks every few hours
- Contract testing between services
- Authentication timeout detection and fixing
- Network failure vs code bug differentiation
- Flaky selector detection and updating
- Timing issue identification and retry logic
- Environment drift detection
- Test data management issues
- Deprecated API usage updates
- Assertion modernization
- Test isolation improvements
- Performance regression tracking
- Build failure pattern analysis
- Deployment gate reliability monitoring
- Cross-platform test consistency
- Security scan failure investigation
# Only handle serious issues
- uses: cardscan-ai/claude-code-watchdog@v0.2
with:
severity_threshold: 'high' # Ignore low/medium failures# Conservative approach - just create issues
- uses: cardscan-ai/claude-code-watchdog@v0.2
with:
create_fixes: 'false'
rerun_tests: 'false'# Maximum automation
- uses: cardscan-ai/claude-code-watchdog@v0.2
with:
create_fixes: 'true'
rerun_tests: 'true'
severity_threshold: 'low' # Handle everything# Watchdog [API Tests]: Authentication timeout in user service
**Workflow:** API Tests
**Run:** [#1234](https://github.com/org/repo/actions/runs/1234)
**Severity:** High
**Pattern:** Frequent (67% failure rate over last 20 runs)
## π Failure Analysis
The user authentication endpoint is consistently timing out after 5 seconds. This started happening 2 days ago after commit abc123 which updated the auth service dependencies.
## π Pattern Analysis
- **Total runs analyzed:** 20
- **Failed runs:** 13
- **Failure rate:** 67%
- **Pattern:** Frequent
This represents a significant reliability issue that's blocking multiple workflows.
## π§ Recommendations
- [ ] Investigate auth service performance after recent dependency updates
- [ ] Consider increasing timeout from 5s to 10s as temporary fix
- [ ] Check database connection pool settings
- [ ] Review auth service logs for commit abc123 timeframe
## π Context
- **Commit:** abc123456
- **Actor:** developer-name
- **Event:** schedule
---
*Auto-generated by Claude Code Watchdog*Every run generates a detailed analysis report uploaded as a GitHub artifact:
watchdog-report-{run-id}/
βββ final-report.md # Comprehensive analysis summary
The report includes:
- Analysis results (severity, actions taken)
- Failure patterns and context
- Issue/PR numbers created
- Historical data summary
Enable debug mode for detailed troubleshooting:
- uses: cardscan-ai/claude-code-watchdog@v0.2
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
debug_mode: 'true' # Upload all analysis dataDebug artifacts include:
watchdog-debug-{run-id}/
βββ .watchdog/
β βββ context-summary.json # Run context
β βββ failure-analysis.json # Failure patterns
β βββ existing-issues.json # Related issues
β βββ recent-runs.json # Workflow history
β βββ test-files.txt # Test files found
βββ test-results.json # Your test outputs
βββ junit-results.xml # JUnit files
βββ *.log # Test logs
Perfect for:
- Understanding why Claude made specific decisions
- Debugging pattern recognition
- Seeing exactly what test data was analyzed
- Troubleshooting action behavior
β οΈ IMPORTANT DISCLAIMER: Cost estimates are approximate and may vary significantly based on your specific use case, test output size, and complexity. CardScan.ai provides NO warranty or guarantee regarding actual costs incurred. Usage costs are your responsibility.π¨ STRONGLY RECOMMENDED: Set up API key spending limits and budgets you are comfortable with before using this action. Monitor your Anthropic API usage regularly.
Claude Code Watchdog uses the Anthropic API, so each run incurs a cost based on token usage.
| Configuration | Input Tokens | Output Tokens | Estimated Cost |
|---|---|---|---|
| Analysis only | ~2-3k | ~500-1k | ~$0.20-$0.40 |
| Analysis + Issue creation | ~3-4k | ~1-2k | ~$0.40-$0.60 |
| Analysis + Fixes + PR | ~4-6k | ~2-4k | ~$0.60-$1.20 |
| Complex fixes + Re-run | ~6-8k | ~3-5k | ~$1.00-$1.80 |
Input tokens (what Claude reads):
- Context data (runs, commits, issues): ~1-2k tokens
- Test output files: ~1-3k tokens (varies by test size)
- Configuration and prompts: ~500 tokens
Output tokens (what Claude generates):
- Analysis and recommendations: ~500-1k tokens
- Issue/PR descriptions: ~500-1k tokens
- Code fixes: ~500-2k tokens (varies by complexity)
- Multiple fix attempts: Can increase cost
- Start conservative: Use
create_fixes: falseinitially - Limit scope: Use
severity_thresholdto avoid low-priority runs - Monitor usage: Check cost estimates in analysis reports and your Anthropic dashboard
- Schedule wisely: Monthly demos instead of daily
- Debug selectively: Only enable
debug_modewhen needed - Set spending limits: Configure budget alerts in your Anthropic account
- Test cautiously: Start with non-critical workflows to understand actual costs
β οΈ These are rough estimates only - your actual costs may be significantly higher or lower
- Light usage (5 failures/month, analysis only): ~$2-3/month
- Regular usage (15 failures/month, fixes enabled): ~$8-12/month
- Heavy usage (30 failures/month, full automation): ~$20-30/month
IMPORTANT: These estimates assume typical test output sizes. Large test suites, verbose logs, or complex codebases can significantly increase token usage and costs.
The action shows actual costs (when available) in console output and detailed breakdowns in analysis reports. Always monitor your Anthropic API usage dashboard for real spending.
β "GitHub CLI not authenticated"
- Ensure your workflow has a valid
GITHUB_TOKEN - Default
GITHUB_TOKENis automatically available in most cases
β "Anthropic API key required"
- Add your API key to repository secrets as
ANTHROPIC_API_KEY - Verify the secret name matches exactly
β "No push permissions - cannot create PRs"
- Add
contents: writeandpull-requests: writeto your workflow permissions - Or set
create_fixes: falsefor analysis-only mode
β "No test output files found"
- Ensure your tests output JUnit XML, JSON reports, or log files
- Check that test files match the patterns:
*test*.xml,*test*.json, etc.
- Check the workflow logs - Artemis provides detailed output about what it's doing
- Review permissions - Many issues are permission-related
- Validate test outputs - Ensure your tests create parseable output files
- Start simple - Begin with
create_fixes: falseand add features gradually - Use debug mode - Enable
debug_mode: trueto see exactly what data Claude analyzed
For maximum security, pin actions to specific commit SHAs instead of using version tags:
# Instead of version tags
- uses: cardscan-ai/claude-code-watchdog@v0.3.2
# Use SHA hash pinning for production
- uses: cardscan-ai/claude-code-watchdog@975fd591cfaa7179bfdedb112558dceca966e87e # v0.3.2- Security: Prevents malicious code injection if tags are compromised
- Immutability: Ensures exact same code runs every time
- Compliance: Required by many security policies (SLSA, OpenSSF)
- Reproducibility: Guarantees consistent builds across environments
You can find the SHA for any release on the releases page.
We love contributions! Here's how to help:
- Use the issue template
- Include workflow logs
- Describe expected vs actual behavior
- Describe your use case
- Explain how it would help your team
- Consider if it fits Artemis's core mission
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
MIT License - see LICENSE for details.
This project is maintained by CardScan.ai, makers of AI-powered insurance card scanning and eligibility verification tools.
We built this tool because we run scheduled API tests, WebSocket monitoring, and cross-platform SDK validation that can fail for various reasons. We got tired of waking up to notification storms about flaky tests while real issues got buried in the noise.
Claude Code Watchdog helps us focus on what matters: real bugs and breaking changes, not environment hiccups and timing issues.
This entire project was developed using Claude Code, demonstrating the power of AI-assisted software development. No human coding work was required for the construction of this project.
Development Statistics:
Total cost: $21.27
Total duration (API): 1h 38m 6.7s
Total duration (wall): 8h 52m 56.3s
Total code changes: 2064 lines added, 696 lines removed
Token usage by model:
claude-3-5-haiku: 650.1k input, 20.0k output, 0 cache read, 0 cache write
claude-sonnet: 1.4k input, 127.6k output, 35.0m cache read, 2.2m cache write
This represents a complete GitHub Action with:
- Complex GitHub Actions workflow orchestration
- Node.js scripts for data processing and validation
- Intelligent duplicate detection and search algorithms
- Cost monitoring and reporting systems
- Comprehensive documentation and examples
- Full error handling and fallback mechanisms
All accomplished through natural language conversations with Claude Code at a cost of $21.27.
