Skip to content

Workflow: Add kelos-retrospective TaskSpawner for continuous PR outcome analysis and prompt improvement #513

@kelos-bot

Description

@kelos-bot

🤖 Kelos Agent @gjkim42

Summary

The self-development loop has five specialized agents (triage, workers, self-update, fake-user, fake-strategist), but no agent systematically analyzes the outcomes of their work. The loop generates PRs, but nobody measures whether those PRs are getting better or worse over time, identifies recurring rejection patterns, or feeds concrete learnings back into prompts.

Issue #508 demonstrated the value of this analysis — a one-time manual review found a 40% closed-without-merge rate and identified 5 systemic prompt gaps. But this analysis was performed once by a human-triggered strategist run. It should be a continuous, automated part of the loop.

Problem

The feedback gap in the current self-development pipeline

Issues → [triage] → Triaged Issues → [workers] → PRs → {merged | closed}
                                                           ↑
              [self-update] reviews configs ────────────────┘ (weak link)
              [fake-strategist] proposes features ──────────┘ (no link)
              [fake-user] tests DX ─────────────────────────┘ (no link)

The self-update agent reviews config files daily but:

  • Has no structured data about which PRs succeeded or failed
  • Relies on browsing recent PRs ad-hoc rather than systematic analysis
  • Cannot identify statistical trends (e.g., "PRs for kind/api issues fail 80% of the time")
  • Cannot track whether its own previous improvements actually helped

The workers agent creates PRs but has no way to learn from past failures. Each task starts fresh with no accumulated knowledge beyond what's hardcoded in the prompt.

Evidence this matters

From the current PR data (last ~50 kelos-generated PRs):

The 40% rejection rate found in #508 has fluctuated but not systematically improved. Without continuous measurement, we can't know if prompt changes in #510 (the fix for #508) actually improved outcomes.

What's missing vs. what exists

Existing Issue What it covers What it doesn't cover
#508 (Prompt hardening) One-time analysis of 5 failure patterns Continuous monitoring, trend tracking
#355 (Cost metrics) Prometheus counters for cost/tokens PR outcome tracking, merge rates
#390 (Telemetry) Anonymous usage analytics Self-development effectiveness
#455 (Context enrichment) Triage-to-worker metadata Worker-to-outcome feedback
#495 (Improve self-dev workflow) Umbrella for workflow redesign No specific proposal for retrospective

Proposal: kelos-retrospective TaskSpawner

Add a new cron-based TaskSpawner that runs weekly, systematically analyzes PR outcomes, and produces either:

  1. Concrete prompt change issues (backed by statistical evidence), or
  2. A structured report when no actionable changes are found

Proposed config: self-development/kelos-retrospective.yaml

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: kelos-retrospective
spec:
  when:
    cron:
      schedule: "0 0 * * 1"  # Weekly on Monday at midnight UTC
  maxConcurrency: 1
  taskTemplate:
    workspaceRef:
      name: kelos-agent
    model: opus
    type: claude-code
    ttlSecondsAfterFinished: 864000
    credentials:
      type: oauth
      secretRef:
        name: kelos-credentials
    podOverrides:
      resources:
        requests:
          cpu: "250m"
          memory: "512Mi"
          ephemeral-storage: "2Gi"
        limits:
          cpu: "1"
          memory: "2Gi"
          ephemeral-storage: "2Gi"
    agentConfigRef:
      name: kelos-dev-agent
    promptTemplate: |
      You are a retrospective analyst for the Kelos self-development loop.
      Your job is to measure the effectiveness of agent-generated PRs and
      identify evidence-backed improvements to the worker prompt.

      ## Step 1: Collect PR outcome data (last 7 days)

      Fetch all recent kelos-generated PRs:
      ```
      gh pr list --state all --label generated-by-kelos --limit 50 --json number,title,state,mergedAt,closedAt,body,labels
      ```

      For each PR, classify it:
      - **Merged**: Successfully contributed to the project
      - **Closed without merge**: Rejected — investigate why

      For each closed PR, read its review comments to categorize the rejection:
      ```
      gh api repos/{owner}/{repo}/pulls/{number}/reviews
      gh api repos/{owner}/{repo}/pulls/{number}/comments
      gh pr view {number} --comments
      ```

      Categorize rejections into failure modes:
      - **Scope creep**: Agent added unrequested features
      - **Design disagreement**: Maintainer wanted a different approach
      - **Duplicate/existing code**: Agent created something that already existed
      - **Format/convention violation**: PR description, commit format, etc.
      - **Quality issue**: Tests missing, code incorrect, review feedback ignored
      - **Not actionable**: Issue was not suitable for autonomous agent work
      - **Stale/superseded**: Another PR addressed the same issue first

      ## Step 2: Compute metrics

      Calculate:
      - Total PRs created this week
      - Merge rate (merged / total)
      - Rejection rate by failure mode
      - Average reset count (how many /reset-worker cycles per merged PR)
      - Compare to previous weeks if data is available from prior retrospective issues

      ## Step 3: Identify actionable patterns

      For each failure mode with 2+ occurrences:
      - Read the current worker prompt in `self-development/kelos-workers.yaml`
      - Check if the prompt already addresses this failure mode
      - If not, propose a specific prompt addition with exact wording

      For merged PRs that required multiple resets:
      - What did the agent miss on the first attempt?
      - Could the prompt be clearer about that scenario?

      ## Step 4: Output

      If you find actionable improvements (new failure patterns not yet addressed
      in the prompt, or evidence that previous prompt changes didn't help):
        Create a GitHub issue with:
        - Title: "Retrospective: [week date range] — [key finding]"
        - Body with: metrics summary, failure mode breakdown, specific prompt changes
        ```
        gh issue create --title "..." --body "..." --label generated-by-kelos
        ```

      If all failure modes are already addressed in the prompt and merge rate
      is above 70%: exit without creating an issue. Not every run needs output.

      ## Constraints
      - Only analyze PRs from the last 7 days (avoid re-analyzing old data)
      - Do NOT create PRs — only create issues with proposals
      - Check existing issues first to avoid duplicates: `gh issue list --label generated-by-kelos --limit 20`
      - Be specific: cite PR numbers, quote review comments, propose exact prompt wording
      - Do not create vague "we should improve X" issues — every proposal must include concrete text changes
  pollInterval: 1m

Why this is different from kelos-self-update

Aspect kelos-self-update kelos-retrospective
Frequency Daily Weekly
Scope Reviews config files for drift/best practices Analyzes PR outcomes for effectiveness
Input data Self-development YAML files, recent agent activity Merged/closed PR data, review comments, rejection reasons
Output Config alignment issues Evidence-backed prompt changes with metrics
Method Read configs → compare to conventions Collect data → compute metrics → identify patterns → propose changes
Tracks progress No Yes (compares metrics to previous weeks)

Expected impact

  1. Continuous merge rate tracking: The 53% merge rate becomes a monitored metric that the team can see trending up or down week-over-week
  2. Evidence-backed prompt changes: Instead of one-off analyses like Workflow: Harden kelos-workers prompt to reduce closed-without-merge rate #508, every prompt improvement is backed by data from recent PR outcomes
  3. Faster feedback: When a new prompt change (like Harden kelos-workers prompt to reduce closed-without-merge rate #510) is deployed, the next weekly retrospective can measure whether it actually improved merge rates
  4. Pattern detection at scale: As the volume of agent PRs grows, manual analysis becomes impractical; an automated retrospective scales with the pipeline
  5. Reduced wasted compute: Each rejected PR costs $5-50+ in compute (opus tasks). A 10% improvement in merge rate at current volumes saves ~$50-500/week

Implementation notes

  • The TaskSpawner is self-contained — no code changes needed, just a new YAML file in self-development/
  • Follows the same pattern as the existing cron-based spawners (fake-strategist, fake-user, self-update)
  • Uses the same workspace, credentials, and agentConfig as other self-dev agents
  • Weekly cadence balances having enough data to analyze vs. not running too frequently

Related issues

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions