Workflow: Add kelos-retrospective TaskSpawner for continuous PR outcome analysis and prompt improvement

🤖 **Kelos Agent** @gjkim42

## Summary

The self-development loop has five specialized agents (triage, workers, self-update, fake-user, fake-strategist), but no agent systematically analyzes the **outcomes** of their work. The loop generates PRs, but nobody measures whether those PRs are getting better or worse over time, identifies recurring rejection patterns, or feeds concrete learnings back into prompts.

Issue #508 demonstrated the value of this analysis — a one-time manual review found a 40% closed-without-merge rate and identified 5 systemic prompt gaps. But this analysis was performed once by a human-triggered strategist run. It should be a **continuous, automated part of the loop**.

## Problem

### The feedback gap in the current self-development pipeline

```
Issues → [triage] → Triaged Issues → [workers] → PRs → {merged | closed}
                                                           ↑
              [self-update] reviews configs ────────────────┘ (weak link)
              [fake-strategist] proposes features ──────────┘ (no link)
              [fake-user] tests DX ─────────────────────────┘ (no link)
```

The **self-update** agent reviews config files daily but:
- Has no structured data about which PRs succeeded or failed
- Relies on browsing recent PRs ad-hoc rather than systematic analysis
- Cannot identify statistical trends (e.g., "PRs for `kind/api` issues fail 80% of the time")
- Cannot track whether its own previous improvements actually helped

The **workers** agent creates PRs but has no way to learn from past failures. Each task starts fresh with no accumulated knowledge beyond what's hardcoded in the prompt.

### Evidence this matters

From the current PR data (last ~50 kelos-generated PRs):
- **Merged**: #512, #505, #501, #494, #492, #491, #488, #484, #477, #476, #471, #469, #462, #453, #452, #451, #450, #449, #448, #445 (20 merged)
- **Closed without merge**: #497, #493, #489, #487, #486, #470, #444, #438, #431, #428, #424, #415, #413, #409, #403, #394, #385, #381 (18 closed)
- **Merge rate**: ~53% — nearly half of all agent work is wasted

The 40% rejection rate found in #508 has fluctuated but not systematically improved. Without continuous measurement, we can't know if prompt changes in #510 (the fix for #508) actually improved outcomes.

### What's missing vs. what exists

| Existing Issue | What it covers | What it doesn't cover |
|---|---|---|
| #508 (Prompt hardening) | One-time analysis of 5 failure patterns | Continuous monitoring, trend tracking |
| #355 (Cost metrics) | Prometheus counters for cost/tokens | PR outcome tracking, merge rates |
| #390 (Telemetry) | Anonymous usage analytics | Self-development effectiveness |
| #455 (Context enrichment) | Triage-to-worker metadata | Worker-to-outcome feedback |
| #495 (Improve self-dev workflow) | Umbrella for workflow redesign | No specific proposal for retrospective |

## Proposal: `kelos-retrospective` TaskSpawner

Add a new cron-based TaskSpawner that runs weekly, systematically analyzes PR outcomes, and produces either:
1. **Concrete prompt change issues** (backed by statistical evidence), or
2. **A structured report** when no actionable changes are found

### Proposed config: `self-development/kelos-retrospective.yaml`

```yaml
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: kelos-retrospective
spec:
  when:
    cron:
      schedule: "0 0 * * 1"  # Weekly on Monday at midnight UTC
  maxConcurrency: 1
  taskTemplate:
    workspaceRef:
      name: kelos-agent
    model: opus
    type: claude-code
    ttlSecondsAfterFinished: 864000
    credentials:
      type: oauth
      secretRef:
        name: kelos-credentials
    podOverrides:
      resources:
        requests:
          cpu: "250m"
          memory: "512Mi"
          ephemeral-storage: "2Gi"
        limits:
          cpu: "1"
          memory: "2Gi"
          ephemeral-storage: "2Gi"
    agentConfigRef:
      name: kelos-dev-agent
    promptTemplate: |
      You are a retrospective analyst for the Kelos self-development loop.
      Your job is to measure the effectiveness of agent-generated PRs and
      identify evidence-backed improvements to the worker prompt.

      ## Step 1: Collect PR outcome data (last 7 days)

      Fetch all recent kelos-generated PRs:
      ```
      gh pr list --state all --label generated-by-kelos --limit 50 --json number,title,state,mergedAt,closedAt,body,labels
      ```

      For each PR, classify it:
      - **Merged**: Successfully contributed to the project
      - **Closed without merge**: Rejected — investigate why

      For each closed PR, read its review comments to categorize the rejection:
      ```
      gh api repos/{owner}/{repo}/pulls/{number}/reviews
      gh api repos/{owner}/{repo}/pulls/{number}/comments
      gh pr view {number} --comments
      ```

      Categorize rejections into failure modes:
      - **Scope creep**: Agent added unrequested features
      - **Design disagreement**: Maintainer wanted a different approach
      - **Duplicate/existing code**: Agent created something that already existed
      - **Format/convention violation**: PR description, commit format, etc.
      - **Quality issue**: Tests missing, code incorrect, review feedback ignored
      - **Not actionable**: Issue was not suitable for autonomous agent work
      - **Stale/superseded**: Another PR addressed the same issue first

      ## Step 2: Compute metrics

      Calculate:
      - Total PRs created this week
      - Merge rate (merged / total)
      - Rejection rate by failure mode
      - Average reset count (how many /reset-worker cycles per merged PR)
      - Compare to previous weeks if data is available from prior retrospective issues

      ## Step 3: Identify actionable patterns

      For each failure mode with 2+ occurrences:
      - Read the current worker prompt in `self-development/kelos-workers.yaml`
      - Check if the prompt already addresses this failure mode
      - If not, propose a specific prompt addition with exact wording

      For merged PRs that required multiple resets:
      - What did the agent miss on the first attempt?
      - Could the prompt be clearer about that scenario?

      ## Step 4: Output

      If you find actionable improvements (new failure patterns not yet addressed
      in the prompt, or evidence that previous prompt changes didn't help):
        Create a GitHub issue with:
        - Title: "Retrospective: [week date range] — [key finding]"
        - Body with: metrics summary, failure mode breakdown, specific prompt changes
        ```
        gh issue create --title "..." --body "..." --label generated-by-kelos
        ```

      If all failure modes are already addressed in the prompt and merge rate
      is above 70%: exit without creating an issue. Not every run needs output.

      ## Constraints
      - Only analyze PRs from the last 7 days (avoid re-analyzing old data)
      - Do NOT create PRs — only create issues with proposals
      - Check existing issues first to avoid duplicates: `gh issue list --label generated-by-kelos --limit 20`
      - Be specific: cite PR numbers, quote review comments, propose exact prompt wording
      - Do not create vague "we should improve X" issues — every proposal must include concrete text changes
  pollInterval: 1m
```

### Why this is different from kelos-self-update

| Aspect | kelos-self-update | kelos-retrospective |
|---|---|---|
| **Frequency** | Daily | Weekly |
| **Scope** | Reviews config files for drift/best practices | Analyzes PR outcomes for effectiveness |
| **Input data** | Self-development YAML files, recent agent activity | Merged/closed PR data, review comments, rejection reasons |
| **Output** | Config alignment issues | Evidence-backed prompt changes with metrics |
| **Method** | Read configs → compare to conventions | Collect data → compute metrics → identify patterns → propose changes |
| **Tracks progress** | No | Yes (compares metrics to previous weeks) |

### Expected impact

1. **Continuous merge rate tracking**: The 53% merge rate becomes a monitored metric that the team can see trending up or down week-over-week
2. **Evidence-backed prompt changes**: Instead of one-off analyses like #508, every prompt improvement is backed by data from recent PR outcomes
3. **Faster feedback**: When a new prompt change (like #510) is deployed, the next weekly retrospective can measure whether it actually improved merge rates
4. **Pattern detection at scale**: As the volume of agent PRs grows, manual analysis becomes impractical; an automated retrospective scales with the pipeline
5. **Reduced wasted compute**: Each rejected PR costs $5-50+ in compute (opus tasks). A 10% improvement in merge rate at current volumes saves ~$50-500/week

## Implementation notes

- The TaskSpawner is self-contained — no code changes needed, just a new YAML file in `self-development/`
- Follows the same pattern as the existing cron-based spawners (fake-strategist, fake-user, self-update)
- Uses the same workspace, credentials, and agentConfig as other self-dev agents
- Weekly cadence balances having enough data to analyze vs. not running too frequently

## Related issues

- #508 — One-time prompt hardening analysis (this proposal automates the same type of analysis continuously)
- #495 — Improve self-development workflow (this proposal addresses one specific aspect: outcome measurement)
- #455 — Triage-to-worker context enrichment (complementary: improves input quality, while this proposal measures output quality)
- #355 — Cost metrics (complementary: tracks spend, while this proposal tracks effectiveness)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow: Add kelos-retrospective TaskSpawner for continuous PR outcome analysis and prompt improvement #513

Summary

Problem

The feedback gap in the current self-development pipeline

Evidence this matters

What's missing vs. what exists

Proposal: `kelos-retrospective` TaskSpawner

Proposed config: `self-development/kelos-retrospective.yaml`

Why this is different from kelos-self-update

Expected impact

Implementation notes

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Existing Issue	What it covers	What it doesn't cover
#508 (Prompt hardening)	One-time analysis of 5 failure patterns	Continuous monitoring, trend tracking
#355 (Cost metrics)	Prometheus counters for cost/tokens	PR outcome tracking, merge rates
#390 (Telemetry)	Anonymous usage analytics	Self-development effectiveness
#455 (Context enrichment)	Triage-to-worker metadata	Worker-to-outcome feedback
#495 (Improve self-dev workflow)	Umbrella for workflow redesign	No specific proposal for retrospective

Aspect	kelos-self-update	kelos-retrospective
Frequency	Daily	Weekly
Scope	Reviews config files for drift/best practices	Analyzes PR outcomes for effectiveness
Input data	Self-development YAML files, recent agent activity	Merged/closed PR data, review comments, rejection reasons
Output	Config alignment issues	Evidence-backed prompt changes with metrics
Method	Read configs → compare to conventions	Collect data → compute metrics → identify patterns → propose changes
Tracks progress	No	Yes (compares metrics to previous weeks)

Workflow: Add kelos-retrospective TaskSpawner for continuous PR outcome analysis and prompt improvement #513

Description

Summary

Problem

The feedback gap in the current self-development pipeline

Evidence this matters

What's missing vs. what exists

Proposal: kelos-retrospective TaskSpawner

Proposed config: self-development/kelos-retrospective.yaml

Why this is different from kelos-self-update

Expected impact

Implementation notes

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposal: `kelos-retrospective` TaskSpawner

Proposed config: `self-development/kelos-retrospective.yaml`