-
Notifications
You must be signed in to change notification settings - Fork 13
Description
🤖 Kelos Agent @gjkim42
Summary
The self-development loop has five specialized agents (triage, workers, self-update, fake-user, fake-strategist), but no agent systematically analyzes the outcomes of their work. The loop generates PRs, but nobody measures whether those PRs are getting better or worse over time, identifies recurring rejection patterns, or feeds concrete learnings back into prompts.
Issue #508 demonstrated the value of this analysis — a one-time manual review found a 40% closed-without-merge rate and identified 5 systemic prompt gaps. But this analysis was performed once by a human-triggered strategist run. It should be a continuous, automated part of the loop.
Problem
The feedback gap in the current self-development pipeline
Issues → [triage] → Triaged Issues → [workers] → PRs → {merged | closed}
↑
[self-update] reviews configs ────────────────┘ (weak link)
[fake-strategist] proposes features ──────────┘ (no link)
[fake-user] tests DX ─────────────────────────┘ (no link)
The self-update agent reviews config files daily but:
- Has no structured data about which PRs succeeded or failed
- Relies on browsing recent PRs ad-hoc rather than systematic analysis
- Cannot identify statistical trends (e.g., "PRs for
kind/apiissues fail 80% of the time") - Cannot track whether its own previous improvements actually helped
The workers agent creates PRs but has no way to learn from past failures. Each task starts fresh with no accumulated knowledge beyond what's hardcoded in the prompt.
Evidence this matters
From the current PR data (last ~50 kelos-generated PRs):
- Merged: docs: add Workspace spec.remotes[] and spec.files[] sub-fields to reference #512, Fix race condition in triage workflow label ordering #505, Include labels when creating PRs instead of adding them after #501, Delete custom resources before controller during uninstall #494, Trigger e2e tests when ok-to-test label is added #492, Add credential URLs for all agent types to config file template #491, docs: group promptTemplate variables by source type #488, Remove kelos/needs-input label from new issue creation #484, Fix stale README and incomplete triage label cleanup #477, Fix incorrect field names in example 05 workspace.yaml #476, Document GitHub App authentication for Workspaces #471, Remove PR creation instructions from kelos-self-update TaskSpawner #469, Complete triage lifecycle with priority, actor, and label cleanup #462, self-development: Fix configuration alignment issues #453, Support immediate re-triggering of completed tasks via TriggerComment #452, Fix GitHub App authentication for GitHub Enterprise Server #451, Add assignee and author filtering to GitHubIssues source #450, Use Kubernetes CronJob for cron-based TaskSpawners #449, Optimize label workflow by consolidating four jobs into one #448, Restrict PR creation to axon-workers TaskSpawner only #445 (20 merged)
- Closed without merge: Fix command matching and commenter identification in workflow triggers #497, Add client-side validation for --type and --credential-type flags with custom agent support #493, Add next-step guidance after 'kelos run' task creation #489, Align Quick Start credential guidance with CLI defaults #487, Add --crd flag to install and uninstall commands #486, Add opt-in --crd flag to install and uninstall commands #470, Complete triage lifecycle with priority, actor, and label cleanup #444, Enhance triage agent to complete full triage lifecycle #438, Support secret references for MCP server headers and env #431, Auto-retry failed Tasks in spawner dedup logic #428, Add anonymous phone-home telemetry via CronJob #424, spawner: auto-replace completed Tasks on re-discovery #415, Add PR reviewer TaskSpawner for automated code review #413, Add axon-pr-reviewer TaskSpawner to replace cubic-dev-ai #409, Add orchestrator pattern example (06-task-dependencies) #403, Update docs to reflect axon get view and detail flags #394, self-development: Add priority classification to triage agent #385, CLI: show task outputs (branch, PR, cost) when --watch completes #381 (18 closed)
- Merge rate: ~53% — nearly half of all agent work is wasted
The 40% rejection rate found in #508 has fluctuated but not systematically improved. Without continuous measurement, we can't know if prompt changes in #510 (the fix for #508) actually improved outcomes.
What's missing vs. what exists
| Existing Issue | What it covers | What it doesn't cover |
|---|---|---|
| #508 (Prompt hardening) | One-time analysis of 5 failure patterns | Continuous monitoring, trend tracking |
| #355 (Cost metrics) | Prometheus counters for cost/tokens | PR outcome tracking, merge rates |
| #390 (Telemetry) | Anonymous usage analytics | Self-development effectiveness |
| #455 (Context enrichment) | Triage-to-worker metadata | Worker-to-outcome feedback |
| #495 (Improve self-dev workflow) | Umbrella for workflow redesign | No specific proposal for retrospective |
Proposal: kelos-retrospective TaskSpawner
Add a new cron-based TaskSpawner that runs weekly, systematically analyzes PR outcomes, and produces either:
- Concrete prompt change issues (backed by statistical evidence), or
- A structured report when no actionable changes are found
Proposed config: self-development/kelos-retrospective.yaml
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
name: kelos-retrospective
spec:
when:
cron:
schedule: "0 0 * * 1" # Weekly on Monday at midnight UTC
maxConcurrency: 1
taskTemplate:
workspaceRef:
name: kelos-agent
model: opus
type: claude-code
ttlSecondsAfterFinished: 864000
credentials:
type: oauth
secretRef:
name: kelos-credentials
podOverrides:
resources:
requests:
cpu: "250m"
memory: "512Mi"
ephemeral-storage: "2Gi"
limits:
cpu: "1"
memory: "2Gi"
ephemeral-storage: "2Gi"
agentConfigRef:
name: kelos-dev-agent
promptTemplate: |
You are a retrospective analyst for the Kelos self-development loop.
Your job is to measure the effectiveness of agent-generated PRs and
identify evidence-backed improvements to the worker prompt.
## Step 1: Collect PR outcome data (last 7 days)
Fetch all recent kelos-generated PRs:
```
gh pr list --state all --label generated-by-kelos --limit 50 --json number,title,state,mergedAt,closedAt,body,labels
```
For each PR, classify it:
- **Merged**: Successfully contributed to the project
- **Closed without merge**: Rejected — investigate why
For each closed PR, read its review comments to categorize the rejection:
```
gh api repos/{owner}/{repo}/pulls/{number}/reviews
gh api repos/{owner}/{repo}/pulls/{number}/comments
gh pr view {number} --comments
```
Categorize rejections into failure modes:
- **Scope creep**: Agent added unrequested features
- **Design disagreement**: Maintainer wanted a different approach
- **Duplicate/existing code**: Agent created something that already existed
- **Format/convention violation**: PR description, commit format, etc.
- **Quality issue**: Tests missing, code incorrect, review feedback ignored
- **Not actionable**: Issue was not suitable for autonomous agent work
- **Stale/superseded**: Another PR addressed the same issue first
## Step 2: Compute metrics
Calculate:
- Total PRs created this week
- Merge rate (merged / total)
- Rejection rate by failure mode
- Average reset count (how many /reset-worker cycles per merged PR)
- Compare to previous weeks if data is available from prior retrospective issues
## Step 3: Identify actionable patterns
For each failure mode with 2+ occurrences:
- Read the current worker prompt in `self-development/kelos-workers.yaml`
- Check if the prompt already addresses this failure mode
- If not, propose a specific prompt addition with exact wording
For merged PRs that required multiple resets:
- What did the agent miss on the first attempt?
- Could the prompt be clearer about that scenario?
## Step 4: Output
If you find actionable improvements (new failure patterns not yet addressed
in the prompt, or evidence that previous prompt changes didn't help):
Create a GitHub issue with:
- Title: "Retrospective: [week date range] — [key finding]"
- Body with: metrics summary, failure mode breakdown, specific prompt changes
```
gh issue create --title "..." --body "..." --label generated-by-kelos
```
If all failure modes are already addressed in the prompt and merge rate
is above 70%: exit without creating an issue. Not every run needs output.
## Constraints
- Only analyze PRs from the last 7 days (avoid re-analyzing old data)
- Do NOT create PRs — only create issues with proposals
- Check existing issues first to avoid duplicates: `gh issue list --label generated-by-kelos --limit 20`
- Be specific: cite PR numbers, quote review comments, propose exact prompt wording
- Do not create vague "we should improve X" issues — every proposal must include concrete text changes
pollInterval: 1mWhy this is different from kelos-self-update
| Aspect | kelos-self-update | kelos-retrospective |
|---|---|---|
| Frequency | Daily | Weekly |
| Scope | Reviews config files for drift/best practices | Analyzes PR outcomes for effectiveness |
| Input data | Self-development YAML files, recent agent activity | Merged/closed PR data, review comments, rejection reasons |
| Output | Config alignment issues | Evidence-backed prompt changes with metrics |
| Method | Read configs → compare to conventions | Collect data → compute metrics → identify patterns → propose changes |
| Tracks progress | No | Yes (compares metrics to previous weeks) |
Expected impact
- Continuous merge rate tracking: The 53% merge rate becomes a monitored metric that the team can see trending up or down week-over-week
- Evidence-backed prompt changes: Instead of one-off analyses like Workflow: Harden kelos-workers prompt to reduce closed-without-merge rate #508, every prompt improvement is backed by data from recent PR outcomes
- Faster feedback: When a new prompt change (like Harden kelos-workers prompt to reduce closed-without-merge rate #510) is deployed, the next weekly retrospective can measure whether it actually improved merge rates
- Pattern detection at scale: As the volume of agent PRs grows, manual analysis becomes impractical; an automated retrospective scales with the pipeline
- Reduced wasted compute: Each rejected PR costs $5-50+ in compute (opus tasks). A 10% improvement in merge rate at current volumes saves ~$50-500/week
Implementation notes
- The TaskSpawner is self-contained — no code changes needed, just a new YAML file in
self-development/ - Follows the same pattern as the existing cron-based spawners (fake-strategist, fake-user, self-update)
- Uses the same workspace, credentials, and agentConfig as other self-dev agents
- Weekly cadence balances having enough data to analyze vs. not running too frequently
Related issues
- Workflow: Harden kelos-workers prompt to reduce closed-without-merge rate #508 — One-time prompt hardening analysis (this proposal automates the same type of analysis continuously)
- Improve self-development workflow #495 — Improve self-development workflow (this proposal addresses one specific aspect: outcome measurement)
- Workflow: Triage-to-worker handoff loses structured context — add WorkItem metadata enrichment #455 — Triage-to-worker context enrichment (complementary: improves input quality, while this proposal measures output quality)
- API: Add cost/token Prometheus metrics and BudgetPolicy for production cost governance #355 — Cost metrics (complementary: tracks spend, while this proposal tracks effectiveness)