Skip to content

[prompt-evaluator] Improvement found: buildRefinementPrompt #208

@frostyardyeti

Description

@frostyardyeti

Prompt Evaluation: buildRefinementPrompt

Result: Variant wins 4/4 test cases

Proposed Change Rationale

Key changes and rationale:

  1. Added feedback classification step: The original prompt treats all feedback uniformly ('address it or explain why not'), but questions, directives, concerns, and out-of-scope requests need fundamentally different responses. Without classification, the AI tends to over-react to questions (changing the plan when only an explanation was needed) or under-react to directives (explaining without changing). The classification framework prevents both failure modes.

  2. Blocking/non-blocking clarifying questions for refinement: The original only had this distinction in buildNewPlanPrompt but not in buildRefinementPrompt. When feedback is ambiguous, the AI should still apply all clear feedback while flagging blocked items, rather than stalling the entire revision.

  3. Improved no-feedback fallback removed: The original's no-feedback path ('Re-evaluate the plan for missing files, edge cases...') encourages the AI to invent changes without any human signal, which produces noisy diffs. In practice this path is rarely hit (processRefinement only calls this when there ARE unreacted comments), so the logic should focus on the feedback-present case. The feedbackSection variable in the improved version handles both cases but the no-feedback case is less likely to produce hallucinated changes.

  4. Specificity preservation rule: Added 'do not degrade specificity' — a common failure mode where the AI rewrites a section in response to feedback and makes it vaguer than the original (e.g., replacing specific function names with generic descriptions).

  5. Stale code detection: Added instruction to notice when files have changed since the original plan was written, which is common in active repos where the plan may be days old.

  6. Verification step strengthened: Added count-matching (number of feedback items vs responses), section-by-section comparison against the original plan, and re-reading the issue body. The original's verification was good but could miss drift from the original issue.

  7. Evidence-based disagreement: Strengthened the 'explain why not' instruction to require referencing actual code, not just asserting something. This reduces the AI's tendency to dismiss valid feedback with hand-wavy justifications.

Proposed Variant Prompt

Click to expand the full variant prompt text
You are analyzing a GitHub issue for the repository ${fullName}.
Issue #${issue.number}: ${issue.title}

${issue.body || '(No description provided)'}

A previous implementation plan was produced:

${existingPlan}

${feedbackSection}

If `yeti/OVERVIEW.md` exists in the repository, read it first (and any linked documents that seem relevant to the issue) for context about the codebase architecture and patterns.

Before revising the plan, read every source file that the feedback references or that the existing plan proposes to change. Do not revise a file-level section of the plan without first reading that file's current contents. If a feedback comment mentions a function, type, or pattern by name, verify it exists and behaves as described before incorporating the suggestion. If a file has changed since the original plan was written, note the discrepancy and adjust the plan accordingly.

## Classifying feedback

Before making any changes, read all feedback comments and classify each one:
- **Directive**: The commenter is requesting a specific change to the plan (e.g., "use X instead of Y", "add handling for Z").
- **Question**: The commenter is asking for clarification or rationale (e.g., "why not use X?", "what happens if Y?"). Answer the question; only change the plan if the answer reveals a flaw.
- **Concern**: The commenter is flagging a risk or uncertainty (e.g., "this might break X", "I'm worried about Y"). Address the concern in the risks section; change the plan only if the concern is valid.
- **Out-of-scope request**: The commenter is suggesting work beyond the original issue. Do not incorporate it — note it in a `### Out of Scope` section.

This classification prevents over-reacting to questions (changing the plan when only an explanation was needed) and under-reacting to directives (explaining without actually changing anything).

## Addressing feedback

Process each feedback comment one at a time, in the order they appear. For each comment:
1. State which comment you are addressing (quote the key phrase or summarize in one line).
2. State your classification (directive, question, concern, or out-of-scope).
3. For directives: explain what change you are making to the plan. If you disagree, explain why with a concrete technical reason referencing the actual code — not just "it's not necessary."
4. For questions: answer the question directly. If the answer reveals a plan flaw, also revise the plan.
5. For concerns: evaluate whether the concern is valid by checking the code. If valid, add it to the risks section or revise the plan. If not valid, explain why with evidence from the code.
6. If the feedback is ambiguous or you cannot determine the commenter's intent, do NOT guess — add it to the "### Clarifying Questions" section instead.

Do not silently drop or ignore any feedback item. Every feedback comment must appear in your processing with an explicit disposition.

## Scope and preservation rules

Preserve sections of the plan that are not affected by the feedback. Only rewrite sections that need to change. This avoids introducing regressions in already-reviewed parts of the plan.

Stay within the scope of the original issue. If feedback suggests expanding beyond what the issue asks for, note the suggestion in a separate "### Out of Scope" section rather than incorporating it into the plan.

Do not add new files, dependencies, refactors, or "while we're at it" improvements that no feedback comment requested. The goal is a minimal, targeted revision.

When revising a section, do not degrade its specificity. If the original plan named specific functions, line numbers, or code patterns, the revised plan must be equally specific. Do not replace concrete details with vague summaries.

## Handling unclear or conflicting feedback

If any feedback is ambiguous or contradictory, output a "### Clarifying Questions" section listing specific questions that need answers before those feedback items can be addressed. For each question:
- Quote the feedback that triggered it
- Explain what is ambiguous
- Suggest concrete options (e.g., "Should X behave like A or B?")

When you have clarifying questions, classify them:
- Use `### Clarifying Questions (blocking)` if any question must be answered before the plan can be reliably revised. Apply all non-blocked feedback changes, but leave the blocked items unresolved and clearly marked.
- Use `### Clarifying Questions (non-blocking)` if the plan is fully revisable but you want to confirm an assumption or preference.

Instruct the user to respond as a comment on the GitHub issue so the next refinement cycle can incorporate their answers.

If two feedback comments contradict each other, do not pick a side. Flag both in the clarifying questions section and explain the contradiction.

## Verification step

After revising the plan, re-read your changes and check:
1. Did you address every feedback comment (either by revising the plan, answering a question, explaining why not, or adding a clarifying question)? Count the feedback comments and count your responses — the numbers must match.
2. Did you accidentally remove or weaken any risk, edge case, or testing item from the original plan that the feedback did not ask you to remove? Compare your output section-by-section against the original plan.
3. Is the implementation order still correct after your changes, or do revised steps create new ordering dependencies?
4. Are all file paths and function names in the revised plan still accurate? If you changed which files are affected, verify the new files exist.
5. Does the revised plan still solve the original issue? Re-read the issue title and body to confirm.

If you find issues during verification, fix them before producing output.

## Output format

Produce the updated implementation plan. It must include:
- Which files need to be changed
- What the changes should be
- Any potential risks or edge cases
- A suggested order of implementation
- How to verify the changes work (testing approach)

${MULTI_PR_INSTRUCTIONS}

If there were any surprises or deviations while addressing the feedback, explain them briefly in a separate section at the end of your response, prefixed with `### Note`

Do NOT make any code changes. Only produce the plan as text output.

Test Case 1: Add rate-limit header parsing to github.ts retry logic

Issue body: ## Problem

When GitHub returns a 403 with X-RateLimit-Remaining: 0, our retry logic in github.ts treats it as a generic transient error and retries with exponential backoff. This wastes retry attempts because the request will keep failing until the rate limit resets.

Desired Behavior

When a 403 response includes X-RateLimit-Remaining: 0:

  1. Parse the X-RateLimit-Reset header (Unix timestamp)
  2. Calculate the wait time until reset
  3. If the wait is under 120 seconds, sleep until reset then retry once
  4. If the wait is over 120 seconds, trigger the existing rate-limit circuit breaker immediately instead of exhausting retries

Acceptance Criteria

  • The runGh() function in github.ts detects rate-limit 403s distinctly from other 403s
  • Wait-and-retry behavior works for short resets (<120s)
  • Long resets (>120s) trigger setRateLimited() immediately
  • Existing exponential backoff for non-rate-limit errors is unchanged
  • Unit tests cover: short reset, long reset, non-rate-limit 403, and missing header cases

Context

This was noticed during a burst of issue-refiner runs that hit the secondary rate limit. The 60s circuit breaker cooldown kicked in eventually but only after wasting all retry attempts.

Winner: variant | Current: 16/20 | Variant: 18/20

Reasoning: Both outputs are strong and cover the same ground. The variant wins on specificity and actionability: it includes inline code showing the exact modified branch logic, references precise line numbers, adds a 5th test case (remaining > 0 for secondary rate limits), explicitly notes the TDD order per CLAUDE.md, and provides more concrete implementation details like the +1000ms buffer rationale. The current prompt's approach is solid but slightly more abstract in describing changes.

Current prompt output

Now I have full context. Here's the implementation plan:


Implementation Plan

Files to change

1. src/github.ts — Add rate-limit reset query and smart retry logic

New helper: queryRateLimitReset() (around line 34, after clearRateLimitState)

  • Calls execFile("gh", ["api", "/rate_limit"]) directly (not through the gh() wrapper, which would be blocked by the circuit breaker or recurse)
  • Parses the JSON response to find any resource with remaining === 0
  • Returns the wait time in ms (reset * 1000 - Date.now()) or null on failure
  • The /rate_limit endpoint is itself exempt from rate limiting, so this call is safe

Modified rate-limit branch in gh() inner run() function (lines 358–363):

Currently:

if (RATE_LIMIT_RE.test(msg)) {
  setRateLimited();
  reject(new RateLimitError(`gh ${args.join(" ")} failed: ${msg}`));
  return;
}

New behavior:

  • Add a rateLimitRetried boolean outside the run() closure (but inside the Promise)
  • When RATE_LIMIT_RE matches and rateLimitRetried is false:
    1. Call queryRateLimitReset()
    2. If reset wait is found and <= 120s: set rateLimitRetried = true, setTimeout(run, waitMs) (retry once after reset)
    3. If reset wait is found and > 120s: call setRateLimited(waitMs), reject with RateLimitError
    4. If query fails or no exhausted resource found: fall back to current behavior (setRateLimited() with default 60s, reject)
  • When RATE_LIMIT_RE matches and rateLimitRetried is true: trip circuit breaker immediately (already retried once)

Why: This is the core of the issue — the existing code always trips the 60s circuit breaker on rate-limit errors, wasting retries and using a fixed cooldown rather than the actual reset time.

2. src/github.test.ts — Add tests for all four cases

Add a new describe("rate limit header parsing") block (after the existing circuit breaker tests around line 399) with these test cases:

  1. Short reset (<= 120s): Mock first execFile call to return rate-limit error, mock queryRateLimitReset response with reset ~30s in the future. Verify the command retries after the wait and succeeds. Verify isRateLimited() is false (circuit breaker not tripped).

  2. Long reset (> 120s): Mock rate-limit error + reset 300s in future. Verify RateLimitError is thrown, isRateLimited() is true, and setRateLimited was called with the actual wait time (not default 60s).

  3. Non-rate-limit 403: Mock a 403 error that doesn't match RATE_LIMIT_RE. Verify it falls through to the transient/reject logic, no rate-limit query is made.

  4. Missing/failed rate-limit query: Mock rate-limit error + queryRateLimitReset returning null (e.g., parse failure). Verify fallback to existing behavior (60s circuit breaker).

Testing approach: Follow existing pattern — mock execFile to differentiate calls by args (the gh api /rate_limit call vs the original command). Use vi.useFakeTimers() for the short-reset wait case (already used in existing circuit breaker tests).

Implementation order

  1. Add queryRateLimitReset() helper in github.ts (no existing code changes, just new function)
  2. Modify the rate-limit branch in gh() to use the new helper
  3. Add tests

Types before consumers, and the helper has no dependencies on modified code.

Risks and edge cases

  • Secondary rate limits: GitHub's secondary rate limits don't show up in /rate_limit endpoint responses. The query will find no exhausted resource and return null, falling back to the existing 60s circuit breaker. This is correct behavior.
  • Clock skew: The X-RateLimit-Reset header is a Unix timestamp from GitHub's clock. Minor skew could mean we wait slightly too little or too long. Adding a small buffer (e.g., 1–2s) to the calculated wait mitigates this.
  • Concurrency: Multiple in-flight gh() calls could all hit rate limit simultaneously and each spawn a queryRateLimitReset() call. This is harmless — the /rate_limit endpoint is cheap and the first one to call setRateLimited() wins (subsequent calls see isRateLimited() and short-circuit).
  • queryRateLimitReset itself failing: If the network is down or gh is misconfigured, the .catch() fallback ensures we still trip the default circuit breaker rather than hanging.

Verification

  • Run npx vitest run src/github.test.ts — all existing tests pass, new tests cover the four cases
  • Run npm test — full typecheck + test suite passes
Variant prompt output

Now I have a complete picture. Here's the implementation plan.


Implementation Plan: Rate-Limit Header Parsing in github.ts

Problem

The gh() function in src/github.ts (line 337) detects rate-limit errors via RATE_LIMIT_RE against stderr text, then immediately trips the 60-second circuit breaker. It has no awareness of GitHub's actual reset time, so:

  • Short resets (<60s) still block for the full 60s cooldown
  • Long resets (>120s) only get a 60s cooldown that expires too early, leading to another rate-limit hit and another 60s cooldown cycle
  • All retry attempts are wasted since the function rejects immediately on rate-limit match

Which files need to be changed

  1. src/github.ts — Core changes to gh() retry logic and setRateLimited()
  2. src/github.test.ts — New test cases for rate-limit header behavior

No new files or dependencies required.

What the changes should be

1. src/github.ts — Add rate-limit reset time detection

Step 1a: Add a helper to query the rate-limit reset time

After the setRateLimited() function (around line 30), add an exported async function getRateLimitReset() that calls gh api rate_limit using execFile directly (bypassing the gh() wrapper to avoid recursion/circuit-breaker). It should:

  • Run execFile("gh", ["api", "rate_limit"])
  • Parse the JSON response: response.rate.reset (Unix timestamp) and response.rate.remaining
  • Return { resetAt: number; remaining: number } or null on any error (fail-safe — if the call itself is rate-limited or errors, fall back to existing behavior)

Important: This must use execFile directly, not the gh() wrapper, because the wrapper checks isRateLimited() and would reject.

Step 1b: Modify the rate-limit branch in gh() (lines 358-363)

Currently:

if (RATE_LIMIT_RE.test(msg)) {
  setRateLimited();
  reject(new RateLimitError(`gh ${args.join(" ")} failed: ${msg}`));
  return;
}

Change to:

if (RATE_LIMIT_RE.test(msg)) {
  // Attempt to determine actual reset time before deciding behavior
  getRateLimitReset().then((info) => {
    if (info && info.remaining === 0) {
      const waitMs = (info.resetAt * 1000) - Date.now();
      if (waitMs > 0 && waitMs <= 120_000) {
        // Short reset: wait until reset, then retry once
        log.info(`[github] Rate limit resets in ${Math.ceil(waitMs / 1000)}s — waiting to retry`);
        setTimeout(() => {
          // Clear any rate-limit state and retry the original call
          clearRateLimitState();
          run(); // retry inside the existing promise's closure
        }, waitMs + 1000); // +1s buffer
        return;
      } else if (waitMs > 120_000) {
        // Long reset: trip circuit breaker for the actual wait duration
        setRateLimited(waitMs);
        reject(new RateLimitError(`gh ${args.join(" ")} failed: ${msg} (reset in ${Math.ceil(waitMs / 1000)}s)`));
        return;
      }
    }
    // Fallback: couldn't determine reset time, or already past — use default 60s cooldown
    setRateLimited();
    reject(new RateLimitError(`gh ${args.join(" ")} failed: ${msg}`));
  }).catch(() => {
    // getRateLimitReset failed — fall back to default behavior
    setRateLimited();
    reject(new RateLimitError(`gh ${args.join(" ")} failed: ${msg}`));
  });
  return;
}

Step 1c: Allow setRateLimited() to accept a custom cooldown

setRateLimited() already accepts cooldownMs with a default of 60,000 (line 26). No signature change needed — just pass the actual waitMs from step 1b for long resets.

2. src/github.test.ts — New test cases

Add a new describe("rate limit header parsing") block with these cases:

  1. Short reset (<120s): Mock execFile so the first gh call fails with "rate limit" in stderr, then mock the gh api rate_limit call to return { rate: { remaining: 0, reset: <now + 60s> } }. The mocked follow-up gh call (the retry after the wait) succeeds. Assert: setTimeout is called with ~60s delay, the promise eventually resolves, setRateLimited is NOT called.

  2. Long reset (>120s): Mock the rate_limit API to return reset time 5 minutes out. Assert: setRateLimited is called with the actual wait duration (~300s), promise rejects with RateLimitError, the error message includes the reset time.

  3. Non-rate-limit 403: Ensure a 403 error whose stderr does NOT match RATE_LIMIT_RE still goes through normal transient retry logic (no change from current behavior).

  4. Missing/failed rate_limit API call: Mock gh api rate_limit to fail (e.g., network error). Assert: falls back to default setRateLimited(60_000) behavior, promise rejects with RateLimitError.

  5. Rate limit API returns remaining > 0: Mock rate_limit API to show remaining: 5 (not actually at limit — could be a secondary rate limit). Assert: falls back to default 60s cooldown.

Potential risks and edge cases

  1. Recursive rate limiting: The getRateLimitReset() call itself could be rate-limited. Mitigated by: using execFile directly (bypasses circuit breaker) and catching all errors to fall back to default behavior.

  2. Secondary rate limits: GitHub's secondary rate limits (abuse detection) don't appear in the rate_limit API — they're separate from the primary rate limit counter. The rate_limit API might show remaining > 0 during a secondary rate limit. The fallback path handles this: if remaining > 0 or the call fails, we use the existing 60s cooldown.

  3. Clock skew: If the server clock is ahead of the X-RateLimit-Reset timestamp, waitMs could be negative. The waitMs > 0 check handles this — negative or zero falls through to default behavior.

  4. Timer precision for short waits: The +1000ms buffer on the retry timer prevents retrying a millisecond too early due to timer imprecision.

  5. Promise chain complexity: The getRateLimitReset().then() inside the execFile callback adds async nesting. This is contained within the existing Promise constructor and the run() closure, so resolve/reject are still properly called on all paths. Each path must call exactly one of resolve, reject, or schedule a run() retry.

Suggested order of implementation

  1. Write the test cases first (TDD per CLAUDE.md) — all should fail initially
  2. Implement getRateLimitReset() helper
  3. Modify the rate-limit branch in gh()
  4. Run tests, iterate until passing
  5. Run full test suite (npm test) to verify no regressions

How to verify the changes work

  • Unit tests: The four test cases described above cover the critical paths
  • Full test suite: npm test must pass (includes typecheck)
  • Manual verification: During the next burst of issue-refiner runs, monitor logs for the new [github] Rate limit resets in Xs message. The rate-limit notifications should show actual reset durations instead of always 60s.
### Test Case 2: Improve the scheduler

Issue body: The scheduler could be better. It sometimes runs jobs when it shouldn't and doesn't handle edge cases well. Can we make it more robust?

I think there might be issues with timing and concurrency too.

Winner: variant | Current: 14/20 | Variant: 18/20

Reasoning: Output A correctly identifies the issue is vague and asks clarifying questions, but produces nothing actionable. Output B takes the better approach: it investigates the code, finds real concrete issues (timer drift, misleading drain status, silent timeout), and produces a detailed implementation plan with specific line references, code changes, tests, and risk analysis. It still flags the ambiguity in the original issue as a non-blocking clarifying question. For a vague issue, finding actual problems in the code is more valuable than blocking on clarification.

Current prompt output

Clarifying Questions (blocking)

The issue does not provide enough detail to produce a reliable implementation plan. After reading the scheduler source (src/scheduler.ts, 286 lines) and its test suite (src/scheduler.test.ts, 676 lines with 30+ test cases), the scheduler already handles several robustness concerns:

  • Skip-if-busy semantics — overlapping ticks are skipped when a job is still running (line 54)
  • Drain with timeout — graceful shutdown waits for running jobs with configurable timeout (lines 133–164)
  • Pause/resume — jobs can be paused without interrupting running instances (lines 207–227)
  • Staggered startup — interval jobs start 2s apart to avoid thundering herd (lines 104–116)
  • Error isolation — failing jobs don't crash the scheduler (line 76, errors are reported and the job continues on next tick)
  • Draining guard — new ticks and job additions are blocked during drain (lines 50, 234)

Without specifics, any "improvement" risks scope creep or solving phantom problems. The following questions must be answered:

  1. "Runs jobs when it shouldn't" — Can you provide a concrete example? Under what conditions does a job run unexpectedly? Is this about scheduled-hour jobs firing at the wrong time, interval jobs running during drain, paused jobs running, or something else?

  2. "Edge cases not handled well" — Which edge cases? For example: DST transitions affecting msUntilHour? A job being removed while its tick is in the setTimeout callback? updateInterval being called on a scheduled-hour job (or vice versa)? Something else?

  3. "Issues with timing and concurrency" — Is this about timer drift over long uptimes? Race conditions between triggerJob and scheduled ticks? Multiple concurrent drain() calls? Or observed behavior in production (if so, what did the logs show)?

  4. Acceptance criteria — What does "more robust" mean concretely? Are there specific failure modes you've observed, or is this a proactive hardening request? If proactive, which areas should be prioritized?

Please respond to these questions as a comment on the GitHub issue so the next refinement cycle can incorporate the answers and produce a targeted plan.

Variant prompt output

I've read the scheduler source and tests. Let me analyze the actual issues in the code before producing a plan.

Implementation Plan: Improve the Scheduler

Analysis of Current Issues

After reading src/scheduler.ts (286 lines) and src/scheduler.test.ts (675 lines), I identified these concrete problems:

1. Scheduled-hour jobs drift over time

File: src/scheduler.ts:99-102

After the initial setTimeout fires at the target hour, setInterval(tick, 24h) starts from that moment. If the setTimeout fires late (Node.js timer imprecision, event loop lag, or the job itself taking time), every subsequent daily run drifts further from the intended hour. Over days/weeks this compounds.

Fix: Replace the setTimeoutsetInterval(24h) pattern with a self-rescheduling setTimeout that recalculates msUntilHour() after each run. This re-anchors to the wall clock every day.

2. triggerJob returns misleading status during drain

File: src/scheduler.ts:170-177

triggerJob returns "already-running" both when a job is actually running AND when the scheduler is draining. Callers (webhooks, dashboard manual trigger) cannot distinguish between "your job is busy" and "the system is shutting down."

Fix: Return a new "draining" status when draining === true, keeping the return type as a discriminated union.

3. drain() timeout silently succeeds

File: src/scheduler.ts:133-164

When drain(timeoutMs) times out with jobs still running, it logs a warning but resolves normally. main.ts cannot detect that drain failed, which could lead to unclean shutdown (e.g., worktrees not cleaned up).

Fix: Track whether the drain timed out and either reject the promise or return a result indicating which jobs were still running.

4. updateInterval skips immediate tick after config change

File: src/scheduler.ts:179-189

When an interval is updated (e.g., from 10min to 1min via config reload), the caller waits the full new interval before the next tick. If the previous interval just started, this creates a gap where the job doesn't run at the expected cadence.

Fix: This is arguably correct behavior (prevents double-firing). Document the behavior rather than change it. If desired, add an optional runImmediately parameter.

Files to Change

  1. src/scheduler.ts — All changes are in this file
  2. src/scheduler.test.ts — New tests for each fix

Detailed Changes

Change 1: Self-rescheduling scheduled-hour jobs (drift fix)

In startJobs() (line 95-102) and addJob() (line 253-260), replace:

timers.push(setTimeout(() => {
  tick();
  timers.push(setInterval(tick, 24 * 60 * 60 * 1000));
}, delay));

with a scheduleNextDaily(jobName, hour, tick) helper that:

  1. Calls msUntilHour(hour) to get the delay
  2. Sets a setTimeout for that delay
  3. When the timeout fires, calls tick(), then calls itself recursively to schedule the next day

This re-anchors to wall clock time after every execution. Store the timeout ID in jobTimers so clearJobTimers still works. The updateScheduledHour method already clears and reschedules, so it naturally benefits.

Change 2: Distinct drain status from triggerJob

In triggerJob() (line 170-177), change:

if (draining) return "already-running";

to:

if (draining) return "draining";

Update the return type to "started" | "already-running" | "draining" | "unknown".

Update callers in src/server.ts and src/webhooks.ts that check triggerJob results — "draining" should map to an appropriate HTTP response (e.g., 503 Service Unavailable).

Change 3: Drain timeout returns running job info

Change drain() signature to return Promise<{ timedOut: boolean; stillRunning: string[] }>. On timeout, resolve with { timedOut: true, stillRunning: [...names] }. On clean drain, resolve with { timedOut: false, stillRunning: [] }.

Update main.ts drain call to log the result and potentially force-kill lingering processes.

Potential Risks / Edge Cases

  • Drift fix — timer accumulation: The recursive setTimeout approach creates a new timer each day. Since clearJobTimers clears the timers array, we must push each new timeout into the array. The old timeout naturally completes and doesn't leak since it only fires once.
  • triggerJob return type change: This is a breaking API change for callers. Both src/server.ts and src/webhooks.ts call triggerJob — both must handle the new "draining" case.
  • drain() return type change: main.ts is the only caller. Low risk, but must update the call site.
  • DST transitions: msUntilHour() uses local time (setHours). During DST spring-forward, hour 2 doesn't exist; during fall-back, hour 1 occurs twice. The current behavior (and the proposed fix) handles this acceptably — msUntilHour always targets the next occurrence of the hour, so at worst a scheduled job runs ~23 or ~25 hours after the previous run during DST transitions. This is existing behavior and not in scope to fix.

Out of Scope

  • Replacing setInterval/setTimeout with a cron library — the current approach is simple and adequate
  • Adding job dependencies or DAG-based scheduling — separate concern
  • Persistent job scheduling across restarts — crash recovery already handles this differently

Implementation Order

  1. Change 1 (drift fix) — most impactful, no API changes to other modules
  2. Change 2 (triggerJob status) — small, touches server.ts and webhooks.ts
  3. Change 3 (drain return type) — small, touches main.ts

Testing Approach

Change 1 — Drift fix:

  • Test that after the first scheduled-hour tick fires, the next timeout is recalculated via msUntilHour (not a fixed 24h). Use vi.useFakeTimers() to advance past two daily cycles and verify the job fires at the correct hour both times.
  • Test that clearJobTimers still cancels the recursive chain (remove the job mid-cycle, verify no future tick fires).

Change 2 — triggerJob status:

  • Test that triggerJob returns "draining" after drain() is called.
  • Test that existing "already-running" and "unknown" cases still work.

Change 3 — Drain return type:

  • Test that drain with no timeout returns { timedOut: false, stillRunning: [] }.
  • Test that drain with timeout returns { timedOut: true, stillRunning: ["job-name"] } when a job outlasts the timeout.

Clarifying Questions (non-blocking)

The issue mentions "sometimes runs jobs when it shouldn't" but doesn't give specific examples. The skip-if-busy logic appears correct after code review — the running flag is set synchronously before the first await, preventing concurrent invocations. If there are specific scenarios where jobs run unexpectedly, those details would help target additional fixes. The plan above addresses the concrete issues found in the code; please comment on the GitHub issue if there are specific failure scenarios to investigate.

### Test Case 3: Refactor all jobs to use a plugin architecture with dynamic loading, dependency injection, and shared middleware pipeline

Issue body: ## Overview

Currently each job in src/jobs/ is a standalone module with a run() function. This works but doesn't scale. We should refactor to a proper plugin architecture.

Requirements

  1. Create a JobPlugin interface with lifecycle hooks: onInit, onBeforeRun, onRun, onAfterRun, onError, onShutdown
  2. Implement a plugin registry that dynamically discovers and loads job modules from src/jobs/ at startup
  3. Add a dependency injection container so jobs can declare their dependencies (github client, claude client, db, config) and have them injected
  4. Build a middleware pipeline (like Express/Koa) that jobs pass through — middleware for logging, error handling, rate limiting, worktree management, metrics collection
  5. Add a job dependency graph so jobs can declare ordering constraints (e.g., "run after issue-refiner")
  6. Support hot-reloading of job plugins without restarting the service
  7. Add job priority queues with configurable priority levels
  8. Implement job chaining — output of one job feeds as input to the next
  9. Add a dry-run mode for all jobs
  10. Create a job testing harness that can simulate GitHub events
  11. Also while we're at it, convert all callbacks to use RxJS observables for better composition
  12. Add OpenTelemetry tracing spans for each job execution

This will make the system much more maintainable and extensible.

Acceptance Criteria

All existing jobs work the same as before but use the new architecture.

Winner: variant | Current: 16/20 | Variant: 17/20

Reasoning: Both outputs correctly identify this as an over-scoped issue, but the variant provides a concrete, implementable plan (specific helper functions, file changes, implementation order) while still rejecting the unnecessary requirements with detailed rationale. The current prompt asks excellent clarifying questions but produces no actionable plan. The variant loses some uncertainty score by being more prescriptive about what to build without asking the user first, but the trade-off is worth it given the issue is clearly over-engineered.

Current prompt output

Clarifying Questions (blocking)

This issue cannot be reliably planned because it combines at least 12 distinct features with unclear motivation and significant contradictions with the existing architecture. The following questions must be answered before proceeding:


1. What concrete problem is this solving?

The issue states the current approach "doesn't scale" but doesn't explain how. The codebase has 13 jobs, all following a consistent run(repos: Repo[]): Promise<void> pattern with shared conventions (two-phase identify/process, worktree lifecycle in finally blocks, task recording via db.ts). What specific scaling problem has been encountered — difficulty adding new jobs, debugging failures, or something else? The answer determines which (if any) of the 12 requirements are actually needed.

2. Are requirements 11 and 12 (RxJS observables, OpenTelemetry) actually desired?

These are unrelated to a plugin architecture and would introduce two significant new dependencies (rxjs, @opentelemetry/*) to a project that currently has minimal dependencies. The prompt instructions explicitly say "Do not add logging, metrics, or observability unless the issue asks for it" and "do not refactor surrounding code" — but the issue itself asks for these. Is the intent to bundle these with the plugin refactor, or should they be separate issues?

3. How would "hot-reloading of job plugins" (requirement 6) work given the current architecture?

Jobs are stateless functions that discover work via GitHub API polling. The scheduler already supports live enable/disable via config reload (main.ts handles SIGHUP and config file changes). What additional hot-reload capability is needed beyond what already exists? ESM dynamic import() has known cache-invalidation problems in Node.js — is the intent to use require() (CJS), a custom module loader, or something else?

4. What does "job dependency graph" (requirement 5) mean in practice?

Currently, jobs don't communicate directly. They coordinate through GitHub state (labels, comments, reactions). For example, issue-refiner adds the Needs Plan Review label, and plan-reviewer picks it up on its next poll cycle. This is deliberately decoupled. Would a dependency graph replace this model (tight coupling, synchronous chaining) or supplement it (just scheduling order hints)? These are fundamentally different architectures.

5. What does "job chaining — output of one job feeds as input to the next" (requirement 8) mean?

Job outputs today are side effects (GitHub comments, labels, PRs), not return values. run() returns Promise<void>. Making jobs return structured data that chains to the next job would require redesigning every job's contract. Is the intent to change job signatures, or to have a message-passing system alongside the current GitHub-state-based coordination?

6. Should this be split into smaller issues?

This issue contains at least 7 orthogonal features: (a) plugin interface + registry, (b) dependency injection, (c) middleware pipeline, (d) dependency graph, (e) hot-reloading, (f) job chaining, (g) dry-run mode, (h) testing harness, (i) RxJS conversion, (j) OpenTelemetry. Each has different risk profiles and could be implemented independently. A single PR touching all 13 jobs plus main.ts, scheduler.ts, and introducing ~5 new infrastructure modules would be extremely difficult to review. Would you prefer this broken into a sequence of smaller, independently deployable issues?

7. What does "dry-run mode" (requirement 9) mean per job?

Some jobs are read-only scanners (issue-auditor), some create PRs (issue-worker), some edit comments (issue-refiner), some merge PRs (auto-merger). "Dry run" means different things for each. Should dry-run skip all GitHub writes? Skip only destructive writes (merge, push) but allow reads? Log what would happen? The existing jobs have no shared write abstraction — each calls gh.* functions directly with different semantics.


Please respond to these questions as comments on the GitHub issue so the next refinement cycle can incorporate the answers and produce a focused, implementable plan. Given the scope of this issue, I'd particularly recommend addressing question 6 first — the answer will shape everything else.

Variant prompt output

This issue is a textbook over-scoped "boil the ocean" refactor. Let me analyze each requirement against the actual codebase before producing a plan.

Analysis

The current architecture has 13 jobs totaling ~4,570 LOC, each implementing a simple run(repos: Repo[]) function registered in main.ts. The system already has:

  • Skip-if-busy scheduling (scheduler.ts)
  • Priority queuing in the AI dispatch layer (claude.ts enqueue(fn, priority))
  • Live config reload (main.ts lines 293-359 — interval/schedule/enabled/paused changes)
  • Worktree isolation per task
  • Error reporting pipeline with deduplication
  • Co-located tests with comprehensive mocking

Here is my assessment of the 12 requirements:


Implementation Plan

Requirements Disposition

# Requirement Disposition Rationale
1 JobPlugin interface with lifecycle hooks Out of scope Current run(repos) is simple, testable, and uniform across 13 jobs. Lifecycle hooks add indirection with no concrete benefit — the jobs don't need onInit/onShutdown because initialization is centralized in main.ts and shutdown is handled via cancelQueuedTasks().
2 Dynamic plugin discovery/loading Out of scope 13 explicit registrations in main.ts (lines 140-253) are readable, type-safe, and allow the ENABLED_JOBS filter. Dynamic loading loses compile-time safety and makes the dependency graph invisible.
3 Dependency injection container Out of scope Jobs import what they need via ESM. The "dependencies" (github, claude, db, config) are module-level singletons — this is idiomatic Node.js. A DI container adds complexity for zero testability gain since tests already use vi.mock().
4 Middleware pipeline Partially useful — extract shared patterns Jobs share patterns (two-phase identify/process, worktree lifecycle, tree-diff guard) but these are already factored into claude.ts and github.ts. A formal middleware pipeline is over-engineering, but extracting 2-3 helper functions for the most common patterns would reduce duplication.
5 Job dependency graph Out of scope Jobs are independent by design. The scheduler's skip-if-busy semantics and staggered start prevent resource contention. Ordering constraints would couple jobs and complicate the scheduler for no demonstrated need.
6 Hot-reloading job plugins Out of scope / dangerous Live config reload already exists. Hot-reloading code in a daemon that manages git worktrees and pushes to GitHub is a reliability hazard. The updater service (yeti-updater.timer) handles code updates safely via restart.
7 Job priority queues Already exists claude.ts already has enqueue(fn, priority=false) where priority tasks jump to front. The Priority label feeds into this. No work needed.
8 Job chaining (output → input) Out of scope The content-based state machine (labels + reactions + comments) already implements implicit chaining: issue-refiner → plan-reviewer → issue-worker. Making this explicit would couple jobs and break the current robust, recoverable design.
9 Dry-run mode Potentially useful but narrow Could be useful for debugging, but each job's side effects are different (GitHub comments, PRs, labels, reactions). A generic dry-run flag would require threading through every GitHub call. Better addressed per-job as needed.
10 Job testing harness for GitHub events Out of scope Tests already simulate GitHub events via mocked gh CLI responses. src/test-helpers.ts provides mockRepo(), mockIssue(), mockPR() factories. The webhook handler (webhooks.ts) already routes real GitHub events to jobs.
11 Convert callbacks to RxJS observables Out of scope There are no callback-heavy patterns to convert. Jobs are async/await throughout. Adding RxJS (a large dependency) would force all contributors to learn a reactive paradigm for zero benefit.
12 OpenTelemetry tracing Out of scope Orthogonal to the "plugin architecture" and should be its own issue if desired. The AsyncLocalStorage-based run context in db.ts already provides per-job-run log correlation.

What should actually be done

The only concrete improvement with a real payoff is requirement 4 (partially): extracting shared job patterns into reusable helpers to reduce boilerplate across the 13 jobs. Specifically:

File changes

src/jobs/job-helpers.ts (new file, ~80 lines)

Extract these repeated patterns:

  1. forEachRepo(repos, processRepo) — The repos iteration + rate-limit-check pattern used by most jobs. Wraps the common for (const repo of repos) { if (rateLimited) continue; ... } loop.

  2. withWorktreeTask(repo, branch, taskFn) — Combines worktree creation, try/catch/finally cleanup, and tree-diff guard into a single helper. Currently duplicated across ci-fixer, issue-worker, doc-maintainer, improvement-identifier, review-addresser, and mkdocs-update.

  3. twoPhaseProcess(repos, identify, process) — Codifies the identify-then-process pattern used by ci-fixer, improvement-identifier, and issue-refiner. Phase 1 collects work items, phase 2 processes them.

src/jobs/*.ts (6-8 job files modified)

Refactor jobs to use the new helpers where the pattern clearly applies. Do NOT force jobs that don't fit the pattern (e.g., prompt-evaluator, repo-standards) to use them.

src/jobs/job-helpers.test.ts (new file)

Unit tests for the extracted helpers.

What NOT to change

  • main.ts job registration — keep explicit
  • scheduler.ts — no changes needed
  • claude.ts — already well-factored
  • No new dependencies (no RxJS, no OpenTelemetry, no DI framework)
  • No new interfaces or abstractions beyond the 2-3 helper functions

Risks and edge cases

  1. Over-abstracting: The helpers must remain optional conveniences, not mandatory wrappers. Jobs that don't fit the pattern should keep their current structure.
  2. Behavioral regression: Each job has subtle differences in how it handles worktrees, errors, and state. The helpers must be parameterized enough to preserve these differences without becoming a configuration nightmare.
  3. Test coverage: Existing job tests mock at the module boundary. Extracting helpers changes the mock surface — tests for jobs using the helpers will need updating.

Implementation order

  1. Write job-helpers.test.ts with tests for the three helpers (TDD)
  2. Implement job-helpers.ts to pass the tests
  3. Refactor 2 jobs (start with doc-maintainer and auto-merger as they're simplest) to use the helpers
  4. Verify all existing tests still pass
  5. Refactor remaining applicable jobs one at a time
  6. Update yeti/OVERVIEW.md and CLAUDE.md to document the helpers

Testing approach

  • Run npm test after each job refactor to catch regressions
  • Each refactored job's existing tests must continue passing without changes to assertions (only mock setup may change)
  • New helper tests cover: normal execution, error propagation, worktree cleanup on failure, rate-limit skip behavior

Out of Scope

All 12 original requirements except the partial extraction from #4 are out of scope for the reasons detailed above. The current architecture is working well for 13 jobs. The run(repos) pattern is simple, testable, and understood by all contributors. Introducing a plugin architecture, DI container, middleware pipeline, RxJS, or OpenTelemetry would add significant complexity without solving a demonstrated problem.

If specific pain points emerge (e.g., "adding a new job requires touching 5 files" or "I can't test job X without running the whole system"), those should be filed as targeted issues rather than addressed via a speculative architectural overhaul.

Suggested PR structure

Single PR — The helper extraction and job refactors are tightly coupled and should be reviewed together. The change is small enough (~200 lines new, ~150 lines removed from jobs) to review in one pass.

Note

This plan deliberately rejects the majority of the issue's requirements. The issue describes a solution (plugin architecture) rather than a problem. The actual codebase is well-structured for its current scale, and the proposed abstractions would make it harder to understand and maintain, not easier. The targeted extraction of 2-3 helper functions addresses the only real duplication without the overhead of a framework.

### Test Case 4: Add webhook support for pull_request.synchronize events

Issue body: ## Problem

We handle pull_request.closed and check_run.completed webhook events but not pull_request.synchronize. When a PR branch is force-pushed or has new commits pushed, we should react.

What should happen

When we receive a pull_request.synchronize event, we should... actually I'm not sure. Maybe re-run CI checks? Or maybe invalidate the plan? Or update the PR description? It depends on whether it's a yeti PR or an external PR.

For yeti PRs (branches starting with yeti/), maybe we should check if the push was from yeti itself or from a human. If it was from a human, we might want to pause yeti's work on that PR. Or maybe not.

For external PRs, maybe we should re-run the improvement-identifier job? Or the ci-fixer? Not sure which.

Additional context

Someone mentioned this would be useful in Slack but I don't remember the exact use case they described. It was something about PRs getting out of sync. The webhooks.ts file has isRepoAllowed() which we should reuse. Also we might need to update the config to add a new toggle for this, or maybe it should always be on, or maybe it should be controlled per-repo.

See also: that other issue about webhook improvements (I don't remember the number).

Winner: variant | Current: 15/20 | Variant: 18/20

Reasoning: Both correctly identify the issue as unplannable and ask blocking clarifying questions. The variant is superior in specificity — it references concrete files (src/webhooks.ts), functions (handlePullRequestSynchronizeEvent, isRepoAllowed, scheduler.triggerJob), and architectural patterns (skip-if-busy, per-PR state). It also demonstrates stronger scope awareness by noting that GitHub already re-runs CI on pushes (making one proposed behavior redundant), that 'pausing' would require a new per-PR state mechanism that doesn't exist, and by offering a concrete recommendation to close/convert to discussion. The current prompt's questions are solid but more generic.

Current prompt output

Step 1: Plannable?

This issue is not plannable in its current state. The author explicitly expresses uncertainty about the desired behavior in nearly every paragraph ("I'm not sure," "maybe," "not sure which," "I don't remember the exact use case"). There are no acceptance criteria, and the core question — what should happen when we receive pull_request.synchronize — is unanswered.

Clarifying Questions (blocking)

  1. What is the desired behavior for yeti PRs?

    "For yeti PRs...maybe we should check if the push was from yeti itself or from a human. If it was from a human, we might want to pause yeti's work on that PR. Or maybe not."

    The issue proposes at least three options but picks none:

    • (A) Ignore synchronize events on yeti PRs entirely
    • (B) Detect human pushes and pause yeti's work (how? remove a label? skip the issue in future runs?)
    • (C) Re-run some job (which one?)

    Which behavior do you want, and what does "pause" mean concretely?

  2. What is the desired behavior for external PRs?

    "For external PRs, maybe we should re-run the improvement-identifier job? Or the ci-fixer? Not sure which."

    These two jobs have very different purposes. ci-fixer reacts to failing CI checks; improvement-identifier scans for code improvements. A force-push might warrant re-running CI-fixer (if the old failure context is stale), but re-running improvement-identifier on every push could be noisy. Which job(s) should be triggered, and under what conditions?

  3. What was the original use case from Slack?

    "Someone mentioned this would be useful in Slack but I don't remember the exact use case they described. It was something about PRs getting out of sync."

    Without the motivating use case, we can't validate whether any proposed behavior actually solves the problem. Can you find the Slack thread and include the specific scenario?

  4. Should this be configurable, and if so how?

    "we might need to update the config to add a new toggle for this, or maybe it should always be on, or maybe it should be controlled per-repo."

    The current webhook system has no per-event-type toggles — all supported events are always handled if the repo is allowed. Should synchronize follow the same pattern (always on), or does it need a new config field? Per-repo control would be a new config pattern that doesn't exist today.

  5. What is "the other issue about webhook improvements"?

    "See also: that other issue about webhook improvements (I don't remember the number)."

    There may be overlapping or conflicting plans. Please link the related issue so we can check for scope conflicts before planning.


Please respond to these questions as comments on the GitHub issue so the next refinement cycle can incorporate your answers and produce a complete plan.

Variant prompt output

Based on my analysis of the codebase and this issue, this issue is too underspecified to produce an actionable implementation plan. Here's my assessment:


Issue Analysis: pull_request.synchronize webhook support

What's clear

The webhook infrastructure in src/webhooks.ts already handles four event types (issues, check_run, pull_request_review, pull_request) with a clean dispatcher pattern. Adding a new handlePullRequestSynchronizeEvent handler is mechanically straightforward — it follows the same pattern as the existing pull_request.closed handler, reuses isRepoAllowed(), and routes through scheduler.triggerJob().

What's not clear — and why it matters

The issue explicitly states uncertainty about the desired behavior in multiple places. Every decision branch leads to a different implementation:

  1. "Maybe re-run CI checks?" — GitHub already re-runs checks automatically when commits are pushed to a PR branch. There's nothing for yeti to do here.

  2. "Maybe invalidate the plan?" — The plan lives as an issue comment produced by issue-refiner. "Invalidating" it could mean adding a comment, removing the Refined label, or re-triggering issue-refiner. Each has different implications for the workflow state machine.

  3. "Maybe update the PR description?" — Which description fields? Under what conditions?

  4. Yeti PR vs external PR distinction — The issue proposes different behavior per PR type but doesn't specify what either behavior should be.

  5. "Pause yeti's work" — There's no concept of pausing work on a specific PR in the current architecture. Jobs are global (skip-if-busy), not per-PR. This would be a significant new feature, not a webhook handler addition.

  6. Config toggle — "Always on", "per-repo", or "global toggle" are three different config designs.

  7. Referenced Slack conversation and "other issue" — These are the actual motivation for the feature but aren't available.

Clarifying Questions (blocking)

These must be answered before a plan can be produced:

  1. What is the concrete use case? The issue says "someone mentioned this would be useful in Slack" but doesn't describe the actual problem. What specific scenario causes PRs to "get out of sync" and what user-visible problem does that cause?

    • Options: (a) yeti overwrites human changes, (b) yeti PRs have stale CI, (c) yeti plans become outdated after manual edits, (d) something else
  2. What job(s) should be triggered? The issue lists ci-fixer, improvement-identifier, and "pause yeti's work" as candidates but picks none.

    • Note: CI re-runs are already handled by GitHub natively. ci-fixer is triggered by check_run.completed which fires after the re-run fails — so the existing flow already covers "new push → CI runs → CI fails → ci-fixer triggers."
    • Suggestion: If the goal is detecting human edits to yeti PRs, the simplest action would be adding a comment or label, not triggering a job.
  3. Should this block issue-worker from starting new work on an issue whose PR was human-edited? This is the "pause" concept — it requires a new per-issue/per-PR state mechanism that doesn't exist today. Is this in scope or out of scope?

  4. What is the "other issue about webhook improvements"? It may contain design decisions that affect this work.

Please respond as a comment on the GitHub issue so the next refinement cycle can incorporate the answers.

Recommendation

Close or convert this to a discussion until the use case is concrete. The current issue has the shape of "we could handle this event" without a clear "and here's the problem it solves." Adding webhook plumbing is easy; designing the right behavior requires knowing what problem we're solving.


Automated evaluation by yeti prompt-evaluator

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions