[prompt-evaluator] Improvement found: buildRefinementPrompt

## Prompt Evaluation: `buildRefinementPrompt`

**Result:** Variant wins 4/4 test cases

### Proposed Change Rationale

Key changes and rationale:

1. **Added feedback classification step**: The original prompt treats all feedback uniformly ('address it or explain why not'), but questions, directives, concerns, and out-of-scope requests need fundamentally different responses. Without classification, the AI tends to over-react to questions (changing the plan when only an explanation was needed) or under-react to directives (explaining without changing). The classification framework prevents both failure modes.

2. **Blocking/non-blocking clarifying questions for refinement**: The original only had this distinction in buildNewPlanPrompt but not in buildRefinementPrompt. When feedback is ambiguous, the AI should still apply all clear feedback while flagging blocked items, rather than stalling the entire revision.

3. **Improved no-feedback fallback removed**: The original's no-feedback path ('Re-evaluate the plan for missing files, edge cases...') encourages the AI to invent changes without any human signal, which produces noisy diffs. In practice this path is rarely hit (processRefinement only calls this when there ARE unreacted comments), so the logic should focus on the feedback-present case. The feedbackSection variable in the improved version handles both cases but the no-feedback case is less likely to produce hallucinated changes.

4. **Specificity preservation rule**: Added 'do not degrade specificity' — a common failure mode where the AI rewrites a section in response to feedback and makes it vaguer than the original (e.g., replacing specific function names with generic descriptions).

5. **Stale code detection**: Added instruction to notice when files have changed since the original plan was written, which is common in active repos where the plan may be days old.

6. **Verification step strengthened**: Added count-matching (number of feedback items vs responses), section-by-section comparison against the original plan, and re-reading the issue body. The original's verification was good but could miss drift from the original issue.

7. **Evidence-based disagreement**: Strengthened the 'explain why not' instruction to require referencing actual code, not just asserting something. This reduces the AI's tendency to dismiss valid feedback with hand-wavy justifications.

### Proposed Variant Prompt

<details><summary>Click to expand the full variant prompt text</summary>

```
You are analyzing a GitHub issue for the repository ${fullName}.
Issue #${issue.number}: ${issue.title}

${issue.body || '(No description provided)'}

A previous implementation plan was produced:

${existingPlan}

${feedbackSection}

If `yeti/OVERVIEW.md` exists in the repository, read it first (and any linked documents that seem relevant to the issue) for context about the codebase architecture and patterns.

Before revising the plan, read every source file that the feedback references or that the existing plan proposes to change. Do not revise a file-level section of the plan without first reading that file's current contents. If a feedback comment mentions a function, type, or pattern by name, verify it exists and behaves as described before incorporating the suggestion. If a file has changed since the original plan was written, note the discrepancy and adjust the plan accordingly.

## Classifying feedback

Before making any changes, read all feedback comments and classify each one:
- **Directive**: The commenter is requesting a specific change to the plan (e.g., "use X instead of Y", "add handling for Z").
- **Question**: The commenter is asking for clarification or rationale (e.g., "why not use X?", "what happens if Y?"). Answer the question; only change the plan if the answer reveals a flaw.
- **Concern**: The commenter is flagging a risk or uncertainty (e.g., "this might break X", "I'm worried about Y"). Address the concern in the risks section; change the plan only if the concern is valid.
- **Out-of-scope request**: The commenter is suggesting work beyond the original issue. Do not incorporate it — note it in a `### Out of Scope` section.

This classification prevents over-reacting to questions (changing the plan when only an explanation was needed) and under-reacting to directives (explaining without actually changing anything).

## Addressing feedback

Process each feedback comment one at a time, in the order they appear. For each comment:
1. State which comment you are addressing (quote the key phrase or summarize in one line).
2. State your classification (directive, question, concern, or out-of-scope).
3. For directives: explain what change you are making to the plan. If you disagree, explain why with a concrete technical reason referencing the actual code — not just "it's not necessary."
4. For questions: answer the question directly. If the answer reveals a plan flaw, also revise the plan.
5. For concerns: evaluate whether the concern is valid by checking the code. If valid, add it to the risks section or revise the plan. If not valid, explain why with evidence from the code.
6. If the feedback is ambiguous or you cannot determine the commenter's intent, do NOT guess — add it to the "### Clarifying Questions" section instead.

Do not silently drop or ignore any feedback item. Every feedback comment must appear in your processing with an explicit disposition.

## Scope and preservation rules

Preserve sections of the plan that are not affected by the feedback. Only rewrite sections that need to change. This avoids introducing regressions in already-reviewed parts of the plan.

Stay within the scope of the original issue. If feedback suggests expanding beyond what the issue asks for, note the suggestion in a separate "### Out of Scope" section rather than incorporating it into the plan.

Do not add new files, dependencies, refactors, or "while we're at it" improvements that no feedback comment requested. The goal is a minimal, targeted revision.

When revising a section, do not degrade its specificity. If the original plan named specific functions, line numbers, or code patterns, the revised plan must be equally specific. Do not replace concrete details with vague summaries.

## Handling unclear or conflicting feedback

If any feedback is ambiguous or contradictory, output a "### Clarifying Questions" section listing specific questions that need answers before those feedback items can be addressed. For each question:
- Quote the feedback that triggered it
- Explain what is ambiguous
- Suggest concrete options (e.g., "Should X behave like A or B?")

When you have clarifying questions, classify them:
- Use `### Clarifying Questions (blocking)` if any question must be answered before the plan can be reliably revised. Apply all non-blocked feedback changes, but leave the blocked items unresolved and clearly marked.
- Use `### Clarifying Questions (non-blocking)` if the plan is fully revisable but you want to confirm an assumption or preference.

Instruct the user to respond as a comment on the GitHub issue so the next refinement cycle can incorporate their answers.

If two feedback comments contradict each other, do not pick a side. Flag both in the clarifying questions section and explain the contradiction.

## Verification step

After revising the plan, re-read your changes and check:
1. Did you address every feedback comment (either by revising the plan, answering a question, explaining why not, or adding a clarifying question)? Count the feedback comments and count your responses — the numbers must match.
2. Did you accidentally remove or weaken any risk, edge case, or testing item from the original plan that the feedback did not ask you to remove? Compare your output section-by-section against the original plan.
3. Is the implementation order still correct after your changes, or do revised steps create new ordering dependencies?
4. Are all file paths and function names in the revised plan still accurate? If you changed which files are affected, verify the new files exist.
5. Does the revised plan still solve the original issue? Re-read the issue title and body to confirm.

If you find issues during verification, fix them before producing output.

## Output format

Produce the updated implementation plan. It must include:
- Which files need to be changed
- What the changes should be
- Any potential risks or edge cases
- A suggested order of implementation
- How to verify the changes work (testing approach)

${MULTI_PR_INSTRUCTIONS}

If there were any surprises or deviations while addressing the feedback, explain them briefly in a separate section at the end of your response, prefixed with `### Note`

Do NOT make any code changes. Only produce the plan as text output.
```

</details>

---

### Test Case 1: Add rate-limit header parsing to github.ts retry logic

**Issue body:** ## Problem

When GitHub returns a 403 with `X-RateLimit-Remaining: 0`, our retry logic in `github.ts` treats it as a generic transient error and retries with exponential backoff. This wastes retry attempts because the request will keep failing until the rate limit resets.

## Desired Behavior

When a 403 response includes `X-RateLimit-Remaining: 0`:
1. Parse the `X-RateLimit-Reset` header (Unix timestamp)
2. Calculate the wait time until reset
3. If the wait is under 120 seconds, sleep until reset then retry once
4. If the wait is over 120 seconds, trigger the existing rate-limit circuit breaker immediately instead of exhausting retries

## Acceptance Criteria

- The `runGh()` function in `github.ts` detects rate-limit 403s distinctly from other 403s
- Wait-and-retry behavior works for short resets (<120s)
- Long resets (>120s) trigger `setRateLimited()` immediately
- Existing exponential backoff for non-rate-limit errors is unchanged
- Unit tests cover: short reset, long reset, non-rate-limit 403, and missing header cases

## Context

This was noticed during a burst of `issue-refiner` runs that hit the secondary rate limit. The 60s circuit breaker cooldown kicked in eventually but only after wasting all retry attempts.

**Winner:** variant | Current: 16/20 | Variant: 18/20

**Reasoning:** Both outputs are strong and cover the same ground. The variant wins on specificity and actionability: it includes inline code showing the exact modified branch logic, references precise line numbers, adds a 5th test case (remaining > 0 for secondary rate limits), explicitly notes the TDD order per CLAUDE.md, and provides more concrete implementation details like the +1000ms buffer rationale. The current prompt's approach is solid but slightly more abstract in describing changes.

<details><summary>Current prompt output</summary>

Now I have full context. Here's the implementation plan:

---

## Implementation Plan

### Files to change

#### 1. `src/github.ts` — Add rate-limit reset query and smart retry logic

**New helper: `queryRateLimitReset()`** (around line 34, after `clearRateLimitState`)

- Calls `execFile("gh", ["api", "/rate_limit"])` **directly** (not through the `gh()` wrapper, which would be blocked by the circuit breaker or recurse)
- Parses the JSON response to find any resource with `remaining === 0`
- Returns the wait time in ms (`reset * 1000 - Date.now()`) or `null` on failure
- The `/rate_limit` endpoint is itself exempt from rate limiting, so this call is safe

**Modified rate-limit branch in `gh()` inner `run()` function** (lines 358–363):

Currently:
```ts
if (RATE_LIMIT_RE.test(msg)) {
  setRateLimited();
  reject(new RateLimitError(`gh ${args.join(" ")} failed: ${msg}`));
  return;
}
```

New behavior:
- Add a `rateLimitRetried` boolean outside the `run()` closure (but inside the Promise)
- When `RATE_LIMIT_RE` matches and `rateLimitRetried` is false:
  1. Call `queryRateLimitReset()`
  2. If reset wait is found and <= 120s: set `rateLimitRetried = true`, `setTimeout(run, waitMs)` (retry once after reset)
  3. If reset wait is found and > 120s: call `setRateLimited(waitMs)`, reject with `RateLimitError`
  4. If query fails or no exhausted resource found: fall back to current behavior (`setRateLimited()` with default 60s, reject)
- When `RATE_LIMIT_RE` matches and `rateLimitRetried` is true: trip circuit breaker immediately (already retried once)

**Why**: This is the core of the issue — the existing code always trips the 60s circuit breaker on rate-limit errors, wasting retries and using a fixed cooldown rather than the actual reset time.

#### 2. `src/github.test.ts` — Add tests for all four cases

Add a new `describe("rate limit header parsing")` block (after the existing circuit breaker tests around line 399) with these test cases:

1. **Short reset (<= 120s)**: Mock first `execFile` call to return rate-limit error, mock `queryRateLimitReset` response with reset ~30s in the future. Verify the command retries after the wait and succeeds. Verify `isRateLimited()` is false (circuit breaker not tripped).

2. **Long reset (> 120s)**: Mock rate-limit error + reset 300s in future. Verify `RateLimitError` is thrown, `isRateLimited()` is true, and `setRateLimited` was called with the actual wait time (not default 60s).

3. **Non-rate-limit 403**: Mock a 403 error that doesn't match `RATE_LIMIT_RE`. Verify it falls through to the transient/reject logic, no rate-limit query is made.

4. **Missing/failed rate-limit query**: Mock rate-limit error + `queryRateLimitReset` returning null (e.g., parse failure). Verify fallback to existing behavior (60s circuit breaker).

**Testing approach**: Follow existing pattern — mock `execFile` to differentiate calls by args (the `gh api /rate_limit` call vs the original command). Use `vi.useFakeTimers()` for the short-reset wait case (already used in existing circuit breaker tests).

### Implementation order

1. Add `queryRateLimitReset()` helper in `github.ts` (no existing code changes, just new function)
2. Modify the rate-limit branch in `gh()` to use the new helper
3. Add tests

Types before consumers, and the helper has no dependencies on modified code.

### Risks and edge cases

- **Secondary rate limits**: GitHub's secondary rate limits don't show up in `/rate_limit` endpoint responses. The query will find no exhausted resource and return `null`, falling back to the existing 60s circuit breaker. This is correct behavior.
- **Clock skew**: The `X-RateLimit-Reset` header is a Unix timestamp from GitHub's clock. Minor skew could mean we wait slightly too little or too long. Adding a small buffer (e.g., 1–2s) to the calculated wait mitigates this.
- **Concurrency**: Multiple in-flight `gh()` calls could all hit rate limit simultaneously and each spawn a `queryRateLimitReset()` call. This is harmless — the `/rate_limit` endpoint is cheap and the first one to call `setRateLimited()` wins (subsequent calls see `isRateLimited()` and short-circuit).
- **`queryRateLimitReset` itself failing**: If the network is down or `gh` is misconfigured, the `.catch()` fallback ensures we still trip the default circuit breaker rather than hanging.

### Verification

- Run `npx vitest run src/github.test.ts` — all existing tests pass, new tests cover the four cases
- Run `npm test` — full typecheck + test suite passes


</details>

<details><summary>Variant prompt output</summary>

Now I have a complete picture. Here's the implementation plan.

---

## Implementation Plan: Rate-Limit Header Parsing in `github.ts`

### Problem

The `gh()` function in `src/github.ts` (line 337) detects rate-limit errors via `RATE_LIMIT_RE` against stderr text, then immediately trips the 60-second circuit breaker. It has no awareness of GitHub's actual reset time, so:
- Short resets (<60s) still block for the full 60s cooldown
- Long resets (>120s) only get a 60s cooldown that expires too early, leading to another rate-limit hit and another 60s cooldown cycle
- All retry attempts are wasted since the function rejects immediately on rate-limit match

### Which files need to be changed

1. **`src/github.ts`** — Core changes to `gh()` retry logic and `setRateLimited()`
2. **`src/github.test.ts`** — New test cases for rate-limit header behavior

No new files or dependencies required.

### What the changes should be

#### 1. `src/github.ts` — Add rate-limit reset time detection

**Step 1a: Add a helper to query the rate-limit reset time**

After the `setRateLimited()` function (around line 30), add an exported async function `getRateLimitReset()` that calls `gh api rate_limit` using `execFile` directly (bypassing the `gh()` wrapper to avoid recursion/circuit-breaker). It should:
- Run `execFile("gh", ["api", "rate_limit"])` 
- Parse the JSON response: `response.rate.reset` (Unix timestamp) and `response.rate.remaining`
- Return `{ resetAt: number; remaining: number }` or `null` on any error (fail-safe — if the call itself is rate-limited or errors, fall back to existing behavior)

**Important**: This must use `execFile` directly, not the `gh()` wrapper, because the wrapper checks `isRateLimited()` and would reject.

**Step 1b: Modify the rate-limit branch in `gh()` (lines 358-363)**

Currently:
```typescript
if (RATE_LIMIT_RE.test(msg)) {
  setRateLimited();
  reject(new RateLimitError(`gh ${args.join(" ")} failed: ${msg}`));
  return;
}
```

Change to:
```typescript
if (RATE_LIMIT_RE.test(msg)) {
  // Attempt to determine actual reset time before deciding behavior
  getRateLimitReset().then((info) => {
    if (info && info.remaining === 0) {
      const waitMs = (info.resetAt * 1000) - Date.now();
      if (waitMs > 0 && waitMs <= 120_000) {
        // Short reset: wait until reset, then retry once
        log.info(`[github] Rate limit resets in ${Math.ceil(waitMs / 1000)}s — waiting to retry`);
        setTimeout(() => {
          // Clear any rate-limit state and retry the original call
          clearRateLimitState();
          run(); // retry inside the existing promise's closure
        }, waitMs + 1000); // +1s buffer
        return;
      } else if (waitMs > 120_000) {
        // Long reset: trip circuit breaker for the actual wait duration
        setRateLimited(waitMs);
        reject(new RateLimitError(`gh ${args.join(" ")} failed: ${msg} (reset in ${Math.ceil(waitMs / 1000)}s)`));
        return;
      }
    }
    // Fallback: couldn't determine reset time, or already past — use default 60s cooldown
    setRateLimited();
    reject(new RateLimitError(`gh ${args.join(" ")} failed: ${msg}`));
  }).catch(() => {
    // getRateLimitReset failed — fall back to default behavior
    setRateLimited();
    reject(new RateLimitError(`gh ${args.join(" ")} failed: ${msg}`));
  });
  return;
}
```

**Step 1c: Allow `setRateLimited()` to accept a custom cooldown**

`setRateLimited()` already accepts `cooldownMs` with a default of 60,000 (line 26). No signature change needed — just pass the actual `waitMs` from step 1b for long resets.

#### 2. `src/github.test.ts` — New test cases

Add a new `describe("rate limit header parsing")` block with these cases:

1. **Short reset (<120s)**: Mock `execFile` so the first `gh` call fails with "rate limit" in stderr, then mock the `gh api rate_limit` call to return `{ rate: { remaining: 0, reset: <now + 60s> } }`. The mocked follow-up `gh` call (the retry after the wait) succeeds. Assert: `setTimeout` is called with ~60s delay, the promise eventually resolves, `setRateLimited` is NOT called.

2. **Long reset (>120s)**: Mock the rate_limit API to return reset time 5 minutes out. Assert: `setRateLimited` is called with the actual wait duration (~300s), promise rejects with `RateLimitError`, the error message includes the reset time.

3. **Non-rate-limit 403**: Ensure a 403 error whose stderr does NOT match `RATE_LIMIT_RE` still goes through normal transient retry logic (no change from current behavior).

4. **Missing/failed rate_limit API call**: Mock `gh api rate_limit` to fail (e.g., network error). Assert: falls back to default `setRateLimited(60_000)` behavior, promise rejects with `RateLimitError`.

5. **Rate limit API returns remaining > 0**: Mock rate_limit API to show `remaining: 5` (not actually at limit — could be a secondary rate limit). Assert: falls back to default 60s cooldown.

### Potential risks and edge cases

1. **Recursive rate limiting**: The `getRateLimitReset()` call itself could be rate-limited. Mitigated by: using `execFile` directly (bypasses circuit breaker) and catching all errors to fall back to default behavior.

2. **Secondary rate limits**: GitHub's secondary rate limits (abuse detection) don't appear in the `rate_limit` API — they're separate from the primary rate limit counter. The `rate_limit` API might show `remaining > 0` during a secondary rate limit. The fallback path handles this: if `remaining > 0` or the call fails, we use the existing 60s cooldown.

3. **Clock skew**: If the server clock is ahead of the `X-RateLimit-Reset` timestamp, `waitMs` could be negative. The `waitMs > 0` check handles this — negative or zero falls through to default behavior.

4. **Timer precision for short waits**: The `+1000ms` buffer on the retry timer prevents retrying a millisecond too early due to timer imprecision.

5. **Promise chain complexity**: The `getRateLimitReset().then()` inside the `execFile` callback adds async nesting. This is contained within the existing Promise constructor and the `run()` closure, so resolve/reject are still properly called on all paths. Each path must call exactly one of `resolve`, `reject`, or schedule a `run()` retry.

### Suggested order of implementation

1. Write the test cases first (TDD per CLAUDE.md) — all should fail initially
2. Implement `getRateLimitReset()` helper
3. Modify the rate-limit branch in `gh()` 
4. Run tests, iterate until passing
5. Run full test suite (`npm test`) to verify no regressions

### How to verify the changes work

- **Unit tests**: The four test cases described above cover the critical paths
- **Full test suite**: `npm test` must pass (includes typecheck)
- **Manual verification**: During the next burst of issue-refiner runs, monitor logs for the new `[github] Rate limit resets in Xs` message. The rate-limit notifications should show actual reset durations instead of always 60s.


</details>
### Test Case 2: Improve the scheduler

**Issue body:** The scheduler could be better. It sometimes runs jobs when it shouldn't and doesn't handle edge cases well. Can we make it more robust?

I think there might be issues with timing and concurrency too.

**Winner:** variant | Current: 14/20 | Variant: 18/20

**Reasoning:** Output A correctly identifies the issue is vague and asks clarifying questions, but produces nothing actionable. Output B takes the better approach: it investigates the code, finds real concrete issues (timer drift, misleading drain status, silent timeout), and produces a detailed implementation plan with specific line references, code changes, tests, and risk analysis. It still flags the ambiguity in the original issue as a non-blocking clarifying question. For a vague issue, finding actual problems in the code is more valuable than blocking on clarification.

<details><summary>Current prompt output</summary>

### Clarifying Questions (blocking)

The issue does not provide enough detail to produce a reliable implementation plan. After reading the scheduler source (`src/scheduler.ts`, 286 lines) and its test suite (`src/scheduler.test.ts`, 676 lines with 30+ test cases), the scheduler already handles several robustness concerns:

- **Skip-if-busy semantics** — overlapping ticks are skipped when a job is still running (line 54)
- **Drain with timeout** — graceful shutdown waits for running jobs with configurable timeout (lines 133–164)
- **Pause/resume** — jobs can be paused without interrupting running instances (lines 207–227)
- **Staggered startup** — interval jobs start 2s apart to avoid thundering herd (lines 104–116)
- **Error isolation** — failing jobs don't crash the scheduler (line 76, errors are reported and the job continues on next tick)
- **Draining guard** — new ticks and job additions are blocked during drain (lines 50, 234)

Without specifics, any "improvement" risks scope creep or solving phantom problems. The following questions must be answered:

1. **"Runs jobs when it shouldn't"** — Can you provide a concrete example? Under what conditions does a job run unexpectedly? Is this about scheduled-hour jobs firing at the wrong time, interval jobs running during drain, paused jobs running, or something else?

2. **"Edge cases not handled well"** — Which edge cases? For example: DST transitions affecting `msUntilHour`? A job being removed while its tick is in the `setTimeout` callback? `updateInterval` being called on a scheduled-hour job (or vice versa)? Something else?

3. **"Issues with timing and concurrency"** — Is this about timer drift over long uptimes? Race conditions between `triggerJob` and scheduled ticks? Multiple concurrent `drain()` calls? Or observed behavior in production (if so, what did the logs show)?

4. **Acceptance criteria** — What does "more robust" mean concretely? Are there specific failure modes you've observed, or is this a proactive hardening request? If proactive, which areas should be prioritized?

Please respond to these questions as a comment on the GitHub issue so the next refinement cycle can incorporate the answers and produce a targeted plan.


</details>

<details><summary>Variant prompt output</summary>

I've read the scheduler source and tests. Let me analyze the actual issues in the code before producing a plan.

## Implementation Plan: Improve the Scheduler

### Analysis of Current Issues

After reading `src/scheduler.ts` (286 lines) and `src/scheduler.test.ts` (675 lines), I identified these concrete problems:

#### 1. Scheduled-hour jobs drift over time
**File**: `src/scheduler.ts:99-102`

After the initial `setTimeout` fires at the target hour, `setInterval(tick, 24h)` starts from that moment. If the setTimeout fires late (Node.js timer imprecision, event loop lag, or the job itself taking time), every subsequent daily run drifts further from the intended hour. Over days/weeks this compounds.

**Fix**: Replace the `setTimeout` → `setInterval(24h)` pattern with a self-rescheduling `setTimeout` that recalculates `msUntilHour()` after each run. This re-anchors to the wall clock every day.

#### 2. `triggerJob` returns misleading status during drain
**File**: `src/scheduler.ts:170-177`

`triggerJob` returns `"already-running"` both when a job is actually running AND when the scheduler is draining. Callers (webhooks, dashboard manual trigger) cannot distinguish between "your job is busy" and "the system is shutting down."

**Fix**: Return a new `"draining"` status when `draining === true`, keeping the return type as a discriminated union.

#### 3. `drain()` timeout silently succeeds
**File**: `src/scheduler.ts:133-164`

When `drain(timeoutMs)` times out with jobs still running, it logs a warning but resolves normally. `main.ts` cannot detect that drain failed, which could lead to unclean shutdown (e.g., worktrees not cleaned up).

**Fix**: Track whether the drain timed out and either reject the promise or return a result indicating which jobs were still running.

#### 4. `updateInterval` skips immediate tick after config change
**File**: `src/scheduler.ts:179-189`

When an interval is updated (e.g., from 10min to 1min via config reload), the caller waits the full new interval before the next tick. If the previous interval just started, this creates a gap where the job doesn't run at the expected cadence.

**Fix**: This is arguably correct behavior (prevents double-firing). Document the behavior rather than change it. If desired, add an optional `runImmediately` parameter.

### Files to Change

1. **`src/scheduler.ts`** — All changes are in this file
2. **`src/scheduler.test.ts`** — New tests for each fix

### Detailed Changes

#### Change 1: Self-rescheduling scheduled-hour jobs (drift fix)

In `startJobs()` (line 95-102) and `addJob()` (line 253-260), replace:
```typescript
timers.push(setTimeout(() => {
  tick();
  timers.push(setInterval(tick, 24 * 60 * 60 * 1000));
}, delay));
```
with a `scheduleNextDaily(jobName, hour, tick)` helper that:
1. Calls `msUntilHour(hour)` to get the delay
2. Sets a `setTimeout` for that delay
3. When the timeout fires, calls `tick()`, then calls itself recursively to schedule the next day

This re-anchors to wall clock time after every execution. Store the timeout ID in `jobTimers` so `clearJobTimers` still works. The `updateScheduledHour` method already clears and reschedules, so it naturally benefits.

#### Change 2: Distinct drain status from triggerJob

In `triggerJob()` (line 170-177), change:
```typescript
if (draining) return "already-running";
```
to:
```typescript
if (draining) return "draining";
```

Update the return type to `"started" | "already-running" | "draining" | "unknown"`.

Update callers in `src/server.ts` and `src/webhooks.ts` that check `triggerJob` results — `"draining"` should map to an appropriate HTTP response (e.g., 503 Service Unavailable).

#### Change 3: Drain timeout returns running job info

Change `drain()` signature to return `Promise<{ timedOut: boolean; stillRunning: string[] }>`. On timeout, resolve with `{ timedOut: true, stillRunning: [...names] }`. On clean drain, resolve with `{ timedOut: false, stillRunning: [] }`.

Update `main.ts` drain call to log the result and potentially force-kill lingering processes.

### Potential Risks / Edge Cases

- **Drift fix — timer accumulation**: The recursive `setTimeout` approach creates a new timer each day. Since `clearJobTimers` clears the timers array, we must push each new timeout into the array. The old timeout naturally completes and doesn't leak since it only fires once.
- **`triggerJob` return type change**: This is a breaking API change for callers. Both `src/server.ts` and `src/webhooks.ts` call `triggerJob` — both must handle the new `"draining"` case.
- **`drain()` return type change**: `main.ts` is the only caller. Low risk, but must update the call site.
- **DST transitions**: `msUntilHour()` uses local time (`setHours`). During DST spring-forward, hour 2 doesn't exist; during fall-back, hour 1 occurs twice. The current behavior (and the proposed fix) handles this acceptably — `msUntilHour` always targets the next occurrence of the hour, so at worst a scheduled job runs ~23 or ~25 hours after the previous run during DST transitions. This is existing behavior and not in scope to fix.

### Out of Scope

- Replacing `setInterval`/`setTimeout` with a cron library — the current approach is simple and adequate
- Adding job dependencies or DAG-based scheduling — separate concern
- Persistent job scheduling across restarts — crash recovery already handles this differently

### Implementation Order

1. **Change 1** (drift fix) — most impactful, no API changes to other modules
2. **Change 2** (triggerJob status) — small, touches server.ts and webhooks.ts
3. **Change 3** (drain return type) — small, touches main.ts

### Testing Approach

**Change 1 — Drift fix**:
- Test that after the first scheduled-hour tick fires, the next timeout is recalculated via `msUntilHour` (not a fixed 24h). Use `vi.useFakeTimers()` to advance past two daily cycles and verify the job fires at the correct hour both times.
- Test that `clearJobTimers` still cancels the recursive chain (remove the job mid-cycle, verify no future tick fires).

**Change 2 — triggerJob status**:
- Test that `triggerJob` returns `"draining"` after `drain()` is called.
- Test that existing `"already-running"` and `"unknown"` cases still work.

**Change 3 — Drain return type**:
- Test that drain with no timeout returns `{ timedOut: false, stillRunning: [] }`.
- Test that drain with timeout returns `{ timedOut: true, stillRunning: ["job-name"] }` when a job outlasts the timeout.

### Clarifying Questions (non-blocking)

The issue mentions "sometimes runs jobs when it shouldn't" but doesn't give specific examples. The skip-if-busy logic appears correct after code review — the running flag is set synchronously before the first `await`, preventing concurrent invocations. If there are specific scenarios where jobs run unexpectedly, those details would help target additional fixes. The plan above addresses the concrete issues found in the code; please comment on the GitHub issue if there are specific failure scenarios to investigate.


</details>
### Test Case 3: Refactor all jobs to use a plugin architecture with dynamic loading, dependency injection, and shared middleware pipeline

**Issue body:** ## Overview

Currently each job in `src/jobs/` is a standalone module with a `run()` function. This works but doesn't scale. We should refactor to a proper plugin architecture.

## Requirements

1. Create a `JobPlugin` interface with lifecycle hooks: `onInit`, `onBeforeRun`, `onRun`, `onAfterRun`, `onError`, `onShutdown`
2. Implement a plugin registry that dynamically discovers and loads job modules from `src/jobs/` at startup
3. Add a dependency injection container so jobs can declare their dependencies (github client, claude client, db, config) and have them injected
4. Build a middleware pipeline (like Express/Koa) that jobs pass through — middleware for logging, error handling, rate limiting, worktree management, metrics collection
5. Add a job dependency graph so jobs can declare ordering constraints (e.g., "run after issue-refiner")
6. Support hot-reloading of job plugins without restarting the service
7. Add job priority queues with configurable priority levels
8. Implement job chaining — output of one job feeds as input to the next
9. Add a dry-run mode for all jobs
10. Create a job testing harness that can simulate GitHub events
11. Also while we're at it, convert all callbacks to use RxJS observables for better composition
12. Add OpenTelemetry tracing spans for each job execution

This will make the system much more maintainable and extensible.

## Acceptance Criteria

All existing jobs work the same as before but use the new architecture.

**Winner:** variant | Current: 16/20 | Variant: 17/20

**Reasoning:** Both outputs correctly identify this as an over-scoped issue, but the variant provides a concrete, implementable plan (specific helper functions, file changes, implementation order) while still rejecting the unnecessary requirements with detailed rationale. The current prompt asks excellent clarifying questions but produces no actionable plan. The variant loses some uncertainty score by being more prescriptive about what to build without asking the user first, but the trade-off is worth it given the issue is clearly over-engineered.

<details><summary>Current prompt output</summary>

### Clarifying Questions (blocking)

This issue cannot be reliably planned because it combines at least 12 distinct features with unclear motivation and significant contradictions with the existing architecture. The following questions must be answered before proceeding:

---

**1. What concrete problem is this solving?**

The issue states the current approach "doesn't scale" but doesn't explain how. The codebase has 13 jobs, all following a consistent `run(repos: Repo[]): Promise<void>` pattern with shared conventions (two-phase identify/process, worktree lifecycle in `finally` blocks, task recording via `db.ts`). What specific scaling problem has been encountered — difficulty adding new jobs, debugging failures, or something else? The answer determines which (if any) of the 12 requirements are actually needed.

**2. Are requirements 11 and 12 (RxJS observables, OpenTelemetry) actually desired?**

These are unrelated to a plugin architecture and would introduce two significant new dependencies (`rxjs`, `@opentelemetry/*`) to a project that currently has minimal dependencies. The prompt instructions explicitly say *"Do not add logging, metrics, or observability unless the issue asks for it"* and *"do not refactor surrounding code"* — but the issue itself asks for these. Is the intent to bundle these with the plugin refactor, or should they be separate issues?

**3. How would "hot-reloading of job plugins" (requirement 6) work given the current architecture?**

Jobs are stateless functions that discover work via GitHub API polling. The scheduler already supports live enable/disable via config reload (`main.ts` handles `SIGHUP` and config file changes). What additional hot-reload capability is needed beyond what already exists? ESM dynamic `import()` has known cache-invalidation problems in Node.js — is the intent to use `require()` (CJS), a custom module loader, or something else?

**4. What does "job dependency graph" (requirement 5) mean in practice?**

Currently, jobs don't communicate directly. They coordinate through GitHub state (labels, comments, reactions). For example, `issue-refiner` adds the `Needs Plan Review` label, and `plan-reviewer` picks it up on its next poll cycle. This is deliberately decoupled. Would a dependency graph replace this model (tight coupling, synchronous chaining) or supplement it (just scheduling order hints)? These are fundamentally different architectures.

**5. What does "job chaining — output of one job feeds as input to the next" (requirement 8) mean?**

Job outputs today are side effects (GitHub comments, labels, PRs), not return values. `run()` returns `Promise<void>`. Making jobs return structured data that chains to the next job would require redesigning every job's contract. Is the intent to change job signatures, or to have a message-passing system alongside the current GitHub-state-based coordination?

**6. Should this be split into smaller issues?**

This issue contains at least 7 orthogonal features: (a) plugin interface + registry, (b) dependency injection, (c) middleware pipeline, (d) dependency graph, (e) hot-reloading, (f) job chaining, (g) dry-run mode, (h) testing harness, (i) RxJS conversion, (j) OpenTelemetry. Each has different risk profiles and could be implemented independently. A single PR touching all 13 jobs plus `main.ts`, `scheduler.ts`, and introducing ~5 new infrastructure modules would be extremely difficult to review. Would you prefer this broken into a sequence of smaller, independently deployable issues?

**7. What does "dry-run mode" (requirement 9) mean per job?**

Some jobs are read-only scanners (`issue-auditor`), some create PRs (`issue-worker`), some edit comments (`issue-refiner`), some merge PRs (`auto-merger`). "Dry run" means different things for each. Should dry-run skip all GitHub writes? Skip only destructive writes (merge, push) but allow reads? Log what *would* happen? The existing jobs have no shared write abstraction — each calls `gh.*` functions directly with different semantics.

---

Please respond to these questions as comments on the GitHub issue so the next refinement cycle can incorporate the answers and produce a focused, implementable plan. Given the scope of this issue, I'd particularly recommend addressing question 6 first — the answer will shape everything else.


</details>

<details><summary>Variant prompt output</summary>

This issue is a textbook over-scoped "boil the ocean" refactor. Let me analyze each requirement against the actual codebase before producing a plan.

## Analysis

The current architecture has **13 jobs** totaling ~4,570 LOC, each implementing a simple `run(repos: Repo[])` function registered in `main.ts`. The system already has:

- **Skip-if-busy scheduling** (scheduler.ts)
- **Priority queuing** in the AI dispatch layer (claude.ts `enqueue(fn, priority)`)
- **Live config reload** (main.ts lines 293-359 — interval/schedule/enabled/paused changes)
- **Worktree isolation** per task
- **Error reporting pipeline** with deduplication
- **Co-located tests** with comprehensive mocking

Here is my assessment of the 12 requirements:

---

## Implementation Plan

### Requirements Disposition

| # | Requirement | Disposition | Rationale |
|---|------------|-------------|-----------|
| 1 | `JobPlugin` interface with lifecycle hooks | **Out of scope** | Current `run(repos)` is simple, testable, and uniform across 13 jobs. Lifecycle hooks add indirection with no concrete benefit — the jobs don't need `onInit`/`onShutdown` because initialization is centralized in `main.ts` and shutdown is handled via `cancelQueuedTasks()`. |
| 2 | Dynamic plugin discovery/loading | **Out of scope** | 13 explicit registrations in `main.ts` (lines 140-253) are readable, type-safe, and allow the `ENABLED_JOBS` filter. Dynamic loading loses compile-time safety and makes the dependency graph invisible. |
| 3 | Dependency injection container | **Out of scope** | Jobs import what they need via ESM. The "dependencies" (github, claude, db, config) are module-level singletons — this is idiomatic Node.js. A DI container adds complexity for zero testability gain since tests already use `vi.mock()`. |
| 4 | Middleware pipeline | **Partially useful — extract shared patterns** | Jobs share patterns (two-phase identify/process, worktree lifecycle, tree-diff guard) but these are already factored into `claude.ts` and `github.ts`. A formal middleware pipeline is over-engineering, but extracting 2-3 helper functions for the most common patterns would reduce duplication. |
| 5 | Job dependency graph | **Out of scope** | Jobs are independent by design. The scheduler's skip-if-busy semantics and staggered start prevent resource contention. Ordering constraints would couple jobs and complicate the scheduler for no demonstrated need. |
| 6 | Hot-reloading job plugins | **Out of scope / dangerous** | Live config reload already exists. Hot-reloading *code* in a daemon that manages git worktrees and pushes to GitHub is a reliability hazard. The updater service (`yeti-updater.timer`) handles code updates safely via restart. |
| 7 | Job priority queues | **Already exists** | `claude.ts` already has `enqueue(fn, priority=false)` where priority tasks jump to front. The `Priority` label feeds into this. No work needed. |
| 8 | Job chaining (output → input) | **Out of scope** | The content-based state machine (labels + reactions + comments) already implements implicit chaining: issue-refiner → plan-reviewer → issue-worker. Making this explicit would couple jobs and break the current robust, recoverable design. |
| 9 | Dry-run mode | **Potentially useful but narrow** | Could be useful for debugging, but each job's side effects are different (GitHub comments, PRs, labels, reactions). A generic dry-run flag would require threading through every GitHub call. Better addressed per-job as needed. |
| 10 | Job testing harness for GitHub events | **Out of scope** | Tests already simulate GitHub events via mocked `gh` CLI responses. `src/test-helpers.ts` provides `mockRepo()`, `mockIssue()`, `mockPR()` factories. The webhook handler (`webhooks.ts`) already routes real GitHub events to jobs. |
| 11 | Convert callbacks to RxJS observables | **Out of scope** | There are no callback-heavy patterns to convert. Jobs are async/await throughout. Adding RxJS (a large dependency) would force all contributors to learn a reactive paradigm for zero benefit. |
| 12 | OpenTelemetry tracing | **Out of scope** | Orthogonal to the "plugin architecture" and should be its own issue if desired. The `AsyncLocalStorage`-based run context in `db.ts` already provides per-job-run log correlation. |

### What should actually be done

The only concrete improvement with a real payoff is **requirement 4 (partially)**: extracting shared job patterns into reusable helpers to reduce boilerplate across the 13 jobs. Specifically:

#### File changes

**`src/jobs/job-helpers.ts`** (new file, ~80 lines)

Extract these repeated patterns:

1. **`forEachRepo(repos, processRepo)`** — The repos iteration + rate-limit-check pattern used by most jobs. Wraps the common `for (const repo of repos) { if (rateLimited) continue; ... }` loop.

2. **`withWorktreeTask(repo, branch, taskFn)`** — Combines worktree creation, try/catch/finally cleanup, and tree-diff guard into a single helper. Currently duplicated across ci-fixer, issue-worker, doc-maintainer, improvement-identifier, review-addresser, and mkdocs-update.

3. **`twoPhaseProcess(repos, identify, process)`** — Codifies the identify-then-process pattern used by ci-fixer, improvement-identifier, and issue-refiner. Phase 1 collects work items, phase 2 processes them.

**`src/jobs/*.ts`** (6-8 job files modified)

Refactor jobs to use the new helpers where the pattern clearly applies. Do NOT force jobs that don't fit the pattern (e.g., prompt-evaluator, repo-standards) to use them.

**`src/jobs/job-helpers.test.ts`** (new file)

Unit tests for the extracted helpers.

#### What NOT to change

- `main.ts` job registration — keep explicit
- `scheduler.ts` — no changes needed
- `claude.ts` — already well-factored
- No new dependencies (no RxJS, no OpenTelemetry, no DI framework)
- No new interfaces or abstractions beyond the 2-3 helper functions

### Risks and edge cases

1. **Over-abstracting**: The helpers must remain optional conveniences, not mandatory wrappers. Jobs that don't fit the pattern should keep their current structure.
2. **Behavioral regression**: Each job has subtle differences in how it handles worktrees, errors, and state. The helpers must be parameterized enough to preserve these differences without becoming a configuration nightmare.
3. **Test coverage**: Existing job tests mock at the module boundary. Extracting helpers changes the mock surface — tests for jobs using the helpers will need updating.

### Implementation order

1. Write `job-helpers.test.ts` with tests for the three helpers (TDD)
2. Implement `job-helpers.ts` to pass the tests
3. Refactor 2 jobs (start with `doc-maintainer` and `auto-merger` as they're simplest) to use the helpers
4. Verify all existing tests still pass
5. Refactor remaining applicable jobs one at a time
6. Update `yeti/OVERVIEW.md` and `CLAUDE.md` to document the helpers

### Testing approach

- Run `npm test` after each job refactor to catch regressions
- Each refactored job's existing tests must continue passing without changes to assertions (only mock setup may change)
- New helper tests cover: normal execution, error propagation, worktree cleanup on failure, rate-limit skip behavior

### Out of Scope

All 12 original requirements except the partial extraction from #4 are out of scope for the reasons detailed above. The current architecture is working well for 13 jobs. The `run(repos)` pattern is simple, testable, and understood by all contributors. Introducing a plugin architecture, DI container, middleware pipeline, RxJS, or OpenTelemetry would add significant complexity without solving a demonstrated problem.

If specific pain points emerge (e.g., "adding a new job requires touching 5 files" or "I can't test job X without running the whole system"), those should be filed as targeted issues rather than addressed via a speculative architectural overhaul.

### Suggested PR structure

**Single PR** — The helper extraction and job refactors are tightly coupled and should be reviewed together. The change is small enough (~200 lines new, ~150 lines removed from jobs) to review in one pass.

### Note

This plan deliberately rejects the majority of the issue's requirements. The issue describes a solution (plugin architecture) rather than a problem. The actual codebase is well-structured for its current scale, and the proposed abstractions would make it harder to understand and maintain, not easier. The targeted extraction of 2-3 helper functions addresses the only real duplication without the overhead of a framework.


</details>
### Test Case 4: Add webhook support for pull_request.synchronize events

**Issue body:** ## Problem

We handle `pull_request.closed` and `check_run.completed` webhook events but not `pull_request.synchronize`. When a PR branch is force-pushed or has new commits pushed, we should react.

## What should happen

When we receive a `pull_request.synchronize` event, we should... actually I'm not sure. Maybe re-run CI checks? Or maybe invalidate the plan? Or update the PR description? It depends on whether it's a yeti PR or an external PR.

For yeti PRs (branches starting with `yeti/`), maybe we should check if the push was from yeti itself or from a human. If it was from a human, we might want to pause yeti's work on that PR. Or maybe not.

For external PRs, maybe we should re-run the `improvement-identifier` job? Or the `ci-fixer`? Not sure which.

## Additional context

Someone mentioned this would be useful in Slack but I don't remember the exact use case they described. It was something about PRs getting out of sync. The `webhooks.ts` file has `isRepoAllowed()` which we should reuse. Also we might need to update the config to add a new toggle for this, or maybe it should always be on, or maybe it should be controlled per-repo.

See also: that other issue about webhook improvements (I don't remember the number).

**Winner:** variant | Current: 15/20 | Variant: 18/20

**Reasoning:** Both correctly identify the issue as unplannable and ask blocking clarifying questions. The variant is superior in specificity — it references concrete files (src/webhooks.ts), functions (handlePullRequestSynchronizeEvent, isRepoAllowed, scheduler.triggerJob), and architectural patterns (skip-if-busy, per-PR state). It also demonstrates stronger scope awareness by noting that GitHub already re-runs CI on pushes (making one proposed behavior redundant), that 'pausing' would require a new per-PR state mechanism that doesn't exist, and by offering a concrete recommendation to close/convert to discussion. The current prompt's questions are solid but more generic.

<details><summary>Current prompt output</summary>

## Step 1: Plannable?

This issue is **not plannable** in its current state. The author explicitly expresses uncertainty about the desired behavior in nearly every paragraph ("I'm not sure," "maybe," "not sure which," "I don't remember the exact use case"). There are no acceptance criteria, and the core question — *what should happen when we receive `pull_request.synchronize`* — is unanswered.

### Clarifying Questions (blocking)

1. **What is the desired behavior for yeti PRs?**
   > "For yeti PRs...maybe we should check if the push was from yeti itself or from a human. If it was from a human, we might want to pause yeti's work on that PR. Or maybe not."

   The issue proposes at least three options but picks none:
   - (A) Ignore synchronize events on yeti PRs entirely
   - (B) Detect human pushes and pause yeti's work (how? remove a label? skip the issue in future runs?)
   - (C) Re-run some job (which one?)

   Which behavior do you want, and what does "pause" mean concretely?

2. **What is the desired behavior for external PRs?**
   > "For external PRs, maybe we should re-run the `improvement-identifier` job? Or the `ci-fixer`? Not sure which."

   These two jobs have very different purposes. `ci-fixer` reacts to failing CI checks; `improvement-identifier` scans for code improvements. A force-push might warrant re-running CI-fixer (if the old failure context is stale), but re-running improvement-identifier on every push could be noisy. Which job(s) should be triggered, and under what conditions?

3. **What was the original use case from Slack?**
   > "Someone mentioned this would be useful in Slack but I don't remember the exact use case they described. It was something about PRs getting out of sync."

   Without the motivating use case, we can't validate whether any proposed behavior actually solves the problem. Can you find the Slack thread and include the specific scenario?

4. **Should this be configurable, and if so how?**
   > "we might need to update the config to add a new toggle for this, or maybe it should always be on, or maybe it should be controlled per-repo."

   The current webhook system has no per-event-type toggles — all supported events are always handled if the repo is allowed. Should synchronize follow the same pattern (always on), or does it need a new config field? Per-repo control would be a new config pattern that doesn't exist today.

5. **What is "the other issue about webhook improvements"?**
   > "See also: that other issue about webhook improvements (I don't remember the number)."

   There may be overlapping or conflicting plans. Please link the related issue so we can check for scope conflicts before planning.

---

Please respond to these questions as comments on the GitHub issue so the next refinement cycle can incorporate your answers and produce a complete plan.


</details>

<details><summary>Variant prompt output</summary>

Based on my analysis of the codebase and this issue, this issue is **too underspecified to produce an actionable implementation plan**. Here's my assessment:

---

## Issue Analysis: `pull_request.synchronize` webhook support

### What's clear

The webhook infrastructure in `src/webhooks.ts` already handles four event types (`issues`, `check_run`, `pull_request_review`, `pull_request`) with a clean dispatcher pattern. Adding a new `handlePullRequestSynchronizeEvent` handler is mechanically straightforward — it follows the same pattern as the existing `pull_request.closed` handler, reuses `isRepoAllowed()`, and routes through `scheduler.triggerJob()`.

### What's not clear — and why it matters

The issue explicitly states uncertainty about the desired behavior in multiple places. Every decision branch leads to a different implementation:

1. **"Maybe re-run CI checks?"** — GitHub already re-runs checks automatically when commits are pushed to a PR branch. There's nothing for yeti to do here.

2. **"Maybe invalidate the plan?"** — The plan lives as an issue comment produced by `issue-refiner`. "Invalidating" it could mean adding a comment, removing the `Refined` label, or re-triggering `issue-refiner`. Each has different implications for the workflow state machine.

3. **"Maybe update the PR description?"** — Which description fields? Under what conditions?

4. **Yeti PR vs external PR distinction** — The issue proposes different behavior per PR type but doesn't specify what either behavior should be.

5. **"Pause yeti's work"** — There's no concept of pausing work on a specific PR in the current architecture. Jobs are global (skip-if-busy), not per-PR. This would be a significant new feature, not a webhook handler addition.

6. **Config toggle** — "Always on", "per-repo", or "global toggle" are three different config designs.

7. **Referenced Slack conversation and "other issue"** — These are the actual motivation for the feature but aren't available.

### Clarifying Questions (blocking)

These must be answered before a plan can be produced:

1. **What is the concrete use case?** The issue says "someone mentioned this would be useful in Slack" but doesn't describe the actual problem. What specific scenario causes PRs to "get out of sync" and what user-visible problem does that cause?
   - Options: (a) yeti overwrites human changes, (b) yeti PRs have stale CI, (c) yeti plans become outdated after manual edits, (d) something else

2. **What job(s) should be triggered?** The issue lists ci-fixer, improvement-identifier, and "pause yeti's work" as candidates but picks none.
   - Note: CI re-runs are already handled by GitHub natively. `ci-fixer` is triggered by `check_run.completed` which fires *after* the re-run fails — so the existing flow already covers "new push → CI runs → CI fails → ci-fixer triggers."
   - Suggestion: If the goal is detecting human edits to yeti PRs, the simplest action would be adding a comment or label, not triggering a job.

3. **Should this block issue-worker from starting new work on an issue whose PR was human-edited?** This is the "pause" concept — it requires a new per-issue/per-PR state mechanism that doesn't exist today. Is this in scope or out of scope?

4. **What is the "other issue about webhook improvements"?** It may contain design decisions that affect this work.

Please respond as a comment on the GitHub issue so the next refinement cycle can incorporate the answers.

### Recommendation

Close or convert this to a discussion until the use case is concrete. The current issue has the shape of "we *could* handle this event" without a clear "and here's the problem it solves." Adding webhook plumbing is easy; designing the right behavior requires knowing what problem we're solving.


</details>

---
*Automated evaluation by yeti prompt-evaluator*

#	Requirement	Disposition	Rationale
1	`JobPlugin` interface with lifecycle hooks	Out of scope	Current `run(repos)` is simple, testable, and uniform across 13 jobs. Lifecycle hooks add indirection with no concrete benefit — the jobs don't need `onInit`/`onShutdown` because initialization is centralized in `main.ts` and shutdown is handled via `cancelQueuedTasks()`.
2	Dynamic plugin discovery/loading	Out of scope	13 explicit registrations in `main.ts` (lines 140-253) are readable, type-safe, and allow the `ENABLED_JOBS` filter. Dynamic loading loses compile-time safety and makes the dependency graph invisible.
3	Dependency injection container	Out of scope	Jobs import what they need via ESM. The "dependencies" (github, claude, db, config) are module-level singletons — this is idiomatic Node.js. A DI container adds complexity for zero testability gain since tests already use `vi.mock()`.
4	Middleware pipeline	Partially useful — extract shared patterns	Jobs share patterns (two-phase identify/process, worktree lifecycle, tree-diff guard) but these are already factored into `claude.ts` and `github.ts`. A formal middleware pipeline is over-engineering, but extracting 2-3 helper functions for the most common patterns would reduce duplication.
5	Job dependency graph	Out of scope	Jobs are independent by design. The scheduler's skip-if-busy semantics and staggered start prevent resource contention. Ordering constraints would couple jobs and complicate the scheduler for no demonstrated need.
6	Hot-reloading job plugins	Out of scope / dangerous	Live config reload already exists. Hot-reloading code in a daemon that manages git worktrees and pushes to GitHub is a reliability hazard. The updater service (`yeti-updater.timer`) handles code updates safely via restart.
7	Job priority queues	Already exists	`claude.ts` already has `enqueue(fn, priority=false)` where priority tasks jump to front. The `Priority` label feeds into this. No work needed.
8	Job chaining (output → input)	Out of scope	The content-based state machine (labels + reactions + comments) already implements implicit chaining: issue-refiner → plan-reviewer → issue-worker. Making this explicit would couple jobs and break the current robust, recoverable design.
9	Dry-run mode	Potentially useful but narrow	Could be useful for debugging, but each job's side effects are different (GitHub comments, PRs, labels, reactions). A generic dry-run flag would require threading through every GitHub call. Better addressed per-job as needed.
10	Job testing harness for GitHub events	Out of scope	Tests already simulate GitHub events via mocked `gh` CLI responses. `src/test-helpers.ts` provides `mockRepo()`, `mockIssue()`, `mockPR()` factories. The webhook handler (`webhooks.ts`) already routes real GitHub events to jobs.
11	Convert callbacks to RxJS observables	Out of scope	There are no callback-heavy patterns to convert. Jobs are async/await throughout. Adding RxJS (a large dependency) would force all contributors to learn a reactive paradigm for zero benefit.
12	OpenTelemetry tracing	Out of scope	Orthogonal to the "plugin architecture" and should be its own issue if desired. The `AsyncLocalStorage`-based run context in `db.ts` already provides per-job-run log correlation.

[prompt-evaluator] Improvement found: buildRefinementPrompt #208

Description

Prompt Evaluation: buildRefinementPrompt

Proposed Change Rationale

Proposed Variant Prompt

Test Case 1: Add rate-limit header parsing to github.ts retry logic

Desired Behavior

Acceptance Criteria

Context

Implementation Plan

Files to change

1. src/github.ts — Add rate-limit reset query and smart retry logic

2. src/github.test.ts — Add tests for all four cases

Implementation order

Risks and edge cases

Verification

Implementation Plan: Rate-Limit Header Parsing in github.ts

Problem

Which files need to be changed

What the changes should be

1. src/github.ts — Add rate-limit reset time detection

2. src/github.test.ts — New test cases

Potential risks and edge cases

Suggested order of implementation

How to verify the changes work

Clarifying Questions (blocking)

Implementation Plan: Improve the Scheduler

Analysis of Current Issues

1. Scheduled-hour jobs drift over time

2. triggerJob returns misleading status during drain

3. drain() timeout silently succeeds

4. updateInterval skips immediate tick after config change

Files to Change

Detailed Changes

Change 1: Self-rescheduling scheduled-hour jobs (drift fix)

Change 2: Distinct drain status from triggerJob

Change 3: Drain timeout returns running job info

Potential Risks / Edge Cases

Out of Scope

Implementation Order

Testing Approach

Clarifying Questions (non-blocking)

Requirements

Acceptance Criteria

Clarifying Questions (blocking)

Analysis

Implementation Plan

Requirements Disposition

What should actually be done

File changes

What NOT to change

Risks and edge cases

Implementation order

Testing approach

Out of Scope

Suggested PR structure

Note

What should happen

Additional context

Step 1: Plannable?

Clarifying Questions (blocking)

Issue Analysis: pull_request.synchronize webhook support

What's clear

What's not clear — and why it matters

Clarifying Questions (blocking)

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Prompt Evaluation: `buildRefinementPrompt`

1. `src/github.ts` — Add rate-limit reset query and smart retry logic

2. `src/github.test.ts` — Add tests for all four cases

Implementation Plan: Rate-Limit Header Parsing in `github.ts`

1. `src/github.ts` — Add rate-limit reset time detection

2. `src/github.test.ts` — New test cases

2. `triggerJob` returns misleading status during drain

3. `drain()` timeout silently succeeds

4. `updateInterval` skips immediate tick after config change

Issue Analysis: `pull_request.synchronize` webhook support