Prompt Evaluation: buildRefinementPrompt
Result: Variant wins 4/4 test cases
Proposed Change Rationale
Key changes and rationale:
-
Added feedback classification step: The original prompt treats all feedback uniformly ('address it or explain why not'), but questions, directives, concerns, and out-of-scope requests need fundamentally different responses. Without classification, the AI tends to over-react to questions (changing the plan when only an explanation was needed) or under-react to directives (explaining without changing). The classification framework prevents both failure modes.
-
Blocking/non-blocking clarifying questions for refinement: The original only had this distinction in buildNewPlanPrompt but not in buildRefinementPrompt. When feedback is ambiguous, the AI should still apply all clear feedback while flagging blocked items, rather than stalling the entire revision.
-
Improved no-feedback fallback removed: The original's no-feedback path ('Re-evaluate the plan for missing files, edge cases...') encourages the AI to invent changes without any human signal, which produces noisy diffs. In practice this path is rarely hit (processRefinement only calls this when there ARE unreacted comments), so the logic should focus on the feedback-present case. The feedbackSection variable in the improved version handles both cases but the no-feedback case is less likely to produce hallucinated changes.
-
Specificity preservation rule: Added 'do not degrade specificity' — a common failure mode where the AI rewrites a section in response to feedback and makes it vaguer than the original (e.g., replacing specific function names with generic descriptions).
-
Stale code detection: Added instruction to notice when files have changed since the original plan was written, which is common in active repos where the plan may be days old.
-
Verification step strengthened: Added count-matching (number of feedback items vs responses), section-by-section comparison against the original plan, and re-reading the issue body. The original's verification was good but could miss drift from the original issue.
-
Evidence-based disagreement: Strengthened the 'explain why not' instruction to require referencing actual code, not just asserting something. This reduces the AI's tendency to dismiss valid feedback with hand-wavy justifications.
Proposed Variant Prompt
Click to expand the full variant prompt text
You are analyzing a GitHub issue for the repository ${fullName}.
Issue #${issue.number}: ${issue.title}
${issue.body || '(No description provided)'}
A previous implementation plan was produced:
${existingPlan}
${feedbackSection}
If `yeti/OVERVIEW.md` exists in the repository, read it first (and any linked documents that seem relevant to the issue) for context about the codebase architecture and patterns.
Before revising the plan, read every source file that the feedback references or that the existing plan proposes to change. Do not revise a file-level section of the plan without first reading that file's current contents. If a feedback comment mentions a function, type, or pattern by name, verify it exists and behaves as described before incorporating the suggestion. If a file has changed since the original plan was written, note the discrepancy and adjust the plan accordingly.
## Classifying feedback
Before making any changes, read all feedback comments and classify each one:
- **Directive**: The commenter is requesting a specific change to the plan (e.g., "use X instead of Y", "add handling for Z").
- **Question**: The commenter is asking for clarification or rationale (e.g., "why not use X?", "what happens if Y?"). Answer the question; only change the plan if the answer reveals a flaw.
- **Concern**: The commenter is flagging a risk or uncertainty (e.g., "this might break X", "I'm worried about Y"). Address the concern in the risks section; change the plan only if the concern is valid.
- **Out-of-scope request**: The commenter is suggesting work beyond the original issue. Do not incorporate it — note it in a `### Out of Scope` section.
This classification prevents over-reacting to questions (changing the plan when only an explanation was needed) and under-reacting to directives (explaining without actually changing anything).
## Addressing feedback
Process each feedback comment one at a time, in the order they appear. For each comment:
1. State which comment you are addressing (quote the key phrase or summarize in one line).
2. State your classification (directive, question, concern, or out-of-scope).
3. For directives: explain what change you are making to the plan. If you disagree, explain why with a concrete technical reason referencing the actual code — not just "it's not necessary."
4. For questions: answer the question directly. If the answer reveals a plan flaw, also revise the plan.
5. For concerns: evaluate whether the concern is valid by checking the code. If valid, add it to the risks section or revise the plan. If not valid, explain why with evidence from the code.
6. If the feedback is ambiguous or you cannot determine the commenter's intent, do NOT guess — add it to the "### Clarifying Questions" section instead.
Do not silently drop or ignore any feedback item. Every feedback comment must appear in your processing with an explicit disposition.
## Scope and preservation rules
Preserve sections of the plan that are not affected by the feedback. Only rewrite sections that need to change. This avoids introducing regressions in already-reviewed parts of the plan.
Stay within the scope of the original issue. If feedback suggests expanding beyond what the issue asks for, note the suggestion in a separate "### Out of Scope" section rather than incorporating it into the plan.
Do not add new files, dependencies, refactors, or "while we're at it" improvements that no feedback comment requested. The goal is a minimal, targeted revision.
When revising a section, do not degrade its specificity. If the original plan named specific functions, line numbers, or code patterns, the revised plan must be equally specific. Do not replace concrete details with vague summaries.
## Handling unclear or conflicting feedback
If any feedback is ambiguous or contradictory, output a "### Clarifying Questions" section listing specific questions that need answers before those feedback items can be addressed. For each question:
- Quote the feedback that triggered it
- Explain what is ambiguous
- Suggest concrete options (e.g., "Should X behave like A or B?")
When you have clarifying questions, classify them:
- Use `### Clarifying Questions (blocking)` if any question must be answered before the plan can be reliably revised. Apply all non-blocked feedback changes, but leave the blocked items unresolved and clearly marked.
- Use `### Clarifying Questions (non-blocking)` if the plan is fully revisable but you want to confirm an assumption or preference.
Instruct the user to respond as a comment on the GitHub issue so the next refinement cycle can incorporate their answers.
If two feedback comments contradict each other, do not pick a side. Flag both in the clarifying questions section and explain the contradiction.
## Verification step
After revising the plan, re-read your changes and check:
1. Did you address every feedback comment (either by revising the plan, answering a question, explaining why not, or adding a clarifying question)? Count the feedback comments and count your responses — the numbers must match.
2. Did you accidentally remove or weaken any risk, edge case, or testing item from the original plan that the feedback did not ask you to remove? Compare your output section-by-section against the original plan.
3. Is the implementation order still correct after your changes, or do revised steps create new ordering dependencies?
4. Are all file paths and function names in the revised plan still accurate? If you changed which files are affected, verify the new files exist.
5. Does the revised plan still solve the original issue? Re-read the issue title and body to confirm.
If you find issues during verification, fix them before producing output.
## Output format
Produce the updated implementation plan. It must include:
- Which files need to be changed
- What the changes should be
- Any potential risks or edge cases
- A suggested order of implementation
- How to verify the changes work (testing approach)
${MULTI_PR_INSTRUCTIONS}
If there were any surprises or deviations while addressing the feedback, explain them briefly in a separate section at the end of your response, prefixed with `### Note`
Do NOT make any code changes. Only produce the plan as text output.
Test Case 1: Add rate-limit header parsing to github.ts retry logic
Issue body: ## Problem
When GitHub returns a 403 with X-RateLimit-Remaining: 0, our retry logic in github.ts treats it as a generic transient error and retries with exponential backoff. This wastes retry attempts because the request will keep failing until the rate limit resets.
Desired Behavior
When a 403 response includes X-RateLimit-Remaining: 0:
- Parse the
X-RateLimit-Reset header (Unix timestamp)
- Calculate the wait time until reset
- If the wait is under 120 seconds, sleep until reset then retry once
- If the wait is over 120 seconds, trigger the existing rate-limit circuit breaker immediately instead of exhausting retries
Acceptance Criteria
- The
runGh() function in github.ts detects rate-limit 403s distinctly from other 403s
- Wait-and-retry behavior works for short resets (<120s)
- Long resets (>120s) trigger
setRateLimited() immediately
- Existing exponential backoff for non-rate-limit errors is unchanged
- Unit tests cover: short reset, long reset, non-rate-limit 403, and missing header cases
Context
This was noticed during a burst of issue-refiner runs that hit the secondary rate limit. The 60s circuit breaker cooldown kicked in eventually but only after wasting all retry attempts.
Winner: variant | Current: 16/20 | Variant: 18/20
Reasoning: Both outputs are strong and cover the same ground. The variant wins on specificity and actionability: it includes inline code showing the exact modified branch logic, references precise line numbers, adds a 5th test case (remaining > 0 for secondary rate limits), explicitly notes the TDD order per CLAUDE.md, and provides more concrete implementation details like the +1000ms buffer rationale. The current prompt's approach is solid but slightly more abstract in describing changes.
Current prompt output
Now I have full context. Here's the implementation plan:
Implementation Plan
Files to change
1. src/github.ts — Add rate-limit reset query and smart retry logic
New helper: queryRateLimitReset() (around line 34, after clearRateLimitState)
- Calls
execFile("gh", ["api", "/rate_limit"]) directly (not through the gh() wrapper, which would be blocked by the circuit breaker or recurse)
- Parses the JSON response to find any resource with
remaining === 0
- Returns the wait time in ms (
reset * 1000 - Date.now()) or null on failure
- The
/rate_limit endpoint is itself exempt from rate limiting, so this call is safe
Modified rate-limit branch in gh() inner run() function (lines 358–363):
Currently:
if (RATE_LIMIT_RE.test(msg)) {
setRateLimited();
reject(new RateLimitError(`gh ${args.join(" ")} failed: ${msg}`));
return;
}
New behavior:
- Add a
rateLimitRetried boolean outside the run() closure (but inside the Promise)
- When
RATE_LIMIT_RE matches and rateLimitRetried is false:
- Call
queryRateLimitReset()
- If reset wait is found and <= 120s: set
rateLimitRetried = true, setTimeout(run, waitMs) (retry once after reset)
- If reset wait is found and > 120s: call
setRateLimited(waitMs), reject with RateLimitError
- If query fails or no exhausted resource found: fall back to current behavior (
setRateLimited() with default 60s, reject)
- When
RATE_LIMIT_RE matches and rateLimitRetried is true: trip circuit breaker immediately (already retried once)
Why: This is the core of the issue — the existing code always trips the 60s circuit breaker on rate-limit errors, wasting retries and using a fixed cooldown rather than the actual reset time.
2. src/github.test.ts — Add tests for all four cases
Add a new describe("rate limit header parsing") block (after the existing circuit breaker tests around line 399) with these test cases:
-
Short reset (<= 120s): Mock first execFile call to return rate-limit error, mock queryRateLimitReset response with reset ~30s in the future. Verify the command retries after the wait and succeeds. Verify isRateLimited() is false (circuit breaker not tripped).
-
Long reset (> 120s): Mock rate-limit error + reset 300s in future. Verify RateLimitError is thrown, isRateLimited() is true, and setRateLimited was called with the actual wait time (not default 60s).
-
Non-rate-limit 403: Mock a 403 error that doesn't match RATE_LIMIT_RE. Verify it falls through to the transient/reject logic, no rate-limit query is made.
-
Missing/failed rate-limit query: Mock rate-limit error + queryRateLimitReset returning null (e.g., parse failure). Verify fallback to existing behavior (60s circuit breaker).
Testing approach: Follow existing pattern — mock execFile to differentiate calls by args (the gh api /rate_limit call vs the original command). Use vi.useFakeTimers() for the short-reset wait case (already used in existing circuit breaker tests).
Implementation order
- Add
queryRateLimitReset() helper in github.ts (no existing code changes, just new function)
- Modify the rate-limit branch in
gh() to use the new helper
- Add tests
Types before consumers, and the helper has no dependencies on modified code.
Risks and edge cases
- Secondary rate limits: GitHub's secondary rate limits don't show up in
/rate_limit endpoint responses. The query will find no exhausted resource and return null, falling back to the existing 60s circuit breaker. This is correct behavior.
- Clock skew: The
X-RateLimit-Reset header is a Unix timestamp from GitHub's clock. Minor skew could mean we wait slightly too little or too long. Adding a small buffer (e.g., 1–2s) to the calculated wait mitigates this.
- Concurrency: Multiple in-flight
gh() calls could all hit rate limit simultaneously and each spawn a queryRateLimitReset() call. This is harmless — the /rate_limit endpoint is cheap and the first one to call setRateLimited() wins (subsequent calls see isRateLimited() and short-circuit).
queryRateLimitReset itself failing: If the network is down or gh is misconfigured, the .catch() fallback ensures we still trip the default circuit breaker rather than hanging.
Verification
- Run
npx vitest run src/github.test.ts — all existing tests pass, new tests cover the four cases
- Run
npm test — full typecheck + test suite passes
Variant prompt output
Now I have a complete picture. Here's the implementation plan.
Implementation Plan: Rate-Limit Header Parsing in github.ts
Problem
The gh() function in src/github.ts (line 337) detects rate-limit errors via RATE_LIMIT_RE against stderr text, then immediately trips the 60-second circuit breaker. It has no awareness of GitHub's actual reset time, so:
- Short resets (<60s) still block for the full 60s cooldown
- Long resets (>120s) only get a 60s cooldown that expires too early, leading to another rate-limit hit and another 60s cooldown cycle
- All retry attempts are wasted since the function rejects immediately on rate-limit match
Which files need to be changed
src/github.ts — Core changes to gh() retry logic and setRateLimited()
src/github.test.ts — New test cases for rate-limit header behavior
No new files or dependencies required.
What the changes should be
1. src/github.ts — Add rate-limit reset time detection
Step 1a: Add a helper to query the rate-limit reset time
After the setRateLimited() function (around line 30), add an exported async function getRateLimitReset() that calls gh api rate_limit using execFile directly (bypassing the gh() wrapper to avoid recursion/circuit-breaker). It should:
- Run
execFile("gh", ["api", "rate_limit"])
- Parse the JSON response:
response.rate.reset (Unix timestamp) and response.rate.remaining
- Return
{ resetAt: number; remaining: number } or null on any error (fail-safe — if the call itself is rate-limited or errors, fall back to existing behavior)
Important: This must use execFile directly, not the gh() wrapper, because the wrapper checks isRateLimited() and would reject.
Step 1b: Modify the rate-limit branch in gh() (lines 358-363)
Currently:
if (RATE_LIMIT_RE.test(msg)) {
setRateLimited();
reject(new RateLimitError(`gh ${args.join(" ")} failed: ${msg}`));
return;
}
Change to:
if (RATE_LIMIT_RE.test(msg)) {
// Attempt to determine actual reset time before deciding behavior
getRateLimitReset().then((info) => {
if (info && info.remaining === 0) {
const waitMs = (info.resetAt * 1000) - Date.now();
if (waitMs > 0 && waitMs <= 120_000) {
// Short reset: wait until reset, then retry once
log.info(`[github] Rate limit resets in ${Math.ceil(waitMs / 1000)}s — waiting to retry`);
setTimeout(() => {
// Clear any rate-limit state and retry the original call
clearRateLimitState();
run(); // retry inside the existing promise's closure
}, waitMs + 1000); // +1s buffer
return;
} else if (waitMs > 120_000) {
// Long reset: trip circuit breaker for the actual wait duration
setRateLimited(waitMs);
reject(new RateLimitError(`gh ${args.join(" ")} failed: ${msg} (reset in ${Math.ceil(waitMs / 1000)}s)`));
return;
}
}
// Fallback: couldn't determine reset time, or already past — use default 60s cooldown
setRateLimited();
reject(new RateLimitError(`gh ${args.join(" ")} failed: ${msg}`));
}).catch(() => {
// getRateLimitReset failed — fall back to default behavior
setRateLimited();
reject(new RateLimitError(`gh ${args.join(" ")} failed: ${msg}`));
});
return;
}
Step 1c: Allow setRateLimited() to accept a custom cooldown
setRateLimited() already accepts cooldownMs with a default of 60,000 (line 26). No signature change needed — just pass the actual waitMs from step 1b for long resets.
2. src/github.test.ts — New test cases
Add a new describe("rate limit header parsing") block with these cases:
-
Short reset (<120s): Mock execFile so the first gh call fails with "rate limit" in stderr, then mock the gh api rate_limit call to return { rate: { remaining: 0, reset: <now + 60s> } }. The mocked follow-up gh call (the retry after the wait) succeeds. Assert: setTimeout is called with ~60s delay, the promise eventually resolves, setRateLimited is NOT called.
-
Long reset (>120s): Mock the rate_limit API to return reset time 5 minutes out. Assert: setRateLimited is called with the actual wait duration (~300s), promise rejects with RateLimitError, the error message includes the reset time.
-
Non-rate-limit 403: Ensure a 403 error whose stderr does NOT match RATE_LIMIT_RE still goes through normal transient retry logic (no change from current behavior).
-
Missing/failed rate_limit API call: Mock gh api rate_limit to fail (e.g., network error). Assert: falls back to default setRateLimited(60_000) behavior, promise rejects with RateLimitError.
-
Rate limit API returns remaining > 0: Mock rate_limit API to show remaining: 5 (not actually at limit — could be a secondary rate limit). Assert: falls back to default 60s cooldown.
Potential risks and edge cases
-
Recursive rate limiting: The getRateLimitReset() call itself could be rate-limited. Mitigated by: using execFile directly (bypasses circuit breaker) and catching all errors to fall back to default behavior.
-
Secondary rate limits: GitHub's secondary rate limits (abuse detection) don't appear in the rate_limit API — they're separate from the primary rate limit counter. The rate_limit API might show remaining > 0 during a secondary rate limit. The fallback path handles this: if remaining > 0 or the call fails, we use the existing 60s cooldown.
-
Clock skew: If the server clock is ahead of the X-RateLimit-Reset timestamp, waitMs could be negative. The waitMs > 0 check handles this — negative or zero falls through to default behavior.
-
Timer precision for short waits: The +1000ms buffer on the retry timer prevents retrying a millisecond too early due to timer imprecision.
-
Promise chain complexity: The getRateLimitReset().then() inside the execFile callback adds async nesting. This is contained within the existing Promise constructor and the run() closure, so resolve/reject are still properly called on all paths. Each path must call exactly one of resolve, reject, or schedule a run() retry.
Suggested order of implementation
- Write the test cases first (TDD per CLAUDE.md) — all should fail initially
- Implement
getRateLimitReset() helper
- Modify the rate-limit branch in
gh()
- Run tests, iterate until passing
- Run full test suite (
npm test) to verify no regressions
How to verify the changes work
- Unit tests: The four test cases described above cover the critical paths
- Full test suite:
npm test must pass (includes typecheck)
- Manual verification: During the next burst of issue-refiner runs, monitor logs for the new
[github] Rate limit resets in Xs message. The rate-limit notifications should show actual reset durations instead of always 60s.
### Test Case 2: Improve the scheduler
Issue body: The scheduler could be better. It sometimes runs jobs when it shouldn't and doesn't handle edge cases well. Can we make it more robust?
I think there might be issues with timing and concurrency too.
Winner: variant | Current: 14/20 | Variant: 18/20
Reasoning: Output A correctly identifies the issue is vague and asks clarifying questions, but produces nothing actionable. Output B takes the better approach: it investigates the code, finds real concrete issues (timer drift, misleading drain status, silent timeout), and produces a detailed implementation plan with specific line references, code changes, tests, and risk analysis. It still flags the ambiguity in the original issue as a non-blocking clarifying question. For a vague issue, finding actual problems in the code is more valuable than blocking on clarification.
Current prompt output
Clarifying Questions (blocking)
The issue does not provide enough detail to produce a reliable implementation plan. After reading the scheduler source (src/scheduler.ts, 286 lines) and its test suite (src/scheduler.test.ts, 676 lines with 30+ test cases), the scheduler already handles several robustness concerns:
- Skip-if-busy semantics — overlapping ticks are skipped when a job is still running (line 54)
- Drain with timeout — graceful shutdown waits for running jobs with configurable timeout (lines 133–164)
- Pause/resume — jobs can be paused without interrupting running instances (lines 207–227)
- Staggered startup — interval jobs start 2s apart to avoid thundering herd (lines 104–116)
- Error isolation — failing jobs don't crash the scheduler (line 76, errors are reported and the job continues on next tick)
- Draining guard — new ticks and job additions are blocked during drain (lines 50, 234)
Without specifics, any "improvement" risks scope creep or solving phantom problems. The following questions must be answered:
-
"Runs jobs when it shouldn't" — Can you provide a concrete example? Under what conditions does a job run unexpectedly? Is this about scheduled-hour jobs firing at the wrong time, interval jobs running during drain, paused jobs running, or something else?
-
"Edge cases not handled well" — Which edge cases? For example: DST transitions affecting msUntilHour? A job being removed while its tick is in the setTimeout callback? updateInterval being called on a scheduled-hour job (or vice versa)? Something else?
-
"Issues with timing and concurrency" — Is this about timer drift over long uptimes? Race conditions between triggerJob and scheduled ticks? Multiple concurrent drain() calls? Or observed behavior in production (if so, what did the logs show)?
-
Acceptance criteria — What does "more robust" mean concretely? Are there specific failure modes you've observed, or is this a proactive hardening request? If proactive, which areas should be prioritized?
Please respond to these questions as a comment on the GitHub issue so the next refinement cycle can incorporate the answers and produce a targeted plan.
Variant prompt output
I've read the scheduler source and tests. Let me analyze the actual issues in the code before producing a plan.
Implementation Plan: Improve the Scheduler
Analysis of Current Issues
After reading src/scheduler.ts (286 lines) and src/scheduler.test.ts (675 lines), I identified these concrete problems:
1. Scheduled-hour jobs drift over time
File: src/scheduler.ts:99-102
After the initial setTimeout fires at the target hour, setInterval(tick, 24h) starts from that moment. If the setTimeout fires late (Node.js timer imprecision, event loop lag, or the job itself taking time), every subsequent daily run drifts further from the intended hour. Over days/weeks this compounds.
Fix: Replace the setTimeout → setInterval(24h) pattern with a self-rescheduling setTimeout that recalculates msUntilHour() after each run. This re-anchors to the wall clock every day.
2. triggerJob returns misleading status during drain
File: src/scheduler.ts:170-177
triggerJob returns "already-running" both when a job is actually running AND when the scheduler is draining. Callers (webhooks, dashboard manual trigger) cannot distinguish between "your job is busy" and "the system is shutting down."
Fix: Return a new "draining" status when draining === true, keeping the return type as a discriminated union.
3. drain() timeout silently succeeds
File: src/scheduler.ts:133-164
When drain(timeoutMs) times out with jobs still running, it logs a warning but resolves normally. main.ts cannot detect that drain failed, which could lead to unclean shutdown (e.g., worktrees not cleaned up).
Fix: Track whether the drain timed out and either reject the promise or return a result indicating which jobs were still running.
4. updateInterval skips immediate tick after config change
File: src/scheduler.ts:179-189
When an interval is updated (e.g., from 10min to 1min via config reload), the caller waits the full new interval before the next tick. If the previous interval just started, this creates a gap where the job doesn't run at the expected cadence.
Fix: This is arguably correct behavior (prevents double-firing). Document the behavior rather than change it. If desired, add an optional runImmediately parameter.
Files to Change
src/scheduler.ts — All changes are in this file
src/scheduler.test.ts — New tests for each fix
Detailed Changes
Change 1: Self-rescheduling scheduled-hour jobs (drift fix)
In startJobs() (line 95-102) and addJob() (line 253-260), replace:
timers.push(setTimeout(() => {
tick();
timers.push(setInterval(tick, 24 * 60 * 60 * 1000));
}, delay));
with a scheduleNextDaily(jobName, hour, tick) helper that:
- Calls
msUntilHour(hour) to get the delay
- Sets a
setTimeout for that delay
- When the timeout fires, calls
tick(), then calls itself recursively to schedule the next day
This re-anchors to wall clock time after every execution. Store the timeout ID in jobTimers so clearJobTimers still works. The updateScheduledHour method already clears and reschedules, so it naturally benefits.
Change 2: Distinct drain status from triggerJob
In triggerJob() (line 170-177), change:
if (draining) return "already-running";
to:
if (draining) return "draining";
Update the return type to "started" | "already-running" | "draining" | "unknown".
Update callers in src/server.ts and src/webhooks.ts that check triggerJob results — "draining" should map to an appropriate HTTP response (e.g., 503 Service Unavailable).
Change 3: Drain timeout returns running job info
Change drain() signature to return Promise<{ timedOut: boolean; stillRunning: string[] }>. On timeout, resolve with { timedOut: true, stillRunning: [...names] }. On clean drain, resolve with { timedOut: false, stillRunning: [] }.
Update main.ts drain call to log the result and potentially force-kill lingering processes.
Potential Risks / Edge Cases
- Drift fix — timer accumulation: The recursive
setTimeout approach creates a new timer each day. Since clearJobTimers clears the timers array, we must push each new timeout into the array. The old timeout naturally completes and doesn't leak since it only fires once.
triggerJob return type change: This is a breaking API change for callers. Both src/server.ts and src/webhooks.ts call triggerJob — both must handle the new "draining" case.
drain() return type change: main.ts is the only caller. Low risk, but must update the call site.
- DST transitions:
msUntilHour() uses local time (setHours). During DST spring-forward, hour 2 doesn't exist; during fall-back, hour 1 occurs twice. The current behavior (and the proposed fix) handles this acceptably — msUntilHour always targets the next occurrence of the hour, so at worst a scheduled job runs ~23 or ~25 hours after the previous run during DST transitions. This is existing behavior and not in scope to fix.
Out of Scope
- Replacing
setInterval/setTimeout with a cron library — the current approach is simple and adequate
- Adding job dependencies or DAG-based scheduling — separate concern
- Persistent job scheduling across restarts — crash recovery already handles this differently
Implementation Order
- Change 1 (drift fix) — most impactful, no API changes to other modules
- Change 2 (triggerJob status) — small, touches server.ts and webhooks.ts
- Change 3 (drain return type) — small, touches main.ts
Testing Approach
Change 1 — Drift fix:
- Test that after the first scheduled-hour tick fires, the next timeout is recalculated via
msUntilHour (not a fixed 24h). Use vi.useFakeTimers() to advance past two daily cycles and verify the job fires at the correct hour both times.
- Test that
clearJobTimers still cancels the recursive chain (remove the job mid-cycle, verify no future tick fires).
Change 2 — triggerJob status:
- Test that
triggerJob returns "draining" after drain() is called.
- Test that existing
"already-running" and "unknown" cases still work.
Change 3 — Drain return type:
- Test that drain with no timeout returns
{ timedOut: false, stillRunning: [] }.
- Test that drain with timeout returns
{ timedOut: true, stillRunning: ["job-name"] } when a job outlasts the timeout.
Clarifying Questions (non-blocking)
The issue mentions "sometimes runs jobs when it shouldn't" but doesn't give specific examples. The skip-if-busy logic appears correct after code review — the running flag is set synchronously before the first await, preventing concurrent invocations. If there are specific scenarios where jobs run unexpectedly, those details would help target additional fixes. The plan above addresses the concrete issues found in the code; please comment on the GitHub issue if there are specific failure scenarios to investigate.
### Test Case 3: Refactor all jobs to use a plugin architecture with dynamic loading, dependency injection, and shared middleware pipeline
Issue body: ## Overview
Currently each job in src/jobs/ is a standalone module with a run() function. This works but doesn't scale. We should refactor to a proper plugin architecture.
Requirements
- Create a
JobPlugin interface with lifecycle hooks: onInit, onBeforeRun, onRun, onAfterRun, onError, onShutdown
- Implement a plugin registry that dynamically discovers and loads job modules from
src/jobs/ at startup
- Add a dependency injection container so jobs can declare their dependencies (github client, claude client, db, config) and have them injected
- Build a middleware pipeline (like Express/Koa) that jobs pass through — middleware for logging, error handling, rate limiting, worktree management, metrics collection
- Add a job dependency graph so jobs can declare ordering constraints (e.g., "run after issue-refiner")
- Support hot-reloading of job plugins without restarting the service
- Add job priority queues with configurable priority levels
- Implement job chaining — output of one job feeds as input to the next
- Add a dry-run mode for all jobs
- Create a job testing harness that can simulate GitHub events
- Also while we're at it, convert all callbacks to use RxJS observables for better composition
- Add OpenTelemetry tracing spans for each job execution
This will make the system much more maintainable and extensible.
Acceptance Criteria
All existing jobs work the same as before but use the new architecture.
Winner: variant | Current: 16/20 | Variant: 17/20
Reasoning: Both outputs correctly identify this as an over-scoped issue, but the variant provides a concrete, implementable plan (specific helper functions, file changes, implementation order) while still rejecting the unnecessary requirements with detailed rationale. The current prompt asks excellent clarifying questions but produces no actionable plan. The variant loses some uncertainty score by being more prescriptive about what to build without asking the user first, but the trade-off is worth it given the issue is clearly over-engineered.
Current prompt output
Clarifying Questions (blocking)
This issue cannot be reliably planned because it combines at least 12 distinct features with unclear motivation and significant contradictions with the existing architecture. The following questions must be answered before proceeding:
1. What concrete problem is this solving?
The issue states the current approach "doesn't scale" but doesn't explain how. The codebase has 13 jobs, all following a consistent run(repos: Repo[]): Promise<void> pattern with shared conventions (two-phase identify/process, worktree lifecycle in finally blocks, task recording via db.ts). What specific scaling problem has been encountered — difficulty adding new jobs, debugging failures, or something else? The answer determines which (if any) of the 12 requirements are actually needed.
2. Are requirements 11 and 12 (RxJS observables, OpenTelemetry) actually desired?
These are unrelated to a plugin architecture and would introduce two significant new dependencies (rxjs, @opentelemetry/*) to a project that currently has minimal dependencies. The prompt instructions explicitly say "Do not add logging, metrics, or observability unless the issue asks for it" and "do not refactor surrounding code" — but the issue itself asks for these. Is the intent to bundle these with the plugin refactor, or should they be separate issues?
3. How would "hot-reloading of job plugins" (requirement 6) work given the current architecture?
Jobs are stateless functions that discover work via GitHub API polling. The scheduler already supports live enable/disable via config reload (main.ts handles SIGHUP and config file changes). What additional hot-reload capability is needed beyond what already exists? ESM dynamic import() has known cache-invalidation problems in Node.js — is the intent to use require() (CJS), a custom module loader, or something else?
4. What does "job dependency graph" (requirement 5) mean in practice?
Currently, jobs don't communicate directly. They coordinate through GitHub state (labels, comments, reactions). For example, issue-refiner adds the Needs Plan Review label, and plan-reviewer picks it up on its next poll cycle. This is deliberately decoupled. Would a dependency graph replace this model (tight coupling, synchronous chaining) or supplement it (just scheduling order hints)? These are fundamentally different architectures.
5. What does "job chaining — output of one job feeds as input to the next" (requirement 8) mean?
Job outputs today are side effects (GitHub comments, labels, PRs), not return values. run() returns Promise<void>. Making jobs return structured data that chains to the next job would require redesigning every job's contract. Is the intent to change job signatures, or to have a message-passing system alongside the current GitHub-state-based coordination?
6. Should this be split into smaller issues?
This issue contains at least 7 orthogonal features: (a) plugin interface + registry, (b) dependency injection, (c) middleware pipeline, (d) dependency graph, (e) hot-reloading, (f) job chaining, (g) dry-run mode, (h) testing harness, (i) RxJS conversion, (j) OpenTelemetry. Each has different risk profiles and could be implemented independently. A single PR touching all 13 jobs plus main.ts, scheduler.ts, and introducing ~5 new infrastructure modules would be extremely difficult to review. Would you prefer this broken into a sequence of smaller, independently deployable issues?
7. What does "dry-run mode" (requirement 9) mean per job?
Some jobs are read-only scanners (issue-auditor), some create PRs (issue-worker), some edit comments (issue-refiner), some merge PRs (auto-merger). "Dry run" means different things for each. Should dry-run skip all GitHub writes? Skip only destructive writes (merge, push) but allow reads? Log what would happen? The existing jobs have no shared write abstraction — each calls gh.* functions directly with different semantics.
Please respond to these questions as comments on the GitHub issue so the next refinement cycle can incorporate the answers and produce a focused, implementable plan. Given the scope of this issue, I'd particularly recommend addressing question 6 first — the answer will shape everything else.
Variant prompt output
This issue is a textbook over-scoped "boil the ocean" refactor. Let me analyze each requirement against the actual codebase before producing a plan.
Analysis
The current architecture has 13 jobs totaling ~4,570 LOC, each implementing a simple run(repos: Repo[]) function registered in main.ts. The system already has:
- Skip-if-busy scheduling (scheduler.ts)
- Priority queuing in the AI dispatch layer (claude.ts
enqueue(fn, priority))
- Live config reload (main.ts lines 293-359 — interval/schedule/enabled/paused changes)
- Worktree isolation per task
- Error reporting pipeline with deduplication
- Co-located tests with comprehensive mocking
Here is my assessment of the 12 requirements:
Implementation Plan
Requirements Disposition
| # |
Requirement |
Disposition |
Rationale |
| 1 |
JobPlugin interface with lifecycle hooks |
Out of scope |
Current run(repos) is simple, testable, and uniform across 13 jobs. Lifecycle hooks add indirection with no concrete benefit — the jobs don't need onInit/onShutdown because initialization is centralized in main.ts and shutdown is handled via cancelQueuedTasks(). |
| 2 |
Dynamic plugin discovery/loading |
Out of scope |
13 explicit registrations in main.ts (lines 140-253) are readable, type-safe, and allow the ENABLED_JOBS filter. Dynamic loading loses compile-time safety and makes the dependency graph invisible. |
| 3 |
Dependency injection container |
Out of scope |
Jobs import what they need via ESM. The "dependencies" (github, claude, db, config) are module-level singletons — this is idiomatic Node.js. A DI container adds complexity for zero testability gain since tests already use vi.mock(). |
| 4 |
Middleware pipeline |
Partially useful — extract shared patterns |
Jobs share patterns (two-phase identify/process, worktree lifecycle, tree-diff guard) but these are already factored into claude.ts and github.ts. A formal middleware pipeline is over-engineering, but extracting 2-3 helper functions for the most common patterns would reduce duplication. |
| 5 |
Job dependency graph |
Out of scope |
Jobs are independent by design. The scheduler's skip-if-busy semantics and staggered start prevent resource contention. Ordering constraints would couple jobs and complicate the scheduler for no demonstrated need. |
| 6 |
Hot-reloading job plugins |
Out of scope / dangerous |
Live config reload already exists. Hot-reloading code in a daemon that manages git worktrees and pushes to GitHub is a reliability hazard. The updater service (yeti-updater.timer) handles code updates safely via restart. |
| 7 |
Job priority queues |
Already exists |
claude.ts already has enqueue(fn, priority=false) where priority tasks jump to front. The Priority label feeds into this. No work needed. |
| 8 |
Job chaining (output → input) |
Out of scope |
The content-based state machine (labels + reactions + comments) already implements implicit chaining: issue-refiner → plan-reviewer → issue-worker. Making this explicit would couple jobs and break the current robust, recoverable design. |
| 9 |
Dry-run mode |
Potentially useful but narrow |
Could be useful for debugging, but each job's side effects are different (GitHub comments, PRs, labels, reactions). A generic dry-run flag would require threading through every GitHub call. Better addressed per-job as needed. |
| 10 |
Job testing harness for GitHub events |
Out of scope |
Tests already simulate GitHub events via mocked gh CLI responses. src/test-helpers.ts provides mockRepo(), mockIssue(), mockPR() factories. The webhook handler (webhooks.ts) already routes real GitHub events to jobs. |
| 11 |
Convert callbacks to RxJS observables |
Out of scope |
There are no callback-heavy patterns to convert. Jobs are async/await throughout. Adding RxJS (a large dependency) would force all contributors to learn a reactive paradigm for zero benefit. |
| 12 |
OpenTelemetry tracing |
Out of scope |
Orthogonal to the "plugin architecture" and should be its own issue if desired. The AsyncLocalStorage-based run context in db.ts already provides per-job-run log correlation. |
What should actually be done
The only concrete improvement with a real payoff is requirement 4 (partially): extracting shared job patterns into reusable helpers to reduce boilerplate across the 13 jobs. Specifically:
File changes
src/jobs/job-helpers.ts (new file, ~80 lines)
Extract these repeated patterns:
-
forEachRepo(repos, processRepo) — The repos iteration + rate-limit-check pattern used by most jobs. Wraps the common for (const repo of repos) { if (rateLimited) continue; ... } loop.
-
withWorktreeTask(repo, branch, taskFn) — Combines worktree creation, try/catch/finally cleanup, and tree-diff guard into a single helper. Currently duplicated across ci-fixer, issue-worker, doc-maintainer, improvement-identifier, review-addresser, and mkdocs-update.
-
twoPhaseProcess(repos, identify, process) — Codifies the identify-then-process pattern used by ci-fixer, improvement-identifier, and issue-refiner. Phase 1 collects work items, phase 2 processes them.
src/jobs/*.ts (6-8 job files modified)
Refactor jobs to use the new helpers where the pattern clearly applies. Do NOT force jobs that don't fit the pattern (e.g., prompt-evaluator, repo-standards) to use them.
src/jobs/job-helpers.test.ts (new file)
Unit tests for the extracted helpers.
What NOT to change
main.ts job registration — keep explicit
scheduler.ts — no changes needed
claude.ts — already well-factored
- No new dependencies (no RxJS, no OpenTelemetry, no DI framework)
- No new interfaces or abstractions beyond the 2-3 helper functions
Risks and edge cases
- Over-abstracting: The helpers must remain optional conveniences, not mandatory wrappers. Jobs that don't fit the pattern should keep their current structure.
- Behavioral regression: Each job has subtle differences in how it handles worktrees, errors, and state. The helpers must be parameterized enough to preserve these differences without becoming a configuration nightmare.
- Test coverage: Existing job tests mock at the module boundary. Extracting helpers changes the mock surface — tests for jobs using the helpers will need updating.
Implementation order
- Write
job-helpers.test.ts with tests for the three helpers (TDD)
- Implement
job-helpers.ts to pass the tests
- Refactor 2 jobs (start with
doc-maintainer and auto-merger as they're simplest) to use the helpers
- Verify all existing tests still pass
- Refactor remaining applicable jobs one at a time
- Update
yeti/OVERVIEW.md and CLAUDE.md to document the helpers
Testing approach
- Run
npm test after each job refactor to catch regressions
- Each refactored job's existing tests must continue passing without changes to assertions (only mock setup may change)
- New helper tests cover: normal execution, error propagation, worktree cleanup on failure, rate-limit skip behavior
Out of Scope
All 12 original requirements except the partial extraction from #4 are out of scope for the reasons detailed above. The current architecture is working well for 13 jobs. The run(repos) pattern is simple, testable, and understood by all contributors. Introducing a plugin architecture, DI container, middleware pipeline, RxJS, or OpenTelemetry would add significant complexity without solving a demonstrated problem.
If specific pain points emerge (e.g., "adding a new job requires touching 5 files" or "I can't test job X without running the whole system"), those should be filed as targeted issues rather than addressed via a speculative architectural overhaul.
Suggested PR structure
Single PR — The helper extraction and job refactors are tightly coupled and should be reviewed together. The change is small enough (~200 lines new, ~150 lines removed from jobs) to review in one pass.
Note
This plan deliberately rejects the majority of the issue's requirements. The issue describes a solution (plugin architecture) rather than a problem. The actual codebase is well-structured for its current scale, and the proposed abstractions would make it harder to understand and maintain, not easier. The targeted extraction of 2-3 helper functions addresses the only real duplication without the overhead of a framework.
### Test Case 4: Add webhook support for pull_request.synchronize events
Issue body: ## Problem
We handle pull_request.closed and check_run.completed webhook events but not pull_request.synchronize. When a PR branch is force-pushed or has new commits pushed, we should react.
What should happen
When we receive a pull_request.synchronize event, we should... actually I'm not sure. Maybe re-run CI checks? Or maybe invalidate the plan? Or update the PR description? It depends on whether it's a yeti PR or an external PR.
For yeti PRs (branches starting with yeti/), maybe we should check if the push was from yeti itself or from a human. If it was from a human, we might want to pause yeti's work on that PR. Or maybe not.
For external PRs, maybe we should re-run the improvement-identifier job? Or the ci-fixer? Not sure which.
Additional context
Someone mentioned this would be useful in Slack but I don't remember the exact use case they described. It was something about PRs getting out of sync. The webhooks.ts file has isRepoAllowed() which we should reuse. Also we might need to update the config to add a new toggle for this, or maybe it should always be on, or maybe it should be controlled per-repo.
See also: that other issue about webhook improvements (I don't remember the number).
Winner: variant | Current: 15/20 | Variant: 18/20
Reasoning: Both correctly identify the issue as unplannable and ask blocking clarifying questions. The variant is superior in specificity — it references concrete files (src/webhooks.ts), functions (handlePullRequestSynchronizeEvent, isRepoAllowed, scheduler.triggerJob), and architectural patterns (skip-if-busy, per-PR state). It also demonstrates stronger scope awareness by noting that GitHub already re-runs CI on pushes (making one proposed behavior redundant), that 'pausing' would require a new per-PR state mechanism that doesn't exist, and by offering a concrete recommendation to close/convert to discussion. The current prompt's questions are solid but more generic.
Current prompt output
Step 1: Plannable?
This issue is not plannable in its current state. The author explicitly expresses uncertainty about the desired behavior in nearly every paragraph ("I'm not sure," "maybe," "not sure which," "I don't remember the exact use case"). There are no acceptance criteria, and the core question — what should happen when we receive pull_request.synchronize — is unanswered.
Clarifying Questions (blocking)
-
What is the desired behavior for yeti PRs?
"For yeti PRs...maybe we should check if the push was from yeti itself or from a human. If it was from a human, we might want to pause yeti's work on that PR. Or maybe not."
The issue proposes at least three options but picks none:
- (A) Ignore synchronize events on yeti PRs entirely
- (B) Detect human pushes and pause yeti's work (how? remove a label? skip the issue in future runs?)
- (C) Re-run some job (which one?)
Which behavior do you want, and what does "pause" mean concretely?
-
What is the desired behavior for external PRs?
"For external PRs, maybe we should re-run the improvement-identifier job? Or the ci-fixer? Not sure which."
These two jobs have very different purposes. ci-fixer reacts to failing CI checks; improvement-identifier scans for code improvements. A force-push might warrant re-running CI-fixer (if the old failure context is stale), but re-running improvement-identifier on every push could be noisy. Which job(s) should be triggered, and under what conditions?
-
What was the original use case from Slack?
"Someone mentioned this would be useful in Slack but I don't remember the exact use case they described. It was something about PRs getting out of sync."
Without the motivating use case, we can't validate whether any proposed behavior actually solves the problem. Can you find the Slack thread and include the specific scenario?
-
Should this be configurable, and if so how?
"we might need to update the config to add a new toggle for this, or maybe it should always be on, or maybe it should be controlled per-repo."
The current webhook system has no per-event-type toggles — all supported events are always handled if the repo is allowed. Should synchronize follow the same pattern (always on), or does it need a new config field? Per-repo control would be a new config pattern that doesn't exist today.
-
What is "the other issue about webhook improvements"?
"See also: that other issue about webhook improvements (I don't remember the number)."
There may be overlapping or conflicting plans. Please link the related issue so we can check for scope conflicts before planning.
Please respond to these questions as comments on the GitHub issue so the next refinement cycle can incorporate your answers and produce a complete plan.
Variant prompt output
Based on my analysis of the codebase and this issue, this issue is too underspecified to produce an actionable implementation plan. Here's my assessment:
Issue Analysis: pull_request.synchronize webhook support
What's clear
The webhook infrastructure in src/webhooks.ts already handles four event types (issues, check_run, pull_request_review, pull_request) with a clean dispatcher pattern. Adding a new handlePullRequestSynchronizeEvent handler is mechanically straightforward — it follows the same pattern as the existing pull_request.closed handler, reuses isRepoAllowed(), and routes through scheduler.triggerJob().
What's not clear — and why it matters
The issue explicitly states uncertainty about the desired behavior in multiple places. Every decision branch leads to a different implementation:
-
"Maybe re-run CI checks?" — GitHub already re-runs checks automatically when commits are pushed to a PR branch. There's nothing for yeti to do here.
-
"Maybe invalidate the plan?" — The plan lives as an issue comment produced by issue-refiner. "Invalidating" it could mean adding a comment, removing the Refined label, or re-triggering issue-refiner. Each has different implications for the workflow state machine.
-
"Maybe update the PR description?" — Which description fields? Under what conditions?
-
Yeti PR vs external PR distinction — The issue proposes different behavior per PR type but doesn't specify what either behavior should be.
-
"Pause yeti's work" — There's no concept of pausing work on a specific PR in the current architecture. Jobs are global (skip-if-busy), not per-PR. This would be a significant new feature, not a webhook handler addition.
-
Config toggle — "Always on", "per-repo", or "global toggle" are three different config designs.
-
Referenced Slack conversation and "other issue" — These are the actual motivation for the feature but aren't available.
Clarifying Questions (blocking)
These must be answered before a plan can be produced:
-
What is the concrete use case? The issue says "someone mentioned this would be useful in Slack" but doesn't describe the actual problem. What specific scenario causes PRs to "get out of sync" and what user-visible problem does that cause?
- Options: (a) yeti overwrites human changes, (b) yeti PRs have stale CI, (c) yeti plans become outdated after manual edits, (d) something else
-
What job(s) should be triggered? The issue lists ci-fixer, improvement-identifier, and "pause yeti's work" as candidates but picks none.
- Note: CI re-runs are already handled by GitHub natively.
ci-fixer is triggered by check_run.completed which fires after the re-run fails — so the existing flow already covers "new push → CI runs → CI fails → ci-fixer triggers."
- Suggestion: If the goal is detecting human edits to yeti PRs, the simplest action would be adding a comment or label, not triggering a job.
-
Should this block issue-worker from starting new work on an issue whose PR was human-edited? This is the "pause" concept — it requires a new per-issue/per-PR state mechanism that doesn't exist today. Is this in scope or out of scope?
-
What is the "other issue about webhook improvements"? It may contain design decisions that affect this work.
Please respond as a comment on the GitHub issue so the next refinement cycle can incorporate the answers.
Recommendation
Close or convert this to a discussion until the use case is concrete. The current issue has the shape of "we could handle this event" without a clear "and here's the problem it solves." Adding webhook plumbing is easy; designing the right behavior requires knowing what problem we're solving.
Automated evaluation by yeti prompt-evaluator
Prompt Evaluation:
buildRefinementPromptResult: Variant wins 4/4 test cases
Proposed Change Rationale
Key changes and rationale:
Added feedback classification step: The original prompt treats all feedback uniformly ('address it or explain why not'), but questions, directives, concerns, and out-of-scope requests need fundamentally different responses. Without classification, the AI tends to over-react to questions (changing the plan when only an explanation was needed) or under-react to directives (explaining without changing). The classification framework prevents both failure modes.
Blocking/non-blocking clarifying questions for refinement: The original only had this distinction in buildNewPlanPrompt but not in buildRefinementPrompt. When feedback is ambiguous, the AI should still apply all clear feedback while flagging blocked items, rather than stalling the entire revision.
Improved no-feedback fallback removed: The original's no-feedback path ('Re-evaluate the plan for missing files, edge cases...') encourages the AI to invent changes without any human signal, which produces noisy diffs. In practice this path is rarely hit (processRefinement only calls this when there ARE unreacted comments), so the logic should focus on the feedback-present case. The feedbackSection variable in the improved version handles both cases but the no-feedback case is less likely to produce hallucinated changes.
Specificity preservation rule: Added 'do not degrade specificity' — a common failure mode where the AI rewrites a section in response to feedback and makes it vaguer than the original (e.g., replacing specific function names with generic descriptions).
Stale code detection: Added instruction to notice when files have changed since the original plan was written, which is common in active repos where the plan may be days old.
Verification step strengthened: Added count-matching (number of feedback items vs responses), section-by-section comparison against the original plan, and re-reading the issue body. The original's verification was good but could miss drift from the original issue.
Evidence-based disagreement: Strengthened the 'explain why not' instruction to require referencing actual code, not just asserting something. This reduces the AI's tendency to dismiss valid feedback with hand-wavy justifications.
Proposed Variant Prompt
Click to expand the full variant prompt text
Test Case 1: Add rate-limit header parsing to github.ts retry logic
Issue body: ## Problem
When GitHub returns a 403 with
X-RateLimit-Remaining: 0, our retry logic ingithub.tstreats it as a generic transient error and retries with exponential backoff. This wastes retry attempts because the request will keep failing until the rate limit resets.Desired Behavior
When a 403 response includes
X-RateLimit-Remaining: 0:X-RateLimit-Resetheader (Unix timestamp)Acceptance Criteria
runGh()function ingithub.tsdetects rate-limit 403s distinctly from other 403ssetRateLimited()immediatelyContext
This was noticed during a burst of
issue-refinerruns that hit the secondary rate limit. The 60s circuit breaker cooldown kicked in eventually but only after wasting all retry attempts.Winner: variant | Current: 16/20 | Variant: 18/20
Reasoning: Both outputs are strong and cover the same ground. The variant wins on specificity and actionability: it includes inline code showing the exact modified branch logic, references precise line numbers, adds a 5th test case (remaining > 0 for secondary rate limits), explicitly notes the TDD order per CLAUDE.md, and provides more concrete implementation details like the +1000ms buffer rationale. The current prompt's approach is solid but slightly more abstract in describing changes.
Current prompt output
Now I have full context. Here's the implementation plan:
Implementation Plan
Files to change
1.
src/github.ts— Add rate-limit reset query and smart retry logicNew helper:
queryRateLimitReset()(around line 34, afterclearRateLimitState)execFile("gh", ["api", "/rate_limit"])directly (not through thegh()wrapper, which would be blocked by the circuit breaker or recurse)remaining === 0reset * 1000 - Date.now()) ornullon failure/rate_limitendpoint is itself exempt from rate limiting, so this call is safeModified rate-limit branch in
gh()innerrun()function (lines 358–363):Currently:
New behavior:
rateLimitRetriedboolean outside therun()closure (but inside the Promise)RATE_LIMIT_REmatches andrateLimitRetriedis false:queryRateLimitReset()rateLimitRetried = true,setTimeout(run, waitMs)(retry once after reset)setRateLimited(waitMs), reject withRateLimitErrorsetRateLimited()with default 60s, reject)RATE_LIMIT_REmatches andrateLimitRetriedis true: trip circuit breaker immediately (already retried once)Why: This is the core of the issue — the existing code always trips the 60s circuit breaker on rate-limit errors, wasting retries and using a fixed cooldown rather than the actual reset time.
2.
src/github.test.ts— Add tests for all four casesAdd a new
describe("rate limit header parsing")block (after the existing circuit breaker tests around line 399) with these test cases:Short reset (<= 120s): Mock first
execFilecall to return rate-limit error, mockqueryRateLimitResetresponse with reset ~30s in the future. Verify the command retries after the wait and succeeds. VerifyisRateLimited()is false (circuit breaker not tripped).Long reset (> 120s): Mock rate-limit error + reset 300s in future. Verify
RateLimitErroris thrown,isRateLimited()is true, andsetRateLimitedwas called with the actual wait time (not default 60s).Non-rate-limit 403: Mock a 403 error that doesn't match
RATE_LIMIT_RE. Verify it falls through to the transient/reject logic, no rate-limit query is made.Missing/failed rate-limit query: Mock rate-limit error +
queryRateLimitResetreturning null (e.g., parse failure). Verify fallback to existing behavior (60s circuit breaker).Testing approach: Follow existing pattern — mock
execFileto differentiate calls by args (thegh api /rate_limitcall vs the original command). Usevi.useFakeTimers()for the short-reset wait case (already used in existing circuit breaker tests).Implementation order
queryRateLimitReset()helper ingithub.ts(no existing code changes, just new function)gh()to use the new helperTypes before consumers, and the helper has no dependencies on modified code.
Risks and edge cases
/rate_limitendpoint responses. The query will find no exhausted resource and returnnull, falling back to the existing 60s circuit breaker. This is correct behavior.X-RateLimit-Resetheader is a Unix timestamp from GitHub's clock. Minor skew could mean we wait slightly too little or too long. Adding a small buffer (e.g., 1–2s) to the calculated wait mitigates this.gh()calls could all hit rate limit simultaneously and each spawn aqueryRateLimitReset()call. This is harmless — the/rate_limitendpoint is cheap and the first one to callsetRateLimited()wins (subsequent calls seeisRateLimited()and short-circuit).queryRateLimitResetitself failing: If the network is down orghis misconfigured, the.catch()fallback ensures we still trip the default circuit breaker rather than hanging.Verification
npx vitest run src/github.test.ts— all existing tests pass, new tests cover the four casesnpm test— full typecheck + test suite passesVariant prompt output
Now I have a complete picture. Here's the implementation plan.
Implementation Plan: Rate-Limit Header Parsing in
github.tsProblem
The
gh()function insrc/github.ts(line 337) detects rate-limit errors viaRATE_LIMIT_REagainst stderr text, then immediately trips the 60-second circuit breaker. It has no awareness of GitHub's actual reset time, so:Which files need to be changed
src/github.ts— Core changes togh()retry logic andsetRateLimited()src/github.test.ts— New test cases for rate-limit header behaviorNo new files or dependencies required.
What the changes should be
1.
src/github.ts— Add rate-limit reset time detectionStep 1a: Add a helper to query the rate-limit reset time
After the
setRateLimited()function (around line 30), add an exported async functiongetRateLimitReset()that callsgh api rate_limitusingexecFiledirectly (bypassing thegh()wrapper to avoid recursion/circuit-breaker). It should:execFile("gh", ["api", "rate_limit"])response.rate.reset(Unix timestamp) andresponse.rate.remaining{ resetAt: number; remaining: number }ornullon any error (fail-safe — if the call itself is rate-limited or errors, fall back to existing behavior)Important: This must use
execFiledirectly, not thegh()wrapper, because the wrapper checksisRateLimited()and would reject.Step 1b: Modify the rate-limit branch in
gh()(lines 358-363)Currently:
Change to:
Step 1c: Allow
setRateLimited()to accept a custom cooldownsetRateLimited()already acceptscooldownMswith a default of 60,000 (line 26). No signature change needed — just pass the actualwaitMsfrom step 1b for long resets.2.
src/github.test.ts— New test casesAdd a new
describe("rate limit header parsing")block with these cases:Short reset (<120s): Mock
execFileso the firstghcall fails with "rate limit" in stderr, then mock thegh api rate_limitcall to return{ rate: { remaining: 0, reset: <now + 60s> } }. The mocked follow-upghcall (the retry after the wait) succeeds. Assert:setTimeoutis called with ~60s delay, the promise eventually resolves,setRateLimitedis NOT called.Long reset (>120s): Mock the rate_limit API to return reset time 5 minutes out. Assert:
setRateLimitedis called with the actual wait duration (~300s), promise rejects withRateLimitError, the error message includes the reset time.Non-rate-limit 403: Ensure a 403 error whose stderr does NOT match
RATE_LIMIT_REstill goes through normal transient retry logic (no change from current behavior).Missing/failed rate_limit API call: Mock
gh api rate_limitto fail (e.g., network error). Assert: falls back to defaultsetRateLimited(60_000)behavior, promise rejects withRateLimitError.Rate limit API returns remaining > 0: Mock rate_limit API to show
remaining: 5(not actually at limit — could be a secondary rate limit). Assert: falls back to default 60s cooldown.Potential risks and edge cases
Recursive rate limiting: The
getRateLimitReset()call itself could be rate-limited. Mitigated by: usingexecFiledirectly (bypasses circuit breaker) and catching all errors to fall back to default behavior.Secondary rate limits: GitHub's secondary rate limits (abuse detection) don't appear in the
rate_limitAPI — they're separate from the primary rate limit counter. Therate_limitAPI might showremaining > 0during a secondary rate limit. The fallback path handles this: ifremaining > 0or the call fails, we use the existing 60s cooldown.Clock skew: If the server clock is ahead of the
X-RateLimit-Resettimestamp,waitMscould be negative. ThewaitMs > 0check handles this — negative or zero falls through to default behavior.Timer precision for short waits: The
+1000msbuffer on the retry timer prevents retrying a millisecond too early due to timer imprecision.Promise chain complexity: The
getRateLimitReset().then()inside theexecFilecallback adds async nesting. This is contained within the existing Promise constructor and therun()closure, so resolve/reject are still properly called on all paths. Each path must call exactly one ofresolve,reject, or schedule arun()retry.Suggested order of implementation
getRateLimitReset()helpergh()npm test) to verify no regressionsHow to verify the changes work
npm testmust pass (includes typecheck)[github] Rate limit resets in Xsmessage. The rate-limit notifications should show actual reset durations instead of always 60s.Issue body: The scheduler could be better. It sometimes runs jobs when it shouldn't and doesn't handle edge cases well. Can we make it more robust?
I think there might be issues with timing and concurrency too.
Winner: variant | Current: 14/20 | Variant: 18/20
Reasoning: Output A correctly identifies the issue is vague and asks clarifying questions, but produces nothing actionable. Output B takes the better approach: it investigates the code, finds real concrete issues (timer drift, misleading drain status, silent timeout), and produces a detailed implementation plan with specific line references, code changes, tests, and risk analysis. It still flags the ambiguity in the original issue as a non-blocking clarifying question. For a vague issue, finding actual problems in the code is more valuable than blocking on clarification.
Current prompt output
Clarifying Questions (blocking)
The issue does not provide enough detail to produce a reliable implementation plan. After reading the scheduler source (
src/scheduler.ts, 286 lines) and its test suite (src/scheduler.test.ts, 676 lines with 30+ test cases), the scheduler already handles several robustness concerns:Without specifics, any "improvement" risks scope creep or solving phantom problems. The following questions must be answered:
"Runs jobs when it shouldn't" — Can you provide a concrete example? Under what conditions does a job run unexpectedly? Is this about scheduled-hour jobs firing at the wrong time, interval jobs running during drain, paused jobs running, or something else?
"Edge cases not handled well" — Which edge cases? For example: DST transitions affecting
msUntilHour? A job being removed while its tick is in thesetTimeoutcallback?updateIntervalbeing called on a scheduled-hour job (or vice versa)? Something else?"Issues with timing and concurrency" — Is this about timer drift over long uptimes? Race conditions between
triggerJoband scheduled ticks? Multiple concurrentdrain()calls? Or observed behavior in production (if so, what did the logs show)?Acceptance criteria — What does "more robust" mean concretely? Are there specific failure modes you've observed, or is this a proactive hardening request? If proactive, which areas should be prioritized?
Please respond to these questions as a comment on the GitHub issue so the next refinement cycle can incorporate the answers and produce a targeted plan.
Variant prompt output
I've read the scheduler source and tests. Let me analyze the actual issues in the code before producing a plan.
Implementation Plan: Improve the Scheduler
Analysis of Current Issues
After reading
src/scheduler.ts(286 lines) andsrc/scheduler.test.ts(675 lines), I identified these concrete problems:1. Scheduled-hour jobs drift over time
File:
src/scheduler.ts:99-102After the initial
setTimeoutfires at the target hour,setInterval(tick, 24h)starts from that moment. If the setTimeout fires late (Node.js timer imprecision, event loop lag, or the job itself taking time), every subsequent daily run drifts further from the intended hour. Over days/weeks this compounds.Fix: Replace the
setTimeout→setInterval(24h)pattern with a self-reschedulingsetTimeoutthat recalculatesmsUntilHour()after each run. This re-anchors to the wall clock every day.2.
triggerJobreturns misleading status during drainFile:
src/scheduler.ts:170-177triggerJobreturns"already-running"both when a job is actually running AND when the scheduler is draining. Callers (webhooks, dashboard manual trigger) cannot distinguish between "your job is busy" and "the system is shutting down."Fix: Return a new
"draining"status whendraining === true, keeping the return type as a discriminated union.3.
drain()timeout silently succeedsFile:
src/scheduler.ts:133-164When
drain(timeoutMs)times out with jobs still running, it logs a warning but resolves normally.main.tscannot detect that drain failed, which could lead to unclean shutdown (e.g., worktrees not cleaned up).Fix: Track whether the drain timed out and either reject the promise or return a result indicating which jobs were still running.
4.
updateIntervalskips immediate tick after config changeFile:
src/scheduler.ts:179-189When an interval is updated (e.g., from 10min to 1min via config reload), the caller waits the full new interval before the next tick. If the previous interval just started, this creates a gap where the job doesn't run at the expected cadence.
Fix: This is arguably correct behavior (prevents double-firing). Document the behavior rather than change it. If desired, add an optional
runImmediatelyparameter.Files to Change
src/scheduler.ts— All changes are in this filesrc/scheduler.test.ts— New tests for each fixDetailed Changes
Change 1: Self-rescheduling scheduled-hour jobs (drift fix)
In
startJobs()(line 95-102) andaddJob()(line 253-260), replace:with a
scheduleNextDaily(jobName, hour, tick)helper that:msUntilHour(hour)to get the delaysetTimeoutfor that delaytick(), then calls itself recursively to schedule the next dayThis re-anchors to wall clock time after every execution. Store the timeout ID in
jobTimerssoclearJobTimersstill works. TheupdateScheduledHourmethod already clears and reschedules, so it naturally benefits.Change 2: Distinct drain status from triggerJob
In
triggerJob()(line 170-177), change:to:
Update the return type to
"started" | "already-running" | "draining" | "unknown".Update callers in
src/server.tsandsrc/webhooks.tsthat checktriggerJobresults —"draining"should map to an appropriate HTTP response (e.g., 503 Service Unavailable).Change 3: Drain timeout returns running job info
Change
drain()signature to returnPromise<{ timedOut: boolean; stillRunning: string[] }>. On timeout, resolve with{ timedOut: true, stillRunning: [...names] }. On clean drain, resolve with{ timedOut: false, stillRunning: [] }.Update
main.tsdrain call to log the result and potentially force-kill lingering processes.Potential Risks / Edge Cases
setTimeoutapproach creates a new timer each day. SinceclearJobTimersclears the timers array, we must push each new timeout into the array. The old timeout naturally completes and doesn't leak since it only fires once.triggerJobreturn type change: This is a breaking API change for callers. Bothsrc/server.tsandsrc/webhooks.tscalltriggerJob— both must handle the new"draining"case.drain()return type change:main.tsis the only caller. Low risk, but must update the call site.msUntilHour()uses local time (setHours). During DST spring-forward, hour 2 doesn't exist; during fall-back, hour 1 occurs twice. The current behavior (and the proposed fix) handles this acceptably —msUntilHouralways targets the next occurrence of the hour, so at worst a scheduled job runs ~23 or ~25 hours after the previous run during DST transitions. This is existing behavior and not in scope to fix.Out of Scope
setInterval/setTimeoutwith a cron library — the current approach is simple and adequateImplementation Order
Testing Approach
Change 1 — Drift fix:
msUntilHour(not a fixed 24h). Usevi.useFakeTimers()to advance past two daily cycles and verify the job fires at the correct hour both times.clearJobTimersstill cancels the recursive chain (remove the job mid-cycle, verify no future tick fires).Change 2 — triggerJob status:
triggerJobreturns"draining"afterdrain()is called."already-running"and"unknown"cases still work.Change 3 — Drain return type:
{ timedOut: false, stillRunning: [] }.{ timedOut: true, stillRunning: ["job-name"] }when a job outlasts the timeout.Clarifying Questions (non-blocking)
The issue mentions "sometimes runs jobs when it shouldn't" but doesn't give specific examples. The skip-if-busy logic appears correct after code review — the running flag is set synchronously before the first
await, preventing concurrent invocations. If there are specific scenarios where jobs run unexpectedly, those details would help target additional fixes. The plan above addresses the concrete issues found in the code; please comment on the GitHub issue if there are specific failure scenarios to investigate.Issue body: ## Overview
Currently each job in
src/jobs/is a standalone module with arun()function. This works but doesn't scale. We should refactor to a proper plugin architecture.Requirements
JobPlugininterface with lifecycle hooks:onInit,onBeforeRun,onRun,onAfterRun,onError,onShutdownsrc/jobs/at startupThis will make the system much more maintainable and extensible.
Acceptance Criteria
All existing jobs work the same as before but use the new architecture.
Winner: variant | Current: 16/20 | Variant: 17/20
Reasoning: Both outputs correctly identify this as an over-scoped issue, but the variant provides a concrete, implementable plan (specific helper functions, file changes, implementation order) while still rejecting the unnecessary requirements with detailed rationale. The current prompt asks excellent clarifying questions but produces no actionable plan. The variant loses some uncertainty score by being more prescriptive about what to build without asking the user first, but the trade-off is worth it given the issue is clearly over-engineered.
Current prompt output
Clarifying Questions (blocking)
This issue cannot be reliably planned because it combines at least 12 distinct features with unclear motivation and significant contradictions with the existing architecture. The following questions must be answered before proceeding:
1. What concrete problem is this solving?
The issue states the current approach "doesn't scale" but doesn't explain how. The codebase has 13 jobs, all following a consistent
run(repos: Repo[]): Promise<void>pattern with shared conventions (two-phase identify/process, worktree lifecycle infinallyblocks, task recording viadb.ts). What specific scaling problem has been encountered — difficulty adding new jobs, debugging failures, or something else? The answer determines which (if any) of the 12 requirements are actually needed.2. Are requirements 11 and 12 (RxJS observables, OpenTelemetry) actually desired?
These are unrelated to a plugin architecture and would introduce two significant new dependencies (
rxjs,@opentelemetry/*) to a project that currently has minimal dependencies. The prompt instructions explicitly say "Do not add logging, metrics, or observability unless the issue asks for it" and "do not refactor surrounding code" — but the issue itself asks for these. Is the intent to bundle these with the plugin refactor, or should they be separate issues?3. How would "hot-reloading of job plugins" (requirement 6) work given the current architecture?
Jobs are stateless functions that discover work via GitHub API polling. The scheduler already supports live enable/disable via config reload (
main.tshandlesSIGHUPand config file changes). What additional hot-reload capability is needed beyond what already exists? ESM dynamicimport()has known cache-invalidation problems in Node.js — is the intent to userequire()(CJS), a custom module loader, or something else?4. What does "job dependency graph" (requirement 5) mean in practice?
Currently, jobs don't communicate directly. They coordinate through GitHub state (labels, comments, reactions). For example,
issue-refineradds theNeeds Plan Reviewlabel, andplan-reviewerpicks it up on its next poll cycle. This is deliberately decoupled. Would a dependency graph replace this model (tight coupling, synchronous chaining) or supplement it (just scheduling order hints)? These are fundamentally different architectures.5. What does "job chaining — output of one job feeds as input to the next" (requirement 8) mean?
Job outputs today are side effects (GitHub comments, labels, PRs), not return values.
run()returnsPromise<void>. Making jobs return structured data that chains to the next job would require redesigning every job's contract. Is the intent to change job signatures, or to have a message-passing system alongside the current GitHub-state-based coordination?6. Should this be split into smaller issues?
This issue contains at least 7 orthogonal features: (a) plugin interface + registry, (b) dependency injection, (c) middleware pipeline, (d) dependency graph, (e) hot-reloading, (f) job chaining, (g) dry-run mode, (h) testing harness, (i) RxJS conversion, (j) OpenTelemetry. Each has different risk profiles and could be implemented independently. A single PR touching all 13 jobs plus
main.ts,scheduler.ts, and introducing ~5 new infrastructure modules would be extremely difficult to review. Would you prefer this broken into a sequence of smaller, independently deployable issues?7. What does "dry-run mode" (requirement 9) mean per job?
Some jobs are read-only scanners (
issue-auditor), some create PRs (issue-worker), some edit comments (issue-refiner), some merge PRs (auto-merger). "Dry run" means different things for each. Should dry-run skip all GitHub writes? Skip only destructive writes (merge, push) but allow reads? Log what would happen? The existing jobs have no shared write abstraction — each callsgh.*functions directly with different semantics.Please respond to these questions as comments on the GitHub issue so the next refinement cycle can incorporate the answers and produce a focused, implementable plan. Given the scope of this issue, I'd particularly recommend addressing question 6 first — the answer will shape everything else.
Variant prompt output
This issue is a textbook over-scoped "boil the ocean" refactor. Let me analyze each requirement against the actual codebase before producing a plan.
Analysis
The current architecture has 13 jobs totaling ~4,570 LOC, each implementing a simple
run(repos: Repo[])function registered inmain.ts. The system already has:enqueue(fn, priority))Here is my assessment of the 12 requirements:
Implementation Plan
Requirements Disposition
JobPlugininterface with lifecycle hooksrun(repos)is simple, testable, and uniform across 13 jobs. Lifecycle hooks add indirection with no concrete benefit — the jobs don't needonInit/onShutdownbecause initialization is centralized inmain.tsand shutdown is handled viacancelQueuedTasks().main.ts(lines 140-253) are readable, type-safe, and allow theENABLED_JOBSfilter. Dynamic loading loses compile-time safety and makes the dependency graph invisible.vi.mock().claude.tsandgithub.ts. A formal middleware pipeline is over-engineering, but extracting 2-3 helper functions for the most common patterns would reduce duplication.yeti-updater.timer) handles code updates safely via restart.claude.tsalready hasenqueue(fn, priority=false)where priority tasks jump to front. ThePrioritylabel feeds into this. No work needed.ghCLI responses.src/test-helpers.tsprovidesmockRepo(),mockIssue(),mockPR()factories. The webhook handler (webhooks.ts) already routes real GitHub events to jobs.AsyncLocalStorage-based run context indb.tsalready provides per-job-run log correlation.What should actually be done
The only concrete improvement with a real payoff is requirement 4 (partially): extracting shared job patterns into reusable helpers to reduce boilerplate across the 13 jobs. Specifically:
File changes
src/jobs/job-helpers.ts(new file, ~80 lines)Extract these repeated patterns:
forEachRepo(repos, processRepo)— The repos iteration + rate-limit-check pattern used by most jobs. Wraps the commonfor (const repo of repos) { if (rateLimited) continue; ... }loop.withWorktreeTask(repo, branch, taskFn)— Combines worktree creation, try/catch/finally cleanup, and tree-diff guard into a single helper. Currently duplicated across ci-fixer, issue-worker, doc-maintainer, improvement-identifier, review-addresser, and mkdocs-update.twoPhaseProcess(repos, identify, process)— Codifies the identify-then-process pattern used by ci-fixer, improvement-identifier, and issue-refiner. Phase 1 collects work items, phase 2 processes them.src/jobs/*.ts(6-8 job files modified)Refactor jobs to use the new helpers where the pattern clearly applies. Do NOT force jobs that don't fit the pattern (e.g., prompt-evaluator, repo-standards) to use them.
src/jobs/job-helpers.test.ts(new file)Unit tests for the extracted helpers.
What NOT to change
main.tsjob registration — keep explicitscheduler.ts— no changes neededclaude.ts— already well-factoredRisks and edge cases
Implementation order
job-helpers.test.tswith tests for the three helpers (TDD)job-helpers.tsto pass the testsdoc-maintainerandauto-mergeras they're simplest) to use the helpersyeti/OVERVIEW.mdandCLAUDE.mdto document the helpersTesting approach
npm testafter each job refactor to catch regressionsOut of Scope
All 12 original requirements except the partial extraction from #4 are out of scope for the reasons detailed above. The current architecture is working well for 13 jobs. The
run(repos)pattern is simple, testable, and understood by all contributors. Introducing a plugin architecture, DI container, middleware pipeline, RxJS, or OpenTelemetry would add significant complexity without solving a demonstrated problem.If specific pain points emerge (e.g., "adding a new job requires touching 5 files" or "I can't test job X without running the whole system"), those should be filed as targeted issues rather than addressed via a speculative architectural overhaul.
Suggested PR structure
Single PR — The helper extraction and job refactors are tightly coupled and should be reviewed together. The change is small enough (~200 lines new, ~150 lines removed from jobs) to review in one pass.
Note
This plan deliberately rejects the majority of the issue's requirements. The issue describes a solution (plugin architecture) rather than a problem. The actual codebase is well-structured for its current scale, and the proposed abstractions would make it harder to understand and maintain, not easier. The targeted extraction of 2-3 helper functions addresses the only real duplication without the overhead of a framework.
Issue body: ## Problem
We handle
pull_request.closedandcheck_run.completedwebhook events but notpull_request.synchronize. When a PR branch is force-pushed or has new commits pushed, we should react.What should happen
When we receive a
pull_request.synchronizeevent, we should... actually I'm not sure. Maybe re-run CI checks? Or maybe invalidate the plan? Or update the PR description? It depends on whether it's a yeti PR or an external PR.For yeti PRs (branches starting with
yeti/), maybe we should check if the push was from yeti itself or from a human. If it was from a human, we might want to pause yeti's work on that PR. Or maybe not.For external PRs, maybe we should re-run the
improvement-identifierjob? Or theci-fixer? Not sure which.Additional context
Someone mentioned this would be useful in Slack but I don't remember the exact use case they described. It was something about PRs getting out of sync. The
webhooks.tsfile hasisRepoAllowed()which we should reuse. Also we might need to update the config to add a new toggle for this, or maybe it should always be on, or maybe it should be controlled per-repo.See also: that other issue about webhook improvements (I don't remember the number).
Winner: variant | Current: 15/20 | Variant: 18/20
Reasoning: Both correctly identify the issue as unplannable and ask blocking clarifying questions. The variant is superior in specificity — it references concrete files (src/webhooks.ts), functions (handlePullRequestSynchronizeEvent, isRepoAllowed, scheduler.triggerJob), and architectural patterns (skip-if-busy, per-PR state). It also demonstrates stronger scope awareness by noting that GitHub already re-runs CI on pushes (making one proposed behavior redundant), that 'pausing' would require a new per-PR state mechanism that doesn't exist, and by offering a concrete recommendation to close/convert to discussion. The current prompt's questions are solid but more generic.
Current prompt output
Step 1: Plannable?
This issue is not plannable in its current state. The author explicitly expresses uncertainty about the desired behavior in nearly every paragraph ("I'm not sure," "maybe," "not sure which," "I don't remember the exact use case"). There are no acceptance criteria, and the core question — what should happen when we receive
pull_request.synchronize— is unanswered.Clarifying Questions (blocking)
What is the desired behavior for yeti PRs?
The issue proposes at least three options but picks none:
Which behavior do you want, and what does "pause" mean concretely?
What is the desired behavior for external PRs?
These two jobs have very different purposes.
ci-fixerreacts to failing CI checks;improvement-identifierscans for code improvements. A force-push might warrant re-running CI-fixer (if the old failure context is stale), but re-running improvement-identifier on every push could be noisy. Which job(s) should be triggered, and under what conditions?What was the original use case from Slack?
Without the motivating use case, we can't validate whether any proposed behavior actually solves the problem. Can you find the Slack thread and include the specific scenario?
Should this be configurable, and if so how?
The current webhook system has no per-event-type toggles — all supported events are always handled if the repo is allowed. Should synchronize follow the same pattern (always on), or does it need a new config field? Per-repo control would be a new config pattern that doesn't exist today.
What is "the other issue about webhook improvements"?
There may be overlapping or conflicting plans. Please link the related issue so we can check for scope conflicts before planning.
Please respond to these questions as comments on the GitHub issue so the next refinement cycle can incorporate your answers and produce a complete plan.
Variant prompt output
Based on my analysis of the codebase and this issue, this issue is too underspecified to produce an actionable implementation plan. Here's my assessment:
Issue Analysis:
pull_request.synchronizewebhook supportWhat's clear
The webhook infrastructure in
src/webhooks.tsalready handles four event types (issues,check_run,pull_request_review,pull_request) with a clean dispatcher pattern. Adding a newhandlePullRequestSynchronizeEventhandler is mechanically straightforward — it follows the same pattern as the existingpull_request.closedhandler, reusesisRepoAllowed(), and routes throughscheduler.triggerJob().What's not clear — and why it matters
The issue explicitly states uncertainty about the desired behavior in multiple places. Every decision branch leads to a different implementation:
"Maybe re-run CI checks?" — GitHub already re-runs checks automatically when commits are pushed to a PR branch. There's nothing for yeti to do here.
"Maybe invalidate the plan?" — The plan lives as an issue comment produced by
issue-refiner. "Invalidating" it could mean adding a comment, removing theRefinedlabel, or re-triggeringissue-refiner. Each has different implications for the workflow state machine."Maybe update the PR description?" — Which description fields? Under what conditions?
Yeti PR vs external PR distinction — The issue proposes different behavior per PR type but doesn't specify what either behavior should be.
"Pause yeti's work" — There's no concept of pausing work on a specific PR in the current architecture. Jobs are global (skip-if-busy), not per-PR. This would be a significant new feature, not a webhook handler addition.
Config toggle — "Always on", "per-repo", or "global toggle" are three different config designs.
Referenced Slack conversation and "other issue" — These are the actual motivation for the feature but aren't available.
Clarifying Questions (blocking)
These must be answered before a plan can be produced:
What is the concrete use case? The issue says "someone mentioned this would be useful in Slack" but doesn't describe the actual problem. What specific scenario causes PRs to "get out of sync" and what user-visible problem does that cause?
What job(s) should be triggered? The issue lists ci-fixer, improvement-identifier, and "pause yeti's work" as candidates but picks none.
ci-fixeris triggered bycheck_run.completedwhich fires after the re-run fails — so the existing flow already covers "new push → CI runs → CI fails → ci-fixer triggers."Should this block issue-worker from starting new work on an issue whose PR was human-edited? This is the "pause" concept — it requires a new per-issue/per-PR state mechanism that doesn't exist today. Is this in scope or out of scope?
What is the "other issue about webhook improvements"? It may contain design decisions that affect this work.
Please respond as a comment on the GitHub issue so the next refinement cycle can incorporate the answers.
Recommendation
Close or convert this to a discussion until the use case is concrete. The current issue has the shape of "we could handle this event" without a clear "and here's the problem it solves." Adding webhook plumbing is easy; designing the right behavior requires knowing what problem we're solving.
Automated evaluation by yeti prompt-evaluator