LabKey · labkey-jeckels · Mar 21, 2026 · Mar 21, 2026 · Mar 21, 2026 · Mar 21, 2026
diff --git a/.claude/commands/review-pr.md b/.claude/commands/review-pr.md
@@ -0,0 +1,59 @@
+Use the `gh` CLI to fetch the PR details and diff, then perform a systematic code review.
+
+Steps:
+1. Run `gh pr view $ARGUMENTS` to get the PR title, description, and author.
+2. Run `gh pr diff $ARGUMENTS` to get the full diff.
+3. For each file changed, if you need more context than the diff provides, read the relevant file(s).
+
+Then perform a thorough review in this exact order:
+
+---
+
+## Phase 1: Understand the Intent
+
+Summarize in 2-3 sentences what this PR is supposed to do, based on the title, description, and diff. This is your baseline for correctness checks.
+
+## Phase 2: Logic Analysis (Most Critical)
+
+For **each changed function or method**, work through it mechanically:
+
+- **Trace the execution**: Walk through what the code does step by step in plain English. Do not just restate the code — describe what values flow through and what decisions are made.
+- **Check conditions**: For every `if`, `while`, `for`, ternary, or boolean expression: is the condition correct? Could it be inverted? Are the operands in the right order?
+- **Check edge cases**: What happens with null/empty/zero/negative/maximum inputs? Are bounds correct (off-by-one)?
+- **Check missing cases**: Are there code paths the change forgot to handle?
+- **Check state mutations**: If the code modifies shared state, is the order of operations correct? Could this cause incorrect behavior if called multiple times or concurrently?
+
+Do not skip this phase for "simple-looking" changes. Many bugs hide in code that appears straightforward.
+
+## Phase 3: Correctness Against Intent
+
+Compare what the code *actually does* (from Phase 2) against what it *should do* (from Phase 1). Call out any gaps.
+
+## Phase 4: Security
+
+- Input validation and sanitization
+- Authentication and authorization checks
+- SQL injection, XSS, path traversal
+- Sensitive data in logs or responses
+- Insecure defaults
+
+## Phase 5: Interactions and Side Effects
+
+- Could this change break existing callers that depend on the old behavior?
+- Are there other places in the codebase that should have been updated alongside this change?
+- Are tests updated to cover the new behavior?
+
+---
+
+## Output Format
+
+For each issue found, report:
+
+**Finding #*IncrementingNumber* -  [Severity: Critical/High/Medium/Low]** — *Category* — `file:line`
+> **Issue**: What is wrong.
+> **Why it matters**: The impact if unfixed.
+> **Suggestion**: How to fix it.
+
+Lead with Critical and High severity issues. After all issues, give a one-paragraph overall assessment.
+
+ultrathink
diff --git a/.claude/review-pr-eval/README.md b/.claude/review-pr-eval/README.md
@@ -0,0 +1,74 @@
+# review-pr eval
+
+Evaluates variants of the `review-pr` prompt against a training set of GitHub PRs that contain known bugs, measuring how often the prompt catches them.
+
+Each run invokes Claude on every PR in the training set. With the current training set, expect **10+ minutes** per evaluation. A `--compare` with two names runs both sequentially, so plan for double that.
+
+**Security warning:** The eval script runs Claude with `--dangerously-skip-permissions` so it can read files from the checked-out repo. PR diffs are injected verbatim into Claude's prompt, so a PR containing adversarial instructions in its diff (e.g. in code comments or string literals) could act as a prompt injection attack and cause Claude to execute arbitrary commands without confirmation. Only add PRs from trusted sources — ideally already-merged, internal PRs where the diff content is known.
+
+
+## Prerequisites
+
+- Python 3.10+
+- `claude` CLI authenticated (`claude --version` should work)
+- `gh` CLI authenticated (`gh auth status` should confirm)
+
+## Running
+
+```bash
+# Evaluate the live prompt (../commands/review-pr.md)
+python eval.py
+
+# Evaluate a specific variant
+python eval.py prompts/my-variant.md
+
+# Evaluate using a specific model
+python eval.py --model claude-opus-4-6
+
+# Compare the live prompt against a variant side by side
+python eval.py --compare current my-variant
+
+# Compare the same prompt across two models
+python eval.py --compare current@claude-opus-4-6 current@claude-sonnet-4-6
+
+# Compare a variant on a specific model against the live prompt
+python eval.py --compare current my-variant@claude-opus-4-6
+```
+
+The `name@model` syntax in `--compare` specifies which Claude model to use for the review step. Cache keys include the model, so results for different models are stored separately.
+
+## Training set
+
+`training_set.json` lists GitHub PR URLs and the specific bugs that are expected to be caught. The judge (Claude Haiku) scores each review as `CAUGHT`, `PARTIAL`, or `MISSED` for each expected issue.
+
+To add a PR to the training set, append an entry:
+
+```json
+{
+  "url": "https://github.com/org/repo/pull/123",
+  "expected_issues": [
+    "Description of the specific bug that should be caught"
+  ]
+}
+```
+
+## Prompt variants
+
+The live prompt is always `../commands/review-pr.md`. Named variants live in `prompts/`. To create a variant:
+
+```bash
+cp ../commands/review-pr.md prompts/my-variant.md
+# edit prompts/my-variant.md
+python eval.py --compare current my-variant
+python eval.py --compare current my-variant@claude-opus-4-6
+```
+
+## Repo cache
+
+When evaluating, the script checks out each PR's merge commit so Claude has access to the full repository context. Clones are stored at `build/pr-eval-repos/<org>/<repo-name>` (relative to the server repo root) and reused across runs. Fetches are only performed if the required commit is not already present locally. These clones use `--filter=blob:none` (blobless) so they are relatively lightweight. Note that running `./gradlew clean` will delete the cached clones.
+
+## Results
+
+Results are saved as JSON files in the repo root `build/` directory, named `<prompt-stem>_<timestamp>.json`. Each file contains the full review text, per-issue verdicts, and a summary score.
+
+The catch rate counts `CAUGHT` as 1 and `PARTIAL` as 0.5.