v0.4.2: real Gemini eval — deep research + UX design by RyanAlberts · Pull Request #7 · RyanAlberts/pmstack

RyanAlberts · 2026-04-25T04:59:19Z

Summary

Working example of /run-eval against an external HTTP target. First cross-vendor eval in pmstack.

What's in here

outputs/eval-gemini-deepresearch-uxdesign-2026-04-25.yaml — 6 test cases:

Deep research (3)

dr-1-multisource-synthesis (P0) — compare 4 LLM agent memory approaches with named systems
dr-2-counterfactual (P1) — counterfactual reasoning under uncertainty
dr-3-citations-honesty (P0) — list 5 academic papers (the classic hallucination trap)

UX design (3)

ux-1-html-form (P1) — concrete mobile-first sign-up form (HTML+CSS, accessible)
ux-2-flow-critique (P1) — prioritize 3 UX gaps in a SaaS onboarding flow
ux-3-experiment-design (P1) — A/B test design with explicit assumptions

Target: gemini-2.5-pro via generativelanguage.googleapis.com. Auth via x-goog-api-key header sourced from GEMINI_API_KEY env var. Judge: claude-sonnet-4-6 (different family, no self-grading bias).

Verified

Runner hard-stops with FATAL: requires env var GEMINI_API_KEY when env not set
With placeholder env, loads 6 cases cleanly, ~48K total token estimate (well under 200K warn)

How to run (when reviewer / user has a Gemini API key)

export GEMINI_API_KEY="your_key_here"
python3 bin/run-eval.py outputs/eval-gemini-deepresearch-uxdesign-2026-04-25.yaml \
  --judge-model claude-sonnet-4-6

To compare against Claude: duplicate the YAML, change target.type to claude-session with model: claude-sonnet-4-6, run again, diff the two summary.md files.

Cost

~$0.20 in Gemini tokens + ~$0.20 in Claude judge tokens for the full 6-case run.

Holding before merge

Will merge after the actual run produces clean output and any wording / metric tweaks land. (Easier to fix on an open PR than chase with follow-ups.)

🤖 Generated with Claude Code

Working example of /run-eval against an external HTTP target. Designed to be run by the user against their own Gemini subscription as a real test of the http target type and as the first cross-vendor eval in pmstack. Six test cases: 3 deep research — multi-source synthesis, counterfactual reasoning, citation honesty (the classic hallucination trap) 3 UX design — concrete HTML form, prioritized flow critique, A/B test design with assumptions Target: gemini-2.5-pro via generativelanguage.googleapis.com. Auth: x-goog-api-key header sourced from GEMINI_API_KEY env var. Judge: claude-sonnet-4-6 (different family, no self-grading bias). Verified the runner hard-stops cleanly when GEMINI_API_KEY is unset ("FATAL: requires env var GEMINI_API_KEY"). When set, the eval loads 6 cases at ~48K total token estimate (well under the 200K warn threshold) — affordable smoke test against a real external API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

outputs/eval-runs/ is per-user run output, not canonical example material. The parent outputs/ stays tracked so the example artifacts (Ultraplan eval, Gemini eval YAML, roadmap, verification report) remain in the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a new evaluation configuration for Gemini 2.5 Pro, targeting deep research and UX design capabilities. The review feedback identifies several areas for improvement: aligning the failure_modes and response_path formats with project templates for better compatibility, and adjusting metrics for specific test cases to avoid false negatives where the prompt constraints conflict with the evaluation criteria.

gemini-code-assist · 2026-04-25T05:01:30Z

+failure_modes:
+  - id: fm-1
+    name: "Hallucinated facts in research synthesis"
+    severity: P0
+  - id: fm-2
+    name: "Refuses or stalls on time-sensitive questions instead of dating its claims"
+    severity: P1
+  - id: fm-3
+    name: "Generic 'best practices' answer instead of design with specifics"
+    severity: P1
+  - id: fm-4
+    name: "Critique that lists everything as a problem (no prioritization)"
+    severity: P1


The structure of failure_modes deviates from the templates/eval-template.yaml (lines 72-74), which defines them as a list of strings. If the runner expects the format defined in the template, this structured object approach will cause a failure or incorrect rendering in the summary output.

failure_modes: - "fm-1: Hallucinated facts in research synthesis (P0)" - "fm-2: Refuses or stalls on time-sensitive questions instead of dating its claims (P1)" - "fm-3: Generic 'best practices' answer instead of design with specifics (P1)" - "fm-4: Critique that lists everything as a problem (no prioritization) (P1)"

gemini-code-assist · 2026-04-25T05:01:31Z

+  url: "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-pro:generateContent"
+  method: POST
+  request_template: '{"contents":[{"parts":[{"text":"{{input}}"}]}],"generationConfig":{"maxOutputTokens":2048,"temperature":0.3}}'
+  response_path: ".candidates[0].content.parts[0].text"


The response_path is missing the $ root prefix used in the templates/eval-template.yaml (line 60). While some JSONPath implementations allow omitting it, including it ensures compatibility with standard libraries like jsonpath-ng and maintains consistency with the project's own templates.

response_path: "$.candidates[0].content.parts[0].text"

gemini-code-assist · 2026-04-25T05:01:31Z

+    description: "Concrete HTML/CSS for a constrained spec. Tests whether the model can produce usable code, not generic templates."
+    input: "Write a mobile-first sign-up form for a B2B SaaS. Fields: work email, company name, team size (1-10/11-50/51+/dropdown). Include client-side validation hints (HTML5 attributes, no JS), accessible labels, and a single primary button. Output only the HTML+CSS, in one code block."
+    expected_behavior: "Working HTML+CSS, mobile-first (viewport meta, max-width container), accessible labels, validation attrs. Not just a template."
+    metrics: ["Substance", "Structure", "Length"]


The Structure metric (defined in line 66) evaluates the presence of "headers, lists, comparison tables". Since this test case explicitly requests "only the HTML+CSS, in one code block" (line 102), the model will likely receive a low score for structure even if it follows the prompt perfectly. Consider removing this metric for this specific case to avoid false negatives.

metrics: ["Substance", "Length"]

Real-world test against Gemini 2.5 Pro hit immediate 429s on every call (free-tier quota exhausted). The runner correctly captured the errors and refused to fake scores. Two improvements that came out of debugging: bin/run-eval.py - target.delay_between_cases_sec: optional sleep between cases for rate-limit respect on http targets - 429 errors now emit a clear WARN telling the user to check their quota / try a smaller model / set a delay - Captures the full HTTP error body (300 chars) into evidence so diagnosis doesn't require re-reading raw network logs eval YAML - Switched from gemini-2.5-pro to gemini-2.0-flash (more permissive free-tier limits) - Added delay_between_cases_sec: 6 - Documented why in description Note: the actual Gemini comparison didn't produce real scores in this run (quota exhausted on user's API key). The eval will need to be re-run after the user's quota resets or with a key that has access. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

RyanAlberts and others added 2 commits April 24, 2026 23:58

gemini-code-assist Bot reviewed Apr 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.2: real Gemini eval — deep research + UX design#7

v0.4.2: real Gemini eval — deep research + UX design#7
RyanAlberts wants to merge 3 commits intomainfrom
claude/v0.4.2-gemini-eval

RyanAlberts commented Apr 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RyanAlberts commented Apr 25, 2026

Summary

What's in here

Verified

How to run (when reviewer / user has a Gemini API key)

Cost

Holding before merge

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant