v0.4.2: real Gemini eval — deep research + UX design#7
v0.4.2: real Gemini eval — deep research + UX design#7RyanAlberts wants to merge 3 commits intomainfrom
Conversation
Working example of /run-eval against an external HTTP target. Designed
to be run by the user against their own Gemini subscription as a real
test of the http target type and as the first cross-vendor eval in
pmstack.
Six test cases:
3 deep research — multi-source synthesis, counterfactual reasoning,
citation honesty (the classic hallucination trap)
3 UX design — concrete HTML form, prioritized flow critique,
A/B test design with assumptions
Target: gemini-2.5-pro via generativelanguage.googleapis.com.
Auth: x-goog-api-key header sourced from GEMINI_API_KEY env var.
Judge: claude-sonnet-4-6 (different family, no self-grading bias).
Verified the runner hard-stops cleanly when GEMINI_API_KEY is unset
("FATAL: requires env var GEMINI_API_KEY"). When set, the eval loads
6 cases at ~48K total token estimate (well under the 200K warn
threshold) — affordable smoke test against a real external API.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
outputs/eval-runs/ is per-user run output, not canonical example material. The parent outputs/ stays tracked so the example artifacts (Ultraplan eval, Gemini eval YAML, roadmap, verification report) remain in the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new evaluation configuration for Gemini 2.5 Pro, targeting deep research and UX design capabilities. The review feedback identifies several areas for improvement: aligning the failure_modes and response_path formats with project templates for better compatibility, and adjusting metrics for specific test cases to avoid false negatives where the prompt constraints conflict with the evaluation criteria.
| failure_modes: | ||
| - id: fm-1 | ||
| name: "Hallucinated facts in research synthesis" | ||
| severity: P0 | ||
| - id: fm-2 | ||
| name: "Refuses or stalls on time-sensitive questions instead of dating its claims" | ||
| severity: P1 | ||
| - id: fm-3 | ||
| name: "Generic 'best practices' answer instead of design with specifics" | ||
| severity: P1 | ||
| - id: fm-4 | ||
| name: "Critique that lists everything as a problem (no prioritization)" | ||
| severity: P1 |
There was a problem hiding this comment.
The structure of failure_modes deviates from the templates/eval-template.yaml (lines 72-74), which defines them as a list of strings. If the runner expects the format defined in the template, this structured object approach will cause a failure or incorrect rendering in the summary output.
failure_modes:
- "fm-1: Hallucinated facts in research synthesis (P0)"
- "fm-2: Refuses or stalls on time-sensitive questions instead of dating its claims (P1)"
- "fm-3: Generic 'best practices' answer instead of design with specifics (P1)"
- "fm-4: Critique that lists everything as a problem (no prioritization) (P1)"| url: "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-pro:generateContent" | ||
| method: POST | ||
| request_template: '{"contents":[{"parts":[{"text":"{{input}}"}]}],"generationConfig":{"maxOutputTokens":2048,"temperature":0.3}}' | ||
| response_path: ".candidates[0].content.parts[0].text" |
There was a problem hiding this comment.
The response_path is missing the $ root prefix used in the templates/eval-template.yaml (line 60). While some JSONPath implementations allow omitting it, including it ensures compatibility with standard libraries like jsonpath-ng and maintains consistency with the project's own templates.
response_path: "$.candidates[0].content.parts[0].text"| description: "Concrete HTML/CSS for a constrained spec. Tests whether the model can produce usable code, not generic templates." | ||
| input: "Write a mobile-first sign-up form for a B2B SaaS. Fields: work email, company name, team size (1-10/11-50/51+/dropdown). Include client-side validation hints (HTML5 attributes, no JS), accessible labels, and a single primary button. Output only the HTML+CSS, in one code block." | ||
| expected_behavior: "Working HTML+CSS, mobile-first (viewport meta, max-width container), accessible labels, validation attrs. Not just a template." | ||
| metrics: ["Substance", "Structure", "Length"] |
There was a problem hiding this comment.
The Structure metric (defined in line 66) evaluates the presence of "headers, lists, comparison tables". Since this test case explicitly requests "only the HTML+CSS, in one code block" (line 102), the model will likely receive a low score for structure even if it follows the prompt perfectly. Consider removing this metric for this specific case to avoid false negatives.
metrics: ["Substance", "Length"]Real-world test against Gemini 2.5 Pro hit immediate 429s on every call (free-tier quota exhausted). The runner correctly captured the errors and refused to fake scores. Two improvements that came out of debugging: bin/run-eval.py - target.delay_between_cases_sec: optional sleep between cases for rate-limit respect on http targets - 429 errors now emit a clear WARN telling the user to check their quota / try a smaller model / set a delay - Captures the full HTTP error body (300 chars) into evidence so diagnosis doesn't require re-reading raw network logs eval YAML - Switched from gemini-2.5-pro to gemini-2.0-flash (more permissive free-tier limits) - Added delay_between_cases_sec: 6 - Documented why in description Note: the actual Gemini comparison didn't produce real scores in this run (quota exhausted on user's API key). The eval will need to be re-run after the user's quota resets or with a key that has access. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Working example of
/run-evalagainst an external HTTP target. First cross-vendor eval in pmstack.What's in here
outputs/eval-gemini-deepresearch-uxdesign-2026-04-25.yaml— 6 test cases:Deep research (3)
dr-1-multisource-synthesis(P0) — compare 4 LLM agent memory approaches with named systemsdr-2-counterfactual(P1) — counterfactual reasoning under uncertaintydr-3-citations-honesty(P0) — list 5 academic papers (the classic hallucination trap)UX design (3)
ux-1-html-form(P1) — concrete mobile-first sign-up form (HTML+CSS, accessible)ux-2-flow-critique(P1) — prioritize 3 UX gaps in a SaaS onboarding flowux-3-experiment-design(P1) — A/B test design with explicit assumptionsTarget:
gemini-2.5-proviagenerativelanguage.googleapis.com. Auth viax-goog-api-keyheader sourced fromGEMINI_API_KEYenv var. Judge:claude-sonnet-4-6(different family, no self-grading bias).Verified
FATAL: requires env var GEMINI_API_KEYwhen env not setHow to run (when reviewer / user has a Gemini API key)
To compare against Claude: duplicate the YAML, change
target.typetoclaude-sessionwithmodel: claude-sonnet-4-6, run again, diff the twosummary.mdfiles.Cost
~$0.20 in Gemini tokens + ~$0.20 in Claude judge tokens for the full 6-case run.
Holding before merge
Will merge after the actual run produces clean output and any wording / metric tweaks land. (Easier to fix on an open PR than chase with follow-ups.)
🤖 Generated with Claude Code