Skip to content

fix: calibrate evaluator verdict rubric and baseline context#186

Merged
marcellodebernardi merged 3 commits intomainfrom
codex/model-evaluator-rubric-calibration
Mar 3, 2026
Merged

fix: calibrate evaluator verdict rubric and baseline context#186
marcellodebernardi merged 3 commits intomainfrom
codex/model-evaluator-rubric-calibration

Conversation

@marcellodebernardi
Copy link
Copy Markdown
Contributor

This PR tightens model evaluator synthesis verdict calibration and fixes baseline-phase context injection for core metrics. The aim is to reduce overly critical grading while keeping the evaluation prompt behavior stable and minimal in scope. It also aligns the baseline comparison prompt contract with the actual injected arguments.

Testing

poetry run pytest tests/unit/test_imports.py -q

Copilot AI review requested due to automatic review settings March 3, 2026 00:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the model evaluation flow to (1) correctly inject Phase 1 core metrics into the Baseline Comparison phase and (2) calibrate the Phase 6 synthesis verdict rubric to reduce overly critical grading while keeping the prompt contract stable.

Changes:

  • Inject core_metrics_report into Phase 5 (Baseline Comparison) runtime context so the baseline comparison prompt’s documented variables match what’s actually provided.
  • Add an explicit PASS/CONDITIONAL_PASS/FAIL rubric to the Phase 6 synthesis prompt to better calibrate verdict assignment.
  • Refresh generated code index timestamps.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
plexe/agents/model_evaluator.py Passes Phase 1 metrics into Phase 5 prompt context; adds explicit verdict rubric to Phase 6 prompt.
plexe/CODE_INDEX.md Updates generated timestamp.
tests/CODE_INDEX.md Updates generated timestamp.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR fixes two related issues in the ModelEvaluatorAgent and bumps the patch version to 1.4.2.

  • Baseline context fixcore_metrics_report (from Phase 1) is now correctly injected into the baseline_context dict before Phase 5 (Baseline Comparison) runs. The Phase 5 prompt already documented core_metrics_report as an available variable and instructed the agent to read model metrics from it, but the value was never actually passed in; this PR closes that gap.
  • Verdict rubric calibration — Explicit FAIL / CONDITIONAL_PASS / PASS criteria are added to the synthesis prompt, along with the guard "Do not assign CONDITIONAL_PASS only because minor improvement opportunities exist." This gives the LLM-based evaluator clearer guidance and should reduce spurious downgrades.
  • Both changes are minimal in scope, well-targeted, and introduce no new abstractions or regressions.

Confidence Score: 5/5

  • This PR is safe to merge — changes are narrowly scoped bug fixes with no new risk surface.
  • Both changes are small, well-understood, and address clear defects: the missing argument injection and the under-specified rubric. The phase sequencing guarantees _core_metrics_report is always populated before Phase 5 runs (Phase 1 failure causes an early return), so the .get() call will not silently produce None in the normal path. No new logic paths, no changes to data models or tool interfaces.
  • No files require special attention.

Important Files Changed

Filename Overview
plexe/agents/model_evaluator.py Two targeted changes: (1) injects core_metrics_report into the baseline comparison phase args, aligning the prompt's documented interface with what's actually passed; (2) adds a verdict rubric to the synthesis prompt to reduce over-critical grading. Both are straightforward and correct.
pyproject.toml Patch version bump from 1.4.1 to 1.4.2, consistent with a bug-fix release.

Last reviewed commit: b068093

…-rubric-calibration

# Conflicts:
#	plexe/CODE_INDEX.md
#	tests/CODE_INDEX.md
@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptileai please review again with latest changes

@marcellodebernardi marcellodebernardi merged commit 81c6b48 into main Mar 3, 2026
13 checks passed
@marcellodebernardi marcellodebernardi deleted the codex/model-evaluator-rubric-calibration branch March 3, 2026 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants