fix: calibrate evaluator verdict rubric and baseline context by marcellodebernardi · Pull Request #186 · plexe-ai/plexe

marcellodebernardi · 2026-03-03T00:43:26Z

This PR tightens model evaluator synthesis verdict calibration and fixes baseline-phase context injection for core metrics. The aim is to reduce overly critical grading while keeping the evaluation prompt behavior stable and minimal in scope. It also aligns the baseline comparison prompt contract with the actual injected arguments.

Testing

poetry run pytest tests/unit/test_imports.py -q

Copilot

Pull request overview

This PR updates the model evaluation flow to (1) correctly inject Phase 1 core metrics into the Baseline Comparison phase and (2) calibrate the Phase 6 synthesis verdict rubric to reduce overly critical grading while keeping the prompt contract stable.

Changes:

Inject core_metrics_report into Phase 5 (Baseline Comparison) runtime context so the baseline comparison prompt’s documented variables match what’s actually provided.
Add an explicit PASS/CONDITIONAL_PASS/FAIL rubric to the Phase 6 synthesis prompt to better calibrate verdict assignment.
Refresh generated code index timestamps.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
`plexe/agents/model_evaluator.py`	Passes Phase 1 metrics into Phase 5 prompt context; adds explicit verdict rubric to Phase 6 prompt.
`plexe/CODE_INDEX.md`	Updates generated timestamp.
`tests/CODE_INDEX.md`	Updates generated timestamp.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

greptile-apps · 2026-03-03T00:47:30Z

Greptile Summary

This PR fixes two related issues in the ModelEvaluatorAgent and bumps the patch version to 1.4.2.

Baseline context fix — core_metrics_report (from Phase 1) is now correctly injected into the baseline_context dict before Phase 5 (Baseline Comparison) runs. The Phase 5 prompt already documented core_metrics_report as an available variable and instructed the agent to read model metrics from it, but the value was never actually passed in; this PR closes that gap.
Verdict rubric calibration — Explicit FAIL / CONDITIONAL_PASS / PASS criteria are added to the synthesis prompt, along with the guard "Do not assign CONDITIONAL_PASS only because minor improvement opportunities exist." This gives the LLM-based evaluator clearer guidance and should reduce spurious downgrades.
Both changes are minimal in scope, well-targeted, and introduce no new abstractions or regressions.

Confidence Score: 5/5

This PR is safe to merge — changes are narrowly scoped bug fixes with no new risk surface.
Both changes are small, well-understood, and address clear defects: the missing argument injection and the under-specified rubric. The phase sequencing guarantees _core_metrics_report is always populated before Phase 5 runs (Phase 1 failure causes an early return), so the .get() call will not silently produce None in the normal path. No new logic paths, no changes to data models or tool interfaces.
No files require special attention.

Important Files Changed

Filename	Overview
plexe/agents/model_evaluator.py	Two targeted changes: (1) injects `core_metrics_report` into the baseline comparison phase args, aligning the prompt's documented interface with what's actually passed; (2) adds a verdict rubric to the synthesis prompt to reduce over-critical grading. Both are straightforward and correct.
pyproject.toml	Patch version bump from `1.4.1` to `1.4.2`, consistent with a bug-fix release.

_{Last reviewed commit: b068093}

…-rubric-calibration # Conflicts: # plexe/CODE_INDEX.md # tests/CODE_INDEX.md

marcellodebernardi · 2026-03-03T00:59:49Z

@greptileai please review again with latest changes

fix: calibrate evaluation verdict rubric and baseline context

d0196bb

Copilot AI review requested due to automatic review settings March 3, 2026 00:43

Copilot started reviewing on behalf of marcellodebernardi March 3, 2026 00:43 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

marcellodebernardi added 2 commits March 3, 2026 00:56

Merge remote-tracking branch 'origin/main' into codex/model-evaluator…

f8087a4

…-rubric-calibration # Conflicts: # plexe/CODE_INDEX.md # tests/CODE_INDEX.md

chore: bump version to 1.4.2

b068093

marcellodebernardi merged commit 81c6b48 into main Mar 3, 2026
13 checks passed

marcellodebernardi deleted the codex/model-evaluator-rubric-calibration branch March 3, 2026 01:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: calibrate evaluator verdict rubric and baseline context#186

fix: calibrate evaluator verdict rubric and baseline context#186
marcellodebernardi merged 3 commits intomainfrom
codex/model-evaluator-rubric-calibration

marcellodebernardi commented Mar 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

greptile-apps bot commented Mar 3, 2026 •

edited

Loading

Important Files Changed

Uh oh!

marcellodebernardi commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marcellodebernardi commented Mar 3, 2026

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

greptile-apps bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

marcellodebernardi commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Mar 3, 2026 •

edited

Loading