fix: calibrate evaluator verdict rubric and baseline context#186
Merged
marcellodebernardi merged 3 commits intomainfrom Mar 3, 2026
Merged
fix: calibrate evaluator verdict rubric and baseline context#186marcellodebernardi merged 3 commits intomainfrom
marcellodebernardi merged 3 commits intomainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the model evaluation flow to (1) correctly inject Phase 1 core metrics into the Baseline Comparison phase and (2) calibrate the Phase 6 synthesis verdict rubric to reduce overly critical grading while keeping the prompt contract stable.
Changes:
- Inject
core_metrics_reportinto Phase 5 (Baseline Comparison) runtime context so the baseline comparison prompt’s documented variables match what’s actually provided. - Add an explicit PASS/CONDITIONAL_PASS/FAIL rubric to the Phase 6 synthesis prompt to better calibrate verdict assignment.
- Refresh generated code index timestamps.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
plexe/agents/model_evaluator.py |
Passes Phase 1 metrics into Phase 5 prompt context; adds explicit verdict rubric to Phase 6 prompt. |
plexe/CODE_INDEX.md |
Updates generated timestamp. |
tests/CODE_INDEX.md |
Updates generated timestamp. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
Greptile SummaryThis PR fixes two related issues in the
Confidence Score: 5/5
|
| Filename | Overview |
|---|---|
| plexe/agents/model_evaluator.py | Two targeted changes: (1) injects core_metrics_report into the baseline comparison phase args, aligning the prompt's documented interface with what's actually passed; (2) adds a verdict rubric to the synthesis prompt to reduce over-critical grading. Both are straightforward and correct. |
| pyproject.toml | Patch version bump from 1.4.1 to 1.4.2, consistent with a bug-fix release. |
Last reviewed commit: b068093
…-rubric-calibration # Conflicts: # plexe/CODE_INDEX.md # tests/CODE_INDEX.md
Contributor
Author
|
@greptileai please review again with latest changes |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR tightens model evaluator synthesis verdict calibration and fixes baseline-phase context injection for core metrics. The aim is to reduce overly critical grading while keeping the evaluation prompt behavior stable and minimal in scope. It also aligns the baseline comparison prompt contract with the actual injected arguments.
Testing
poetry run pytest tests/unit/test_imports.py -q