feat: add train/val gap tracking and bootstrap insights by marcellodebernardi · Pull Request #185 · plexe-ai/plexe

marcellodebernardi · 2026-03-02T23:21:09Z

Summary

Train/val gap tracking: evaluate_on_sample now computes the primary metric on both the training and validation samples. The train-val gap is stored on each Solution and surfaced in agent prompts, giving the hypothesiser and insight extractor a signal for overfitting and underfitting.
Bootstrap insight extraction: The insight extractor now runs on bootstrap (initial) solutions, not just hypothesis-driven iterations. A synthetic hypothesis is created for bootstrap rounds so early diverse attempts generate learnings for subsequent search iterations.
Refactored evaluation helpers: The monolithic evaluate_on_sample function is decomposed into _load_predictor and _evaluate_predictor, reducing duplication and making the train-set evaluation a simple second call on the already-loaded predictor.
TODO for evaluation-driven refinement loop: Added a TODO in the workflow orchestrator for a future enhancement where evaluation findings (FAIL/CONDITIONAL_PASS) feed back into targeted search iterations.

Files changed

File	Change
`plexe/models.py`	Add `train_performance` field to `Solution` with serialization support
`plexe/helpers.py`	Refactor into composable helpers; add `train_sample_uri` parameter
`plexe/workflow.py`	Update callers, synthetic bootstrap hypothesis, insight guard, TODO
`plexe/search/journal.py`	Include `train_performance` in `get_history()` entries
`plexe/agents/hypothesiser.py`	Surface train/val gap in node and history summaries
`plexe/agents/insight_extractor.py`	Surface train/val gap in variant results summary
`tests/unit/test_models.py`	Solution serialization round-trip tests
`tests/unit/search/test_journal.py`	Journal history train_performance tests

Test plan

poetry run pytest tests/unit/ — 97 passed, 0 failures
poetry run black . && poetry run ruff check . — clean
make test-integration — staged integration suite

Copilot

Pull request overview

Adds train/val gap visibility to the search loop by recording training-sample performance alongside validation performance, and ensures insight extraction runs for bootstrap (initial) variants as well.

Changes:

Extend Solution + journal history to store/serialize train_performance.
Refactor evaluate_on_sample into predictor loading + reusable evaluation helper, returning (val_performance, train_performance).
Run insight extraction for bootstrap rounds via a synthetic hypothesis and surface train/val gap in agent summaries.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`plexe/models.py`	Adds `train_performance` to `Solution` and includes it in checkpoint serialization.
`plexe/helpers.py`	Refactors evaluation into `_load_predictor`/`_evaluate_predictor` and adds optional train-sample evaluation.
`plexe/workflow.py`	Updates evaluation call sites to capture train performance; enables bootstrap insight extraction; adds TODO.
`plexe/search/journal.py`	Includes `train_performance` in `get_history()` entries.
`plexe/agents/hypothesiser.py`	Surfaces val/train performance and gap in node + history summaries (with one logic issue noted).
`plexe/agents/insight_extractor.py`	Includes train performance and gap in variant result summaries.
`tests/unit/test_models.py`	Adds serialization/backward-compat tests for `train_performance`.
`tests/unit/search/test_journal.py`	Adds coverage for `train_performance` in journal history.
`plexe/CODE_INDEX.md`	Regenerated code index reflecting signature/doc updates.
`tests/CODE_INDEX.md`	Regenerated test code index reflecting new unit tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

plexe/agents/hypothesiser.py

greptile-apps · 2026-03-02T23:35:48Z

Greptile Summary

This PR adds train/val gap tracking to the ML search loop and extends insight extraction to bootstrap (initial) iterations, giving the hypothesiser and insight extractor an overfitting/underfitting signal from the very first round of experiments.

Key changes:

evaluate_on_sample is refactored into composable _load_predictor + _evaluate_predictor helpers; the predictor is loaded once and reused for both the val and (optionally) train evaluation in the same call, which is efficient.
Solution.train_performance is added with None as default and proper d.get(...) deserialization, making old checkpoints fully backward-compatible.
A synthetic Hypothesis (expand_solution_id=-1) is created for bootstrap rounds so InsightExtractorAgent can run and seed the insight store from early diverse attempts — consistent with how PlannerAgent already uses -1 in its own bootstrap path.
The if variant_solutions and expand_solution_id is not None guard is correctly simplified to if variant_solutions, since hypothesis is always defined by the time Step 2e is reached (the try/except continue pattern ensures it).
Unit tests cover the default value, serialization, backward-compat deserialization, round-trip, and journal history for the new field.

Confidence Score: 4/5

Safe to merge; logic is correct and backward-compatible with one minor style nit.
The implementation is clean, backward-compatible, and well-tested. The only concern is a minor pre-existing pattern (if success and perf) inherited into the new train/val gap display, which could suppress output for an exact zero-valued metric — an unlikely but technically incorrect edge case.
plexe/agents/hypothesiser.py — the if success and perf truthiness guard on the train/val gap block.

Important Files Changed

Filename	Overview
plexe/helpers.py	Cleanly refactored into `_load_predictor` + `_evaluate_predictor` helpers; `evaluate_on_sample` now returns a `(val, train
plexe/models.py	New `train_performance` field defaults to `None`, serialized in `to_dict`, and deserialized with `d.get(...)` — backward-compatible with old checkpoints.
plexe/workflow.py	Callers updated to unpack the new tuple; synthetic bootstrap hypothesis is correctly scoped inside the try block so `continue` prevents Step 2e from running on failure; `retrain_on_full_dataset` discards train_performance with `_` as expected.
plexe/agents/hypothesiser.py	Gap surfacing logic is correct; minor concern with the inherited `if success and perf` truthiness guard which could suppress gap display when performance is exactly 0.0.
plexe/agents/insight_extractor.py	Train/val gap added correctly inside the `is_successful` guard; `sol.performance` is guaranteed non-None at that point, so the arithmetic is safe.
plexe/search/journal.py	Minimal, correct change — `train_performance` passthrough added to `get_history()` entries.
tests/unit/test_models.py	Good coverage: default-None, to_dict, backward-compat deserialization, and round-trip tests all present.
tests/unit/search/test_journal.py	Two focused tests cover the set and unset `train_performance` paths in `get_history()`.

_{Last reviewed commit: 85a0f4c}

greptile-apps

_{11 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

plexe/helpers.py

plexe/agents/hypothesiser.py

plexe/helpers.py

marcellodebernardi · 2026-03-02T23:54:01Z

@greptile-apps review again with latest changes

plexe/helpers.py

marcellodebernardi · 2026-03-03T00:08:55Z

@greptile-apps review again with latest changes

plexe/agents/hypothesiser.py

feat: add train/val gap tracking and bootstrap insights

cc4a30f

Copilot AI review requested due to automatic review settings March 2, 2026 23:21

Copilot started reviewing on behalf of marcellodebernardi March 2, 2026 23:21 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

plexe/agents/hypothesiser.py Show resolved Hide resolved

plexe/agents/hypothesiser.py Show resolved Hide resolved

greptile-apps bot reviewed Mar 2, 2026

View reviewed changes

plexe/helpers.py Outdated Show resolved Hide resolved

plexe/helpers.py Outdated Show resolved Hide resolved

plexe/agents/hypothesiser.py Show resolved Hide resolved

plexe/helpers.py Outdated Show resolved Hide resolved

marcellodebernardi added 2 commits March 2, 2026 23:42

fix: clarify TODO wording and add return type annotation

a0d8456

chore: bump version to 1.4.1

45d8566

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

plexe/helpers.py Show resolved Hide resolved

fix: add type annotations to _evaluate_predictor

85a0f4c

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

plexe/agents/hypothesiser.py Show resolved Hide resolved

marcellodebernardi merged commit d0d301e into main Mar 3, 2026
13 checks passed

marcellodebernardi deleted the feature/iterative-evaluation-workflow branch March 3, 2026 00:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add train/val gap tracking and bootstrap insights#185

feat: add train/val gap tracking and bootstrap insights#185
marcellodebernardi merged 4 commits intomainfrom
feature/iterative-evaluation-workflow

marcellodebernardi commented Mar 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Mar 2, 2026 •

edited

Loading

Important Files Changed

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marcellodebernardi commented Mar 2, 2026

Uh oh!

Uh oh!

marcellodebernardi commented Mar 3, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marcellodebernardi commented Mar 2, 2026

Summary

Files changed

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marcellodebernardi commented Mar 2, 2026

Uh oh!

Uh oh!

marcellodebernardi commented Mar 3, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Mar 2, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading