Skip to content

feat: add train/val gap tracking and bootstrap insights#185

Merged
marcellodebernardi merged 4 commits intomainfrom
feature/iterative-evaluation-workflow
Mar 3, 2026
Merged

feat: add train/val gap tracking and bootstrap insights#185
marcellodebernardi merged 4 commits intomainfrom
feature/iterative-evaluation-workflow

Conversation

@marcellodebernardi
Copy link
Copy Markdown
Contributor

Summary

  • Train/val gap tracking: evaluate_on_sample now computes the primary metric on both the training and validation samples. The train-val gap is stored on each Solution and surfaced in agent prompts, giving the hypothesiser and insight extractor a signal for overfitting and underfitting.
  • Bootstrap insight extraction: The insight extractor now runs on bootstrap (initial) solutions, not just hypothesis-driven iterations. A synthetic hypothesis is created for bootstrap rounds so early diverse attempts generate learnings for subsequent search iterations.
  • Refactored evaluation helpers: The monolithic evaluate_on_sample function is decomposed into _load_predictor and _evaluate_predictor, reducing duplication and making the train-set evaluation a simple second call on the already-loaded predictor.
  • TODO for evaluation-driven refinement loop: Added a TODO in the workflow orchestrator for a future enhancement where evaluation findings (FAIL/CONDITIONAL_PASS) feed back into targeted search iterations.

Files changed

File Change
plexe/models.py Add train_performance field to Solution with serialization support
plexe/helpers.py Refactor into composable helpers; add train_sample_uri parameter
plexe/workflow.py Update callers, synthetic bootstrap hypothesis, insight guard, TODO
plexe/search/journal.py Include train_performance in get_history() entries
plexe/agents/hypothesiser.py Surface train/val gap in node and history summaries
plexe/agents/insight_extractor.py Surface train/val gap in variant results summary
tests/unit/test_models.py Solution serialization round-trip tests
tests/unit/search/test_journal.py Journal history train_performance tests

Test plan

  • poetry run pytest tests/unit/ — 97 passed, 0 failures
  • poetry run black . && poetry run ruff check . — clean
  • make test-integration — staged integration suite

Copilot AI review requested due to automatic review settings March 2, 2026 23:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds train/val gap visibility to the search loop by recording training-sample performance alongside validation performance, and ensures insight extraction runs for bootstrap (initial) variants as well.

Changes:

  • Extend Solution + journal history to store/serialize train_performance.
  • Refactor evaluate_on_sample into predictor loading + reusable evaluation helper, returning (val_performance, train_performance).
  • Run insight extraction for bootstrap rounds via a synthetic hypothesis and surface train/val gap in agent summaries.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
plexe/models.py Adds train_performance to Solution and includes it in checkpoint serialization.
plexe/helpers.py Refactors evaluation into _load_predictor/_evaluate_predictor and adds optional train-sample evaluation.
plexe/workflow.py Updates evaluation call sites to capture train performance; enables bootstrap insight extraction; adds TODO.
plexe/search/journal.py Includes train_performance in get_history() entries.
plexe/agents/hypothesiser.py Surfaces val/train performance and gap in node + history summaries (with one logic issue noted).
plexe/agents/insight_extractor.py Includes train performance and gap in variant result summaries.
tests/unit/test_models.py Adds serialization/backward-compat tests for train_performance.
tests/unit/search/test_journal.py Adds coverage for train_performance in journal history.
plexe/CODE_INDEX.md Regenerated code index reflecting signature/doc updates.
tests/CODE_INDEX.md Regenerated test code index reflecting new unit tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 2, 2026

Greptile Summary

This PR adds train/val gap tracking to the ML search loop and extends insight extraction to bootstrap (initial) iterations, giving the hypothesiser and insight extractor an overfitting/underfitting signal from the very first round of experiments.

Key changes:

  • evaluate_on_sample is refactored into composable _load_predictor + _evaluate_predictor helpers; the predictor is loaded once and reused for both the val and (optionally) train evaluation in the same call, which is efficient.
  • Solution.train_performance is added with None as default and proper d.get(...) deserialization, making old checkpoints fully backward-compatible.
  • A synthetic Hypothesis (expand_solution_id=-1) is created for bootstrap rounds so InsightExtractorAgent can run and seed the insight store from early diverse attempts — consistent with how PlannerAgent already uses -1 in its own bootstrap path.
  • The if variant_solutions and expand_solution_id is not None guard is correctly simplified to if variant_solutions, since hypothesis is always defined by the time Step 2e is reached (the try/except continue pattern ensures it).
  • Unit tests cover the default value, serialization, backward-compat deserialization, round-trip, and journal history for the new field.

Confidence Score: 4/5

  • Safe to merge; logic is correct and backward-compatible with one minor style nit.
  • The implementation is clean, backward-compatible, and well-tested. The only concern is a minor pre-existing pattern (if success and perf) inherited into the new train/val gap display, which could suppress output for an exact zero-valued metric — an unlikely but technically incorrect edge case.
  • plexe/agents/hypothesiser.py — the if success and perf truthiness guard on the train/val gap block.

Important Files Changed

Filename Overview
plexe/helpers.py Cleanly refactored into _load_predictor + _evaluate_predictor helpers; evaluate_on_sample now returns a `(val, train
plexe/models.py New train_performance field defaults to None, serialized in to_dict, and deserialized with d.get(...) — backward-compatible with old checkpoints.
plexe/workflow.py Callers updated to unpack the new tuple; synthetic bootstrap hypothesis is correctly scoped inside the try block so continue prevents Step 2e from running on failure; retrain_on_full_dataset discards train_performance with _ as expected.
plexe/agents/hypothesiser.py Gap surfacing logic is correct; minor concern with the inherited if success and perf truthiness guard which could suppress gap display when performance is exactly 0.0.
plexe/agents/insight_extractor.py Train/val gap added correctly inside the is_successful guard; sol.performance is guaranteed non-None at that point, so the arithmetic is safe.
plexe/search/journal.py Minimal, correct change — train_performance passthrough added to get_history() entries.
tests/unit/test_models.py Good coverage: default-None, to_dict, backward-compat deserialization, and round-trip tests all present.
tests/unit/search/test_journal.py Two focused tests cover the set and unset train_performance paths in get_history().

Last reviewed commit: 85a0f4c

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptile-apps review again with latest changes

@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptile-apps review again with latest changes

@marcellodebernardi marcellodebernardi merged commit d0d301e into main Mar 3, 2026
13 checks passed
@marcellodebernardi marcellodebernardi deleted the feature/iterative-evaluation-workflow branch March 3, 2026 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants