feat: add train/val gap tracking and bootstrap insights#185
feat: add train/val gap tracking and bootstrap insights#185marcellodebernardi merged 4 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds train/val gap visibility to the search loop by recording training-sample performance alongside validation performance, and ensures insight extraction runs for bootstrap (initial) variants as well.
Changes:
- Extend
Solution+ journal history to store/serializetrain_performance. - Refactor
evaluate_on_sampleinto predictor loading + reusable evaluation helper, returning(val_performance, train_performance). - Run insight extraction for bootstrap rounds via a synthetic hypothesis and surface train/val gap in agent summaries.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
plexe/models.py |
Adds train_performance to Solution and includes it in checkpoint serialization. |
plexe/helpers.py |
Refactors evaluation into _load_predictor/_evaluate_predictor and adds optional train-sample evaluation. |
plexe/workflow.py |
Updates evaluation call sites to capture train performance; enables bootstrap insight extraction; adds TODO. |
plexe/search/journal.py |
Includes train_performance in get_history() entries. |
plexe/agents/hypothesiser.py |
Surfaces val/train performance and gap in node + history summaries (with one logic issue noted). |
plexe/agents/insight_extractor.py |
Includes train performance and gap in variant result summaries. |
tests/unit/test_models.py |
Adds serialization/backward-compat tests for train_performance. |
tests/unit/search/test_journal.py |
Adds coverage for train_performance in journal history. |
plexe/CODE_INDEX.md |
Regenerated code index reflecting signature/doc updates. |
tests/CODE_INDEX.md |
Regenerated test code index reflecting new unit tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile SummaryThis PR adds train/val gap tracking to the ML search loop and extends insight extraction to bootstrap (initial) iterations, giving the hypothesiser and insight extractor an overfitting/underfitting signal from the very first round of experiments. Key changes:
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| plexe/helpers.py | Cleanly refactored into _load_predictor + _evaluate_predictor helpers; evaluate_on_sample now returns a `(val, train |
| plexe/models.py | New train_performance field defaults to None, serialized in to_dict, and deserialized with d.get(...) — backward-compatible with old checkpoints. |
| plexe/workflow.py | Callers updated to unpack the new tuple; synthetic bootstrap hypothesis is correctly scoped inside the try block so continue prevents Step 2e from running on failure; retrain_on_full_dataset discards train_performance with _ as expected. |
| plexe/agents/hypothesiser.py | Gap surfacing logic is correct; minor concern with the inherited if success and perf truthiness guard which could suppress gap display when performance is exactly 0.0. |
| plexe/agents/insight_extractor.py | Train/val gap added correctly inside the is_successful guard; sol.performance is guaranteed non-None at that point, so the arithmetic is safe. |
| plexe/search/journal.py | Minimal, correct change — train_performance passthrough added to get_history() entries. |
| tests/unit/test_models.py | Good coverage: default-None, to_dict, backward-compat deserialization, and round-trip tests all present. |
| tests/unit/search/test_journal.py | Two focused tests cover the set and unset train_performance paths in get_history(). |
Last reviewed commit: 85a0f4c
|
@greptile-apps review again with latest changes |
|
@greptile-apps review again with latest changes |
Summary
evaluate_on_samplenow computes the primary metric on both the training and validation samples. The train-val gap is stored on eachSolutionand surfaced in agent prompts, giving the hypothesiser and insight extractor a signal for overfitting and underfitting.evaluate_on_samplefunction is decomposed into_load_predictorand_evaluate_predictor, reducing duplication and making the train-set evaluation a simple second call on the already-loaded predictor.Files changed
plexe/models.pytrain_performancefield toSolutionwith serialization supportplexe/helpers.pytrain_sample_uriparameterplexe/workflow.pyplexe/search/journal.pytrain_performanceinget_history()entriesplexe/agents/hypothesiser.pyplexe/agents/insight_extractor.pytests/unit/test_models.pytests/unit/search/test_journal.pyTest plan
poetry run pytest tests/unit/— 97 passed, 0 failurespoetry run black . && poetry run ruff check .— cleanmake test-integration— staged integration suite