feat: add probability-aware metric support across search and evaluation by marcellodebernardi · Pull Request #187 · plexe-ai/plexe

marcellodebernardi · 2026-03-03T02:36:18Z

This PR adds probability-aware metric support across search, baseline validation, and final evaluation while keeping the existing workflow structure intact. The aim is to enable probability-based primary metrics (roc_auc, roc_auc_ovr, roc_auc_ovo, log_loss), make search/fallback selection direction-correct for both higher-is-better and lower-is-better metrics, and enforce clear predictor semantics for probability outputs.

It also adds focused unit coverage for helper routing, search direction behavior, predictor probability semantics, baseline validation, and evaluator prompt guidance.

Testing

poetry run pytest tests/unit/
poetry run ruff check plexe/agents/baseline_builder.py plexe/agents/model_evaluator.py plexe/helpers.py plexe/search/evolutionary_search_policy.py plexe/search/journal.py plexe/search/tree_policy.py plexe/templates/inference/catboost_predictor.py plexe/templates/inference/keras_predictor.py plexe/templates/inference/lightgbm_predictor.py plexe/templates/inference/pytorch_predictor.py plexe/templates/inference/xgboost_predictor.py plexe/tools/submission.py plexe/workflow.py tests/unit/search/test_evolutionary_policy_determinism.py tests/unit/search/test_journal.py tests/unit/search/test_tree_policy_determinism.py tests/unit/test_helpers.py tests/unit/test_lightgbm_predictor.py tests/unit/agents/test_model_evaluator_prompt.py tests/unit/test_baseline_probability_validation.py tests/unit/test_catboost_predictor.py tests/unit/test_keras_predictor.py tests/unit/test_pytorch_predictor.py tests/unit/test_xgboost_predictor.py

Copilot

Pull request overview

This PR adds probability-aware metric support across the search, baseline validation, and final evaluation phases. It enables probability-based primary metrics (roc_auc, roc_auc_ovr, roc_auc_ovo, log_loss) by: (1) routing evaluation through predict_proba() instead of predict() when these metrics are selected, (2) making search and fallback selection direction-correct for both higher- and lower-is-better metrics, and (3) enforcing clear predictor semantics for probability outputs across all inference templates.

Changes:

Adds metric_requires_probabilities(), normalize_probability_predictions() helpers and updates all evaluation paths (baseline validation, _evaluate_predictor, BaselineBuilderAgent) to use predict_proba() for probability metrics
Adds predict_proba() to all five inference templates (XGBoost, LightGBM, CatBoost, PyTorch, Keras), with logit-to-probability conversion logic for Keras
Makes SearchJournal direction-aware via optimization_direction, selection_score(), sort_key(), and is_better() methods; updates all sorting/selection in TreeSearchPolicy and EvolutionarySearchPolicy to respect metric direction
Adds comprehensive unit tests covering helper routing, search direction behavior, predictor probability semantics, baseline validation, and evaluator prompt guidance

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`plexe/helpers.py`	Adds `PROBABILITY_METRICS` set, `metric_requires_probabilities()`, `normalize_probability_predictions()`, and updates `_evaluate_predictor()` to use `predict_proba()` for probability metrics
`plexe/search/journal.py`	Adds `optimization_direction` parameter to `SearchJournal`, plus `selection_score()`, `is_better()`, `sort_key()` methods; updates serialization/deserialization
`plexe/search/tree_policy.py`	Updates sorting and softmax sampling to use `journal.sort_key` and `journal.selection_score`
`plexe/search/evolutionary_search_policy.py`	Updates all performance comparisons to use direction-aware methods; fixes `should_stop` for lower-is-better metrics
`plexe/workflow.py`	Sets `journal.optimization_direction` from `context.metric` during checkpoint restore and fresh journal creation; fixes fallback sort to use `journal.sort_key`
`plexe/tools/submission.py`	Updates `validate_baseline_predictor` and `evaluate_baseline_performance` to use `predict_proba()` for probability metrics
`plexe/agents/baseline_builder.py`	Adds `proba_note` and `proba_requirement` prompt guidance; updates `_evaluate_performance` for probability metrics
`plexe/agents/model_evaluator.py`	Updates `_build_agent` and `_get_phase_1_prompt` with probability metric guidance
`plexe/templates/inference/*.py`	Adds `predict_proba()` to XGBoost, LightGBM, CatBoost, PyTorch templates; refactors Keras predictor to use `_probabilities_from_raw()` shared helper
`tests/unit/...`	New tests for all above changes plus direction-aware journal/policy behavior
`Makefile`	Adds `run-titanic-proba` example target
`plexe/CODE_INDEX.md`, `tests/CODE_INDEX.md`	Updated documentation indexes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

plexe/agents/baseline_builder.py

marcellodebernardi · 2026-03-03T02:51:07Z

@greptileai please review again with latest changes

greptile-apps · 2026-03-03T02:55:06Z

Greptile Summary

This PR adds end-to-end probability-aware metric support (ROC-AUC variants, log-loss) across the search, baseline validation, and final evaluation pipeline. It introduces metric_requires_probabilities / normalize_probability_predictions helpers, a validated optimization_direction property on SearchJournal with direction-aware sort/score helpers, and predict_proba implementations for all five predictor templates. All previous review-thread issues (early-stopping direction, multiclass column-count validation, is_better dead API, logit heuristic, NaN/Inf guard, legacy metadata fallthrough for every predictor, optimization_direction bypass, etc.) have been resolved with matching unit tests.

Key changes:

helpers.py: New PROBABILITY_METRICS set, metric_requires_probabilities, and normalize_probability_predictions (1-D binary / 2-D multiclass routing, column-count guard, single-column expansion, validated error messages); _evaluate_predictor updated to branch on predict_proba vs predict.
search/journal.py: optimization_direction validated property; selection_score, is_better, sort_key helpers wired into best_node, summary, delta trend, and child-node ranking.
search/evolutionary_search_policy.py / tree_policy.py: All raw n.performance comparisons replaced with journal helper calls; early-stopping thresholds fully direction-aware including the no-baseline lower-direction fix.
Predictor templates: All five templates now have predict_proba; Keras adds _probabilities_from_raw with logit detection, NaN/Inf rejection, and missing-metadata inference.
tools/submission.py / validation/validators.py: Split submission now validates split ratios and enforces test-split presence when required; baseline validation tools updated for probability metrics.
workflow.py: Journal optimization_direction set (via validated setter) at every creation/restore site; fallback selection uses journal.sort_key; canonicalized split-ratio guard added for final-evaluation path.

Minor observations:

roc_auc_ovo for a binary task will still surface a cryptic sklearn error because normalize_probability_predictions returns 1-D for binary but compute_metric_hardcoded passes multi_class="ovo". This is an edge case (OVO is a multiclass strategy) and partially pre-existed this PR, but an explicit guard would improve the UX.
Two inline TODO comments in DatasetSplitterAgent describe live limitations of the split prompts (2-way mode still instructing test output). These are worth tracking as issues before the next release.

Confidence Score: 4/5

Safe to merge; all prior review-thread issues resolved and test coverage is comprehensive.
The implementation is thorough, all previous review-thread issues have been addressed with matching unit tests, and the direction-aware logic is correct throughout. One minor logic gap (binary roc_auc_ovo surfacing a cryptic sklearn error) and two in-flight TODO comments in agent prompts keep this slightly below 5.
plexe/helpers.py (binary roc_auc_ovo path) and plexe/agents/dataset_splitter.py (TODO prompt limitations).

Important Files Changed

Filename	Overview
plexe/helpers.py	Adds `metric_requires_probabilities` and `normalize_probability_predictions` helpers; updates `_evaluate_predictor` to route through `predict_proba` for probability metrics. Direction logic and column-count checks are correct; all prior review-thread issues addressed.
plexe/search/journal.py	Adds validated `optimization_direction` property, `selection_score`, `is_better`, and `sort_key` helpers; direction-aware `best_node`, `summary`, and history computations are all correct. Serialisation round-trips correctly with backward-compatible default.
plexe/search/evolutionary_search_policy.py	All `n.performance` comparisons replaced with `journal.sort_key` / `selection_score`. Early-stopping thresholds now handle both directions and the no-baseline edge case for lower-direction metrics is correctly fixed.
plexe/templates/inference/keras_predictor.py	Adds `_uses_logits_output` and `_probabilities_from_raw` with NaN/Inf guard, legacy logit heuristic, and missing-metadata fallback. All prior review-thread issues addressed; logit sigmoid/softmax paths are correct.
plexe/tools/submission.py	Extends `get_save_split_uris_tool` with Spark-level split validation and test-split enforcement; baseline validation tools updated to route through `predict_proba` for probability metrics. Logic is correct.
plexe/workflow.py	Sets `journal.optimization_direction` (now validated by property setter) in all journal creation and restore paths; fallback selection uses `journal.sort_key`; split-ratio canonicalization and 3-way guard added for final-evaluation path.
plexe/agents/dataset_splitter.py	Passes `split_ratios` through to `get_save_split_uris_tool` for Spark-level validation; two TODO comments flag that prompts still instruct 3-way output even for 2-way modes, which could cause accidental holdout splits.

_{Last reviewed commit: 2aac28d}

greptile-apps

_{32 files reviewed, 21 comments}

_{Edit Code Review Agent Settings | Greptile}

plexe/search/evolutionary_search_policy.py

greptile-apps · 2026-03-03T02:55:19Z

Additional Comments (2)

plexe/helpers.py
n_classes inferred from a potentially-incomplete sample

n_classes = len(np.unique(y_true)) counts only the classes that happen to appear in the supplied validation/evaluation sample. If even one class is absent from the sample (e.g., a rare class in a 3-class problem with a small validation split), the function can return a 1-D array instead of the required full probability matrix, silently feeding the wrong shape into roc_auc_ovr / roc_auc_ovo / log_loss.

A safer approach is to pass the expected number of classes explicitly and fall back only when the caller cannot supply it:

def normalize_probability_predictions(
    y_true: np.ndarray,
    y_pred_proba: Any,
    metric_name: str,
    n_classes: int | None = None,
) -> np.ndarray:
    ...
    if n_classes is None:
        n_classes = len(np.unique(y_true))

Alternatively, at the call-sites where task_analysis["num_classes"] is already available, pass it through so the function has a reliable class count.

Prompt To Fix With AI

This is a comment left during a code review.
Path: plexe/helpers.py
Line: 264

Comment:
**`n_classes` inferred from a potentially-incomplete sample**

`n_classes = len(np.unique(y_true))` counts only the classes that happen to appear in the supplied validation/evaluation sample. If even one class is absent from the sample (e.g., a rare class in a 3-class problem with a small validation split), the function can return a 1-D array instead of the required full probability matrix, silently feeding the wrong shape into `roc_auc_ovr` / `roc_auc_ovo` / `log_loss`.

A safer approach is to pass the expected number of classes explicitly and fall back only when the caller cannot supply it:

```python
def normalize_probability_predictions(
    y_true: np.ndarray,
    y_pred_proba: Any,
    metric_name: str,
    n_classes: int | None = None,
) -> np.ndarray:
    ...
    if n_classes is None:
        n_classes = len(np.unique(y_true))
```

Alternatively, at the call-sites where `task_analysis["num_classes"]` is already available, pass it through so the function has a reliable class count.

How can I resolve this? If you propose a fix, please make it concise.

plexe/search/journal.py
improvement percentage sign is wrong for lower-direction improvements

The improvement percentage shown in the summary string is:

score_delta = self.selection_score(best.performance) - self.selection_score(self.baseline_performance)

For lower direction: score_delta = -best - (-baseline) = baseline - best.
When best < baseline (genuine improvement), score_delta > 0, so the displayed +X% is correct.

However, the displayed raw value is always best.performance (the raw metric), while the sign of the percentage is computed via the flipped score. This is correct behaviour, but can be confusing at first read. Consider adding a comment so the intent is clear:

# For lower-direction metrics, score_delta is positive when performance improves
score_delta = self.selection_score(best.performance) - self.selection_score(self.baseline_performance)

Prompt To Fix With AI

This is a comment left during a code review.
Path: plexe/search/journal.py
Line: 481-483

Comment:
**`improvement` percentage sign is wrong for `lower`-direction improvements**

The improvement percentage shown in the summary string is:

```python
score_delta = self.selection_score(best.performance) - self.selection_score(self.baseline_performance)
```

For `lower` direction: `score_delta = -best - (-baseline) = baseline - best`.  
When `best < baseline` (genuine improvement), `score_delta > 0`, so the displayed `+X%` is correct.

However, the displayed raw value is always `best.performance` (the raw metric), while the sign of the percentage is computed via the flipped score. This is correct behaviour, but can be confusing at first read. Consider adding a comment so the intent is clear:

```python
# For lower-direction metrics, score_delta is positive when performance improves
score_delta = self.selection_score(best.performance) - self.selection_score(self.baseline_performance)
```

How can I resolve this? If you propose a fix, please make it concise.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

marcellodebernardi · 2026-03-03T03:04:13Z

@greptileai please review again with latest changes

plexe/helpers.py

greptile-apps · 2026-03-03T03:11:25Z

Additional Comments (1)

plexe/workflow.py
Redundant optimization_direction assignment on restored journals

optimization_direction is already set correctly by SearchJournal.from_dict (via d.get("optimization_direction", "higher")). Setting it again here, and then a third time inside search_models (line 1002), adds noise without benefit. If the intent is "always trust the live metric over the checkpoint value," a single authoritative assignment in search_models (where the journal is actually handed to the policy) is sufficient, and this extra write can be removed.

Prompt To Fix With AI

This is a comment left during a code review.
Path: plexe/workflow.py
Line: 984-985

Comment:
**Redundant `optimization_direction` assignment on restored journals**

`optimization_direction` is already set correctly by `SearchJournal.from_dict` (via `d.get("optimization_direction", "higher")`). Setting it again here, and then a third time inside `search_models` (line 1002), adds noise without benefit. If the intent is "always trust the live metric over the checkpoint value," a single authoritative assignment in `search_models` (where the journal is actually handed to the policy) is sufficient, and this extra write can be removed.

How can I resolve this? If you propose a fix, please make it concise.

…rics-support # Conflicts: # pyproject.toml

marcellodebernardi · 2026-03-03T03:17:20Z

@greptileai please review again with latest changes

plexe/helpers.py

plexe/templates/inference/keras_predictor.py

marcellodebernardi · 2026-03-03T03:59:36Z

@greptileai please review again with latest changes

plexe/helpers.py

marcellodebernardi · 2026-03-03T04:11:32Z

@greptileai please review again with latest changes

plexe/validation/validators.py

plexe/templates/inference/catboost_predictor.py

marcellodebernardi · 2026-03-03T04:24:28Z

@greptileai please review again with latest changes

plexe/templates/inference/pytorch_predictor.py

plexe/search/journal.py

marcellodebernardi · 2026-03-03T04:33:43Z

@greptileai please review again with latest changes

plexe/templates/inference/keras_predictor.py

plexe/workflow.py

marcellodebernardi · 2026-03-03T04:46:41Z

@greptileai please review again with latest changes

plexe/templates/inference/keras_predictor.py

greptile-apps · 2026-03-03T04:55:40Z

Additional Comments (2)

plexe/validation/validators.py
Test-row count inflates total when test_uri is provided but not expected

When the agent calls save_split_uris with a test_uri even though the expected ratios don't include a test split (e.g. expected_ratios = {"train": 0.7, "val": 0.3}), the block below reads and counts the test Parquet:

if test_uri:
    test_count = spark.read.parquet(test_uri).count()
    total = train_count + val_count + test_count

That inflates total, so the ratio check computes:

actual_ratios["val"] = val_count / (train + val + test)   # e.g. 0.15 instead of 0.30

which is 15 percentage points below the expected 0.30, triggering a spurious ">10% off" warning (and, once the TODO escalation lands, a hard rejection). The test_count rows also push actual_ratios["test"] above 0, but "test" is skipped in the loop because it's absent from normalized_expected.

The simplest fix is to only accumulate and count the test split when it is actually expected:

if test_uri and expects_test_split:
    test_count = spark.read.parquet(test_uri).count()
    total = train_count + val_count + test_count
else:
    test_count = 0
    total = train_count + val_count

This aligns total with the set of splits that expected_ratios describes.

Prompt To Fix With AI

This is a comment left during a code review.
Path: plexe/validation/validators.py
Line: 387-395

Comment:
**Test-row count inflates `total` when `test_uri` is provided but not expected**

When the agent calls `save_split_uris` with a `test_uri` even though the expected ratios don't include a test split (e.g. `expected_ratios = {"train": 0.7, "val": 0.3}`), the block below reads and counts the test Parquet:

```python
if test_uri:
    test_count = spark.read.parquet(test_uri).count()
    total = train_count + val_count + test_count
```

That inflates `total`, so the ratio check computes:

```
actual_ratios["val"] = val_count / (train + val + test)   # e.g. 0.15 instead of 0.30
```

which is 15 percentage points below the expected 0.30, triggering a spurious ">10% off" warning (and, once the `TODO` escalation lands, a hard rejection). The `test_count` rows also push `actual_ratios["test"]` above 0, but `"test"` is skipped in the loop because it's absent from `normalized_expected`.

The simplest fix is to only accumulate and count the test split when it is actually expected:

```python
if test_uri and expects_test_split:
    test_count = spark.read.parquet(test_uri).count()
    total = train_count + val_count + test_count
else:
    test_count = 0
    total = train_count + val_count
```

This aligns `total` with the set of splits that `expected_ratios` describes.

How can I resolve this? If you propose a fix, please make it concise.

plexe/helpers.py
metric variable computed but used only conditionally — and metric_name is used in the 1-D branch

metric (the lowercase-stripped form) is assigned early but is only referenced in the multiclass column-count guard much later. The 1-D error message and the "not 2D" error message use the raw metric_name argument instead:

raise ValueError(f"Metric '{metric_name}' requires per-class probabilities ...")
raise ValueError(f"Expected probability outputs to be 1D or 2D, got shape {probabilities.shape}")

Using the normalised metric in the first message would be consistent with the rest of the function (and would ensure casing quirks in caller-supplied names don't leak into error text). Minor but worth aligning:

    if probabilities.ndim == 1:
        if n_classes > 2:
            raise ValueError(f"Metric '{metric}' requires per-class probabilities for multiclass tasks.")
        return probabilities

Prompt To Fix With AI

This is a comment left during a code review.
Path: plexe/helpers.py
Line: 320-334

Comment:
**`metric` variable computed but used only conditionally — and `metric_name` is used in the 1-D branch**

`metric` (the lowercase-stripped form) is assigned early but is only referenced in the multiclass column-count guard much later. The 1-D error message and the "not 2D" error message use the raw `metric_name` argument instead:

```python
raise ValueError(f"Metric '{metric_name}' requires per-class probabilities ...")
raise ValueError(f"Expected probability outputs to be 1D or 2D, got shape {probabilities.shape}")
```

Using the normalised `metric` in the first message would be consistent with the rest of the function (and would ensure casing quirks in caller-supplied names don't leak into error text). Minor but worth aligning:

```suggestion
    if probabilities.ndim == 1:
        if n_classes > 2:
            raise ValueError(f"Metric '{metric}' requires per-class probabilities for multiclass tasks.")
        return probabilities
```

How can I resolve this? If you propose a fix, please make it concise.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

marcellodebernardi · 2026-03-03T04:57:41Z

@greptileai please review again with latest changes

plexe/templates/inference/keras_predictor.py

marcellodebernardi · 2026-03-03T05:08:59Z

@greptileai we have addressed all issues raised on this PR, all current checks are passing, and this should now be ready to merge.

greptile-apps · 2026-03-03T05:15:38Z

Additional Comments (2)

plexe/helpers.py, line 313
roc_auc_ovo with binary classification produces a cryptic sklearn error

normalize_probability_predictions correctly returns a 1-D positive-class probability array for binary tasks (lines 123-125). However, compute_metric_hardcoded then calls:

roc_auc_score(y_true, y_pred, multi_class="ovo")

sklearn rejects a 1-D y_score when multi_class="ovo" is set, raising something like:

ValueError: multi_class must be in ('ovo', 'ovr') when y_pred is multidimensional

roc_auc_ovo is an unusual choice for binary classification (OVO is a multiclass strategy), so this may never occur in practice, but if it does the error message will be confusing because it comes from sklearn rather than this function. A defensive guard in compute_metric_hardcoded (or in normalize_probability_predictions) would give a clearer failure:

elif metric == StandardMetric.ROC_AUC_OVO.value:
    n_classes = len(np.unique(y_true))
    if n_classes == 2:
        # OVO is undefined for binary; fall back to standard binary AUC
        return float(roc_auc_score(y_true, y_pred))
    return float(roc_auc_score(y_true, y_pred, multi_class="ovo"))

plexe/agents/dataset_splitter.py, line 224
TODO comments left in production code path

Two TODO comments were left in the production _build_agent method and _build_task_prompt. These mark known limitations (2-way splits still generating test.parquet instructions; prompt always asking for 3 outputs) that could cause the agent to produce an accidental holdout split when a 2-way mode is requested.

If these aren't being fixed in this PR, consider tracking them in an issue rather than leaving them as inline TODOs, since the prompts are live agent instructions that can affect correctness of the split phase.

feat: add probability-aware metric support across search and evaluation

18abe59

Copilot AI review requested due to automatic review settings March 3, 2026 02:36

Copilot started reviewing on behalf of marcellodebernardi March 3, 2026 02:36 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

plexe/agents/baseline_builder.py Outdated Show resolved Hide resolved

marcellodebernardi added 2 commits March 3, 2026 02:48

chore: bump version to 1.4.3

4467ae0

fix: move baseline predict_proba requirement into critical guidance

e7cf6e6