Skip to content

feat: add probability-aware metric support across search and evaluation#187

Merged
marcellodebernardi merged 13 commits intomainfrom
codex/probability-metrics-support
Mar 3, 2026
Merged

feat: add probability-aware metric support across search and evaluation#187
marcellodebernardi merged 13 commits intomainfrom
codex/probability-metrics-support

Conversation

@marcellodebernardi
Copy link
Copy Markdown
Contributor

This PR adds probability-aware metric support across search, baseline validation, and final evaluation while keeping the existing workflow structure intact. The aim is to enable probability-based primary metrics (roc_auc, roc_auc_ovr, roc_auc_ovo, log_loss), make search/fallback selection direction-correct for both higher-is-better and lower-is-better metrics, and enforce clear predictor semantics for probability outputs.

It also adds focused unit coverage for helper routing, search direction behavior, predictor probability semantics, baseline validation, and evaluator prompt guidance.

Testing

  • poetry run pytest tests/unit/
  • poetry run ruff check plexe/agents/baseline_builder.py plexe/agents/model_evaluator.py plexe/helpers.py plexe/search/evolutionary_search_policy.py plexe/search/journal.py plexe/search/tree_policy.py plexe/templates/inference/catboost_predictor.py plexe/templates/inference/keras_predictor.py plexe/templates/inference/lightgbm_predictor.py plexe/templates/inference/pytorch_predictor.py plexe/templates/inference/xgboost_predictor.py plexe/tools/submission.py plexe/workflow.py tests/unit/search/test_evolutionary_policy_determinism.py tests/unit/search/test_journal.py tests/unit/search/test_tree_policy_determinism.py tests/unit/test_helpers.py tests/unit/test_lightgbm_predictor.py tests/unit/agents/test_model_evaluator_prompt.py tests/unit/test_baseline_probability_validation.py tests/unit/test_catboost_predictor.py tests/unit/test_keras_predictor.py tests/unit/test_pytorch_predictor.py tests/unit/test_xgboost_predictor.py

Copilot AI review requested due to automatic review settings March 3, 2026 02:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds probability-aware metric support across the search, baseline validation, and final evaluation phases. It enables probability-based primary metrics (roc_auc, roc_auc_ovr, roc_auc_ovo, log_loss) by: (1) routing evaluation through predict_proba() instead of predict() when these metrics are selected, (2) making search and fallback selection direction-correct for both higher- and lower-is-better metrics, and (3) enforcing clear predictor semantics for probability outputs across all inference templates.

Changes:

  • Adds metric_requires_probabilities(), normalize_probability_predictions() helpers and updates all evaluation paths (baseline validation, _evaluate_predictor, BaselineBuilderAgent) to use predict_proba() for probability metrics
  • Adds predict_proba() to all five inference templates (XGBoost, LightGBM, CatBoost, PyTorch, Keras), with logit-to-probability conversion logic for Keras
  • Makes SearchJournal direction-aware via optimization_direction, selection_score(), sort_key(), and is_better() methods; updates all sorting/selection in TreeSearchPolicy and EvolutionarySearchPolicy to respect metric direction
  • Adds comprehensive unit tests covering helper routing, search direction behavior, predictor probability semantics, baseline validation, and evaluator prompt guidance

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
plexe/helpers.py Adds PROBABILITY_METRICS set, metric_requires_probabilities(), normalize_probability_predictions(), and updates _evaluate_predictor() to use predict_proba() for probability metrics
plexe/search/journal.py Adds optimization_direction parameter to SearchJournal, plus selection_score(), is_better(), sort_key() methods; updates serialization/deserialization
plexe/search/tree_policy.py Updates sorting and softmax sampling to use journal.sort_key and journal.selection_score
plexe/search/evolutionary_search_policy.py Updates all performance comparisons to use direction-aware methods; fixes should_stop for lower-is-better metrics
plexe/workflow.py Sets journal.optimization_direction from context.metric during checkpoint restore and fresh journal creation; fixes fallback sort to use journal.sort_key
plexe/tools/submission.py Updates validate_baseline_predictor and evaluate_baseline_performance to use predict_proba() for probability metrics
plexe/agents/baseline_builder.py Adds proba_note and proba_requirement prompt guidance; updates _evaluate_performance for probability metrics
plexe/agents/model_evaluator.py Updates _build_agent and _get_phase_1_prompt with probability metric guidance
plexe/templates/inference/*.py Adds predict_proba() to XGBoost, LightGBM, CatBoost, PyTorch templates; refactors Keras predictor to use _probabilities_from_raw() shared helper
tests/unit/... New tests for all above changes plus direction-aware journal/policy behavior
Makefile Adds run-titanic-proba example target
plexe/CODE_INDEX.md, tests/CODE_INDEX.md Updated documentation indexes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptileai please review again with latest changes

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR adds end-to-end probability-aware metric support (ROC-AUC variants, log-loss) across the search, baseline validation, and final evaluation pipeline. It introduces metric_requires_probabilities / normalize_probability_predictions helpers, a validated optimization_direction property on SearchJournal with direction-aware sort/score helpers, and predict_proba implementations for all five predictor templates. All previous review-thread issues (early-stopping direction, multiclass column-count validation, is_better dead API, logit heuristic, NaN/Inf guard, legacy metadata fallthrough for every predictor, optimization_direction bypass, etc.) have been resolved with matching unit tests.

Key changes:

  • helpers.py: New PROBABILITY_METRICS set, metric_requires_probabilities, and normalize_probability_predictions (1-D binary / 2-D multiclass routing, column-count guard, single-column expansion, validated error messages); _evaluate_predictor updated to branch on predict_proba vs predict.
  • search/journal.py: optimization_direction validated property; selection_score, is_better, sort_key helpers wired into best_node, summary, delta trend, and child-node ranking.
  • search/evolutionary_search_policy.py / tree_policy.py: All raw n.performance comparisons replaced with journal helper calls; early-stopping thresholds fully direction-aware including the no-baseline lower-direction fix.
  • Predictor templates: All five templates now have predict_proba; Keras adds _probabilities_from_raw with logit detection, NaN/Inf rejection, and missing-metadata inference.
  • tools/submission.py / validation/validators.py: Split submission now validates split ratios and enforces test-split presence when required; baseline validation tools updated for probability metrics.
  • workflow.py: Journal optimization_direction set (via validated setter) at every creation/restore site; fallback selection uses journal.sort_key; canonicalized split-ratio guard added for final-evaluation path.

Minor observations:

  • roc_auc_ovo for a binary task will still surface a cryptic sklearn error because normalize_probability_predictions returns 1-D for binary but compute_metric_hardcoded passes multi_class="ovo". This is an edge case (OVO is a multiclass strategy) and partially pre-existed this PR, but an explicit guard would improve the UX.
  • Two inline TODO comments in DatasetSplitterAgent describe live limitations of the split prompts (2-way mode still instructing test output). These are worth tracking as issues before the next release.

Confidence Score: 4/5

  • Safe to merge; all prior review-thread issues resolved and test coverage is comprehensive.
  • The implementation is thorough, all previous review-thread issues have been addressed with matching unit tests, and the direction-aware logic is correct throughout. One minor logic gap (binary roc_auc_ovo surfacing a cryptic sklearn error) and two in-flight TODO comments in agent prompts keep this slightly below 5.
  • plexe/helpers.py (binary roc_auc_ovo path) and plexe/agents/dataset_splitter.py (TODO prompt limitations).

Important Files Changed

Filename Overview
plexe/helpers.py Adds metric_requires_probabilities and normalize_probability_predictions helpers; updates _evaluate_predictor to route through predict_proba for probability metrics. Direction logic and column-count checks are correct; all prior review-thread issues addressed.
plexe/search/journal.py Adds validated optimization_direction property, selection_score, is_better, and sort_key helpers; direction-aware best_node, summary, and history computations are all correct. Serialisation round-trips correctly with backward-compatible default.
plexe/search/evolutionary_search_policy.py All n.performance comparisons replaced with journal.sort_key / selection_score. Early-stopping thresholds now handle both directions and the no-baseline edge case for lower-direction metrics is correctly fixed.
plexe/templates/inference/keras_predictor.py Adds _uses_logits_output and _probabilities_from_raw with NaN/Inf guard, legacy logit heuristic, and missing-metadata fallback. All prior review-thread issues addressed; logit sigmoid/softmax paths are correct.
plexe/tools/submission.py Extends get_save_split_uris_tool with Spark-level split validation and test-split enforcement; baseline validation tools updated to route through predict_proba for probability metrics. Logic is correct.
plexe/workflow.py Sets journal.optimization_direction (now validated by property setter) in all journal creation and restore paths; fallback selection uses journal.sort_key; split-ratio canonicalization and 3-way guard added for final-evaluation path.
plexe/agents/dataset_splitter.py Passes split_ratios through to get_save_split_uris_tool for Spark-level validation; two TODO comments flag that prompts still instruct 3-way output even for 2-way modes, which could cause accidental holdout splits.

Last reviewed commit: 2aac28d

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32 files reviewed, 21 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Additional Comments (2)

plexe/helpers.py
n_classes inferred from a potentially-incomplete sample

n_classes = len(np.unique(y_true)) counts only the classes that happen to appear in the supplied validation/evaluation sample. If even one class is absent from the sample (e.g., a rare class in a 3-class problem with a small validation split), the function can return a 1-D array instead of the required full probability matrix, silently feeding the wrong shape into roc_auc_ovr / roc_auc_ovo / log_loss.

A safer approach is to pass the expected number of classes explicitly and fall back only when the caller cannot supply it:

def normalize_probability_predictions(
    y_true: np.ndarray,
    y_pred_proba: Any,
    metric_name: str,
    n_classes: int | None = None,
) -> np.ndarray:
    ...
    if n_classes is None:
        n_classes = len(np.unique(y_true))

Alternatively, at the call-sites where task_analysis["num_classes"] is already available, pass it through so the function has a reliable class count.

Prompt To Fix With AI
This is a comment left during a code review.
Path: plexe/helpers.py
Line: 264

Comment:
**`n_classes` inferred from a potentially-incomplete sample**

`n_classes = len(np.unique(y_true))` counts only the classes that happen to appear in the supplied validation/evaluation sample. If even one class is absent from the sample (e.g., a rare class in a 3-class problem with a small validation split), the function can return a 1-D array instead of the required full probability matrix, silently feeding the wrong shape into `roc_auc_ovr` / `roc_auc_ovo` / `log_loss`.

A safer approach is to pass the expected number of classes explicitly and fall back only when the caller cannot supply it:

```python
def normalize_probability_predictions(
    y_true: np.ndarray,
    y_pred_proba: Any,
    metric_name: str,
    n_classes: int | None = None,
) -> np.ndarray:
    ...
    if n_classes is None:
        n_classes = len(np.unique(y_true))
```

Alternatively, at the call-sites where `task_analysis["num_classes"]` is already available, pass it through so the function has a reliable class count.

How can I resolve this? If you propose a fix, please make it concise.

plexe/search/journal.py
improvement percentage sign is wrong for lower-direction improvements

The improvement percentage shown in the summary string is:

score_delta = self.selection_score(best.performance) - self.selection_score(self.baseline_performance)

For lower direction: score_delta = -best - (-baseline) = baseline - best.
When best < baseline (genuine improvement), score_delta > 0, so the displayed +X% is correct.

However, the displayed raw value is always best.performance (the raw metric), while the sign of the percentage is computed via the flipped score. This is correct behaviour, but can be confusing at first read. Consider adding a comment so the intent is clear:

# For lower-direction metrics, score_delta is positive when performance improves
score_delta = self.selection_score(best.performance) - self.selection_score(self.baseline_performance)
Prompt To Fix With AI
This is a comment left during a code review.
Path: plexe/search/journal.py
Line: 481-483

Comment:
**`improvement` percentage sign is wrong for `lower`-direction improvements**

The improvement percentage shown in the summary string is:

```python
score_delta = self.selection_score(best.performance) - self.selection_score(self.baseline_performance)
```

For `lower` direction: `score_delta = -best - (-baseline) = baseline - best`.  
When `best < baseline` (genuine improvement), `score_delta > 0`, so the displayed `+X%` is correct.

However, the displayed raw value is always `best.performance` (the raw metric), while the sign of the percentage is computed via the flipped score. This is correct behaviour, but can be confusing at first read. Consider adding a comment so the intent is clear:

```python
# For lower-direction metrics, score_delta is positive when performance improves
score_delta = self.selection_score(best.performance) - self.selection_score(self.baseline_performance)
```

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptileai please review again with latest changes

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Additional Comments (1)

plexe/workflow.py
Redundant optimization_direction assignment on restored journals

optimization_direction is already set correctly by SearchJournal.from_dict (via d.get("optimization_direction", "higher")). Setting it again here, and then a third time inside search_models (line 1002), adds noise without benefit. If the intent is "always trust the live metric over the checkpoint value," a single authoritative assignment in search_models (where the journal is actually handed to the policy) is sufficient, and this extra write can be removed.

Prompt To Fix With AI
This is a comment left during a code review.
Path: plexe/workflow.py
Line: 984-985

Comment:
**Redundant `optimization_direction` assignment on restored journals**

`optimization_direction` is already set correctly by `SearchJournal.from_dict` (via `d.get("optimization_direction", "higher")`). Setting it again here, and then a third time inside `search_models` (line 1002), adds noise without benefit. If the intent is "always trust the live metric over the checkpoint value," a single authoritative assignment in `search_models` (where the journal is actually handed to the policy) is sufficient, and this extra write can be removed.

How can I resolve this? If you propose a fix, please make it concise.

@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptileai please review again with latest changes

@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptileai please review again with latest changes

@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptileai please review again with latest changes

@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptileai please review again with latest changes

@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptileai please review again with latest changes

@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptileai please review again with latest changes

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Additional Comments (2)

plexe/validation/validators.py
Test-row count inflates total when test_uri is provided but not expected

When the agent calls save_split_uris with a test_uri even though the expected ratios don't include a test split (e.g. expected_ratios = {"train": 0.7, "val": 0.3}), the block below reads and counts the test Parquet:

if test_uri:
    test_count = spark.read.parquet(test_uri).count()
    total = train_count + val_count + test_count

That inflates total, so the ratio check computes:

actual_ratios["val"] = val_count / (train + val + test)   # e.g. 0.15 instead of 0.30

which is 15 percentage points below the expected 0.30, triggering a spurious ">10% off" warning (and, once the TODO escalation lands, a hard rejection). The test_count rows also push actual_ratios["test"] above 0, but "test" is skipped in the loop because it's absent from normalized_expected.

The simplest fix is to only accumulate and count the test split when it is actually expected:

if test_uri and expects_test_split:
    test_count = spark.read.parquet(test_uri).count()
    total = train_count + val_count + test_count
else:
    test_count = 0
    total = train_count + val_count

This aligns total with the set of splits that expected_ratios describes.

Prompt To Fix With AI
This is a comment left during a code review.
Path: plexe/validation/validators.py
Line: 387-395

Comment:
**Test-row count inflates `total` when `test_uri` is provided but not expected**

When the agent calls `save_split_uris` with a `test_uri` even though the expected ratios don't include a test split (e.g. `expected_ratios = {"train": 0.7, "val": 0.3}`), the block below reads and counts the test Parquet:

```python
if test_uri:
    test_count = spark.read.parquet(test_uri).count()
    total = train_count + val_count + test_count
```

That inflates `total`, so the ratio check computes:

```
actual_ratios["val"] = val_count / (train + val + test)   # e.g. 0.15 instead of 0.30
```

which is 15 percentage points below the expected 0.30, triggering a spurious ">10% off" warning (and, once the `TODO` escalation lands, a hard rejection). The `test_count` rows also push `actual_ratios["test"]` above 0, but `"test"` is skipped in the loop because it's absent from `normalized_expected`.

The simplest fix is to only accumulate and count the test split when it is actually expected:

```python
if test_uri and expects_test_split:
    test_count = spark.read.parquet(test_uri).count()
    total = train_count + val_count + test_count
else:
    test_count = 0
    total = train_count + val_count
```

This aligns `total` with the set of splits that `expected_ratios` describes.

How can I resolve this? If you propose a fix, please make it concise.

plexe/helpers.py
metric variable computed but used only conditionally — and metric_name is used in the 1-D branch

metric (the lowercase-stripped form) is assigned early but is only referenced in the multiclass column-count guard much later. The 1-D error message and the "not 2D" error message use the raw metric_name argument instead:

raise ValueError(f"Metric '{metric_name}' requires per-class probabilities ...")
raise ValueError(f"Expected probability outputs to be 1D or 2D, got shape {probabilities.shape}")

Using the normalised metric in the first message would be consistent with the rest of the function (and would ensure casing quirks in caller-supplied names don't leak into error text). Minor but worth aligning:

    if probabilities.ndim == 1:
        if n_classes > 2:
            raise ValueError(f"Metric '{metric}' requires per-class probabilities for multiclass tasks.")
        return probabilities
Prompt To Fix With AI
This is a comment left during a code review.
Path: plexe/helpers.py
Line: 320-334

Comment:
**`metric` variable computed but used only conditionally — and `metric_name` is used in the 1-D branch**

`metric` (the lowercase-stripped form) is assigned early but is only referenced in the multiclass column-count guard much later. The 1-D error message and the "not 2D" error message use the raw `metric_name` argument instead:

```python
raise ValueError(f"Metric '{metric_name}' requires per-class probabilities ...")
raise ValueError(f"Expected probability outputs to be 1D or 2D, got shape {probabilities.shape}")
```

Using the normalised `metric` in the first message would be consistent with the rest of the function (and would ensure casing quirks in caller-supplied names don't leak into error text). Minor but worth aligning:

```suggestion
    if probabilities.ndim == 1:
        if n_classes > 2:
            raise ValueError(f"Metric '{metric}' requires per-class probabilities for multiclass tasks.")
        return probabilities
```

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptileai please review again with latest changes

@marcellodebernardi
Copy link
Copy Markdown
Contributor Author

@greptileai we have addressed all issues raised on this PR, all current checks are passing, and this should now be ready to merge.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Additional Comments (2)

plexe/helpers.py, line 313
roc_auc_ovo with binary classification produces a cryptic sklearn error

normalize_probability_predictions correctly returns a 1-D positive-class probability array for binary tasks (lines 123-125). However, compute_metric_hardcoded then calls:

roc_auc_score(y_true, y_pred, multi_class="ovo")

sklearn rejects a 1-D y_score when multi_class="ovo" is set, raising something like:

ValueError: multi_class must be in ('ovo', 'ovr') when y_pred is multidimensional

roc_auc_ovo is an unusual choice for binary classification (OVO is a multiclass strategy), so this may never occur in practice, but if it does the error message will be confusing because it comes from sklearn rather than this function. A defensive guard in compute_metric_hardcoded (or in normalize_probability_predictions) would give a clearer failure:

elif metric == StandardMetric.ROC_AUC_OVO.value:
    n_classes = len(np.unique(y_true))
    if n_classes == 2:
        # OVO is undefined for binary; fall back to standard binary AUC
        return float(roc_auc_score(y_true, y_pred))
    return float(roc_auc_score(y_true, y_pred, multi_class="ovo"))

plexe/agents/dataset_splitter.py, line 224
TODO comments left in production code path

Two TODO comments were left in the production _build_agent method and _build_task_prompt. These mark known limitations (2-way splits still generating test.parquet instructions; prompt always asking for 3 outputs) that could cause the agent to produce an accidental holdout split when a 2-way mode is requested.

If these aren't being fixed in this PR, consider tracking them in an issue rather than leaving them as inline TODOs, since the prompts are live agent instructions that can affect correctness of the split phase.

@marcellodebernardi marcellodebernardi merged commit b028446 into main Mar 3, 2026
13 checks passed
@marcellodebernardi marcellodebernardi deleted the codex/probability-metrics-support branch March 3, 2026 05:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants