Support new interface assign_scores to VM dataset by AnilSorathiya · Pull Request #407 · validmind/validmind-library

AnilSorathiya · 2025-08-06T08:42:08Z

Pull Request Description

A new interface will allow the adding new extra columns to the dataset object.

Support new interface assign score assign
Unit tests for the assign_score interface

What and why?

Introduce an assign_score/assign_metrics interface to allow the addition of extra columns to the dataset object.
- A new interface will allow the adding new extra columns to the dataset object.
- The rich dataset can be leverage for further analysis with multiple plots and statical functions for multi metrics/comparison tests

How to test

# Single metric
vm_dataset.assign_scores(vm_model, "F1")

# Multiple metrics  
vm_dataset.assign_scores(vm_model, ["F1", "Precision", "Recall"])

# With parameters
vm_dataset.assign_scores(vm_model, "ROC_AUC", average="weighted")

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

cachafla · 2025-08-06T20:42:27Z

Is there an example notebook to try out the new interface?

juanmleng · 2025-08-06T21:17:30Z

Is there an example notebook to try out the new interface?

For a quick test on more realistic data you can add the assign_scores() to the scorecard or customer churn notebooks:

johnwalz97

I'm a little unsure about why we want to add this feature? But the code lgtm.

juanmleng · 2025-08-07T08:28:23Z

Some thoughts:

In my view, the primary use case for assign_score() is to support a specific category of test, let’s call it a scorer.
A scorer should operate on a per-row basis, where each row represents a single experiment or prediction instance. The scorer takes this individual row as input and returns a score output for that specific case.
While the score is typically a single numeric value (e.g., faithfulness, answer relevancy, etc), the design does not necessarily have to be constrained to scalar values, and it could in principle return a structured output (e.g., a dictionary or vector).
The key is that assign_score() should compute scores on a row-by-row basis, much like what evaluate in RAGAS or DeepEval does.
Right now, all our existing tests, including the unit_metrics, do not behave as scorers. Instead, they are designed to compute global metrics over the full dataset, and rely on all rows across the dataset, to measure things like F1, accuracy, PSI, etc., which make sense only at the dataset level.

johnwalz97 · 2025-08-07T15:57:06Z

Some thoughts:

In my view, the primary use case for assign_score() is to support a specific category of test, let’s call it a scorer.

A scorer should operate on a per-row basis, where each row represents a single experiment or prediction instance. The scorer takes this individual row as input and returns a score output for that specific case.

While the score is typically a single numeric value (e.g., faithfulness, answer relevancy, etc), the design does not necessarily have to be constrained to scalar values, and it could in principle return a structured output (e.g., a dictionary or vector).

The key is that assign_score() should compute scores on a row-by-row basis, much like what evaluate in RAGAS or DeepEval does.

Right now, all our existing tests, including the unit_metrics, do not behave as scorers. Instead, they are designed to compute global metrics over the full dataset, and rely on all rows across the dataset, to measure things like F1, accuracy, PSI, etc., which make sense only at the dataset level.

thank you for this @juanmleng, that makes sense. so this a whole new paradigm for the library.

johnwalz97

lgtm... probably want a minor version bump for this one since its a new interface?

notebooks/how_to/assign_score_complete_tutorial.ipynb

cachafla

Looks good to me but I'd suggest renaming the interface to assign_scores(). Two main reasons:

To stay consistent with our other assign_predictions() interface
To acknowledge that you can assign 1 or more scores to a dataset, not just 1

cachafla · 2025-08-08T03:54:08Z

validmind/unit_metrics/classification/individual/ClassBalance.py

Should this one be called ClassImbalance?

The function is named ClassBalance because it measures and returns the degree of balance for each prediction, not the imbalance.
Higher scores (closer to 0.5) = more balanced classes
Lower scores (closer to 0) = more imbalanced classes

cachafla · 2025-08-08T03:54:53Z

Great notebook 😄. I'm excited about the possibility of using scorers in the context of Gen AI use cases.

juanmleng · 2025-08-08T07:16:55Z

I think this interface looks great. A couple of suggestions:

A fantastic side effect of assign_scores() is that it leverages unit_metrics which were in a kind of waiting-to-be-used mode. So this is great. I like the idea of clearly defining which types of tests are supported, as it provides guidance and consistency, and I think this is a nice approach for VM-managed metrics.
We know users will likely want to add their own scorers, so in future iterations we may want to support custom tests. For now, we could include a note in the notebook making it clear that only unit metrics are supported, so users are aware.
The notebook is quite comprehensive and clearly shows everything you can do with assign_scores(). I think it would be the icing on the cake if it wrapped up with two or three of the generalized tests you implemented recently. That would make it even clearer how cool assign_scores() can be! 😊

juanmleng

This looks quite cool. Great job! I just left a couple of minor suggestions.

AnilSorathiya · 2025-08-08T09:28:42Z

Looks good to me but I'd suggest renaming the interface to assign_scores(). Two main reasons:

To stay consistent with our other assign_predictions() interface

To acknowledge that you can assign 1 or more scores to a dataset, not just 1

Make sense. Renamed to assign_scores.

AnilSorathiya · 2025-08-08T09:42:48Z

only unit metrics are supported

Updated the notebook description by adding this feature supports on unit_metrics only
The Individual Metrics section includes new tests

…-assign-score-assign

github-actions · 2025-08-08T10:02:16Z

PR Summary

This PR significantly enhances the functionality for assigning metric scores directly to datasets using the assign_scores() method. Key changes include:

Notebook Tutorial Update: A new interactive notebook (assign_score_complete_tutorial.ipynb) has been added. The notebook explains how to compute and add unit metric scores, handle both scalar and per-row metrics, and work with multiple models. It walks the user through installation, model initialization, prediction assignment, and ultimately using assign_scores() for various metrics including F1, Precision, Recall, ROC_AUC, BrierScore, LogLoss, and several newly introduced individual metrics (e.g., AbsoluteError, CalibrationError, and OutlierScore).
Version Bump: The project version has been updated from 2.8.31 to 2.9.0 in both the pyproject.toml and validmind/__version__.py files, ensuring that the new features are appropriately versioned.
Comprehensive Unit Tests: The tests in tests/test_dataset.py have been extended with several new functions to verify the behavior of assign_scores():
- Testing single and multiple metric assignments.
- Ensuring correct column naming based on the model's input_id.
- Verifying that computed metric values fall within expected bounds (e.g., between 0 and 1 for accuracy-based metrics).
- Handling edge cases such as missing predictions or an unset model input_id, which should trigger descriptive errors.
- Validating the support for full metric IDs as well as short names.
Unit Metrics Enhancements: Several new unit metric functions have been added for classification tasks, including implementations for metrics like AbsoluteError, BrierScore, CalibrationError, ClassBalance, Confidence, Correctness, LogLoss, OutlierScore, ProbabilityError, and Uncertainty. These metrics are designed to compute scores on a per-row basis and support additional parameters such as probability averaging or contamination for outlier detection.
Dataset Method Improvements: The assign_scores() method is integrated into the dataset class where it now normalizes metric identifiers, computes scores using the unit metrics’ module, and adds the resulting metric scores as new columns with a consistent naming convention {model.input_id}_{metric_name}. It includes robust error handling and checks for mismatches between computed metric vector lengths and dataset lengths.

Overall, this PR provides a comprehensive workflow for integrating, computing, and validating model performance metrics directly into the dataset, thereby streamlining model evaluation and documentation integration with the ValidMind Platform.

Test Suggestions

Validate behavior when model.input_id is not set (should raise a ValueError).
Test assign_scores with a single metric and verify that the new column is added with a consistent scalar value across rows.
Test assign_scores with multiple metrics (both short names and full IDs) and check column naming conventions.
Test the failure modes: when an invalid metric name is provided and when predictions have not been assigned yet.
Run assign_scores on regression datasets to ensure metrics such as MeanSquaredError and RSquaredScore are computed correctly.
Verify that for per-row metrics (like BrierScore or LogLoss), the output vector length matches the number of dataset rows.

* add assign score interface * unit tests for assign score * support list of value from unit metrics * new tests and tutorial notebook * rename from assign_score to assign_scores * add text that this feature supports unit_metrics * 2.9.0

AnilSorathiya added 2 commits August 6, 2025 13:03

add assign score interface

ce58dbe

unit tests for assign score

2e89294

AnilSorathiya added enhancement New feature or request highlight Feature to be curated in the release notes labels Aug 6, 2025

AnilSorathiya requested review from cachafla, johnwalz97, juanmleng and nibalizer August 6, 2025 11:47

johnwalz97 approved these changes Aug 7, 2025

View reviewed changes

AnilSorathiya added 2 commits August 7, 2025 20:57

support list of value from unit metrics

3baf2d0

new tests and tutorial notebook

7201816

johnwalz97 approved these changes Aug 7, 2025

View reviewed changes

cachafla reviewed Aug 8, 2025

View reviewed changes

notebooks/how_to/assign_score_complete_tutorial.ipynb Show resolved Hide resolved

cachafla approved these changes Aug 8, 2025

View reviewed changes

cachafla reviewed Aug 8, 2025

View reviewed changes

juanmleng approved these changes Aug 8, 2025

View reviewed changes

AnilSorathiya added 2 commits August 8, 2025 15:02

rename from assign_score to assign_scores

54af207

add text that this feature supports unit_metrics

e772bb3

2.9.0

4417dbb

AnilSorathiya changed the title ~~Support new interface assign_score to VM dataset~~ Support new interface assign_scores to VM dataset Aug 8, 2025

Merge branch 'main' into anilsorathiya/sc-11453/support-new-interface…

715dd01

…-assign-score-assign

AnilSorathiya merged commit 49d0996 into main Aug 8, 2025
7 checks passed

AnilSorathiya deleted the anilsorathiya/sc-11453/support-new-interface-assign-score-assign branch August 8, 2025 10:36

Conversation

AnilSorathiya commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

What and why?

How to test

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

Uh oh!

cachafla commented Aug 6, 2025

Uh oh!

juanmleng commented Aug 6, 2025

Uh oh!

johnwalz97 left a comment

Choose a reason for hiding this comment

Uh oh!

juanmleng commented Aug 7, 2025

Uh oh!

johnwalz97 commented Aug 7, 2025

Uh oh!

johnwalz97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cachafla left a comment

Choose a reason for hiding this comment

Uh oh!

cachafla Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

AnilSorathiya Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

cachafla commented Aug 8, 2025

Uh oh!

juanmleng commented Aug 8, 2025

Uh oh!

juanmleng left a comment

Choose a reason for hiding this comment

Uh oh!

AnilSorathiya commented Aug 8, 2025

Uh oh!

AnilSorathiya commented Aug 8, 2025

Uh oh!

github-actions bot commented Aug 8, 2025

PR Summary

Test Suggestions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AnilSorathiya commented Aug 6, 2025 •

edited

Loading