Skip to content

Support new interface assign_scores to VM dataset#407

Merged
AnilSorathiya merged 8 commits intomainfrom
anilsorathiya/sc-11453/support-new-interface-assign-score-assign
Aug 8, 2025
Merged

Support new interface assign_scores to VM dataset#407
AnilSorathiya merged 8 commits intomainfrom
anilsorathiya/sc-11453/support-new-interface-assign-score-assign

Conversation

@AnilSorathiya
Copy link
Contributor

@AnilSorathiya AnilSorathiya commented Aug 6, 2025

Pull Request Description

A new interface will allow the adding new extra columns to the dataset object.

  • Support new interface assign score assign
  • Unit tests for the assign_score interface

What and why?

  • Introduce an assign_score/assign_metrics interface to allow the addition of extra columns to the dataset object.
    • A new interface will allow the adding new extra columns to the dataset object.
    • The rich dataset can be leverage for further analysis with multiple plots and statical functions for multi metrics/comparison tests

How to test

# Single metric
vm_dataset.assign_scores(vm_model, "F1")

# Multiple metrics  
vm_dataset.assign_scores(vm_model, ["F1", "Precision", "Recall"])

# With parameters
vm_dataset.assign_scores(vm_model, "ROC_AUC", average="weighted")

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

  • What and why
  • Screenshots or videos (Frontend)
  • How to test
  • What needs special review
  • Dependencies, breaking changes, and deployment notes
  • Labels applied
  • PR linked to Shortcut
  • Unit tests added (Backend)
  • Tested locally
  • [] Documentation updated (if required)
  • Environment variable additions/changes documented (if required)

@AnilSorathiya AnilSorathiya added enhancement New feature or request highlight Feature to be curated in the release notes labels Aug 6, 2025
@cachafla
Copy link
Contributor

cachafla commented Aug 6, 2025

Is there an example notebook to try out the new interface?

@juanmleng
Copy link
Contributor

Is there an example notebook to try out the new interface?

For a quick test on more realistic data you can add the assign_scores() to the scorecard or customer churn notebooks:

Screenshot 2025-08-06 at 23 14 12

Copy link
Contributor

@johnwalz97 johnwalz97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unsure about why we want to add this feature? But the code lgtm.

@juanmleng
Copy link
Contributor

Some thoughts:

  • In my view, the primary use case for assign_score() is to support a specific category of test, let’s call it a scorer.
  • A scorer should operate on a per-row basis, where each row represents a single experiment or prediction instance. The scorer takes this individual row as input and returns a score output for that specific case.
  • While the score is typically a single numeric value (e.g., faithfulness, answer relevancy, etc), the design does not necessarily have to be constrained to scalar values, and it could in principle return a structured output (e.g., a dictionary or vector).
  • The key is that assign_score() should compute scores on a row-by-row basis, much like what evaluate in RAGAS or DeepEval does.
  • Right now, all our existing tests, including the unit_metrics, do not behave as scorers. Instead, they are designed to compute global metrics over the full dataset, and rely on all rows across the dataset, to measure things like F1, accuracy, PSI, etc., which make sense only at the dataset level.

@johnwalz97
Copy link
Contributor

Some thoughts:

  • In my view, the primary use case for assign_score() is to support a specific category of test, let’s call it a scorer.
  • A scorer should operate on a per-row basis, where each row represents a single experiment or prediction instance. The scorer takes this individual row as input and returns a score output for that specific case.
  • While the score is typically a single numeric value (e.g., faithfulness, answer relevancy, etc), the design does not necessarily have to be constrained to scalar values, and it could in principle return a structured output (e.g., a dictionary or vector).
  • The key is that assign_score() should compute scores on a row-by-row basis, much like what evaluate in RAGAS or DeepEval does.
  • Right now, all our existing tests, including the unit_metrics, do not behave as scorers. Instead, they are designed to compute global metrics over the full dataset, and rely on all rows across the dataset, to measure things like F1, accuracy, PSI, etc., which make sense only at the dataset level.

thank you for this @juanmleng, that makes sense. so this a whole new paradigm for the library.

Copy link
Contributor

@johnwalz97 johnwalz97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm... probably want a minor version bump for this one since its a new interface?

Copy link
Contributor

@cachafla cachafla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me but I'd suggest renaming the interface to assign_scores(). Two main reasons:

  • To stay consistent with our other assign_predictions() interface
  • To acknowledge that you can assign 1 or more scores to a dataset, not just 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this one be called ClassImbalance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is named ClassBalance because it measures and returns the degree of balance for each prediction, not the imbalance.
Higher scores (closer to 0.5) = more balanced classes
Lower scores (closer to 0) = more imbalanced classes

@cachafla
Copy link
Contributor

cachafla commented Aug 8, 2025

Great notebook 😄. I'm excited about the possibility of using scorers in the context of Gen AI use cases.

@juanmleng
Copy link
Contributor

I think this interface looks great. A couple of suggestions:

  • A fantastic side effect of assign_scores() is that it leverages unit_metrics which were in a kind of waiting-to-be-used mode. So this is great. I like the idea of clearly defining which types of tests are supported, as it provides guidance and consistency, and I think this is a nice approach for VM-managed metrics.

  • We know users will likely want to add their own scorers, so in future iterations we may want to support custom tests. For now, we could include a note in the notebook making it clear that only unit metrics are supported, so users are aware.

  • The notebook is quite comprehensive and clearly shows everything you can do with assign_scores(). I think it would be the icing on the cake if it wrapped up with two or three of the generalized tests you implemented recently. That would make it even clearer how cool assign_scores() can be! 😊

Copy link
Contributor

@juanmleng juanmleng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite cool. Great job! I just left a couple of minor suggestions.

@AnilSorathiya
Copy link
Contributor Author

Looks good to me but I'd suggest renaming the interface to assign_scores(). Two main reasons:

  • To stay consistent with our other assign_predictions() interface
  • To acknowledge that you can assign 1 or more scores to a dataset, not just 1

Make sense. Renamed to assign_scores.

@AnilSorathiya
Copy link
Contributor Author

  • only unit metrics are supported
  • Updated the notebook description by adding this feature supports on unit_metrics only
  • The Individual Metrics section includes new tests

@AnilSorathiya AnilSorathiya changed the title Support new interface assign_score to VM dataset Support new interface assign_scores to VM dataset Aug 8, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2025

PR Summary

This PR significantly enhances the functionality for assigning metric scores directly to datasets using the assign_scores() method. Key changes include:

  1. Notebook Tutorial Update: A new interactive notebook (assign_score_complete_tutorial.ipynb) has been added. The notebook explains how to compute and add unit metric scores, handle both scalar and per-row metrics, and work with multiple models. It walks the user through installation, model initialization, prediction assignment, and ultimately using assign_scores() for various metrics including F1, Precision, Recall, ROC_AUC, BrierScore, LogLoss, and several newly introduced individual metrics (e.g., AbsoluteError, CalibrationError, and OutlierScore).

  2. Version Bump: The project version has been updated from 2.8.31 to 2.9.0 in both the pyproject.toml and validmind/__version__.py files, ensuring that the new features are appropriately versioned.

  3. Comprehensive Unit Tests: The tests in tests/test_dataset.py have been extended with several new functions to verify the behavior of assign_scores():

    • Testing single and multiple metric assignments.
    • Ensuring correct column naming based on the model's input_id.
    • Verifying that computed metric values fall within expected bounds (e.g., between 0 and 1 for accuracy-based metrics).
    • Handling edge cases such as missing predictions or an unset model input_id, which should trigger descriptive errors.
    • Validating the support for full metric IDs as well as short names.
  4. Unit Metrics Enhancements: Several new unit metric functions have been added for classification tasks, including implementations for metrics like AbsoluteError, BrierScore, CalibrationError, ClassBalance, Confidence, Correctness, LogLoss, OutlierScore, ProbabilityError, and Uncertainty. These metrics are designed to compute scores on a per-row basis and support additional parameters such as probability averaging or contamination for outlier detection.

  5. Dataset Method Improvements: The assign_scores() method is integrated into the dataset class where it now normalizes metric identifiers, computes scores using the unit metrics’ module, and adds the resulting metric scores as new columns with a consistent naming convention {model.input_id}_{metric_name}. It includes robust error handling and checks for mismatches between computed metric vector lengths and dataset lengths.

Overall, this PR provides a comprehensive workflow for integrating, computing, and validating model performance metrics directly into the dataset, thereby streamlining model evaluation and documentation integration with the ValidMind Platform.

Test Suggestions

  • Validate behavior when model.input_id is not set (should raise a ValueError).
  • Test assign_scores with a single metric and verify that the new column is added with a consistent scalar value across rows.
  • Test assign_scores with multiple metrics (both short names and full IDs) and check column naming conventions.
  • Test the failure modes: when an invalid metric name is provided and when predictions have not been assigned yet.
  • Run assign_scores on regression datasets to ensure metrics such as MeanSquaredError and RSquaredScore are computed correctly.
  • Verify that for per-row metrics (like BrierScore or LogLoss), the output vector length matches the number of dataset rows.

@AnilSorathiya AnilSorathiya merged commit 49d0996 into main Aug 8, 2025
7 checks passed
@AnilSorathiya AnilSorathiya deleted the anilsorathiya/sc-11453/support-new-interface-assign-score-assign branch August 8, 2025 10:36
johnwalz97 pushed a commit that referenced this pull request Aug 8, 2025
* add assign score interface

* unit tests for assign score

* support list of value from unit metrics

* new tests and tutorial notebook

* rename from assign_score to assign_scores

* add text that this feature supports unit_metrics

* 2.9.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request highlight Feature to be curated in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants