[SC-9919] Fix pandas DataFrame dtype preservation in VMDataset initialization by juanmleng · Pull Request #356 · validmind/validmind-library

juanmleng · 2025-04-23T12:05:29Z

Internal Notes for Reviewers

This PR addresses an issue where some pandas DataFrame dtypes (e.g., categorical types) are lost during VMDataset initialization. This change bypasses the VMDataset initialization to modify only the DataFrameDataset class to store the original DataFrame directly instead of converting to NumPy arrays and back. This ensures that all pandas-specific dtype information and metadata are preserved.

BEFORE:

AFTER:

Testing
Run successfully quickstart_customer_churn_full_suite.ipynb and the application_scorecard_executive.ipynb notebooks.

External Release Notes

Drastically reduce memory overhead with initializing VMDataset objects with vm.init_dataset()
Added new copy_data option to init_dataset() to skip creating a copy of the input dataframe. This option helps dealing with large datasets in memory restricted environments. By default, copy_data is True. Example usage:

vm_ds = vm.init_dataset(
    dataset=df,
    input_id="demo",
    target_column="target",
    copy_data=False,
)

…ion through the VMDataset

validmind/vm_models/dataset/dataset.py

johnwalz97

Couple of comments to reduce memory overhead

validmind/vm_models/dataset/dataset.py

…ervation-in-vmdataset-initialization

cachafla

Awesome. Let me do some testing with this notebook 🙂

cachafla · 2025-04-25T03:21:34Z

Big memory savings:

vm_ds = vm.init_dataset(
dataset=df,
input_id="demo",
target_column="target",
)

Small dataset:
num_rows = 500000
num_Float64 = 30
num_float64 = 30
num_int64 = 16

Before:
used 2850.8 MiB RAM in 3.40s (system mean cpu 22%, single max cpu 77%)

After:
used 478.4 MiB RAM in 1.45s (system mean cpu 18%, single max cpu 70%)

Bigger dataset:
num_rows = 1000000
num_Float64 = 50
num_float64 = 50
num_int64 = 30

Before:
used 7284.9 MiB RAM in 10.70s (system mean cpu 16%, single max cpu 100%)

After:
used 2469.3 MiB RAM in 3.55s (system mean cpu 15%, single max cpu 100%)

Changes have already been made.

github-actions · 2025-04-25T08:39:59Z

PR Summary

This pull request introduces several enhancements to the DataFrameDataset class and related components in the ValidMind library:

Preservation of Data Types: The DataFrameDataset class now preserves the original pandas data types, ensuring that categorical data types are maintained throughout the dataset lifecycle.
copy_data Option: A new boolean parameter copy_data has been added to the DataFrameDataset and VMDataset classes. This parameter allows users to specify whether the dataset should be copied or if a reference to the original data should be maintained. By default, copy_data is set to True, meaning the data will be copied to prevent accidental modifications. If set to False, the dataset will share data with the original DataFrame, which can be useful for memory efficiency but requires caution to avoid unintended data changes.
Test Enhancements: A new test, test_dtype_preserved, has been added to verify that the categorical data type is preserved when initializing a DataFrameDataset.
Logging and Warnings: A warning is logged when copy_data is set to False, alerting users to the potential risks of modifying the original DataFrame.

These changes improve the flexibility and robustness of the dataset handling within the library, particularly for users dealing with large datasets or requiring specific data type handling.

Test Suggestions

Test initializing DataFrameDataset with copy_data=True and verify data is copied.
Test initializing DataFrameDataset with copy_data=False and verify data is not copied.
Test that modifying the original DataFrame does not affect DataFrameDataset when copy_data=True.
Test that modifying the original DataFrame affects DataFrameDataset when copy_data=False.
Test dtype preservation for various pandas dtypes (e.g., categorical, datetime).
Test warning message is logged when copy_data=False.

Fix dtype preservation in DataFrameDataset by bypassing NumPy convers…

84b9b55

…ion through the VMDataset

juanmleng added bug Something isn't working internal Not to be externalized in the release notes labels Apr 23, 2025

juanmleng self-assigned this Apr 23, 2025

juanmleng requested review from AnilSorathiya, cachafla and johnwalz97 April 23, 2025 12:31

cachafla reviewed Apr 23, 2025

View reviewed changes

validmind/vm_models/dataset/dataset.py Outdated Show resolved Hide resolved

johnwalz97 previously requested changes Apr 23, 2025

View reviewed changes

validmind/vm_models/dataset/dataset.py Outdated Show resolved Hide resolved

validmind/vm_models/dataset/dataset.py Outdated Show resolved Hide resolved

juanmleng added 3 commits April 24, 2025 12:30

Add option to not copy data in init_dataset for DataFrameDataset

738bbbc

Add unit test to very dtype preservation

30a5e13

Merge branch 'main' into juan/sc-9919/fix-pandas-dataframe-dtype-pres…

00fd665

…ervation-in-vmdataset-initialization

cachafla approved these changes Apr 24, 2025

View reviewed changes

2.8.21

ad05ba3

juanmleng merged commit ca2140e into main Apr 25, 2025
7 checks passed

johnwalz97 deleted the juan/sc-9919/fix-pandas-dataframe-dtype-preservation-in-vmdataset-initialization branch April 25, 2025 14:25

cachafla added enhancement New feature or request and removed internal Not to be externalized in the release notes labels Apr 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SC-9919] Fix pandas DataFrame dtype preservation in VMDataset initialization#356

[SC-9919] Fix pandas DataFrame dtype preservation in VMDataset initialization#356
juanmleng merged 5 commits intomainfrom
juan/sc-9919/fix-pandas-dataframe-dtype-preservation-in-vmdataset-initialization

juanmleng commented Apr 23, 2025 •

edited by cachafla

Loading

Uh oh!

Uh oh!

johnwalz97 left a comment

Uh oh!

Uh oh!

Uh oh!

cachafla left a comment

Uh oh!

cachafla commented Apr 25, 2025

Uh oh!

github-actions bot commented Apr 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

juanmleng commented Apr 23, 2025 • edited by cachafla Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Internal Notes for Reviewers

External Release Notes

Uh oh!

Uh oh!

johnwalz97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cachafla left a comment

Choose a reason for hiding this comment

Uh oh!

cachafla commented Apr 25, 2025

Uh oh!

github-actions bot commented Apr 25, 2025

PR Summary

Test Suggestions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

juanmleng commented Apr 23, 2025 •

edited by cachafla

Loading