Skip to content

[SC-9919] Fix pandas DataFrame dtype preservation in VMDataset initialization#356

Merged
juanmleng merged 5 commits intomainfrom
juan/sc-9919/fix-pandas-dataframe-dtype-preservation-in-vmdataset-initialization
Apr 25, 2025
Merged

[SC-9919] Fix pandas DataFrame dtype preservation in VMDataset initialization#356
juanmleng merged 5 commits intomainfrom
juan/sc-9919/fix-pandas-dataframe-dtype-preservation-in-vmdataset-initialization

Conversation

@juanmleng
Copy link
Contributor

@juanmleng juanmleng commented Apr 23, 2025

Internal Notes for Reviewers

This PR addresses an issue where some pandas DataFrame dtypes (e.g., categorical types) are lost during VMDataset initialization. This change bypasses the VMDataset initialization to modify only the DataFrameDataset class to store the original DataFrame directly instead of converting to NumPy arrays and back. This ensures that all pandas-specific dtype information and metadata are preserved.

BEFORE:
Screenshot 2025-04-22 at 22 21 00

AFTER:
Screenshot 2025-04-23 at 14 03 46

Testing
Run successfully quickstart_customer_churn_full_suite.ipynb and the application_scorecard_executive.ipynb notebooks.

External Release Notes

  • Drastically reduce memory overhead with initializing VMDataset objects with vm.init_dataset()
  • Added new copy_data option to init_dataset() to skip creating a copy of the input dataframe. This option helps dealing with large datasets in memory restricted environments. By default, copy_data is True. Example usage:
vm_ds = vm.init_dataset(
    dataset=df,
    input_id="demo",
    target_column="target",
    copy_data=False,
)

@juanmleng juanmleng added bug Something isn't working internal Not to be externalized in the release notes labels Apr 23, 2025
@juanmleng juanmleng self-assigned this Apr 23, 2025
Copy link
Contributor

@johnwalz97 johnwalz97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments to reduce memory overhead

Copy link
Contributor

@cachafla cachafla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Let me do some testing with this notebook 🙂

@cachafla
Copy link
Contributor

Big memory savings:

vm_ds = vm.init_dataset(
dataset=df,
input_id="demo",
target_column="target",
)

Small dataset:
num_rows = 500000
num_Float64 = 30
num_float64 = 30
num_int64 = 16

Before:
used 2850.8 MiB RAM in 3.40s (system mean cpu 22%, single max cpu 77%)

After:
used 478.4 MiB RAM in 1.45s (system mean cpu 18%, single max cpu 70%)

Bigger dataset:
num_rows = 1000000
num_Float64 = 50
num_float64 = 50
num_int64 = 30

Before:
used 7284.9 MiB RAM in 10.70s (system mean cpu 16%, single max cpu 100%)

After:
used 2469.3 MiB RAM in 3.55s (system mean cpu 15%, single max cpu 100%)

@juanmleng juanmleng dismissed johnwalz97’s stale review April 25, 2025 08:37

Changes have already been made.

@github-actions
Copy link
Contributor

PR Summary

This pull request introduces several enhancements to the DataFrameDataset class and related components in the ValidMind library:

  1. Preservation of Data Types: The DataFrameDataset class now preserves the original pandas data types, ensuring that categorical data types are maintained throughout the dataset lifecycle.

  2. copy_data Option: A new boolean parameter copy_data has been added to the DataFrameDataset and VMDataset classes. This parameter allows users to specify whether the dataset should be copied or if a reference to the original data should be maintained. By default, copy_data is set to True, meaning the data will be copied to prevent accidental modifications. If set to False, the dataset will share data with the original DataFrame, which can be useful for memory efficiency but requires caution to avoid unintended data changes.

  3. Test Enhancements: A new test, test_dtype_preserved, has been added to verify that the categorical data type is preserved when initializing a DataFrameDataset.

  4. Logging and Warnings: A warning is logged when copy_data is set to False, alerting users to the potential risks of modifying the original DataFrame.

These changes improve the flexibility and robustness of the dataset handling within the library, particularly for users dealing with large datasets or requiring specific data type handling.

Test Suggestions

  • Test initializing DataFrameDataset with copy_data=True and verify data is copied.
  • Test initializing DataFrameDataset with copy_data=False and verify data is not copied.
  • Test that modifying the original DataFrame does not affect DataFrameDataset when copy_data=True.
  • Test that modifying the original DataFrame affects DataFrameDataset when copy_data=False.
  • Test dtype preservation for various pandas dtypes (e.g., categorical, datetime).
  • Test warning message is logged when copy_data=False.

@juanmleng juanmleng merged commit ca2140e into main Apr 25, 2025
7 checks passed
@johnwalz97 johnwalz97 deleted the juan/sc-9919/fix-pandas-dataframe-dtype-preservation-in-vmdataset-initialization branch April 25, 2025 14:25
@cachafla cachafla added enhancement New feature or request and removed internal Not to be externalized in the release notes labels Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants