SC-2886: PII Detection for Uploaded Test Results by johnwalz97 · Pull Request #410 · validmind/validmind-library

johnwalz97 · 2025-08-06T17:59:51Z

Pull Request Description

What and why?

Using Microsoft Presidio to add PII detection to the library that is off by default but can be turned on with an environment variable. This will check every test result for PII before its uploaded and throw an error if its detected. The unsafe=True flag can be used to override for a specific test result.

How to test

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

- Bump Poetry version to 2.1.3 in `poetry.lock`. - Introduce optional PII detection capabilities using Microsoft Presidio in `README.md`. - Add `pii_filter.py` for detecting and masking PII in test results. - Modify `check_for_sensitive_data` to utilize the new PII detection functionality. - Update `pyproject.toml` to include `presidio-analyzer` as an optional dependency. - Adjust `TestResult` class to ensure PII checks are performed correctly.

nibalizer · 2025-08-06T18:56:35Z

Looks good to me. Consider allowing the user to PII scan data before it goes to the LLM. Possible to set the env var to a tuple: ["strict", "results", "annonymize", "off"]

…developer-framework

- Update README.md to reflect new PII filtering options, replacing "Enable PII detection" with "Configure PII filtering" and detailing available modes. - Modify run_e2e_notebooks.py to accept a new command-line option for PII filtering mode, allowing users to specify their desired filtering behavior during notebook execution. - Implement PII filtering in test descriptions by adding a new function to filter PII from summaries before sending to the LLM. - Introduce an Enum for PII filtering modes in pii_filter.py, improving clarity and maintainability of PII filtering logic. - Update existing functions to utilize the new PII filtering capabilities, ensuring that PII is appropriately handled in test results and descriptions.

…nctionality - Update README.md to change "PII filtering" to "PII detection" for clarity and consistency. - Modify run_e2e_notebooks.py to reflect the new environment variable for PII detection. - Refactor test descriptions to check for PII content instead of filtering it, raising exceptions when PII is detected. - Rename PII filtering-related functions and enums in pii_filter.py to align with the new terminology. - Ensure all references to PII handling are updated to use the new detection logic.

validbeck · 2025-08-14T16:43:29Z

Just watching your demo during sprint kickoff — I don't see the demo code you're showing off in this PR, perhaps it needs to be added as a mini notebook to notebooks/how_to?

…port - Introduce Presidio Structured for improved PII detection in structured data. - Update `check_table_for_pii` and related functions to utilize structured analysis when available. - Implement lazy loading for Presidio Structured components to ensure compatibility. - Modify `generate_description` and `TestResult` classes to include PII checks for tables and descriptions. - Update dependencies in `pyproject.toml` and `poetry.lock` to include `presidio-structured`. - Enhance error handling and logging for PII detection failures.

…ctions

…developer-framework

johnwalz97 · 2025-08-20T19:46:37Z

Just watching your demo during sprint kickoff — I don't see the demo code you're showing off in this PR, perhaps it needs to be added as a mini notebook to notebooks/how_to?

Sorry just saw this @validbeck ... i created a notebook and its available in this PR: notebooks/how_to/configure_pii_detection.ipynb

github-actions · 2025-08-21T15:42:20Z

PR Summary

This PR introduces significant enhancements centered around two main areas:

Integration Workflow Updates
- The GitHub Actions integration workflow has been refactored to remove the use of an older virtual environment (sdist-venv) in favor of a uniform environment (all-venv). This simplifies the workflow by installing the built package, additional dependencies, and creating a Jupyter kernel consistently from the same environment.
- Logging improvements have been added to the end-to-end notebook runner. The script now prints the current PII detection mode for debugging purposes and accepts a CLI option to configure the PII detection mode dynamically.
PII Detection Feature
- New documentation sections have been added to the README and a dedicated Jupyter Notebook to explain how to configure and test the optional PII detection capabilities using Microsoft Presidio. This feature allows users to detect and block sensitive data in test descriptions and results.
- In the build configuration (pyproject.toml), dependencies for PII detection (presidio-analyzer and presidio-structured) have been added.
- A new module (pii_filter.py) has been implemented which provides an enum for PII detection modes, along with utility functions (scan_text and scan_df) that detect and raise errors if PII is found in either a text string or a pandas DataFrame.
- Functions in the result logging logic now incorporate PII detection by checking tables and description texts, raising errors when sensitive data is detected under specific PII detection modes. This helps prevent logging of sensitive data accidentally.
- Minor code formatting improvements and test adjustments have also been made to align with these changes.

Overall, this PR not only streamlines the integration workflow but also enhances the software's capability to attempt automatic PII filtering, thereby helping to secure potentially sensitive output data.

Test Suggestions

Test the integration workflow by running notebooks with different PII detection modes (disabled, test_results, test_descriptions, all) to verify the correct enabling/disabling of PII scanning.
Create tests that feed both clean and PII-containing data to the scan_text and scan_df functions to ensure they correctly raise exceptions when appropriate.
Run end-to-end tests that simulate a complete logging operation to verify that the PII checks correctly block the logging of sensitive data.
Validate that the environment variable configuration and CLI option for PII detection mode propagate correctly throughout the system.
Test backward compatibility ensuring normal functionality when PII detection is disabled.

johnwalz97 added 2 commits August 6, 2025 13:56

refactor: remove unused import from utils.py

ff41353

johnwalz97 requested review from cachafla and nibalizer August 6, 2025 17:59

johnwalz97 added 3 commits August 7, 2025 11:47

Merge branch 'main' into john6797/sc-2886/spike-pii-detection-in-the-…

e64b6e5

…developer-framework

johnwalz97 added 4 commits August 19, 2025 10:56

chore: add noqa comments for complexity warnings in PII detection fun…

a044945

…ctions

Merge branch 'main' into john6797/sc-2886/spike-pii-detection-in-the-…

655948e

…developer-framework

feat: add notebook for documenting pii detection

2a0ea74

johnwalz97 added enhancement New feature or request internal Not to be externalized in the release notes labels Aug 19, 2025

johnwalz97 added 2 commits August 20, 2025 11:10

Merge branch 'main' into john6797/sc-2886/spike-pii-detection-in-the-…

30162d9

…developer-framework

feat: rename notebook

77ec96e

johnwalz97 requested review from cachafla and removed request for cachafla August 20, 2025 15:12

chore: fix broken integration tests

e85c7db

johnwalz97 added 2 commits August 21, 2025 11:29

feat: fixing pii detection

42e772d

chore: upgrading linter and fixing complaints

bec495f

cachafla approved these changes Aug 21, 2025

View reviewed changes

johnwalz97 merged commit e32f966 into main Aug 21, 2025
17 checks passed

johnwalz97 deleted the john6797/sc-2886/spike-pii-detection-in-the-developer-framework branch August 21, 2025 16:40

validbeck mentioned this pull request Aug 21, 2025

notebook: Enable PII detection in tests #416

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SC-2886: PII Detection for Uploaded Test Results#410

SC-2886: PII Detection for Uploaded Test Results#410
johnwalz97 merged 14 commits intomainfrom
john6797/sc-2886/spike-pii-detection-in-the-developer-framework

johnwalz97 commented Aug 6, 2025

Uh oh!

nibalizer commented Aug 6, 2025

Uh oh!

validbeck commented Aug 14, 2025

Uh oh!

johnwalz97 commented Aug 20, 2025

Uh oh!

github-actions bot commented Aug 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

johnwalz97 commented Aug 6, 2025

Pull Request Description

What and why?

How to test

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

Uh oh!

nibalizer commented Aug 6, 2025

Uh oh!

validbeck commented Aug 14, 2025

Uh oh!

johnwalz97 commented Aug 20, 2025

Uh oh!

github-actions bot commented Aug 21, 2025

PR Summary

Test Suggestions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants