Skip to content

SC-2886: PII Detection for Uploaded Test Results#410

Merged
johnwalz97 merged 14 commits intomainfrom
john6797/sc-2886/spike-pii-detection-in-the-developer-framework
Aug 21, 2025
Merged

SC-2886: PII Detection for Uploaded Test Results#410
johnwalz97 merged 14 commits intomainfrom
john6797/sc-2886/spike-pii-detection-in-the-developer-framework

Conversation

@johnwalz97
Copy link
Contributor

Pull Request Description

What and why?

Using Microsoft Presidio to add PII detection to the library that is off by default but can be turned on with an environment variable. This will check every test result for PII before its uploaded and throw an error if its detected. The unsafe=True flag can be used to override for a specific test result.

How to test

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

  • What and why
  • Screenshots or videos (Frontend)
  • How to test
  • What needs special review
  • Dependencies, breaking changes, and deployment notes
  • Labels applied
  • PR linked to Shortcut
  • Unit tests added (Backend)
  • Tested locally
  • Documentation updated (if required)
  • Environment variable additions/changes documented (if required)

- Bump Poetry version to 2.1.3 in `poetry.lock`.
- Introduce optional PII detection capabilities using Microsoft Presidio in `README.md`.
- Add `pii_filter.py` for detecting and masking PII in test results.
- Modify `check_for_sensitive_data` to utilize the new PII detection functionality.
- Update `pyproject.toml` to include `presidio-analyzer` as an optional dependency.
- Adjust `TestResult` class to ensure PII checks are performed correctly.
@nibalizer
Copy link
Contributor

Looks good to me. Consider allowing the user to PII scan data before it goes to the LLM. Possible to set the env var to a tuple: ["strict", "results", "annonymize", "off"]

- Update README.md to reflect new PII filtering options, replacing "Enable PII detection" with "Configure PII filtering" and detailing available modes.
- Modify run_e2e_notebooks.py to accept a new command-line option for PII filtering mode, allowing users to specify their desired filtering behavior during notebook execution.
- Implement PII filtering in test descriptions by adding a new function to filter PII from summaries before sending to the LLM.
- Introduce an Enum for PII filtering modes in pii_filter.py, improving clarity and maintainability of PII filtering logic.
- Update existing functions to utilize the new PII filtering capabilities, ensuring that PII is appropriately handled in test results and descriptions.
…nctionality

- Update README.md to change "PII filtering" to "PII detection" for clarity and consistency.
- Modify run_e2e_notebooks.py to reflect the new environment variable for PII detection.
- Refactor test descriptions to check for PII content instead of filtering it, raising exceptions when PII is detected.
- Rename PII filtering-related functions and enums in pii_filter.py to align with the new terminology.
- Ensure all references to PII handling are updated to use the new detection logic.
@validbeck
Copy link
Collaborator

Just watching your demo during sprint kickoff — I don't see the demo code you're showing off in this PR, perhaps it needs to be added as a mini notebook to notebooks/how_to?

…port

- Introduce Presidio Structured for improved PII detection in structured data.
- Update `check_table_for_pii` and related functions to utilize structured analysis when available.
- Implement lazy loading for Presidio Structured components to ensure compatibility.
- Modify `generate_description` and `TestResult` classes to include PII checks for tables and descriptions.
- Update dependencies in `pyproject.toml` and `poetry.lock` to include `presidio-structured`.
- Enhance error handling and logging for PII detection failures.
@johnwalz97 johnwalz97 added enhancement New feature or request internal Not to be externalized in the release notes labels Aug 19, 2025
@johnwalz97 johnwalz97 requested review from cachafla and removed request for cachafla August 20, 2025 15:12
@johnwalz97
Copy link
Contributor Author

Just watching your demo during sprint kickoff — I don't see the demo code you're showing off in this PR, perhaps it needs to be added as a mini notebook to notebooks/how_to?

Sorry just saw this @validbeck ... i created a notebook and its available in this PR: notebooks/how_to/configure_pii_detection.ipynb

@github-actions
Copy link
Contributor

PR Summary

This PR introduces significant enhancements centered around two main areas:

  1. Integration Workflow Updates

    • The GitHub Actions integration workflow has been refactored to remove the use of an older virtual environment (sdist-venv) in favor of a uniform environment (all-venv). This simplifies the workflow by installing the built package, additional dependencies, and creating a Jupyter kernel consistently from the same environment.
    • Logging improvements have been added to the end-to-end notebook runner. The script now prints the current PII detection mode for debugging purposes and accepts a CLI option to configure the PII detection mode dynamically.
  2. PII Detection Feature

    • New documentation sections have been added to the README and a dedicated Jupyter Notebook to explain how to configure and test the optional PII detection capabilities using Microsoft Presidio. This feature allows users to detect and block sensitive data in test descriptions and results.
    • In the build configuration (pyproject.toml), dependencies for PII detection (presidio-analyzer and presidio-structured) have been added.
    • A new module (pii_filter.py) has been implemented which provides an enum for PII detection modes, along with utility functions (scan_text and scan_df) that detect and raise errors if PII is found in either a text string or a pandas DataFrame.
    • Functions in the result logging logic now incorporate PII detection by checking tables and description texts, raising errors when sensitive data is detected under specific PII detection modes. This helps prevent logging of sensitive data accidentally.
    • Minor code formatting improvements and test adjustments have also been made to align with these changes.

Overall, this PR not only streamlines the integration workflow but also enhances the software's capability to attempt automatic PII filtering, thereby helping to secure potentially sensitive output data.

Test Suggestions

  • Test the integration workflow by running notebooks with different PII detection modes (disabled, test_results, test_descriptions, all) to verify the correct enabling/disabling of PII scanning.
  • Create tests that feed both clean and PII-containing data to the scan_text and scan_df functions to ensure they correctly raise exceptions when appropriate.
  • Run end-to-end tests that simulate a complete logging operation to verify that the PII checks correctly block the logging of sensitive data.
  • Validate that the environment variable configuration and CLI option for PII detection mode propagate correctly throughout the system.
  • Test backward compatibility ensuring normal functionality when PII detection is disabled.

@johnwalz97 johnwalz97 merged commit e32f966 into main Aug 21, 2025
17 checks passed
@johnwalz97 johnwalz97 deleted the john6797/sc-2886/spike-pii-detection-in-the-developer-framework branch August 21, 2025 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request internal Not to be externalized in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants