Skip to content

Data Subsetting & Attribute Filtering #61

@MaximumTrainer

Description

@MaximumTrainer

Context: OpenDataMask is a Kotlin-based data masking tool that transfers data from a source to a target. We need to implement a "Subset" capability that allows users to restrict which records (rows) and attributes (columns) are extracted and populated into the target.

Objective: Implement UI-driven configuration for data subsetting and update the extraction engine/CLI to respect these filters.

  1. Domain & Persistence Layer Updates
    Entity Modification: Update the TableConfiguration (or equivalent entity/schema) to include:

extractionMode: An enum including FULL, SUBSET, and MASKED.

subsetCriteria: A string field to store a SQL WHERE clause or a filtering predicate.

selectedAttributes: A list of column names to be included in the extraction.

Database Migration: Create a migration script (e.g., Flyway/Liquibase) to add these fields to the table_configs table.

  1. UI Enhancements (Frontend)
    Table Configuration View: * Add a "Mode" dropdown to the table settings with a "Subset" option.

When "Subset" is selected, display a text area for "Record Filter" (e.g., created_at > '2023-01-01' or tenant_id = 5).

Implement a "Column Picker" or checkbox list that allows users to toggle which attributes/columns should be moved to the target.

API Integration: Update the frontend to send these new configuration parameters to the backend via the existing configuration endpoints.

  1. Extraction Engine Logic (Backend)
    Query Generation: Modify the DataExtractor service to dynamically build the source SELECT query:

Instead of SELECT *, it must use the selectedAttributes list: SELECT col1, col2, ....

If extractionMode is SUBSET, append the subsetCriteria to the WHERE clause.

Target Schema Alignment: Ensure that if the tool automatically creates target tables, it only creates the columns defined in the selectedAttributes list to avoid schema mismatches.

  1. CLI Tooling & Execution
    Configuration Loading: Ensure the CLI runner (typically the main entry point or a RunJob command) fetches the latest TableConfiguration from the database/config file before starting the extraction.

Execution Flow:

The CLI should accept a Job ID or Environment ID.

Iterate through tables.

Apply the subset logic during the stream/batch read from the source.

Validation: Add a validation step in the CLI to ensure the subsetCriteria is valid SQL for the source database type before execution.

Technical Constraints:
Language: Kotlin.

Framework: Standard OpenDataMask patterns (Spring Boot / JDBC / Coroutines if applicable).

Performance: Ensure the subsetting happens at the source (pushdown) rather than filtering in-memory to optimize performance for large datasets.

Tests:
this needs to be verified with unit, rules, verification & UI level tests

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions