Context: OpenDataMask is a Kotlin-based data masking tool that transfers data from a source to a target. We need to implement a "Subset" capability that allows users to restrict which records (rows) and attributes (columns) are extracted and populated into the target.
Objective: Implement UI-driven configuration for data subsetting and update the extraction engine/CLI to respect these filters.
- Domain & Persistence Layer Updates
Entity Modification: Update the TableConfiguration (or equivalent entity/schema) to include:
extractionMode: An enum including FULL, SUBSET, and MASKED.
subsetCriteria: A string field to store a SQL WHERE clause or a filtering predicate.
selectedAttributes: A list of column names to be included in the extraction.
Database Migration: Create a migration script (e.g., Flyway/Liquibase) to add these fields to the table_configs table.
- UI Enhancements (Frontend)
Table Configuration View: * Add a "Mode" dropdown to the table settings with a "Subset" option.
When "Subset" is selected, display a text area for "Record Filter" (e.g., created_at > '2023-01-01' or tenant_id = 5).
Implement a "Column Picker" or checkbox list that allows users to toggle which attributes/columns should be moved to the target.
API Integration: Update the frontend to send these new configuration parameters to the backend via the existing configuration endpoints.
- Extraction Engine Logic (Backend)
Query Generation: Modify the DataExtractor service to dynamically build the source SELECT query:
Instead of SELECT *, it must use the selectedAttributes list: SELECT col1, col2, ....
If extractionMode is SUBSET, append the subsetCriteria to the WHERE clause.
Target Schema Alignment: Ensure that if the tool automatically creates target tables, it only creates the columns defined in the selectedAttributes list to avoid schema mismatches.
- CLI Tooling & Execution
Configuration Loading: Ensure the CLI runner (typically the main entry point or a RunJob command) fetches the latest TableConfiguration from the database/config file before starting the extraction.
Execution Flow:
The CLI should accept a Job ID or Environment ID.
Iterate through tables.
Apply the subset logic during the stream/batch read from the source.
Validation: Add a validation step in the CLI to ensure the subsetCriteria is valid SQL for the source database type before execution.
Technical Constraints:
Language: Kotlin.
Framework: Standard OpenDataMask patterns (Spring Boot / JDBC / Coroutines if applicable).
Performance: Ensure the subsetting happens at the source (pushdown) rather than filtering in-memory to optimize performance for large datasets.
Tests:
this needs to be verified with unit, rules, verification & UI level tests
Context: OpenDataMask is a Kotlin-based data masking tool that transfers data from a source to a target. We need to implement a "Subset" capability that allows users to restrict which records (rows) and attributes (columns) are extracted and populated into the target.
Objective: Implement UI-driven configuration for data subsetting and update the extraction engine/CLI to respect these filters.
Entity Modification: Update the TableConfiguration (or equivalent entity/schema) to include:
extractionMode: An enum including FULL, SUBSET, and MASKED.
subsetCriteria: A string field to store a SQL WHERE clause or a filtering predicate.
selectedAttributes: A list of column names to be included in the extraction.
Database Migration: Create a migration script (e.g., Flyway/Liquibase) to add these fields to the table_configs table.
Table Configuration View: * Add a "Mode" dropdown to the table settings with a "Subset" option.
When "Subset" is selected, display a text area for "Record Filter" (e.g., created_at > '2023-01-01' or tenant_id = 5).
Implement a "Column Picker" or checkbox list that allows users to toggle which attributes/columns should be moved to the target.
API Integration: Update the frontend to send these new configuration parameters to the backend via the existing configuration endpoints.
Query Generation: Modify the DataExtractor service to dynamically build the source SELECT query:
Instead of SELECT *, it must use the selectedAttributes list: SELECT col1, col2, ....
If extractionMode is SUBSET, append the subsetCriteria to the WHERE clause.
Target Schema Alignment: Ensure that if the tool automatically creates target tables, it only creates the columns defined in the selectedAttributes list to avoid schema mismatches.
Configuration Loading: Ensure the CLI runner (typically the main entry point or a RunJob command) fetches the latest TableConfiguration from the database/config file before starting the extraction.
Execution Flow:
The CLI should accept a Job ID or Environment ID.
Iterate through tables.
Apply the subset logic during the stream/batch read from the source.
Validation: Add a validation step in the CLI to ensure the subsetCriteria is valid SQL for the source database type before execution.
Technical Constraints:
Language: Kotlin.
Framework: Standard OpenDataMask patterns (Spring Boot / JDBC / Coroutines if applicable).
Performance: Ensure the subsetting happens at the source (pushdown) rather than filtering in-memory to optimize performance for large datasets.
Tests:
this needs to be verified with unit, rules, verification & UI level tests