OpenDataMask is an open-source data masking platform that helps developers and data engineers protect sensitive information (PII) by applying realistic fake-data generation, masking, and redaction strategies to datasets. It enables safe use of production-like data in development, testing, and analytics environments, supporting GDPR, CCPA, and HIPAA compliance requirements.
Core capabilities:
- Multi-database support: PostgreSQL, MySQL, MongoDB, Azure SQL, Azure Cosmos DB (MongoDB API), flat files
- 60+ generator types: Names, emails, phones, SSNs, credit cards, addresses, medical IDs, financial data, and more
- Workspace model: Isolated masking configurations with role-based access control and inheritance
- Privacy intelligence: Automatic sensitive column detection, privacy hub dashboards, and compliance reports
- Job scheduling: Cron-based automated masking runs
- Webhook integration: Post-job notifications via custom HTTP webhooks or GitHub Actions triggers
- REST API + CLI: Full programmatic access and a Go-based CLI tool
- PII attribute rule mapping: Column-level masking rules (Redact, Partial Mask, Hash, Regex) with a registry pattern for custom business rules
| Requirement | Version | Notes |
|---|---|---|
| JDK | 17+ | Temurin/OpenJDK recommended |
| Docker & Docker Compose | 20.10+ | Required for containerised deployment |
| PostgreSQL | 15+ | Production database backend |
| Node.js | 20+ | Required only for local frontend development |
| Go | 1.21+ | Required only for building the CLI from source |
The fastest way to run OpenDataMask is with Docker Compose:
# Clone the repository
git clone https://github.com/MaximumTrainer/OpenDataMask.git
cd OpenDataMask
# Generate secrets
export JWT_SECRET=$(openssl rand -base64 32)
export ENCRYPTION_KEY=$(openssl rand -base64 24 | head -c 32)
# Start all services (PostgreSQL, backend, frontend)
docker-compose up -d
# Access the UI
open http://localhostDefault services: Frontend on port 80, Backend API on port 8080, PostgreSQL on port 5432.
cd backend
# Build and test
./gradlew build --no-daemon
# Run locally (requires PostgreSQL)
export DATABASE_URL=jdbc:postgresql://localhost:5432/opendatamask
export DATABASE_USERNAME=opendatamask
export DATABASE_PASSWORD=opendatamask
export JWT_SECRET=<your-secret>
export ENCRYPTION_KEY=<your-16-char-key>
./gradlew bootRun --no-daemonThe backend starts on http://localhost:8080.
cd frontend
# Install dependencies
npm ci
# Development server (proxies API to localhost:8080)
npm run dev
# Production build
npm run buildcd cli
# Build
go build -o opendatamask-cli ./...
# Run
./opendatamask-cli --help| Variable | Required | Description | Example |
|---|---|---|---|
DATABASE_URL |
Yes | JDBC URL for the PostgreSQL backend database | jdbc:postgresql://localhost:5432/opendatamask |
DATABASE_USERNAME |
Yes | PostgreSQL username | opendatamask |
DATABASE_PASSWORD |
Yes | PostgreSQL password | secret |
JWT_SECRET |
Yes | Secret for signing JWT tokens (min 32 chars) | Output of openssl rand -base64 32 |
ENCRYPTION_KEY |
Yes | Key for encrypting stored credentials (exactly 16 or 32 chars) | Output of openssl rand -base64 32 | head -c 32 |
SERVER_PORT |
No | Backend HTTP port | 8080 (default) |
JWT_EXPIRATION |
No | Token expiry in milliseconds | 86400000 (24 h, default) |
MONGODB_URI |
No | MongoDB URI when masking MongoDB sources | mongodb://localhost:27017 |
See backend/src/main/resources/application.yml for the full configuration template. Override any property via environment variable using Spring Boot's relaxed binding (e.g., SPRING_DATASOURCE_URL overrides spring.datasource.url).
OpenDataMask supports the following connection types:
| Type | Key | JDBC URL Format |
|---|---|---|
| PostgreSQL | POSTGRESQL |
jdbc:postgresql://<host>:<port>/<database> |
| MySQL | MYSQL |
jdbc:mysql://<host>:<port>/<database> |
| Azure SQL | AZURE_SQL |
jdbc:sqlserver://<server>.database.windows.net:1433;databaseName=<database> |
| MongoDB | MONGODB |
mongodb://<host>:<port>/<database> |
| MongoDB Cosmos DB | MONGODB_COSMOS |
mongodb://<account>.mongo.cosmos.azure.com:10255/<database>?ssl=true&... |
| File (CSV/JSON) | FILE |
(uploaded via the UI; stored internally) |
To connect OpenDataMask to an Azure SQL Database as a source or destination:
- Connection type: Select Azure SQL in the UI or use
type: AZURE_SQLin the API. - Connection string format:
jdbc:sqlserver://<server>.database.windows.net:1433;databaseName=<database> - Credentials: Provide your Azure SQL username and password separately in the
usernameandpasswordfields. - TLS encryption is enabled automatically; the driver validates the certificate against
*.database.windows.net.
Example API request body to create an Azure SQL connection:
{
"name": "prod-azure-sql-source",
"type": "AZURE_SQL",
"connectionString": "jdbc:sqlserver://myserver.database.windows.net:1433;databaseName=mydb",
"username": "sqladmin",
"password": "secret",
"isSource": true,
"isDestination": false
}Dependency: The
com.microsoft.sqlserver:mssql-jdbcdriver is bundled in the backend image — no additional driver installation is required.
A Workspace is an isolated configuration scope. Each workspace has:
- One or more Data Connections (source and target databases)
- One or more Connection Pairs linking a source and destination together
- Table Configurations that define which tables to process and how
- Column Generators that specify what fake data to produce per column
- Team members with ADMIN or USER roles
- An optional parent workspace for configuration inheritance
A Connection Pair groups exactly one Source DataConnection and one Destination DataConnection under a workspace. Using pairs gives you explicit, named source→destination routes and makes it easy to run the same masking rules against different database environments (e.g. staging vs. QA).
| Field | Description |
|---|---|
name |
Human-readable name for the pair (required) |
description |
Optional notes |
sourceConnectionId |
ID of the source DataConnection |
destinationConnectionId |
ID of the destination DataConnection |
Both connections must belong to the same workspace.
Deleting a connection pair performs a soft delete — the record is retained in the database with a deletedAt timestamp, preserving audit history. Soft-deleted pairs are excluded from list responses but can still be found via raw database queries or audit tooling.
| Method | Path | Description |
|---|---|---|
POST |
/api/workspaces/{id}/connection-pairs |
Create a connection pair |
GET |
/api/workspaces/{id}/connection-pairs |
List active connection pairs |
GET |
/api/workspaces/{id}/connection-pairs/{pairId} |
Get a single pair |
PUT |
/api/workspaces/{id}/connection-pairs/{pairId} |
Update a pair |
DELETE |
/api/workspaces/{id}/connection-pairs/{pairId} |
Soft-delete a pair |
POST /api/workspaces/1/connection-pairs
Content-Type: application/json
{
"name": "Prod → Staging",
"description": "Copy masked production data to staging environment",
"sourceConnectionId": 3,
"destinationConnectionId": 7
}When triggering a masking job you can optionally pass a connectionPairId in the request body. The engine will then use that pair's source and destination connections instead of searching for the workspace defaults.
POST /api/workspaces/1/jobs
Content-Type: application/json
{
"connectionPairId": 5
}Omitting the body (or setting connectionPairId to null) falls back to the workspace-wide source / destination connection selection, preserving full backward compatibility.
| Mode | Description |
|---|---|
MASK |
Replace column values with generated fake data |
GENERATE |
Generate a completely new row set |
PASSTHROUGH |
Copy data without modification |
SUBSET |
Copy a filtered/sampled subset of rows using an optional WHERE clause |
UPSERT |
Fetch rows from source, optionally mask them, and write to destination |
SKIP |
Exclude the table from processing |
OpenDataMask supports two complementary mechanisms to control which data is extracted:
Set a SQL WHERE clause on any table configuration to restrict which rows are extracted from the source. This predicate is pushed down to the source database for optimal performance.
{
"tableName": "orders",
"mode": "SUBSET",
"whereClause": "created_at > '2023-01-01' AND tenant_id = 5"
}For MongoDB, the
whereClauseis interpreted as a JSON query filter, e.g.{"status": "active"}.
Set a selectedAttributes list on any table configuration to restrict which columns are extracted from the source. OpenDataMask builds a SELECT col1, col2, ... query (instead of SELECT *) so the filtering happens at the source for maximum performance.
{
"tableName": "users",
"mode": "PASSTHROUGH",
"selectedAttributes": ["id", "name", "email"]
}When selectedAttributes is set:
- Only the listed columns are fetched from the source database.
- The destination table schema is created with only those columns (no schema mismatch).
- Applies to all modes that read from the source:
PASSTHROUGH,MASK,SUBSET, andUPSERT.
Leave selectedAttributes empty or omit it to include all columns (the default behaviour).
In the Tables view, the table configuration modal includes a Selected Columns field. Enter a comma-separated list of column names (e.g. id, name, email) to enable column-level filtering. Leave the field blank to include all columns.
OpenDataMask includes 60+ built-in generators. Key examples:
| Generator | Sample Output |
|---|---|
NAME / FULL_NAME |
Sofia Martinez |
EMAIL |
j.smith@example.net |
PHONE |
+1-555-0147 |
SSN |
123-45-6789 |
CREDIT_CARD |
4111-1111-1111-1111 |
ADDRESS |
742 Evergreen Terrace |
IP_ADDRESS |
192.168.1.42 |
ICD_CODE |
J45.909 |
IBAN |
GB29NWBK60161331926819 |
JOB_TITLE |
Senior Software Engineer |
NATIONALITY |
Canadian |
DOMAIN_NAME |
example.com |
TIMESTAMP |
2024-06-15 14:30:00 |
PARTIAL_MASK |
J*** D*** (preserves format) |
NULL |
null |
CONSTANT |
Fixed value (configured in params) |
OpenDataMask automatically scans workspaces to detect sensitive columns using pattern matching and confidence scoring. The Privacy Hub provides:
- A summary of sensitive column coverage
- Actionable recommendations (e.g., "Add EMAIL generator to
users.email") - Exportable compliance reports (JSON)
The Custom Data Mapping wizard provides a guided, column-level alternative to the Tables view for configuring how each attribute moves from source to destination.
| Action | Description |
|---|---|
MIGRATE_AS_IS |
Copy the column value without modification. Use for primary keys, timestamps, and public metadata. |
MASK |
Replace the column value using a selected masking strategy. |
| Strategy | Description |
|---|---|
FAKE |
Replace with realistic synthetic data. Select a generator type (e.g. EMAIL, FULL_NAME, PHONE). |
HASH |
Apply deterministic SHA-256 hashing to preserve joins while anonymising the value. |
NULL |
Remove the value entirely (sets the column to NULL). |
- Navigate to Workspace → Data Mappings.
- Step 1 – Select Connection: choose a source database connection from the list.
- Step 2 – Select Table: OpenDataMask reads the live schema and presents all available tables.
- Step 3 – Configure Columns: for every column, select
MIGRATE_AS_ISorMASK. IfMASKis chosen, select a strategy and (forFAKE) a generator type. - Click Save Mappings to persist the configuration.
The following REST endpoints manage custom data mappings:
| Method | Path | Description |
|---|---|---|
POST |
/api/workspaces/{id}/mappings |
Create a single mapping |
GET |
/api/workspaces/{id}/mappings |
List all mappings (optional ?connectionId=&tableName= filters) |
GET |
/api/workspaces/{id}/mappings/{mappingId} |
Get a specific mapping |
PUT |
/api/workspaces/{id}/mappings/{mappingId} |
Update a mapping |
DELETE |
/api/workspaces/{id}/mappings/{mappingId} |
Delete a mapping |
POST |
/api/workspaces/{id}/mappings/bulk |
Replace all mappings for a table in one request |
GET |
/api/workspaces/{id}/connections/{connId}/schema |
Browse live tables & columns from a connection |
{
"connectionId": 3,
"tableName": "users",
"columnMappings": [
{ "columnName": "id", "action": "MIGRATE_AS_IS" },
{ "columnName": "email", "action": "MASK", "maskingStrategy": "FAKE", "fakeGeneratorType": "EMAIL" },
{ "columnName": "name", "action": "MASK", "maskingStrategy": "FAKE", "fakeGeneratorType": "FULL_NAME" },
{ "columnName": "ssn", "action": "MASK", "maskingStrategy": "NULL" },
{ "columnName": "ref_id","action": "MASK", "maskingStrategy": "HASH" }
]
}- Register / log in at
http://localhost(or API/api/auth/register) - Create a Workspace — give it a name and optional description
- Add a Source Connection — provide host, port, database, username, password, and connection type
- Add a Target Connection — the database that will receive masked data (can be the same database with a different schema)
- Import schema — OpenDataMask discovers tables and columns automatically
- Configure tables — set the masking mode for each table
- Configure columns — assign a generator type to each sensitive column
- Run a Job — trigger a masking job; monitor progress and logs in real time
- Review Privacy Report — verify all sensitive columns are masked
The Go CLI provides quick access to common operations:
# Authenticate and save credentials locally
opendatamask-cli auth login --url http://localhost:8080 --username admin --password secret
# List workspaces
opendatamask-cli workspace list
# Get workspace details
opendatamask-cli workspace get <workspace-id>
# List jobs
opendatamask-cli job list --workspace <workspace-id>
# Trigger a masking job
opendatamask-cli job run --workspace <workspace-id>Configuration is stored at ~/.opendatamask/config.yaml.
The backend exposes a RESTful API at /api/. All endpoints (except /api/auth/*) require a Bearer JWT token.
# Register
POST /api/auth/register
{"username":"alice","email":"alice@example.com","password":"secret123"}
# Login
POST /api/auth/login
{"username":"alice","password":"secret123"}
# Response: {"token":"<jwt>","user":{...}}| Resource | Endpoint | Methods |
|---|---|---|
| Workspaces | /api/workspaces |
GET, POST |
| Workspace | /api/workspaces/{id} |
GET, PUT, DELETE |
| Workspace Stats | /api/workspaces/{id}/stats |
GET |
| Connections | /api/workspaces/{id}/connections |
GET, POST |
| Tables | /api/workspaces/{id}/tables |
GET, POST |
| Jobs | /api/workspaces/{id}/jobs |
GET, POST |
| Job Schedules | /api/workspaces/{id}/schedules |
GET, POST, PUT, DELETE |
| Webhooks | /api/workspaces/{id}/webhooks |
GET, POST, PUT, DELETE |
| Privacy Hub | /api/workspaces/{id}/privacy-hub |
GET |
| Privacy Report | /api/workspaces/{id}/privacy-report |
GET |
| Sensitivity Scan | /api/workspaces/{id}/sensitivity-scan/run |
POST |
| Schema Changes | /api/workspaces/{id}/schema-changes |
GET, POST |
| Workspace Export | /api/workspaces/{id}/export |
GET |
| Workspace Import | /api/workspaces/{id}/import |
POST |
docker-compose up -dThe infra/ directory contains Terraform configuration to provision a production-ready AWS environment. The setup creates:
- VPC with a public subnet, internet gateway, and route tables
- EC2 instance (Amazon Linux 2023, default
t3.small) running docker-compose - Security group allowing HTTP (80), HTTPS (443), API (8080), and SSH (22)
- Elastic IP for a stable public address
- S3 + DynamoDB remote state backend for team collaboration
| Tool | Version | Notes |
|---|---|---|
| Terraform | 1.6+ | Install |
| AWS CLI | 2.x | Configured with IAM credentials |
| AWS account | – | IAM user with EC2, VPC, S3, DynamoDB permissions |
One-time state backend bootstrap (run locally, once per AWS account):
# Create S3 bucket for Terraform state
aws s3api create-bucket --bucket my-opendatamask-tfstate --region us-east-1
aws s3api put-bucket-versioning \
--bucket my-opendatamask-tfstate \
--versioning-configuration Status=Enabled
# Create DynamoDB table for state locking
aws dynamodb create-table \
--table-name opendatamask-tf-locks \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUESTcd infra
# Copy and fill in the example variables
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars — set public_key_material and any overrides
# Initialise with your state backend
terraform init \
-backend-config="bucket=my-opendatamask-tfstate" \
-backend-config="dynamodb_table=opendatamask-tf-locks" \
-backend-config="region=us-east-1"
# Preview changes
terraform plan
# Apply (provisions the EC2 instance and networking)
terraform apply
# Get the server IP
terraform output server_public_ip
# Tear down
terraform destroyConfigure these in GitHub → Settings → Secrets and variables → Actions:
| Secret | Description |
|---|---|
AWS_ACCESS_KEY_ID |
AWS IAM access key ID |
AWS_SECRET_ACCESS_KEY |
AWS IAM secret access key |
AWS_REGION |
AWS region (e.g. us-east-1) |
EC2_SSH_PRIVATE_KEY |
PEM private key matching EC2_SSH_PUBLIC_KEY |
EC2_SSH_PUBLIC_KEY |
SSH public key content (~/.ssh/id_ed25519.pub) |
JWT_SECRET |
32+ char JWT signing secret |
ENCRYPTION_KEY |
32 char field encryption key |
TF_STATE_BUCKET |
S3 bucket name for Terraform state |
TF_STATE_DYNAMODB_TABLE |
DynamoDB table name for state locking |
The full automated pipeline runs in 4 sequential stages after every push to main:
CI (tests pass)
└─► Docker Build & Push (images → GHCR)
└─► Deploy workflow:
├─ Job 1: terraform apply (provision/update infrastructure)
├─ Job 2: docker push (parallel with Job 1)
├─ Job 3: SSH deploy (write .env, docker-compose pull && up)
└─ Job 4: verify (curl /actuator/health, assert HTTP 200)
| Workflow file | Purpose |
|---|---|
.github/workflows/ci.yml |
Build, lint, test all three components |
.github/workflows/docker.yml |
Build and push images to GHCR |
.github/workflows/deploy.yml |
Full deploy pipeline (terraform → docker → deploy → verify) |
.github/workflows/verify-deployment.yml |
Spring Boot smoke tests + optional live server health check |
.github/workflows/sandbox-verification.yml |
End-to-end masking verification — proves PII masking correctness; publishes JUnit report |
.github/workflows/codeql.yml |
Weekly security analysis |
GitHub Environments (staging, production) are used for deployment tracking, enabling Copilot and the GitHub UI to display live deployment status, history, and URL.
Build and push Docker images, then deploy using standard Kubernetes manifests or Helm. Each component (backend, frontend) has its own Dockerfile.
# Build images
docker build -t opendatamask-backend ./backend
docker build -t opendatamask-frontend ./frontendThe verification/ directory contains a self-contained Docker-based environment that automatically proves OpenDataMask correctly masks PII while preserving referential integrity.
cd verification/
chmod +x run_verification.sh
./run_verification.shThe script builds images, starts all services, configures a masking job via the REST API, runs the job, and then validates the output.
| Check | Description |
|---|---|
| Record Integrity | Source and target row counts must match; fails if source is empty |
| Key Persistence | Every source UUID primary key must exist unchanged in the target |
| Masking Effectiveness | full_name and email must differ for every matched row; fails if no rows were compared |
| Human Readability | Samples 5 masked records (ordered by id) and checks format heuristics; skipped (not failed) when masking didn't pass to avoid exposing potential PII |
Both the script and the standalone Python verifier support JUnit XML output:
# Via the orchestration script:
VERIFY_JUNIT_XML=report.xml ./run_verification.sh
# Directly (when environment is already running):
python3 -m pip install -r requirements.txt
python3 verify.py --junit-xml report.xml.github/workflows/sandbox-verification.yml runs the full suite on every push and pull request to main. It publishes:
- A workflow check (per-check pass/fail annotations via
dorny/test-reporter) - A downloadable artifact (
sandbox-verification-report, 30-day retention) - A markdown job summary with overall pass/fail status
cd verification/
docker compose -f docker-compose.yml down -vSee verification/README.md for the full reference, including all environment variable overrides.
| Symptom | Likely Cause | Fix |
|---|---|---|
JWT_SECRET must be set |
Missing env var | Set JWT_SECRET before starting |
ENCRYPTION_KEY must be set |
Missing env var | Set ENCRYPTION_KEY (exactly 16 or 32 chars) |
| Backend fails to start | PostgreSQL not ready | Ensure PostgreSQL is running and accessible |
Connection refused on port 8080 |
Backend not started | Check docker-compose logs backend |
| Empty workspace stats | No connections/jobs created | Add a connection and run at least one job |
| Sensitivity scan finds nothing | No table configurations yet | Add table configurations and re-run the scan |
OpenDataMask provides a flexible, configuration-driven system to apply fine-grained masking rules to individual source columns as data moves to the target destination.
maskingStrategy |
Description | Required piiRuleParams keys |
|---|---|---|
FAKE |
Replaces the value with realistic synthetic data using the configured fakeGeneratorType |
— |
HASH |
Deterministic SHA-256 hex digest of the value | salt (optional) |
NULL |
Replaces the value with null / NULL |
— |
REDACT |
Replaces the value with the literal token [REDACTED] |
— |
PARTIAL_MASK |
Keeps a configurable number of leading/trailing characters and masks the middle | keepFirst (default 0), keepLast (default 4), maskChar (default *) |
REGEX |
Applies a regular expression replacement to the string representation of the value | pattern (required), replacement (required) |
POST /api/workspaces/{workspaceId}/mappings
Content-Type: application/json
{
"connectionId": 2,
"tableName": "customers",
"columnName": "ssn",
"action": "MASK",
"maskingStrategy": "REDACT"
}POST /api/workspaces/{workspaceId}/mappings/bulk
Content-Type: application/json
{
"connectionId": 2,
"tableName": "customers",
"columnMappings": [
{ "columnName": "id", "action": "MIGRATE_AS_IS" },
{ "columnName": "full_name", "action": "MASK", "maskingStrategy": "PARTIAL_MASK",
"piiRuleParams": "{\"keepFirst\":\"1\",\"keepLast\":\"0\"}" },
{ "columnName": "email", "action": "MASK", "maskingStrategy": "REGEX",
"piiRuleParams": "{\"pattern\":\"(\\\\w+)@(\\\\w+\\\\.\\\\w+)\",\"replacement\":\"***@$2\"}" },
{ "columnName": "ssn", "action": "MASK", "maskingStrategy": "REDACT" },
{ "columnName": "transaction_amount", "action": "MIGRATE_AS_IS" }
]
}A complete JSON example is available in docs/pii-masking-rules-sample.json.
Mappings are evaluated during every PASSTHROUGH table run: for each row fetched from the source, OpenDataMask looks up any active CustomDataMapping entries for that workspace/connection/table combination and applies the configured strategy before writing to the destination. Columns without a mapping pass through unchanged.
The RuleRegistryPort / DefaultRuleRegistry bean accepts runtime-registered PIIMaskingRule implementations. Inject RuleRegistryPort into your Spring component and call registerCustomRule():
@Component
class EuGdprRuleRegistrar(private val ruleRegistry: RuleRegistryPort) {
@PostConstruct
fun register() {
ruleRegistry.registerCustomRule(object : PIIMaskingRule {
override val ruleId = "eu_gdpr_conditional"
override fun mask(input: Any?): Any? {
// Example: if the value looks like an EU phone prefix, redact it fully;
// otherwise apply partial masking
val str = input?.toString() ?: return null
return if (str.startsWith("+3") || str.startsWith("+4"))
"[REDACTED]"
else
str.take(0) + "*".repeat((str.length - 4).coerceAtLeast(0)) + str.takeLast(4)
}
})
}
}After registering the rule, invoke it via any column mapping by adding "ruleId": "<your-rule-id>" to piiRuleParams. The engine checks for a ruleId key first and, if found, delegates to the registry before falling back to the built-in strategy logic:
POST /api/workspaces/{workspaceId}/mappings/bulk
Content-Type: application/json
{
"connectionId": 2,
"tableName": "customers",
"columnMappings": [
{
"columnName": "phone",
"action": "MASK",
"maskingStrategy": "REDACT",
"piiRuleParams": "{\"ruleId\":\"eu_gdpr_conditional\"}"
}
]
}The maskingStrategy field still controls the UI label and fallback type; the ruleId in piiRuleParams takes precedence for actual masking logic.
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Make your changes and add tests
- Run
./gradlew build --no-daemon(backend) andnpm test -- --run(frontend) - Open a pull request against
main
All contributions must maintain the existing test coverage and pass the CI pipeline.