feat: leak detection scenario with end-to-end data pipeline#466
feat: leak detection scenario with end-to-end data pipeline#466
Conversation
…ources ## Summary Add diagnostic settings across blueprint resources, per CRISP security review findings LT-4 (Medium). Supports Threat #24: Insufficient logging and monitoring. Defender for Cloud (LT-1) is intentionally **not** managed here — it's subscription-scoped and should be enforced via Azure Policy by platform teams. ### Changes **Diagnostic Settings (LT-4)** — `azurerm_monitor_diagnostic_setting` in each component: - **Key Vault**: AuditEvent + AllMetrics - **ACR**: ContainerRegistryRepositoryEvents, ContainerRegistryLoginEvents + AllMetrics - **Event Grid**: allLogs + AllMetrics - **Event Hubs**: allLogs + AllMetrics ### Scope - Components: `010-security-identity`, `060-acr`, `040-messaging` - Blueprints: full-single-node, full-multi-node, azure-local, only-cloud, robotics - 19 files changed, 227 insertions ### Design Decisions - Diagnostics gated by `should_enable_diagnostic_settings` (bool) + `log_analytics_workspace_id` — enabled automatically when blueprints wire observability - Component-level ownership: each module manages its own diagnostic settings - Defender left to Azure Policy to avoid subscription-scoped side effects on `terraform destroy` ### Deploy Validation (2026-04-08) Rebased on `dev` and deployed 3 affected blueprints in parallel: | Blueprint | Region | Diagnostic Settings | Result | |---|---|---|---| | full-single-node-cluster | eastus2 | ✅ KV, ACR, EG, EH | All diagnostic resources created. IoT Ops proxy timeout (pre-existing) | | only-cloud-single-node-cluster | westus2 | ✅ ACR, EG, EH | All diagnostic resources created. KV contacts timeout (pre-existing transient) | | robotics | westus3 | ✅ ACR, EG, EH, KV | All diagnostic resources created. Grafana SSL EOF (pre-existing transient) | All diagnostic settings deployed successfully. All failures are pre-existing environmental issues unrelated to this change. Skipped: `full-multi-node-cluster` (pre-existing count issue), `azure-local` (requires HCI hardware). Fixes AB#1984 ---- #### AI description (iteration 5) #### PR Classification Feature enhancement to add diagnostic settings for Azure blueprint resources (ACR, Key Vault, Event Grid, Event Hubs) to address CRISP security findings LT-4 regarding insufficient logging and monitoring. #### PR Summary This PR implements diagnostic settings across Key Vault, ACR, Event Grid, and Event Hubs modules to enable audit logging and metrics collection to Log Analytics workspaces, addressing security compliance gaps. All changes are gated by optional variables and wire the Log Analytics workspace ID from observability modules through blueprint configurations. - Added `azurerm_monitor_diagnostic_setting` resources in `main.tf` files for Key Vault (AuditEvent), ACR (ContainerRegistryRepositoryEvents, ContainerRegistryLoginEvents), Event Grid (allLogs), and Event Hubs (allLogs) with AllMetrics enabled - Introduced `log_analytics_workspace_id` and `should_enable_diagnostic_settings` variables across all affected modules ...
… RBAC for connectedk8s proxy ## Summary Adds Entra ID group-based cluster admin support and Azure Arc RBAC role assignments to the CNCF K3s cluster component, enabling `az connectedk8s proxy` for team members. ## Problem - `az connectedk8s proxy` failed for non-deploying users — no Azure RBAC roles on the Arc resource - Only the deploying user received cluster-admin via individual OID/UPN - No support for Entra ID groups ## Changes ### New variable: `cluster_admin_group_oid` - Accepts an Entra ID group Object ID - Creates Kubernetes `ClusterRoleBinding` with `--group` flag (k3s-device-setup.sh) - Assigns Azure Arc RBAC roles on the Arc connected cluster resource ### Azure Arc RBAC role assignments (new) - `Azure Arc Kubernetes Viewer` — assigned to both user OID and group OID - `Azure Arc Enabled Kubernetes Cluster User Role` — assigned to both user OID and group OID - Scoped to the Arc connected cluster resource - Only created after the cluster exists (static count guard with `has_arc_cluster`) ### Cleanup - Removed `deploy-cluster-admin-oid.sh` — superseded by Terraform automation ### Files changed - `src/100-edge/100-cncf-cluster/terraform/` — variables, main.tf, ubuntu-k3s module, role-assignments module - `src/100-edge/100-cncf-cluster/scripts/k3s-device-setup.sh` — group ClusterRoleBinding - `src/100-edge/110-iot-ops/scripts/deploy-cluster-admin-oid.sh` — deleted - All 5 blueprints — expose `cluster_admin_group_oid` ## Usage ```hcl cluster_admin_group_oid = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" ``` ## Deployment Testing All affected blueprints deployed and verified (Arc clusters Connected): | Blueprint | Region | Result | |---|---|---| | full-single-node-cluster | eastus2 | ✅ Arc Connected | | minimum-single-node-cluster | australiaeast | ✅ Arc Connected | | partial-single-node-cluster | swedencentral | ✅ Arc Connected | | dual-peered-single-node-cluster | westus3 | ✅ Both clusters Arc Connected | Only pre-existing failures observed: - IoT Ops sync rules (`LinkedAuthorizationFailed`) — known issue, unrelated to this PR - Grafana dashboard import script — transient 412 errors ---- #### AI description (iteration 12) #### PR Classification This PR adds new functionality to enable Entra ID group-based cluster admin access and Azure Arc RBAC support for connectedk8s proxy operations. #### PR Summary Adds support for granting cluster-admin permissions to an entire Entra ID group and assigns required Azure Arc RBAC roles to enable `az connectedk8s proxy` access for group members. - `k3s-device-setup.sh`: Added `CLUSTER_ADMIN_GROUP_OID` environment variable and logic to create ClusterRoleBinding for Entra ID groups with cluster-admin permissions - `terraform/main.tf`: Added Arc RBAC role assignments (`Azure Arc Kubernetes Viewer` and `Azure Arc Enabled Kubernetes Cluster User Role`) for both individual users and Entra ID groups on the Arc connected cluster resource - All Terraform variable files: Added `cluster_admin_group_oid` variable to specify the Entra ...
## Summary Fixes deployment issues discovered during parallel blueprint testing. Each of the 9 deployable blueprints was tested 4 times across different Azure regions (36 total deployments, all successful). ## Fixes ### 1. Grafana dashboard import fails on re-apply and fresh deploy (`020-observability`) **Problem:** `az grafana dashboard import` fails in two ways: (a) 412 version-mismatch when dashboards already exist on retry, and (b) SSL EOF errors when Grafana's SSL cert isn't ready on fresh deploys. **Fix:** Add `--overwrite` flag to all import calls, plus retry logic (10 attempts, 30s delay) to wait for SSL cert provisioning. ### 2. AzureML compute cluster requires public IPs without private endpoints (`080-azureml`) **Problem:** `compute_cluster_node_public_ip_enabled` defaults to `false`, but Azure ML requires workspace private endpoints when node public IPs are disabled. ### 3. dpkg lock race condition on VM first boot (`100-cncf-cluster`) **Problem:** `k3s-device-setup.sh` used `curl | bash` to install Azure CLI, which runs internal `apt-get` calls without lock timeout handling. Ubuntu's `unattended-upgrades` holds the dpkg lock on first boot, causing exit code 100. **Fix:** Replace `curl | bash` with inline `apt-get` calls, each using `DPkg::Lock::Timeout=300`. Eliminates the TOCTOU race entirely — no external installer with uncontrolled apt calls. ### 4. Default VM SKUs unavailable in most regions **Problem:** `Standard_D8ds_v5` (AKS node default) available in only 2/9 IoT Operations regions. `Standard_D8s_v3` (VM host default) same. `Standard_D4_v4` (minimum blueprint) same. **Fix:** Update all defaults to v6 series — available in 8/9 or 9/9 regions. ## Deployment Test Results — 36/36 Successful Each blueprint deployed 4 times in different regions: | Blueprint | R1 | R2 | R3 | R4 | |---|---|---|---|---| | only-output-cncf-cluster-script | ✅ westus | ✅ eastus2 | ✅ westeurope | ✅ southcentralus | | only-cloud-single-node-cluster | ✅ germanywestcentral | ✅ westus3 | ✅ westus3 | ✅ southcentralus | | minimum-single-node-cluster | ✅ westus | ✅ eastus2 | ✅ eastus2 | ✅ northeurope | | full-single-node-cluster | ✅ germanywestcentral | ✅ westus | ✅ southcentralus | ✅ westus2 | | partial-single-node-cluster | ✅ westus3 | ✅ germanywestcentral | ✅ westus | ✅ eastus2 | | dual-peered-single-node-cluster | ✅ eastus2 | ✅ southcentralus | ✅ southcentralus | ✅ southcentralus | | full-multi-node-cluster | ✅ eastus2 | ✅ westus3 | ✅ westeurope | ✅ westus | | azureml | ✅ southcentralus | ✅ westus | ✅ germanywestcentral | ✅ westus2 | | robotics | ✅ germanywestcentral | ✅ westus | ✅ westus2 | ✅ westeurope | **Skipped:** fabric (requires Fabric capacity), azure-local (requires Azure Local hardware) ## Changed Files - `src/000-cloud/020-observability/scripts/import-grafana-dashboards.sh` — overwrite + retry logic - `src/000-cloud/080-azureml/terraform/variables.tf` — keep secure default - `blueprints/modules/robotics/terraform/main.tf` — der...
Resolves conflicts: - import-grafana-dashboards.sh: kept dev version (retry wrapper, --overwrite) - k3s-device-setup.sh: kept dev version (install_azure_cli helper, Entra group OID, apt lock timeouts) - deploy-cluster-admin-oid.sh: kept deletion (refactored into k3s-device-setup.sh in PR 624)
- Add blueprints/leak-detection/ with 15-module orchestration - Add leak detection scenario deployment guide - Add leak detection pipeline ADR - Add helper scripts for image building and edge deployment
- Fix cspell: add EOBS to project dictionary
- Fix shfmt: format build-app-images.sh and deploy-edge-apps.sh
- Fix markdown-table-formatter: 4 files reformatted
- Fix tflint: eventhubs default null → {} in both blueprints
All 6 megalinter linters validated locally before push.
- Fix cspell: add leakdet, Standardised, chipsets, optimises, managedidentity - Fix ruff: sort imports in test-real-inference.py - Fix ruff config: add S108/S603/S607 to test file ignores All linters validated locally before push.
…gle-node-cluster - remove blueprints/leak-detection/ and integrate 045-notification via toggle - decouple EventHubConnection__clientId from azure-functions module to blueprint - add function_identity output to break dependency cycle - relocate build/deploy scripts to src/501-ci-cd/scripts/ - create leak-detection.tfvars.example for notification pipeline config 🔧 - Generated by Copilot
Rename build-app-images.sh → build-leak-detection-images.sh and deploy-edge-apps.sh → deploy-leak-detection-apps.sh to reflect their scenario-specific scope. Updated headers with component inventory and context about building/deploying the full image set as a unit. Addresses PR review feedback: generic names would conflict with future end-to-end scenario scripts.
…otification Cherry-pick notification fixes from Kevin's feat/045-notification branch: 045-notification component: - Add teams_group_id variable for Teams channel posting - Change teams_post_location default from 'Group chat' to 'Channel' - Add validation for teams_post_location - Build conditional recipient JSON for channel vs group chat Blueprint (full-single-node-cluster): - Pass teams_group_id and teams_post_location to cloud_notification module - Add teams_group_id, teams_post_location variables with validation
Remove BDR-001 and PDR-001 links (docs/context/ folder doesn't exist). Remove real-time-vision-inference-architecture.md link (file doesn't exist). Fix blueprint README relative path.
- Reset package-lock.json to dev to fix npm ci sync errors (cspell 8→10, eslint 9→10, secretlint→12 version mismatches) - Add GHSA-vfmq-68hx-4jfw (lxml 5.3.0) to grype ignore list (transitive checkov dependency, fix requires lxml>=6.1.0)
- Exclude **/packages.lock.json from cspell (NuGet lockfiles contain package names like Macross, Newtonsoft that aren't real words) - Add easyops to project-specific dictionary (Docusaurus plugin name)
- Add **/test_*.py pattern to ruff per-file-ignores for S101 (assert in tests outside tests/ directories) - Auto-fix I001 import sorting in 4 test files - Auto-fix F401 unused import (os) in test_onvif_camera.py
…4-01 - fix onvif-connector-assets.tfvars.example: commands/events to management_groups/event_groups - fix sse-connector-assets.tfvars.example: events/event_notifier to event_groups/data_source - update ONVIF ADR doc with correct namespaced_assets variable schema - align 111-assets main.tf legacy asset block formatting with schema_validation_enabled 🔧 - Generated by Copilot
…s-webpki CVEs - bump openssl 0.10.73/0.10.74/0.10.76 to 0.10.78 across 8 Cargo.lock files - bump rand to 0.8.6/0.9.4/0.10.1 and rustls-webpki to 0.103.13 🔒 - Generated by Copilot
…mpliance 🔧 - Generated by Copilot
🔧 - Generated by Copilot
- revert 5 Cargo.lock files that diverged during rebase conflict resolution - keeps ai-edge-inference Cargo.lock (intentional small diff) 🔧 - Generated by Copilot
🔧 - Generated by Copilot
- reverts getrandom 0.3.4 downgrade back to 0.4.2 in tempfile dependency 🔧 - Generated by Copilot
🔧 - Generated by Copilot
📚 Documentation Health ReportGenerated on: 2026-05-01 23:55:53 UTC 📈 Documentation Statistics
🏗️ Three-Tree Architecture Status
🔍 Quality Metrics
This report is automatically generated by the Documentation Automation workflow. |
📚 Documentation Health ReportGenerated on: 2026-05-01 23:58:30 UTC 📈 Documentation Statistics
🏗️ Three-Tree Architecture Status
🔍 Quality Metrics
This report is automatically generated by the Documentation Automation workflow. |
katriendg
left a comment
There was a problem hiding this comment.
Thanks for adding this one in, note however the huge amount of files changed makes it harder to correctly review. Can you revert all the changes to bash files, unless really touched?
Also left some comments, but in general this one below is worth also reviewing:
Mislabeling: there is no leak-detection blueprint
-
Symptom: PR title
feat(blueprints): add leak-detection blueprint …and issue #465 advertise a new blueprint. -
Reality: No new directory under
blueprints/. Changes go intoblueprints/full-single-node-cluster/terraform/{main,outputs,variables}.tfand aleak-detection.tfvars.examplealongside existingonvif-connector-assets.tfvars.example/sse-connector-assets.tfvars.example. -
Impact: Future readers will look for a blueprint that does not exist; release notes will be inaccurate; reviewers cannot find scoped changes; users may copy/paste references that don't resolve.
-
Required actions:
- Retitle PR and issue to
feat(blueprints): wire 045-notification into full-single-node-cluster + add leak-detection scenario guide(or similar). - Update
docs/getting-started/leak-detection-scenario.mdH1 and prose to clarify this is a scenario built on top of thefull-single-node-clusterblueprint, not a blueprint of its own. The current doc opens with "Deploy a Leak Detection Pipeline" without telling the reader that the blueprint isfull-single-node-cluster. - Update the body of
leak-detection.tfvars.exampleheader comment ("Full Single Node Cluster with Leak Detection") to match. - Update the ADR
docs/solution-adr-library/leak-detection-e2e-pipeline-architecture.mdto describe the architecture pattern, not "the blueprint".
- Retitle PR and issue to
- Reformat new leak-detection CI/CD scripts to 2-space indent (shfmt baseline) - Add Notification field to BlueprintOutputs Go contract for full-single-node-cluster - Add stub notification output in bicep/main.bicep for IaC parity - Use neutral default (event_id) for notification_partition_key_field; clarify description - Clarify scenario doc, ADR, and tfvars header that this is a scenario on top of the full-single-node-cluster blueprint, not a standalone blueprint - Regenerate terraform-docs README for variable description change
📚 Documentation Health ReportGenerated on: 2026-05-04 14:24:54 UTC 📈 Documentation Statistics
🏗️ Three-Tree Architecture Status
🔍 Quality Metrics
This report is automatically generated by the Documentation Automation workflow. |
📚 Documentation Health ReportGenerated on: 2026-05-04 18:48:55 UTC 📈 Documentation Statistics
🏗️ Three-Tree Architecture Status
🔍 Quality Metrics
This report is automatically generated by the Documentation Automation workflow. |
📚 Documentation Health ReportGenerated on: 2026-05-04 19:24:50 UTC 📈 Documentation Statistics
🏗️ Three-Tree Architecture Status
🔍 Quality Metrics
This report is automatically generated by the Documentation Automation workflow. |
katriendg
left a comment
There was a problem hiding this comment.
Other than potential items being changed which should be synced from ADO instead, I believe this is looking ready to merge. Other than a few comments and the PR body:
🟡 PR description is stale
The PR title was updated, but the body still opens with:
Adds an end-to-end vision-based leak detection pipeline blueprint …
and still advertises content that is no longer in the diff:
- "Shell script reformatting to 4-space indent for shfmt compliance (project-wide)" — reverted ✅, but bullet remains.
- "
scripts/,src/**/scripts/shfmt 4-space indent reformatting" — reverted ✅. - "
.azuredevops/release branch create pipeline, GitVersion config" — not in current diff. - "Package-lock.json sync and lxml CVE exclusion removal" — not in current diff.
- "Blueprint outputs for notification and messaging endpoints" — only
notificationwas added; no top-levelmessagingaggregate output exists in Terraform. - "225 files across the following areas" — current diff is 108 files.
Action: Refresh the PR body to match what actually shipped. Reviewers and release-notes generation use this text.
…luster Closes the BlueprintOutputs contract gap flagged by katriendg: the Go test struct field 'Messaging map[string]any `output:"messaging"`' had no matching top-level Terraform output, while bicep/main.bicep already declared 'output messaging object'. Adds the aggregate output mirroring the Bicep shape so TestTerraformOutputsContract passes on the Terraform deploy path.
📚 Documentation Health ReportGenerated on: 2026-05-05 15:57:29 UTC 📈 Documentation Statistics
🏗️ Three-Tree Architecture Status
🔍 Quality Metrics
This report is automatically generated by the Documentation Automation workflow. |
Description
Closes #465
Adds an end-to-end vision-based leak detection pipeline blueprint to the edge-ai accelerator, integrating camera ingestion, AI inference, video capture, edge-to-cloud dataflow routing, and Microsoft Teams notification.
Architecture
Changes
New Features
full-single-node-clusterblueprint withleak-detection.tfvars.exampleDocumentation
docs/getting-started/leak-detection-scenario.mddocs/solution-adr-library/leak-detection-e2e-pipeline-architecture.mdBuild & Code Quality
Files Changed
225 files across the following areas:
blueprints/full-single-node-cluster/terraform/src/000-cloud/040-messaging/terraform/src/000-cloud/045-notification/terraform/src/501-ci-cd/scripts/build-leak-detection-images.sh,deploy-leak-detection-apps.shdocs/getting-started/docs/solution-adr-library/scripts/,src/**/scripts/.azuredevops/Testing