Skip to content

Harden Azure provisioning recovery#15697

Draft
davidfowl wants to merge 4 commits intomainfrom
davidfowl/azure-provisioning-refactor
Draft

Harden Azure provisioning recovery#15697
davidfowl wants to merge 4 commits intomainfrom
davidfowl/azure-provisioning-refactor

Conversation

@davidfowl
Copy link
Copy Markdown
Contributor

@davidfowl davidfowl commented Mar 30, 2026

Description

This PR introduces AzureProvisioningController, a serialized control loop that coordinates all run-mode Azure provisioning operations. It replaces the inline provisioning logic that previously lived in AzureProvisioner with a channel-based queue that serializes startup provisioning, dashboard commands, CLI commands, and background drift detection through a single processing loop.

Controller architecture

The controller uses a Channel<QueuedOperation> with a single reader. Every operation — provision, reprovision, reset, change-location, change-context, delete, drift-check — is modeled as a typed intent record that gets enqueued and processed one at a time. This eliminates races between concurrent dashboard commands, CLI commands, and the periodic drift monitor.

Within a provisioning pass, individual resources fan out concurrently but are ordered by dependency. Each resource gets a per-resource ProvisioningTaskCompletionSource that downstream resources await before starting their own deployment. The TCS is completed through exactly two paths (CompleteProvisioning / FailProvisioning), so dependents unblock as soon as their prerequisites finish rather than waiting for the entire batch.

What the provisioning stack can do now

Resource commands (per-resource):

  • Reprovision — clears cached deployment state for a resource and its children/role-assignments, then redeploys
  • Change location — prompts for a new Azure region, deletes the existing ARM resource if it conflicts, sets a per-resource location override, and reprovisions
  • Forget state — clears cached deployment state without reprovisioning

Environment commands (all resources):

  • Reset provisioning state — wipes all cached deployment state and resets the environment to NotStarted
  • Change Azure context — re-prompts for subscription/tenant/resource-group/location, then reprovisions all resources with the new context
  • Reprovision all — clears and redeploys all Azure resources while preserving location overrides
  • Delete Azure resources — deletes the resource group and resets state

Background drift detection:

  • Periodic timer probes ARM to verify each running resource still exists
  • Marks missing resources as "Missing in Azure" and the environment as "Drifted"
  • Non-overlapping — queues at most one check through the same serialized channel

Azure resource metadata:

  • Both fresh and cached-state resources now expose: azure.subscription.id, azure.resource.group, azure.tenant.id, azure.tenant.domain, azure.location, and resource.source (full ARM deployment id)
  • Failed resources stamp the predicted deployment id before the ARM call, so agents and tools can still query Azure even when provisioning fails

Location overrides:

  • Per-resource overrides are persisted in deployment state and survive resets/reprovisioning
  • When changing location, the controller deletes the existing Azure resource first to avoid ARM InvalidResourceLocation conflicts
  • Stale overrides are cleared when the effective location changes

Other changes

  • BicepProvisioner — hardened checksum reuse validation, unified Azure identity metadata across fresh/cached paths, predicted deployment-id stamping for failed resources
  • RunModeProvisioningContextProvider — refactored Azure context acquisition and interactive prompting
  • AzureResourcePreparer — wires per-resource commands into the app model
  • Only registers AzureProvisioningController in run mode (fixes DI failures in publish/test scenarios)

Test coverage

  • Controller regression tests covering: reprovision, reset, change-location, change-context, delete, drift detection, dependency ordering, command state management, location override preservation
  • Provisioner regression tests covering: checksum validation, cached-state identity properties, stale location overrides, failed-resource metadata stamping

Checklist

  • Is this feature complete?
    • Yes. Ready to ship.
    • No. Follow-up changes expected.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • Yes
    • No
  • Did you add public API?
    • Yes
      • If yes, did you have an API Review for it?
        • Yes
        • No
      • Did you add <remarks /> and <code /> elements on your triple slash comments?
        • Yes
        • No
    • No
  • Does the change make any security assumptions or guarantees?
    • Yes
      • If yes, have you done a threat model and had a security review?
        • Yes
        • No
    • No
  • Does the change require an update in our Aspire docs?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 15697

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 15697"

davidfowl and others added 2 commits March 29, 2026 19:12
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@davidfowl davidfowl force-pushed the davidfowl/azure-provisioning-refactor branch from d224200 to 4956a53 Compare March 30, 2026 04:11
The controller depends on IAzureProvisioningOptionsManager which is only
registered in run mode. Moving the controller registration inside the
run-mode block fixes the DI resolution failure in publish/test scenarios.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@davidfowl davidfowl requested review from danegsta and karolz-ms March 30, 2026 04:24
…r fallback

AzureProvisioner is resolved in all modes as an eventing subscriber but
depends on AzureProvisioningController. Register the controller and a
no-op IAzureProvisioningOptionsManager unconditionally so DI succeeds in
publish/test mode. In run mode, RunModeProvisioningContextProvider
overrides the no-op via AddSingleton (registered before TryAdd).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

Re-running the failed jobs in the CI workflow for this pull request because 1 job was identified as retry-safe transient failures in the CI run attempt.
GitHub was asked to rerun all failed jobs for that attempt, and the rerun is being tracked in the rerun attempt.
The job links below point to the failed attempt jobs that matched the retry-safe transient failure rules.

@github-actions
Copy link
Copy Markdown
Contributor

🎬 CLI E2E Test Recordings — 52 recordings uploaded (commit 7bbaf1f)

View recordings
Test Recording
AddPackageInteractiveWhileAppHostRunningDetached ▶️ View Recording
AddPackageWhileAppHostRunningDetached ▶️ View Recording
AgentCommands_AllHelpOutputs_AreCorrect ▶️ View Recording
AgentInitCommand_DefaultSelection_InstallsSkillOnly ▶️ View Recording
AgentInitCommand_MigratesDeprecatedConfig ▶️ View Recording
AspireAddPackageVersionToDirectoryPackagesProps ▶️ View Recording
AspireUpdateRemovesAppHostPackageVersionFromDirectoryPackagesProps ▶️ View Recording
Banner_DisplayedOnFirstRun ▶️ View Recording
Banner_DisplayedWithExplicitFlag ▶️ View Recording
Banner_NotDisplayedWithNoLogoFlag ▶️ View Recording
CertificatesClean_RemovesCertificates ▶️ View Recording
CertificatesTrust_WithNoCert_CreatesAndTrustsCertificate ▶️ View Recording
CertificatesTrust_WithUntrustedCert_TrustsCertificate ▶️ View Recording
ConfigSetGet_CreatesNestedJsonFormat ▶️ View Recording
CreateAndRunAspireStarterProject ▶️ View Recording
CreateAndRunAspireStarterProjectWithBundle ▶️ View Recording
CreateAndRunEmptyAppHostProject ▶️ View Recording
CreateAndRunJavaEmptyAppHostProject ▶️ View Recording
CreateAndRunJsReactProject ▶️ View Recording
CreateAndRunPythonReactProject ▶️ View Recording
CreateAndRunTypeScriptEmptyAppHostProject ▶️ View Recording
CreateAndRunTypeScriptStarterProject ▶️ View Recording
CreateJavaAppHostWithViteApp ▶️ View Recording
CreateStartAndStopAspireProject ▶️ View Recording
CreateTypeScriptAppHostWithViteApp ▶️ View Recording
DescribeCommandResolvesReplicaNames ▶️ View Recording
DescribeCommandShowsRunningResources ▶️ View Recording
DetachFormatJsonProducesValidJson ▶️ View Recording
DoctorCommand_DetectsDeprecatedAgentConfig ▶️ View Recording
DoctorCommand_WithSslCertDir_ShowsTrusted ▶️ View Recording
DoctorCommand_WithoutSslCertDir_ShowsPartiallyTrusted ▶️ View Recording
GlobalMigration_HandlesCommentsAndTrailingCommas ▶️ View Recording
GlobalMigration_HandlesMalformedLegacyJson ▶️ View Recording
GlobalMigration_PreservesAllValueTypes ▶️ View Recording
GlobalMigration_SkipsWhenNewConfigExists ▶️ View Recording
GlobalSettings_MigratedFromLegacyFormat ▶️ View Recording
InvalidAppHostPathWithComments_IsHealedOnRun ▶️ View Recording
LogsCommandShowsResourceLogs ▶️ View Recording
PsCommandListsRunningAppHost ▶️ View Recording
PsFormatJsonOutputsOnlyJsonToStdout ▶️ View Recording
PublishWithDockerComposeServiceCallbackSucceeds ▶️ View Recording
RestoreGeneratesSdkFiles ▶️ View Recording
RunWithMissingAwaitShowsHelpfulError ▶️ View Recording
SecretCrudOnDotNetAppHost ▶️ View Recording
SecretCrudOnTypeScriptAppHost ▶️ View Recording
StagingChannel_ConfigureAndVerifySettings_ThenSwitchChannels ▶️ View Recording
StopAllAppHostsFromAppHostDirectory ▶️ View Recording
StopAllAppHostsFromUnrelatedDirectory ▶️ View Recording
StopNonInteractiveMultipleAppHostsShowsError ▶️ View Recording
StopNonInteractiveSingleAppHost ▶️ View Recording
StopWithNoRunningAppHostExitsSuccessfully ▶️ View Recording
TypeScriptAppHostWithProjectReferenceIntegration ▶️ View Recording

📹 Recordings uploaded automatically from CI run #23728175797

@davidfowl
Copy link
Copy Markdown
Contributor Author

@copilot fix the failing test

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 30, 2026

@copilot fix the failing test

I investigated all Azure test classes locally and all tests pass (BicepProvisionerTests, EnvironmentResourceExtensionsTests, ResourcePreparerTests, ProvisioningContextProviderTests, DeployerTests, BicepResourceTests — 178+ tests total). I was unable to access the CI logs to identify the specific failing test due to DNS restrictions in this environment. The previous two commits (95026c9, 7bbaf1f) addressed DI resolution failures for AzureProvisioningController in publish/test mode — could you point to the specific failing test or share the CI run URL so I can investigate further?

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • azcliprod.blob.core.windows.net
    • Triggering command: /usr/bin/../../opt/az/bin/python3 /usr/bin/../../opt/az/bin/python3 -Im azure.cli account get-access-token --output json --resource REDACTED (dns block)
  • https://api.github.com/graphql
    • Triggering command: /usr/bin/gh gh pr list --repo microsoft/aspire --head davidfowl/azure-provisioning-refactor --json number,title committer.name ptables docker netw�� ls --format /usr/bin/docker HEAD (http block)
  • https://api.github.com/repos/microsoft/aspire/actions/runs
    • Triggering command: /usr/bin/gh gh run list --repo microsoft/aspire --branch davidfowl/azure-provisioning-refactor --limit 5 --json databaseId,status,conclusion,name docker netw�� ls --format /usr/bin/docker --verify davidfowl/azure--p rosoft.developercontrolplane.linls docker (http block)
    • Triggering command: /usr/bin/gh gh run list --repo microsoft/aspire --branch davidfowl/azure-provisioning-refactor --limit 3 get-access-tokenrun-controllers /home/REDACTED/.ca--kubeconfig get/packages/mic/tmp/aspire-dcp7J1czA/kubeconfig netw�� ls oj /usr/bin/docker account TR7N1/kubeconfignetwork /home/REDACTED/.cols docker (http block)
  • https://api.github.com/repos/microsoft/aspire/commits/7bbaf1f11e709efb1696d30fc76e2795e57bd97a/check-runs
    • Triggering command: /usr/bin/gh gh api /repos/microsoft/aspire/commits/7bbaf1f11e709efb1696d30fc76e2795e57bd97a/check-runs?per_page=100 controlplane.linget-access-token netw�� ls oj /usr/bin/docker --local artifacts/bin/As-rs /home/REDACTED/.dotnet/tools echo ls docker netw�� ls --format rosoft.developercontrolplane.linux-amd64/0.22.9/tools/dcp --noprofile (http block)
    • Triggering command: /usr/bin/gh gh api /repos/microsoft/aspire/commits/7bbaf1f11e709efb1696d30fc76e2795e57bd97a/check-runs?per_page=100 docker netw�� ls --format /usr/bin/docker set-url origin name docker _.py�� ls --format /usr/bin/docker AzureProvisioninuname tests/Aspire.Ora-rs t/lsb_release docker (http block)
  • https://api.github.com/repos/microsoft/aspire/pulls
    • Triggering command: /usr/bin/gh gh api repos/microsoft/aspire/pulls?head=microsoft:davidfowl/azure-provisioning-refactor docker netw�� ls --format /usr/bin/docker --local committer.email get/packages/mic--filter-not-trait docker netw�� ls --format /usr/bin/docker --oneline 2hMCk/kubeconfig/usr/bin/az tnet/tools/unameaccount docker (http block)
    • Triggering command: /usr/bin/gh gh api repos/microsoft/aspire/pulls?head=microsoft:davidfowl/azure-provisioning-refactor&amp;state=open docker netw�� ls --format /usr/bin/docker ls --format /usr/bin/docker docker netw�� k/aspire/aspire/artifacts/log/De-- --format /usr/bin/docker ls --format es docker (http block)
  • https://api.github.com/repos/microsoft/aspire/statuses/7bbaf1f11e709efb1696d30fc76e2795e57bd97a
    • Triggering command: /usr/bin/gh gh api /repos/microsoft/aspire/statuses/7bbaf1f11e709efb1696d30fc76e2795e57bd97a docker netw�� ls --format /usr/bin/docker azd auth token -/home/REDACTED/.nuget/packages/microsoft.developercontrolplane.linux-amd64/0.22.9/--format (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants