feat(infrastructure): add manage-node-pools script and documentation by bindsi · Pull Request #548 · microsoft/physical-ai-toolchain

bindsi · 2026-04-24T12:01:37Z

feat(infrastructure): add manage-node-pools script for post-deployment pool edits

Description

Adds a script and documentation for adding, removing, or resizing AKS node pools on a running cluster without redeploying infrastructure or the OSMO control plane. The original cluster was sized with 4 vCPU nodes, but an SDG workflow needed more than 6.5 vCPU, and the previous path to fix that was a full reinstall. This PR narrows the blast radius to a single node pool and its subnet by driving changes through Terraform's existing for_each over node_pools, then reconciles OSMO's POD_TEMPLATE, POOL, and BACKEND configs automatically.

Type of Change

🐛 Bug fix (non-breaking change fixing an issue)
✨ New feature (non-breaking change adding functionality)
💥 Breaking change (fix or feature causing existing functionality to change)
📚 Documentation update
🏗️ Infrastructure change (Terraform/IaC)
♻️ Refactoring (no functional changes)

Component(s) Affected

infrastructure/terraform/prerequisites/ - Azure subscription setup
infrastructure/terraform/ - Terraform infrastructure
infrastructure/setup/ - OSMO control plane / Helm
workflows/ - Training and evaluation workflows
training/ - Training pipelines and scripts
docs/ - Documentation

Testing Performed

Terraform plan reviewed (no unexpected changes)
Terraform apply tested in dev environment
Training scripts tested locally with Isaac Sim
OSMO workflow submitted successfully
Smoke tests passed (smoke_test_azure.py)

Local verification performed:

shellcheck passes on infrastructure/setup/optional/manage-node-pools.sh.
markdownlint-cli2 passes on both docs/infrastructure/manage-node-pools.md and docs/infrastructure/cluster-setup-advanced.md.
bash manage-node-pools.sh list against the current Terraform state returns the existing gpu pool row as expected.
bash manage-node-pools.sh --help renders the full usage block.

End-to-end add/remove runs against a live cluster have not been executed in this branch; the boxes above are intentionally left unchecked.

Documentation Impact

No documentation changes needed
Documentation updated in this PR
Documentation issue filed

Bug Fix Checklist

Complete this section for bug fix PRs. Skip for other contribution types.

Linked to issue being fixed
Regression test included, OR
Justification for no regression test:

Checklist

My code follows the project conventions
Commit messages follow conventional commit format
I have performed a self-review
Documentation impact assessed above
No new linting warnings introduced

Changes

Script

Added infrastructure/setup/optional/manage-node-pools.sh with four subcommands:
- list prints the current node_pools table (name, VM size, priority, autoscale range, taints).
- add creates a new pool from CLI flags covering vm-size, subnet, priority, node-count or auto-scale with min-count/max-count, repeatable taint/label/zone, eviction-policy (Spot only), and gpu-driver. Rejects duplicate pool names and validates flag combinations.
- remove deletes a pool from the overlay and warns when removal empties the map or when DEFAULT_POOL from .env.local matches the pool being removed.
- sync re-renders OSMO configs without a Terraform apply (useful after manual terraform.tfvars edits).
The script maintains a managed overlay at infrastructure/terraform/node-pools.managed.auto.tfvars.json. On first mutation it seeds the overlay by evaluating var.node_pools through terraform console, so Terraform's existing for_each on azurerm_kubernetes_cluster_node_pool, subnets, NSG associations, and NAT gateway associations only touches the added or removed pool.
After writing the overlay, the script runs terraform apply -auto-approve (skippable with --skip-apply) and then invokes infrastructure/setup/04-deploy-osmo-backend.sh to regenerate OSMO POD_TEMPLATE, POOL, and BACKEND configs. Operator-supplied flags pass through via --osmo-args so the same auth and ACR settings from the original deploy are preserved.
Follows the repo shell-script template: set -o errexit -o nounset, sources scripts/lib/common.sh and defaults.conf, and uses the info/warn/fatal/section/print_kv helpers.

Documentation

Added docs/infrastructure/manage-node-pools.md with when-to-use rationale (including the SDG workflow that surfaced this gap), a how-it-works explanation of the overlay and for_each semantics, prerequisites, full flag tables, four worked examples (list, CPU pool for SDG, Spot H100 with autoscaling, remove, sync), verification commands (kubectl get nodes, az aks nodepool list, osmo config show POOL), and operational notes on subnet planning, DEFAULT_POOL drift, overlay-as-source-of-truth, Spot constraints, and autoscaling.
Updated docs/infrastructure/cluster-setup-advanced.md to list the new script in the Optional Scripts table with a link to the new page.

Related Issues

None.

Notes

The .auto.tfvars.json overlay is not added to .gitignore; operators can either commit it to share pool composition with the team or keep it local alongside terraform.tfvars.
Mixing edits between terraform.tfvars and the overlay leads to the overlay winning (Terraform loads *.auto.tfvars* after terraform.tfvars). The new documentation flags this explicitly.

Follow-up Tasks

Exercise add and remove end-to-end on a dev cluster and update the Testing Performed checkboxes above.

- implement script for managing AKS node pools - create documentation for node pool management - include usage examples and command options Signed-off-by: Marcel Bindseil <marcelbindseil@gmail.com>

github-actions · 2026-04-24T12:01:59Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA c26e2da.

Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

codecov-commenter · 2026-04-24T12:03:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.56%. Comparing base (48c38dc) to head (c26e2da).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #548      +/-   ##
==========================================
+ Coverage   63.91%   66.56%   +2.65%     
==========================================
  Files         250      262      +12     
  Lines       15409    16639    +1230     
  Branches     2122     2260     +138     
==========================================
+ Hits         9848    11076    +1228     
  Misses       5274     5274              
- Partials      287      289       +2

Flag	Coverage Δ	*Carryforward flag
pester	`83.13% <ø> (ø)`	Carriedforward from 10a7dae
pytest-data-pipeline	`100.00% <ø> (ø)`	Carriedforward from 10a7dae
pytest-dataviewer	`65.12% <ø> (ø)`	Carriedforward from 10a7dae
pytest-dm-tools	`100.00% <ø> (ø)`	Carriedforward from 10a7dae
pytest-evaluation	`99.83% <ø> (?)`
pytest-fuzz	`4.97% <ø> (ø)`	Carriedforward from 10a7dae
pytest-inference	`0.00% <ø> (ø)`	Carriedforward from 10a7dae
pytest-training	`82.14% <ø> (ø)`	Carriedforward from 10a7dae
vitest	`51.08% <ø> (ø)`	Carriedforward from 10a7dae

*This pull request uses carry forward flags. Click here to find out more.
see 12 files with indirect coverage changes

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

liupeirong · 2026-04-26T16:07:55Z

+
+Use this when a workload requires resources the existing pools cannot provide. Examples:
+
+- An SDG workflow requires `>= 6.5` vCPU but the initial pool uses `Standard_B4` (4 vCPU).


To clarify, in this case, do I add a new pool with a different VM sku, or can I add a different VM sku to the existing pool?

liupeirong · 2026-04-26T16:09:20Z

+| `list`   | Print configured node pools from current Terraform state                   |
+| `add`    | Create a new node pool, apply Terraform, and sync OSMO configs             |
+| `remove` | Destroy a node pool, apply Terraform, and sync OSMO configs                |
+| `sync`   | Re-render OSMO `POD_TEMPLATE`, `POOL`, and `BACKEND` configs only          |


Is this the same as the option --config-preview for the other scripts (01-04)? If so, renaming it to the same parameter makes it more consistent.

feat(infrastructure): add manage-node-pools script and documentation

f65eb92

- implement script for managing AKS node pools - create documentation for node pool management - include usage examples and command options Signed-off-by: Marcel Bindseil <marcelbindseil@gmail.com>

style(cspell): update custom words in cspell configuration files

10a7dae

liupeirong reviewed Apr 26, 2026

View reviewed changes

Merge branch 'main' into feature/manage-node-pools

c26e2da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(infrastructure): add manage-node-pools script and documentation#548

feat(infrastructure): add manage-node-pools script and documentation#548
bindsi wants to merge 3 commits intomainfrom
feature/manage-node-pools

bindsi commented Apr 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 24, 2026 •

edited

Loading

Uh oh!

liupeirong Apr 26, 2026

Uh oh!

liupeirong Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		Use this when a workload requires resources the existing pools cannot provide. Examples:

		- An SDG workflow requires `>= 6.5` vCPU but the initial pool uses `Standard_B4` (4 vCPU).

Conversation

bindsi commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat(infrastructure): add manage-node-pools script for post-deployment pool edits

Description

Type of Change

Component(s) Affected

Testing Performed

Documentation Impact

Bug Fix Checklist

Checklist

Changes

Script

Documentation

Related Issues

Notes

Follow-up Tasks

Uh oh!

github-actions Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Snapshot Warnings

Scanned Files

Uh oh!

codecov-commenter commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

liupeirong Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

liupeirong Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bindsi commented Apr 24, 2026 •

edited

Loading

github-actions Bot commented Apr 24, 2026 •

edited

Loading

codecov-commenter commented Apr 24, 2026 •

edited

Loading