Skip to content

feat(infrastructure): add manage-node-pools script and documentation#548

Draft
bindsi wants to merge 3 commits intomainfrom
feature/manage-node-pools
Draft

feat(infrastructure): add manage-node-pools script and documentation#548
bindsi wants to merge 3 commits intomainfrom
feature/manage-node-pools

Conversation

@bindsi
Copy link
Copy Markdown
Member

@bindsi bindsi commented Apr 24, 2026

feat(infrastructure): add manage-node-pools script for post-deployment pool edits

Description

Adds a script and documentation for adding, removing, or resizing AKS node pools on a running cluster without redeploying infrastructure or the OSMO control plane. The original cluster was sized with 4 vCPU nodes, but an SDG workflow needed more than 6.5 vCPU, and the previous path to fix that was a full reinstall. This PR narrows the blast radius to a single node pool and its subnet by driving changes through Terraform's existing for_each over node_pools, then reconciles OSMO's POD_TEMPLATE, POOL, and BACKEND configs automatically.

Type of Change

  • 🐛 Bug fix (non-breaking change fixing an issue)
  • ✨ New feature (non-breaking change adding functionality)
  • 💥 Breaking change (fix or feature causing existing functionality to change)
  • 📚 Documentation update
  • 🏗️ Infrastructure change (Terraform/IaC)
  • ♻️ Refactoring (no functional changes)

Component(s) Affected

  • infrastructure/terraform/prerequisites/ - Azure subscription setup
  • infrastructure/terraform/ - Terraform infrastructure
  • infrastructure/setup/ - OSMO control plane / Helm
  • workflows/ - Training and evaluation workflows
  • training/ - Training pipelines and scripts
  • docs/ - Documentation

Testing Performed

  • Terraform plan reviewed (no unexpected changes)
  • Terraform apply tested in dev environment
  • Training scripts tested locally with Isaac Sim
  • OSMO workflow submitted successfully
  • Smoke tests passed (smoke_test_azure.py)

Local verification performed:

  • shellcheck passes on infrastructure/setup/optional/manage-node-pools.sh.
  • markdownlint-cli2 passes on both docs/infrastructure/manage-node-pools.md and docs/infrastructure/cluster-setup-advanced.md.
  • bash manage-node-pools.sh list against the current Terraform state returns the existing gpu pool row as expected.
  • bash manage-node-pools.sh --help renders the full usage block.

End-to-end add/remove runs against a live cluster have not been executed in this branch; the boxes above are intentionally left unchecked.

Documentation Impact

  • No documentation changes needed
  • Documentation updated in this PR
  • Documentation issue filed

Bug Fix Checklist

Complete this section for bug fix PRs. Skip for other contribution types.

  • Linked to issue being fixed
  • Regression test included, OR
  • Justification for no regression test:

Checklist

Changes

Script

  • Added infrastructure/setup/optional/manage-node-pools.sh with four subcommands:
    • list prints the current node_pools table (name, VM size, priority, autoscale range, taints).
    • add creates a new pool from CLI flags covering vm-size, subnet, priority, node-count or auto-scale with min-count/max-count, repeatable taint/label/zone, eviction-policy (Spot only), and gpu-driver. Rejects duplicate pool names and validates flag combinations.
    • remove deletes a pool from the overlay and warns when removal empties the map or when DEFAULT_POOL from .env.local matches the pool being removed.
    • sync re-renders OSMO configs without a Terraform apply (useful after manual terraform.tfvars edits).
  • The script maintains a managed overlay at infrastructure/terraform/node-pools.managed.auto.tfvars.json. On first mutation it seeds the overlay by evaluating var.node_pools through terraform console, so Terraform's existing for_each on azurerm_kubernetes_cluster_node_pool, subnets, NSG associations, and NAT gateway associations only touches the added or removed pool.
  • After writing the overlay, the script runs terraform apply -auto-approve (skippable with --skip-apply) and then invokes infrastructure/setup/04-deploy-osmo-backend.sh to regenerate OSMO POD_TEMPLATE, POOL, and BACKEND configs. Operator-supplied flags pass through via --osmo-args so the same auth and ACR settings from the original deploy are preserved.
  • Follows the repo shell-script template: set -o errexit -o nounset, sources scripts/lib/common.sh and defaults.conf, and uses the info/warn/fatal/section/print_kv helpers.

Documentation

  • Added docs/infrastructure/manage-node-pools.md with when-to-use rationale (including the SDG workflow that surfaced this gap), a how-it-works explanation of the overlay and for_each semantics, prerequisites, full flag tables, four worked examples (list, CPU pool for SDG, Spot H100 with autoscaling, remove, sync), verification commands (kubectl get nodes, az aks nodepool list, osmo config show POOL), and operational notes on subnet planning, DEFAULT_POOL drift, overlay-as-source-of-truth, Spot constraints, and autoscaling.
  • Updated docs/infrastructure/cluster-setup-advanced.md to list the new script in the Optional Scripts table with a link to the new page.

Related Issues

None.

Notes

  • The .auto.tfvars.json overlay is not added to .gitignore; operators can either commit it to share pool composition with the team or keep it local alongside terraform.tfvars.
  • Mixing edits between terraform.tfvars and the overlay leads to the overlay winning (Terraform loads *.auto.tfvars* after terraform.tfvars). The new documentation flags this explicitly.

Follow-up Tasks

  • Exercise add and remove end-to-end on a dev cluster and update the Testing Performed checkboxes above.

- implement script for managing AKS node pools
- create documentation for node pool management
- include usage examples and command options

Signed-off-by: Marcel Bindseil <marcelbindseil@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA c26e2da.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.56%. Comparing base (48c38dc) to head (c26e2da).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #548      +/-   ##
==========================================
+ Coverage   63.91%   66.56%   +2.65%     
==========================================
  Files         250      262      +12     
  Lines       15409    16639    +1230     
  Branches     2122     2260     +138     
==========================================
+ Hits         9848    11076    +1228     
  Misses       5274     5274              
- Partials      287      289       +2     
Flag Coverage Δ *Carryforward flag
pester 83.13% <ø> (ø) Carriedforward from 10a7dae
pytest-data-pipeline 100.00% <ø> (ø) Carriedforward from 10a7dae
pytest-dataviewer 65.12% <ø> (ø) Carriedforward from 10a7dae
pytest-dm-tools 100.00% <ø> (ø) Carriedforward from 10a7dae
pytest-evaluation 99.83% <ø> (?)
pytest-fuzz 4.97% <ø> (ø) Carriedforward from 10a7dae
pytest-inference 0.00% <ø> (ø) Carriedforward from 10a7dae
pytest-training 82.14% <ø> (ø) Carriedforward from 10a7dae
vitest 51.08% <ø> (ø) Carriedforward from 10a7dae

*This pull request uses carry forward flags. Click here to find out more.
see 12 files with indirect coverage changes

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.


Use this when a workload requires resources the existing pools cannot provide. Examples:

- An SDG workflow requires `>= 6.5` vCPU but the initial pool uses `Standard_B4` (4 vCPU).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, in this case, do I add a new pool with a different VM sku, or can I add a different VM sku to the existing pool?

| `list` | Print configured node pools from current Terraform state |
| `add` | Create a new node pool, apply Terraform, and sync OSMO configs |
| `remove` | Destroy a node pool, apply Terraform, and sync OSMO configs |
| `sync` | Re-render OSMO `POD_TEMPLATE`, `POOL`, and `BACKEND` configs only |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same as the option --config-preview for the other scripts (01-04)? If so, renaming it to the same parameter makes it more consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants