feat(infrastructure): add manage-node-pools script and documentation#548
Draft
feat(infrastructure): add manage-node-pools script and documentation#548
Conversation
- implement script for managing AKS node pools - create documentation for node pool management - include usage examples and command options Signed-off-by: Marcel Bindseil <marcelbindseil@gmail.com>
Contributor
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Snapshot WarningsEnsure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice. Scanned FilesNone |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #548 +/- ##
==========================================
+ Coverage 63.91% 66.56% +2.65%
==========================================
Files 250 262 +12
Lines 15409 16639 +1230
Branches 2122 2260 +138
==========================================
+ Hits 9848 11076 +1228
Misses 5274 5274
- Partials 287 289 +2
*This pull request uses carry forward flags. Click here to find out more. 🚀 New features to boost your workflow:
|
liupeirong
reviewed
Apr 26, 2026
|
|
||
| Use this when a workload requires resources the existing pools cannot provide. Examples: | ||
|
|
||
| - An SDG workflow requires `>= 6.5` vCPU but the initial pool uses `Standard_B4` (4 vCPU). |
There was a problem hiding this comment.
To clarify, in this case, do I add a new pool with a different VM sku, or can I add a different VM sku to the existing pool?
| | `list` | Print configured node pools from current Terraform state | | ||
| | `add` | Create a new node pool, apply Terraform, and sync OSMO configs | | ||
| | `remove` | Destroy a node pool, apply Terraform, and sync OSMO configs | | ||
| | `sync` | Re-render OSMO `POD_TEMPLATE`, `POOL`, and `BACKEND` configs only | |
There was a problem hiding this comment.
Is this the same as the option --config-preview for the other scripts (01-04)? If so, renaming it to the same parameter makes it more consistent.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(infrastructure): add manage-node-pools script for post-deployment pool edits
Description
Adds a script and documentation for adding, removing, or resizing AKS node pools on a running cluster without redeploying infrastructure or the OSMO control plane. The original cluster was sized with 4 vCPU nodes, but an SDG workflow needed more than 6.5 vCPU, and the previous path to fix that was a full reinstall. This PR narrows the blast radius to a single node pool and its subnet by driving changes through Terraform's existing
for_eachovernode_pools, then reconciles OSMO's POD_TEMPLATE, POOL, and BACKEND configs automatically.Type of Change
Component(s) Affected
infrastructure/terraform/prerequisites/- Azure subscription setupinfrastructure/terraform/- Terraform infrastructureinfrastructure/setup/- OSMO control plane / Helmworkflows/- Training and evaluation workflowstraining/- Training pipelines and scriptsdocs/- DocumentationTesting Performed
planreviewed (no unexpected changes)applytested in dev environmentsmoke_test_azure.py)Local verification performed:
shellcheckpasses on infrastructure/setup/optional/manage-node-pools.sh.markdownlint-cli2passes on both docs/infrastructure/manage-node-pools.md and docs/infrastructure/cluster-setup-advanced.md.bash manage-node-pools.sh listagainst the current Terraform state returns the existinggpupool row as expected.bash manage-node-pools.sh --helprenders the full usage block.End-to-end
add/removeruns against a live cluster have not been executed in this branch; the boxes above are intentionally left unchecked.Documentation Impact
Bug Fix Checklist
Complete this section for bug fix PRs. Skip for other contribution types.
Checklist
Changes
Script
listprints the currentnode_poolstable (name, VM size, priority, autoscale range, taints).addcreates a new pool from CLI flags covering vm-size, subnet, priority, node-count or auto-scale with min-count/max-count, repeatable taint/label/zone, eviction-policy (Spot only), and gpu-driver. Rejects duplicate pool names and validates flag combinations.removedeletes a pool from the overlay and warns when removal empties the map or whenDEFAULT_POOLfrom .env.local matches the pool being removed.syncre-renders OSMO configs without a Terraform apply (useful after manual terraform.tfvars edits).var.node_poolsthroughterraform console, so Terraform's existingfor_eachonazurerm_kubernetes_cluster_node_pool, subnets, NSG associations, and NAT gateway associations only touches the added or removed pool.terraform apply -auto-approve(skippable with--skip-apply) and then invokes infrastructure/setup/04-deploy-osmo-backend.sh to regenerate OSMO POD_TEMPLATE, POOL, and BACKEND configs. Operator-supplied flags pass through via--osmo-argsso the same auth and ACR settings from the original deploy are preserved.set -o errexit -o nounset, sourcesscripts/lib/common.shanddefaults.conf, and uses theinfo/warn/fatal/section/print_kvhelpers.Documentation
for_eachsemantics, prerequisites, full flag tables, four worked examples (list, CPU pool for SDG, Spot H100 with autoscaling, remove, sync), verification commands (kubectl get nodes,az aks nodepool list,osmo config show POOL), and operational notes on subnet planning,DEFAULT_POOLdrift, overlay-as-source-of-truth, Spot constraints, and autoscaling.Related Issues
None.
Notes
.auto.tfvars.jsonoverlay is not added to.gitignore; operators can either commit it to share pool composition with the team or keep it local alongsideterraform.tfvars.*.auto.tfvars*afterterraform.tfvars). The new documentation flags this explicitly.Follow-up Tasks
addandremoveend-to-end on a dev cluster and update the Testing Performed checkboxes above.