@@ -16,34 +16,34 @@ In summary, the way this functionality works is as follows:
1616 --limit control,login -e "appliances_server_action=unlock"
1717 ```
1818 is run to unlock the control and login nodes for reimaging.
19- 2 . ` tofu apply ` is run which rebuilds the login and control nodes to the new
19+ 3 . ` tofu apply ` is run which rebuilds the login and control nodes to the new
2020 image(s). The new image reference for compute nodes is ignored, but is
2121 written into the hosts inventory file (and is therefore available as an
2222 Ansible hostvar).
23- 3 . The ` site.yml ` playbook is run which locks the instances again and reconfigures
23+ 4 . The ` site.yml ` playbook is run which locks the instances again and reconfigures
2424 the cluster as normal. At this point the cluster is functional, but using a new
2525 image for the login and control nodes and the old image for the compute nodes.
2626 This playbook also:
2727 - Writes cluster configuration to the control node, using the
2828 [ compute_init] ( ../../ansible/roles/compute_init/README.md ) role.
2929 - Configures an application credential and helper programs on the control
3030 node, using the [ rebuild] ( ../../ansible/roles/rebuild/README.md ) role.
31- 4 . An admin submits Slurm jobs, one for each node, to a special "rebuild"
31+ 5 . An admin submits Slurm jobs, one for each node, to a special "rebuild"
3232 partition using an Ansible playbook. Because this partition has higher
3333 priority than the partitions normal users can use, these rebuild jobs become
3434 the next job in the queue for every node (although any jobs currently
3535 running will complete as normal).
36- 5 . Because these rebuild jobs have the ` --reboot ` flag set, before launching them
36+ 6 . Because these rebuild jobs have the ` --reboot ` flag set, before launching them
3737 the Slurm control node runs a [ RebootProgram] ( https://slurm.schedmd.com/slurm.conf.html#OPT_RebootProgram )
3838 which compares the current image for the node to the one in the cluster
3939 configuration, and if it does not match, uses OpenStack to rebuild the
4040 node to the desired (updated) image.
4141 TODO: Describe the logic if they DO match
42- 6 . After a rebuild, the compute node runs various Ansible tasks during boot,
42+ 7 . After a rebuild, the compute node runs various Ansible tasks during boot,
4343 controlled by the [ compute_init] ( ../../ansible/roles/compute_init/README.md )
4444 role, to fully configure the node again. It retrieves the required cluster
4545 configuration information from the control node via an NFS mount.
46- 7 . Once the ` slurmd ` daemon starts on a compute node, the slurm controller
46+ 8 . Once the ` slurmd ` daemon starts on a compute node, the slurm controller
4747 registers the node as having finished rebooting. It then launches the actual
4848 job, which does not do anything.
4949
0 commit comments