Skip to content

Commit e355f49

Browse files
committed
fix
1 parent 5192f60 commit e355f49

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

docs/experimental/slurm-controlled-rebuild.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,34 +16,34 @@ In summary, the way this functionality works is as follows:
1616
--limit control,login -e "appliances_server_action=unlock"
1717
```
1818
is run to unlock the control and login nodes for reimaging.
19-
2. `tofu apply` is run which rebuilds the login and control nodes to the new
19+
3. `tofu apply` is run which rebuilds the login and control nodes to the new
2020
image(s). The new image reference for compute nodes is ignored, but is
2121
written into the hosts inventory file (and is therefore available as an
2222
Ansible hostvar).
23-
3. The `site.yml` playbook is run which locks the instances again and reconfigures
23+
4. The `site.yml` playbook is run which locks the instances again and reconfigures
2424
the cluster as normal. At this point the cluster is functional, but using a new
2525
image for the login and control nodes and the old image for the compute nodes.
2626
This playbook also:
2727
- Writes cluster configuration to the control node, using the
2828
[compute_init](../../ansible/roles/compute_init/README.md) role.
2929
- Configures an application credential and helper programs on the control
3030
node, using the [rebuild](../../ansible/roles/rebuild/README.md) role.
31-
4. An admin submits Slurm jobs, one for each node, to a special "rebuild"
31+
5. An admin submits Slurm jobs, one for each node, to a special "rebuild"
3232
partition using an Ansible playbook. Because this partition has higher
3333
priority than the partitions normal users can use, these rebuild jobs become
3434
the next job in the queue for every node (although any jobs currently
3535
running will complete as normal).
36-
5. Because these rebuild jobs have the `--reboot` flag set, before launching them
36+
6. Because these rebuild jobs have the `--reboot` flag set, before launching them
3737
the Slurm control node runs a [RebootProgram](https://slurm.schedmd.com/slurm.conf.html#OPT_RebootProgram)
3838
which compares the current image for the node to the one in the cluster
3939
configuration, and if it does not match, uses OpenStack to rebuild the
4040
node to the desired (updated) image.
4141
TODO: Describe the logic if they DO match
42-
6. After a rebuild, the compute node runs various Ansible tasks during boot,
42+
7. After a rebuild, the compute node runs various Ansible tasks during boot,
4343
controlled by the [compute_init](../../ansible/roles/compute_init/README.md)
4444
role, to fully configure the node again. It retrieves the required cluster
4545
configuration information from the control node via an NFS mount.
46-
7. Once the `slurmd` daemon starts on a compute node, the slurm controller
46+
8. Once the `slurmd` daemon starts on a compute node, the slurm controller
4747
registers the node as having finished rebooting. It then launches the actual
4848
job, which does not do anything.
4949

0 commit comments

Comments
 (0)