-
Notifications
You must be signed in to change notification settings - Fork 9
Advanced Use Cases
After going through the tutorial, you should be familiar and comfortable enough with OpenCHAMI to make changes to the deployment process and configuration. We're going to cover some of the more common use-cases that an OpenCHAMI user would want to pursue.
At this point, we can use what we have learned so far in the tutorial to customize our nodes in various ways such as changing how we serve images, deriving new images, updating our cloud-init config, and running MPI jobs. This sections explores some of the use cases that you may want to explore to utilize OpenCHAMI to fit your own needs.
Some of the use cases include:
- Adding SLURM and MPI to the Compute Node
- Serving the Root Filesystem with NFS
- Enable WireGuard Security for the
cloud-init-server - Using Image Layers to Customize Boot Image with a Common Base
- Using
kexecto Reboot Nodes For an Kernel Upgrade or Specialized Kernel - Discovering Nodes Dynamically with Redfish
Note
This guide generally assumes that you have completed the tutorial and already have a working OpenCHAMI deployment.
After getting our nodes to boot using our compute images, let's try to install SLURM and run a test MPI job. We can do this at least two ways here:
-
Create a new
compute-slurmimage similar to thecompute-debugimage using thecompute-baseimage as a base. You do not have to rebuild the parent images unless you want to make changes to them, but keep in mind that you will also have to rebuild any derivative images as well. See the Building Into the Image section for this method. -
Change the cloud-init config to install SLURM and OpenMPI (or any other MPI package of choice) on boot. See the Installing via Cloud-Init section for this method.
One thing to note here. We need to install and start munge and share the munge key before we start SLURM on our nodes. Since we want to protect the key, we will use WireGuard with cloud-init to share the key across compute nodes.
Before we install SLURM, we need to install munge and set up the munge keys across our cluster. Let's download and install the latest release.
curl -fsSL https://github.com/dun/munge/releases/download/munge-0.5.16/munge-0.5.16.tar.xz
tar xJf munge-0.5.16.tar.xz
cd munge-0.5.16
./configure \
--prefix=/usr \
--sysconfdir=/etc \
--localstatedir=/var \
--runstatedir=/run
make
make check
sudo make installThis will install munge on the head node. Then, create a munge key stored in /etc/munge/munge.key as a non-root user.
sudo -u munge /usr/sbin/mungekey --verboseThen we can enable and start the munge service.
sudo systemctl enable --now munge.serviceWarning
The clock must be synced across all of your nodes for munge to work!
Let's now install the SLURM server using the recommended method for production.
curl -fsSL https://download.schedmd.com/slurm/slurm-25.05.1.tar.bz2
rpmbuild -ta slurm-25.05.1.tar.bz2We can go ahead and enable and start the slurmctld service on the head node (aka the "controller" node) since the munge service is already running.
systemctl enable --now slurmctldWe need to set up the compute nodes similar to the head node with munge and SLURM. Like before, we need to do two things:
- Propagate the
/etc/munge/mungekeycreated on the head node - Install SLURM and start
slurmdservice
As mentioned before, we're going to do this in cloud-init to pass around our secrets securely to the nodes.
We can use the image-builder tool to build a new image with the SLURM and OpenMPI packages directly in the image. Since the new image will be for the compute nodes, we'll base our new image on the compute-base image definition from the tutorial.
You should already have a directory at /opt/workdir/images. Make sure you already have a base compute image with s3cmd ls.
# TODO: put the output of `s3cmd ls` here with the compute-base imageTip
If you do not already have the compute-base image, go back to this step from the tutorial, build the image, and push it to S3. Once you have done that, proceed to the next step.
Now, edit a new file at path /opt/workdir/images/compute-slurm-rocky9.yaml and copy the contents below into it.
options:
layer_type: 'base'
name: 'compute-slurm'
publish_tags:
- 'rocky9'
pkg_manager: 'dnf'
parent: 'demo.openchami.cluster:5000/demo/rocky-base:9'
registry_opts_pull:
- '--tls-verify=false'
# Publish SquashFS image to local S3
publish_s3: 'http://demo.openchami.cluster:9000'
s3_prefix: 'compute/base/'
s3_bucket: 'boot-images'
# Publish OCI image to container registry
#
# This is the only way to be able to re-use this image as
# a parent for another image layer.
publish_registry: 'demo.openchami.cluster:5000/demo'
registry_opts_push:
- '--tls-verify=false'
repos:
- alias: 'Epel9'
url: 'https://dl.fedoraproject.org/pub/epel/9/Everything/x86_64/'
gpg: 'https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-9'
packages:
- slurm
- openmpi
cmds:
# Add 'slurm' and 'munge' users to run 'slurmd' and 'munge' respectively
- cmd: "useradd -mG wheel slurm"
- cmd: "useradd -mG wheel munge"
# Install munge like on head node
```bash
- cmd: "curl -fsSL https://github.com/dun/munge/releases/download/munge-0.5.16/munge-0.5.16.tar.xz"
- cmd: "tar xJf munge-0.5.16.tar.xz"
- cmd: "cd munge-0.5.16"
- cmd: "./configure \
--prefix=/usr \
--sysconfdir=/etc \
--localstatedir=/var \
--runstatedir=/run"
- cmd: "make"
- cmd: "make check"
- cmd: "sudo make install"
Notice the changes to the new image definition. We changed the options.name and added packages, and cmds. Since we're basing this image on another image, we only need the packages we want to add to the new image. We can build the image and push it to S3 now.
podman run --rm --device /dev/fuse --network host -e S3_ACCESS=admin -e S3_SECRET=admin123 -v /opt/workdir/images/compute-slurm-rocky9.yaml:/home/builder/config.yaml ghcr.io/openchami/image-build:latest image-build --config config.yaml --log-level DEBUGWait until the build finishes and check the S3 bucket to confirm that it is there with s3cmd ls again. Add a new boot script to /opt/workdir/boot/boot-compute-slurm.yaml which we will use to boot our compute nodes.
kernel: 'http://172.16.0.254:9000/boot-images/efi-images/compute/debug/vmlinuz-5.14.0-570.21.1.el9_6.x86_64'
initrd: 'http://172.16.0.254:9000/boot-images/efi-images/compute/debug/initramfs-5.14.0-570.21.1.el9_6.x86_64.img'
params: 'nomodeset ro root=live:http://172.16.0.254:9000/boot-images/compute/debug/rocky9.6-compute-slurm-rocky9 ip=dhcp overlayroot=tmpfs overlayroot_cfgdisk=disabled apparmor=0 selinux=0 console=ttyS0,115200 ip6=off cloud-init=enabled ds=nocloud-net;s=http://172.16.0.254:8081/cloud-init'
macs:
- 52:54:00:be:ef:01
- 52:54:00:be:ef:02
- 52:54:00:be:ef:03
- 52:54:00:be:ef:04
- 52:54:00:be:ef:05Set and confirm that the boot parameters have been set correctly.
ochami bss boot params set -f yaml -d @/opt/workdir/boot/boot-compute-slurm.yaml
ochami bss boot params get -F yamlAlternatively, we can install the necessary SLURM and OpenMPI packages after booting by adding packages to our cloud-init config and run a couple of commands to configure. This also gives us an opportunity to install and configure munge in one go instead of installing into the image and then setting up using cloud-init.
Let's start by making changes to the cloud-init config file in /opt/workdir/cloud-init/computes.yaml that we used previously. Note that we are using a pre-built RPMs to install SLURM and OpenMPI from the Rocky 9 repos.
- name: compute
description: "compute config"
file:
encoding: plain
content: |
## template: jinja
#cloud-config
merge_how:
- name: list
settings: [append]
- name: dict
settings: [no_replace, recurse_list]
users:
- name: root
ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }}
disable_root: false
packages:
- slurm
- openmpi
cmds:
- curl -fsSL https://github.com/dun/munge/releases/download/munge-0.5.16/munge-0.5.16.tar.xz
- tar xJf munge-0.5.16.tar.xz
- cd munge-0.5.16
- "./configure \
--prefix=/usr \
--sysconfdir=/etc \
--localstatedir=/var \
--runstatedir=/run"
- make
- make check
- sudo make installWe added the packages section to tell cloud-init to install the slurm and openmpi packages after booting the compute. Then, we install munge just like we did before on the head node.
TODO: add section about sharing the munge key using cloud-init wireguard
Finally, once we have everything set up, we can boot the compute nodes.
sudo virt-install \
--name compute1 \
--memory 4096 \
--vcpus 1 \
--disk none \
--pxe \
--os-variant centos-stream9 \
--network network=openchami-net,model=virtio,mac=52:54:00:be:ef:01 \
--graphics none \
--console pty,target_type=serial \
--boot network,hd \
--boot loader=/usr/share/OVMF/OVMF_CODE.secboot.fd,loader.readonly=yes,loader.type=pflash,nvram.template=/usr/share/OVMF/OVMF_VARS.fd,loader_secure=no \
--virt-type kvmYour compute node should start up with iPXE output. If your node does not boot, check the troubleshooting sections for common issues. Both SLURM and OpenMPI should be installed too, but we don't want to start the services yet since we have not set up munge on the node. Start another compute node and call it compute2 using the MAC address specified below.
sudo virt-install \
--name compute2 \
--memory 4096 \
--vcpus 1 \
--disk none \
--pxe \
--os-variant centos-stream9 \
--network network=openchami-net,model=virtio,mac=52:54:00:be:ef:02 \
--graphics none \
--console pty,target_type=serial \
--boot network,hd \
--boot loader=/usr/share/OVMF/OVMF_CODE.secboot.fd,loader.readonly=yes,loader.type=pflash,nvram.template=/usr/share/OVMF/OVMF_VARS.fd,loader_secure=no \
--virt-type kvmAfter we have installed both SLURM and OpenMPI on the compute node, let's try and launch a "hello world" MPI job. To do so, we will need three things:
- Source code for MPI program
- Compiled MPI executable binary
- SLURM job script
We create the MPI program in C. First, create a new directory to store our source code. Then, edit the /opt/workdir/apps/hello.c file.
mkdir -p /opt/workdir/apps/mpi/hello
# edit /opt/workdir/apps/mpi/hello/hello.cNow copy the contents below into the hello.c file.
/*The Parallel Hello World Program*/
#include <stdio.h>
#include <mpi.h>
main(int argc, char **argv)
{
int node;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &node);
printf("Hello World from Node %d\n",node);
MPI_Finalize();
}Compile the program.
cd /opt/workdir/apps/mpi/hello
mpicc hello.c -o helloYou should have an hello executable in the /opt/workdir/apps/mpi/hello directory now. We can use this binary executable with SLURM to launch process in parallel.
Let's create a job script to launch the executable we just created. Create a new directory to hold our SLURM job script. Then, edit a new file called launch-hello.sh in the new /opt/workdir/jobscripts directory.
mkdir -p /opt/workdir/jobscripts
cd /opt/workdir/jobscripts
# edit launch.shCopy the contents below into the launch-hello.sh job script.
Note
The contents of your job script may vary significantly depending on your cluster. Refer to the documentation for your institution and adjust the script accordingly to your needs.
#!/bin/bash
#SBATCH --job-name=hello
#SBATCH --account=account_name
#SBATCH --partition=partition_name
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:00:30
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK /opt/workdir/apps/mpi/hello/helloWe should now have everything we need to test our MPI job with our compute node(s). Launch the job with the sbatch command.
sbatch /opt/workdir/jobscripts/launch-hello.shWe can confirm the job is running with the squeue command.
squeueYou should see a list with a job named hello that was given in the launch-hello.sh job script.
# TODO: add output of squeue aboveIf you saw the output above, you should now be able to inspect the output of the job when it completes.
# TODO: add output of MPI job (should be something like hello.o and/or hello.e)And that's it! You have successfully launched an MPI job with SLURM from an OpenCHAMI deployed system.
For the tutorial, we served images via HTTP with a local S3 bucket using MinIO and an OCI registry. We could instead serve our images by network mounting the directories that hold our images with NFS. We can spin up a NFS server on the head node by including NFS tools in our base image and configure our nodes to mount the images.
Configure NFS to serve your SquashFS nfsroot with as much performance as possible.
sudo mkdir -p /opt/nfsroot && sudo chown rocky /opt/nfsrootCreate a file at path /etc/exports and copy the following contents to export the /opt/nfsroot directory for use by our compute nodes.
/opt/nfsroot *(ro,no_root_squash,no_subtree_check,noatime,async,fsid=0)Reload the NFS daemon to apply the changes.
modprobe -r nfsd && modprobe nfsdFor NFS, we need to update the /etc/exports file and then reload the kernel nfs daemon
Create /opt/nfsroot to serve our images
sudo mkdir /srv/nfs
sudo chown rocky: /srv/nfs-
Create the
/etc/exportsfile with the following contents to export the/srv/nfsdirectory for use by our compute nodes/srv/nfs *(ro,no_root_squash,no_subtree_check,noatime,async,fsid=0) -
Reload the nfs daemon
sudo modprobe -r nfsd && sudo modprobe nfsd
We expose our NFS directory over https as well to make it easy to serve boot artifacts.
# nginx.container
[Unit]
Description=Serve /srv/nfs over HTTP
After=network-online.target
Wants=network-online.target
[Container]
ContainerName=nginx
Image=docker.io/library/nginx:1.28-alpine
Volume=/srv/nfs:/usr/share/nginx/html:Z
PublishPort=80:80
[Service]
TimeoutStartSec=0
Restart=alwaysWhen nodes boot in OpenCHAMI, they make a request out to the cloud-init-server to retrieve a cloud-init config. The request is not encrypted and can be intercepted and modified.
The OpenCHAMI cloud-init metadata server includes a feature to enable a wireguard tunnel before running cloud-init.
TODO: Add more content on how to do this
[Service]
PassEnvironment=ochami_wg_ip
ExecStartPre=/usr/local/bin/ochami-wg-cloud-init-setup.sh
ExecPostStop=/bin/bash -c "ip link delete wg0"#!/bin/sh
set -e -o pipefail
# As configured in systemd, we expect to inherit the "ochami_wg_url" cmdline
# parameter as an env var. Exit if this is not the case.
if [ -z "${ochami_wg_ip}" ];
then
echo "ERROR: Failed to find the 'ochami_wg_url' environment variable."
echo "It should be specified on the kernel cmdline, and will be inherited from there."
if [ -f "/etc/cloud/cloud.cfg.d/ochami.cfg" ];
then
echo "Removing ochami-specific cloud-config; cloud-init will use other defaults"
rm /etc/cloud/cloud.cfg.d/ochami.cfg
else
echo "Not writing ochami-specific cloud-config; cloud-init will use other defaults"
fi
exit 0
fi
echo "Found OpenCHAMI cloud-init URL '${ochami_wg_ip}'"
echo "!!!!Starting pre cloud-init config!!!!"
echo "Loading WireGuard kernel mod"
modprobe wireguard
echo "Generating WireGuard keys"
wg genkey | tee /etc/wireguard/private.key | wg pubkey > /etc/wireguard/public.key
echo "Making Request to configure wireguard tunnel"
PUBLIC_KEY=$(cat /etc/wireguard/public.key)
PAYLOAD="{ \"public_key\": \"${PUBLIC_KEY}\" }"
WG_PAYLOAD=$(curl -s -X POST -d "${PAYLOAD}" http://${ochami_wg_ip}:27777/cloud-init/wg-init)
echo $WG_PAYLOAD | jq
CLIENT_IP=$(echo $WG_PAYLOAD | jq -r '."client-vpn-ip"')
SERVER_IP=$(echo $WG_PAYLOAD | jq -r '."server-ip"' | awk -F'/' '{print $1}')
SERVER_PORT=$(echo $WG_PAYLOAD | jq -r '."server-port"')
SERVER_KEY=$(echo $WG_PAYLOAD | jq -r '."server-public-key"')
echo "Setting up local wireguard interface"
echo "Adding wg0 link"
ip link add dev wg0 type wireguard
echo "Adding ip address ${CLIENT_IP}/32"
ip address add dev wg0 ${CLIENT_IP}/32
echo "Setting the private key"
wg set wg0 private-key /etc/wireguard/private.key
echo "Bringing up the wg0 link"
ip link set wg0 up
echo "Setting up the peer with the server"
wg set wg0 peer ${SERVER_KEY} allowed-ips ${SERVER_IP}/32 endpoint ${ochami_wg_ip}:$SERVER_PORT
rm /etc/wireguard/private.key
rm /etc/wireguard/public.keycopyfiles:
- src: '/opt/workdir/images/files/cloud-init-override.conf'
dest: '/etc/systemd/system/cloud-init.service.d/override.conf'
- src: '/opt/workdir/images/files/ochami-ci-setup.sh'
dest: '/usr/local/bin/ochami-ci-setup.sh'
Restart cloud-init-server with WireGuard.
Often, we want to allocate nodes for different purposes using different images. Let's use the base image that we created before and create another Kubernetes layer called kubernetes-worker based on the base image we created before. We would need to modify the boot script to use this new Kubernetes image and update cloud-init set up the nodes.
kexec-load.sh
#!/usr/bin/env sh
if [ 512000000 -gt $(cat /proc/meminfo | grep -F 'MemTotal' | grep -oE '[0-9]+' | tr -d '\n'; echo 000) ]; then
echo 'Not enough memory to safely load the kernel' >&2
exit 0
fi
if lspci 2>/dev/null | grep -qi '3D controller'; then
echo 'GPUs detected. Not loading kernel to prevent system instability' >&2
exit 0
fi
# Might need to tweak this if the kernel is in a different spot
exec kexec -l "/boot/vmlinuz-$(uname -r)" --initrd="/boot/initramfs-$(uname -r).img" --reuse-cmdlinekexec-update.sh
#!/bin/bash
#set -x
set -e
# This whole script is a bit heavy on the heuristics.
# It would be much better to patch BSS to do JSON output.
# This gets the MAC address of the first interface with an IP address
MAC="$(ip addr | grep -A10 'state UP' | grep -oP -m1 '(?<=link/ether )[a-f0-9:]+')"
# This gets the bss IP address from the kernel commandline
BSS_IP="$(grep -oP '(?<=bss=)[^:/ ]+' /proc/cmdline | tail -n1)"
# When I use the NID it just returns a script that chains into the MAC address one
echo 'Getting boot script...'
BOOT_SCRIPT="$(curl -s "http://$BSS_IP:8081/apis/bss/boot/v1/bootscript?mac=$MAC&json=1")"
if [ -z "$BOOT_SCRIPT" ]; then
echo 'Empty boot script! Aborting...'
exit 1
fi
INITRD="$(echo $BOOT_SCRIPT | jq -r .initrd.path)"
if [ -z "$INITRD" ]; then
echo 'No initrd URL. Aborting...'
exit 2
fi
KERNEL="$(echo $BOOT_SCRIPT | jq -r .kernel.path)"
if [ -z "$KERNEL" ]; then
echo 'No kernel URL. Aborting...'
exit 3
fi
PARAMS="$(echo $BOOT_SCRIPT | jq -r .params)"
if [ -z "$PARAMS" ]; then
echo 'No kernel params. Aborting...'
exit 2
fi
TMP="$(mktemp -d)"
trap 'rm -rf "$TMP"' EXIT
echo 'Getting kernel...'
curl -so "$TMP/kernel" "$KERNEL"
echo 'Getting initrd...'
curl -so "$TMP/initrd" "$INITRD"
kexec -l "$TMP/kernel" --initrd "$TMP/initrd" --command-line "$PARAMS"
echo 'All done!'In the tutorial, we used static discovery to populate our inventory in SMD instead of dynamically discovering nodes on our network. Static discovery is good when we know beforehand the MAC address, IP address, xname, and/or node ID of our nodes and guarantees deterministic behavior. However, sometimes we might not know these properties or we may want to check the current state of our hardware, say for a failure. In these scenario, we can probe our hardware dynamically using the scanning feature from magellan and then update the state of our inventory.
For this demonstration, we have two prerequisites before we get started:
- Emulate board management controllers (BMCs) with running Redfish services
- Have a running instance of SMD or a full running deployment of the OpenCHAMI services
The magellan repository has an emulator included in the project that we can used for quick and dirty testing. This is useful if we want to try out the capabilities of the tool without have to put to much time and effort setting up an environment. However, we want to use multiple BMCs to show how magellan can distinguish between Redfish and non-Redfish services.
TODO: Add content setting up multiple emulated BMCs with Redfish services (the quickstart in the deployment-recipes has this already).
A scan sends out requests to all devices on a network specified with the --subnet flag. If the device responds, it is added to a cache database that we'll need for the next section.
Let's do a scan and see what we can find on our network. We should be able to find all of our emulated BMCs without having to worry too much about any other services.
magellan scan --subnet 172.16.0.100/24 --cache ./assets.dbThis command should not have any output if it runs successfully. By default, the cache will be stored in /tmp/$USER/magellan/assets.db in a tiny SQLite 3 database. Instead, we stored the cache locally with the --cache flag.
We can see what BMCs with Redfish were found with the list command.
magellan listYou should see the emulated BMCs.
# TODO: add list of emulated BMCs from `magellan list` outputNow that we know the IP addresses of the BMCs, let's collect inventory data using the collect command.
We can use the cache to pull in inventory data from the BMCs. If the BMCs require a username and password, we can see them using the secrets store before we run collect.
TEMP_KEY=$(magellan secrets generatekey) # ...or whatever you want to use for your key
export MASTER_KEY=$TEMP_KEY
magellan secrets store default $default_bmc_username:$default_bmc_passwordThis stores a default BMC username and password to use across all BMC nodes that do not have credentials specified. If we want to add specific credentials, we just need to change default to the host.
magellan secrets store https://172.16.0.101 $bmc01_username:$bmc01_passwordThe credentials will be used automatically when collect or crawl and ran. Additionally, when running collect have to add -v flag to see the output and -o to save it to a file.
magellan collect -v -F yaml -o nodes.yamlThere should be a nodes.yaml file in the current directory. The file can be edited to use different values before uploading to SMD. Once done editing, send it off with the send command.
magellan send -F yaml -d @nodes.yaml https://demo.openchami.cluster:8443This will store the inventory data in SMD like before with the information found from the scan.