diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt index aecb8e7d..3ed4fe1e 100644 --- a/.github/actions/spelling/allow.txt +++ b/.github/actions/spelling/allow.txt @@ -16,6 +16,7 @@ CWP CXI Ceph Containerfile +DCGM DNS Dockerfiles Dufourspitze @@ -94,6 +95,7 @@ Piz Plesset Podladchikov Pulay +PyPi RCCL RDMA ROCm @@ -169,7 +171,9 @@ gpu gromos groundstate gsl +gssr hdf +heatmaps hotmail huggingface hwloc @@ -206,6 +210,7 @@ mkl mpi mps multitenancy +mycontainer nanoscale nanotron nccl diff --git a/docs/images/gssr/heatmap_eg.png b/docs/images/gssr/heatmap_eg.png new file mode 100644 index 00000000..dfd674ea Binary files /dev/null and b/docs/images/gssr/heatmap_eg.png differ diff --git a/docs/images/gssr/timeseries_eg.png b/docs/images/gssr/timeseries_eg.png new file mode 100644 index 00000000..37346f19 Binary files /dev/null and b/docs/images/gssr/timeseries_eg.png differ diff --git a/docs/software/devtools/gssr/containers.md b/docs/software/devtools/gssr/containers.md new file mode 100644 index 00000000..993791f6 --- /dev/null +++ b/docs/software/devtools/gssr/containers.md @@ -0,0 +1,121 @@ +[](){#ref-gssr-containers} +# gssr - Containers Guide + +The following guide will explain how to install and use `gssr` within a container. + +Most CSCS users leverage on the base containers with pre-installed CUDA from Nvidia. +As such, in the following documentation, we will use a PyTorch base container as an example. + +## Preparing a container with `gssr` + +### Base Container from Nvidia + +The most commonly used Nvidia container used on Alps is the [Nvidia's PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). Typically the latest version is preferred for the most up-to-date functionalities of PyTorch. + +#### Example: Preparing a Nvidia PyTorch ContainerFile +```dockerfile +FROM --platform=linux/arm64 nvcr.io/nvidia/pytorch:25.08-py3 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update \ + && apt-get install -y wget rsync rclone vim git htop nvtop nano \ + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* + +# Installing gssr +RUN pip install gssr + +# Install your application and dependencies as required +... +``` +As you can see from the above example, gssr can easily be installed with a `RUN pip install gssr` command. + +Once your `ContainerFile` is ready, you can build it on any Alps platforms with the following commands to create a container with label `mycontainer`. + +```bash +srun -A {groupID} --pty bash +# Once you have an interactive session, use podman command to build the +# container +# -v is to mount the fast storage on Alps into the container. +podman build -v $SCRATCH:$SCRATCH -t mycontainer:0.1 . +# Export the container from the podman's cache to a local squashFS file with +# enroot +enroot import -x mount -o mycontainer.sqsh podman://local:mycontainer:0.1 +``` + +Now you should have a squashFS file of your container. Please note that you should replace `mycontainer` label to any other label of your choice. The version `0.1` can also be omitted or replaced with another version as required. + +## Create CSCS configuration for Container + +The next step is to tell CSCS container engine solution where your container is and how you would like to run it. To do so, you will have to create a`{label}.toml` file in your `$HOME/.edf` directory. + +### Example of a `mycontainer.toml` file +``` +image = "/capstor/scratch/cscs/username/{yourDir}/mycontainer.sqsh" +mounts = ["/capstor/scratch/cscs/username:/capstor/scratch/cscs/username"] +workdir = "/capstor/scratch/cscs/username" +writable = true + +[annotations] +com.hooks.dcgm.enabled = "true" +``` + +Please note that the `mounts` line is important if you want $SCRATCH to be available in your container. You can also mount a specific directory or file in $HOME and/or $SCRATCH as required. You should modify the username and the image directory as per your setup. + +To use `gssr` in a container, you will need the `dcgm` hook that is configured in the `[annotations]` section to enable DCGM libraries to be available within the container. + +### Run the application and container with gssr + +To invoke `gssr`, you can do the following in your sbatch file. + +#### Example of a mycontainer.sbatch file +``` +#!/bin/bash +#SBATCH -N4 +#SBATCH -A groupname +#SBATCH -J mycontainer +#SBATCH -t 1:00:00 +#SBATCH ... + +srun --environment=mycontainer bash -c 'gssr --wrap="python abc.py"' + +``` + +Please replace the text `...` for any other SBATCH configuration that your job requires. +The `--environment` flag tells Slurm which container (name of the toml file) you would like to run. +The `bash -c` requirement is to initialise the bash environment within your container. + +If no `gssr` is used, the `srun` command in your container should like that.: + +``` +srun --environment=mycontainer bash -c 'python abc.py'. +``` + +Now you are ready to submit your sbatch file to slurm with `sbatch` command. + +## Analyze the output + +Once your job successfully concluded, you should find a folder named `profile_out_{slurm_jobid}` where `gssr` json outputs are in. + +To analyze the outputs, you can do so interactively within any containers where `gssr` is installed, e.g., `mycontainer` we have in this guide. + +To get an interactive session of this container: + +``` +srun -A groupname --environment=mycontainer --pty bash +cd {directory where the gssr output data is generated} +``` +Alternatively, you can install `gssr` locally and copy the `profile_out_{slurm_jobid}` to your computer and visualize it locally. + +#### Metric Output +The profiled output can be analysed as follows.: + + gssr analyze -i ./profile_out + +#### PDF File Output with Plots + + gssr analyze -i ./profile_out --report + +At least one PDF report will be generated. + diff --git a/docs/software/devtools/gssr/index.md b/docs/software/devtools/gssr/index.md new file mode 100644 index 00000000..f7f86846 --- /dev/null +++ b/docs/software/devtools/gssr/index.md @@ -0,0 +1,14 @@ +[](){#ref-gssr-overview} +# gssr + +GPU Saturation Scorer (gssr) provides a simple way to profile your code and get the results in both tables and plots for easy visualisation. gssr works on top of [NVIDIA Data Center GPU Manager (DCGM)](https://developer.nvidia.com/dcgm) and thus only NVIDIA GPUs are currently supported. + +The following documentations will be available.: + +* [Quickstart Guide][ref-gssr-quickstart] +* [Container Guide][ref-gssr-containers] + +This tool will produce time-series and heatmaps of the profiled metric values. Here is an example of one set of plots generated by the tool from the application Megatron-LLM from EPFL. + +![gssr timeseries](../../../images/gssr/timeseries_eg.png) +![gssr heatmap](../../../images/gssr/heatmap_eg.png) diff --git a/docs/software/devtools/gssr/quickstart.md b/docs/software/devtools/gssr/quickstart.md new file mode 100644 index 00000000..655eff93 --- /dev/null +++ b/docs/software/devtools/gssr/quickstart.md @@ -0,0 +1,54 @@ +[](){#ref-gssr-quickstart} +# gssr - Quickstart Guide + +## Installation + +### From PyPi + +`gssr` can be easily installed as follows.: + + pip install gssr + +### From GitHub Source + +To install directly from the source: + + pip install git+https://github.com/eth-cscs/GPU-Saturation-Scorer.git + +To install from a specific branch, e.g. the development branch, from the source: + + pip install git+https://github.com/eth-cscs/GPU-Saturation-Scorer.git@dev + +To install a specific release tag, e.g. gssr-v0.3, from the source: + + pip install git+https://github.com/eth-cscs/GPU-Saturation-Scorer.git@gssr-v0.3 + +## Profile + +### Example + +If you are submitting a batch job and the command you are executing is: + + srun python abc.py + +The corresponding srun command should be modified as follows.: + + srun gssr profile -wrap="python abc.py" + +* The `gssr` option to run is `profile` +* The `"--wrap"` flag will wrap the command that you would like to run +* The default output directory is `profile_out_{slurm_job_id}` +* A label to the output data can be set with the `-l` flag + +## Analyze + +### Metric Output +The profiled output can be analysed as follows.: + + gssr analyze -i ./profile_out + +### PDF File Output with Plots + + gssr analyze -i ./profile_out --report + +A/Multiple PDF report(s) will be generated. diff --git a/docs/software/devtools/index.md b/docs/software/devtools/index.md index 7d8f3177..061eacdb 100644 --- a/docs/software/devtools/index.md +++ b/docs/software/devtools/index.md @@ -31,3 +31,4 @@ In this section we introduce the various performance analysis solutions availabl * [NVIDIA Nsight Developer Tools][ref-devtools-nsight] * [Linaro Forge MAP][ref-devtools-map] * [VI-HPS Tools][ref-devtools-vihps] +* [gssr][ref-gssr-overview] diff --git a/mkdocs.yml b/mkdocs.yml index b52c63c9..b34ae606 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -118,6 +118,10 @@ nav: - 'Linaro debugger': software/devtools/linaro-ddt.md - 'Using Score-P/Scalasca': software/devtools/vihps.md - 'Job report': running/jobreport.md + - 'GPU Saturation Scorer (gssr)': + - software/devtools/gssr/index.md + - 'Quickstart Guide': software/devtools/gssr/quickstart.md + - 'Container Guide': software/devtools/gssr/containers.md - 'Data Management and Storage': - storage/index.md - 'File Systems': storage/filesystems.md