This tutorial with use Parsl: Parallel programming library for Python to:
- Request GPU nodes from the fe.ai.cs.uchicago.edu cluster
- Demo launching a Pytorch based MNIST training example in various GPU configurations
- Show checkpoint and restart functionality.
fe.ai.cs.uchicago.edu cluster comes with a conda installation already available. We'll be using this to install the requirements for the demo.
$ conda create --yes --name parsl_py3.7 python=3.7
$ conda activate parsl_py3.7
$ pip install parsl==1.0.0
$ conda install -c torch torchvision
Let's check the environment sanity with a basic mnist application:
# This should print a sequence of results from training, this will be slow running on the login node
(parsl_py3.7) $ python3 torch_mnist.py --epochs=1
# Let's config parsl is installed. This should print 1.0.0
(parsl_py3.7) $ python3 -c "import parsl; print(f'Parsl version: {parsl.__version__}')"
A sample configuration file config.py contains a config object that requests nodes
from the the slurm scheduler and launches 1 manager+worker pair for each available GPU.
Note: The SrunLauncher.overrides feature is used here to hack the launcher into launching more manager+worker groups per node than usual. Usually 1 manager manages the whole node, but in this situation we want the manager and it's child processes (workers) to be bound to each GPU.
Please tune the config via the variables :
# Configure options here:
NODES_PER_JOB = 2
GPUS_PER_NODE = 4
GPUS_PER_WORKER = 1
The basic_grid_search.py example shows you the following :
- Running a very simple
python_appcalledplatinfothat returns the nodename and CUDA information run_mnisttakes a range of batch sizes and epochs, and launches thetorch_mnistapplication- The
torch_mnistapplication is abash_appthat invokes thetorch_mnist.pyexample from pytorch examples from the commandline on each worker which is bound to 1 GPU on the cluster nodes.
(parsl_py3.7) $ python3 basic_grid_search.py
The checkpoint_test.py along with torch_mnist_checkpointed.py shows you how to run torch applications with checkpoint and restart functionality.
The key updates to torch_mnist_checkpointed.py are:
- Updated
train_modelmethod that takes proper python params rather than argparge.args checkpoint_periodkwarg option that specifies how many minutes apart checkpoint events should be triggered.checkpoint_inputandcheckpoint_outputpaths that define paths from/to which checkpoints should be read/written.- Minor code blocks that load and write checkpoints.
checkpoint_test.py uses a python_app that explicitly adds the current directory to the module path,
so that the methods in the torch_mnist_checkpointed module can be imported on the compute node which don't share the python environment on the login node. This test sets a low walltime so that workers and their mnist training tasks are terminated due to node loss to simulate a failure. The test sets config.retries=3 so that the application is rerun and with checkpoint restart support, very little compute is lost.
(parsl_py3.7) $ python3 checkpoint_test.py