Created: April 26, 2022 3:48 PM Last Edited Time: May 16, 2022 8:49 PM
ML workflows are a PITA because:
- Every model is a special snowflake, and other people’s tools can come with weird version requirements. You want to provide as much isolation between these dependencies as possible, which means running each one on its own OS.
- Many models have external dependencies (e.g. executables that demand specific versions of GCC) that are typically not in the python ecosystem. The best way to reproducibly maintain / share / scale these dependencies is with a docker container.
- Tracking model experiments + metrics is annoying, and data management is hard. It’s worth it to invest in good tooling for this, so that you don’t end up having to guess.
- DevOPs is super annoying, and you want to minimize the effort / maximize your learning on your own machine to make your own environment as scalable as possible.
- The actual ML itself is also hard 😊
Solution: Dockerize everything.
This has a steep learning curve, but will solve several key pain points:
- Reproducibility:
- Model scalability: it’s much easier to deploy a docker container
- Learning: many ML folks have weak sysadmin / DevOps skills, and learning this stack will be incredibly helpful to bring you up to speed.
My stack (circa April 2022):
-
Docker - every package gets its own docker container, which runs in isolation.
-
docker-compose I want to manage many packages, which means coordinating a few containers. docker-compose works well for this.
-
Other containers:
- portainer for GUI management of docker images
- port forwarding - so I can e.g. use multiply Jupyter notebooks from different containers)
- pachyderm for data versioning and pipelines
- weights-and-balances for metric tracking and
-
Project layout looks like this.
- One repo = one container = one folder in
/code.Don't be clever. - External libraries / dependencies go in
/projects/code/repo_name/external/ - Data is separate from code.
**/projects** **/data /some_data_1** data_1.csv **/some_data_2** data_2.csV **/code** **docker-compose.yml** **/ml_github_repo_2** Dockerfile requirements.txt **/ml_github_repo_2** Dockerfile requirements.txt some_python_file.py **/external /some_annoying_dependency** install.sh
- One repo = one container = one folder in
Quickstart:
- We’re going to optimize for running CUDA workflows inside a docker container. This will save you a bunch of problems with NVIDIA drivers and general compatibility.
- We’re going to use the Use NVIDIA drivers here, the NVIDIA toolkit for docker here, and NVIDIA-distributed containers as your base here.
- I’m assuming you don’t spend a ton of time doing Linux sysadmin work (like me) but also that you know your way around bash (like me).
BUT here be dragons:
- Unfortunately, the NVIDIA drivers you already have installed are your enemy. There’s a pretty narrow path to gettting dockerized CUDA to work well, and you’ll have to nuke your existing drivers to make it work.
These are not helpful, and extremely difficult to remove. We’ll blacklist them to help.
To do this, edit /etc/modprobe.d/blacklist-nouveau.conf to blacklist as follows
sudo nano /etc/modprobe.d/blacklist-nouveau.confand edit it to look like:
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau offUpdate your system as much as possible. This will install a bunch of drivers you don’t want, but we’ll remove them later.
sudo apt update
sudo apt full-upgrade --yes
sudo apt autoremove --yes
sudo apt autoclean --yes
rebootsudo apt-get autoremove
sudo apt-get remove nvidia*
sudo apt-get purge cuda*
rebootFollow instructions here. You usually want the most recent one, which (for Ubuntu 2.04 in April 2022) is Version 510.60+. Once you’ve downloaded the package, install as:
sudo sh ./NVIDIA-Linux-x86_64-510.68.02.run
rebootWarning - when you reboot, you may have a wonky display. Your BIOS will need to be convinced to use the right GPU for your monitor. Typically you want to fix this by entering your BIOS config generator. This is pretty easy: hold down F2 during startup and find
Once your monitor situation is sorted, verify your installation with the NVIDIA System Management Interface as:
nvidia-smiYou should see something like this. If you see a ton of N/As in the first few lines then something went wrong.
Important values: you’ll need both of these later.
NVIDIA-SMI This is the driver version.
CUDA VERSION This is the CUDA version
Install Docker + nvidia-docker following instructions here.
# Install docker
curl https://get.docker.com | sh \
&& sudo systemctl --now enable docker
# Set up NVIDIA container toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get install -y nvidia-docker2If you’re new to docker, here’s a breakdown: we run an image (which is like a bare-bones OS) to generate a container (which is like a running computer)
sudo docker run --rm --gpus all nvidia/cuda:**11.6-base** nvidia-smiBreakdown of the command:
docker run We’re going to run a container
--rm This container is temporary, and we’ll remove once we’re done..
--gpus all Let’s use our GPUs plz.
nvidia/cuda This is an NVIDIA-hosted container
nvidia/cuda:11.6-base This container targets CUDA 11.6 - which is what I’ve got installed.
NOTE: change this value for your own! It’s probably backwards-compatible (i.e. a newer CUDA version should mostly work fine with an older container) but better safe than sorry.
nvidia-smi This is the actual command we’re running.
You should see something like this. It’s basically the same as the report above, but the running processes are masked because the containerized version of nvidia-smi can’t see outside its container.
To do this, you want to run an NVIDIA container that already has pytorch installed. You’ll do this with a command like this. See here for details:
docker run --gpus all -it --rm [nvcr.io/nvidia/pytorch:xx.xx-py3 python](http://nvcr.io/nvidia/pytorch:xx.xx-py3)
where
docker run --gpus all --rm is as above
-it generates an interactive login session
[nvcr.io](http://nvcr.io) is the NVIDIA container repository
[nvidia/pytorch](http://nvcr.io/nvidia/pytorch) is the name of the container we’re running, and
xx.xx-py3 is a placeholder for the version.
python we’re running python this time.
To make this run, you have to pick the version from here, specifically the Framework Containers Matrix. You need to choose the Container Image Version that matches the other drivers:
- CUDA version - e.g.
11.60above - NVIDIA Driver version - e.g.
510.60.02above
As of April 2022 this looks like this:
For my case, I use Container Image version 22.02 because I’m using NVIDIA CUDA 11.6.0, not 11.6.1.
So the full command FOR ME is
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.02-py3 pythonThis will load me into a python REPL, where I can see if pytorch is happy
import torch
torch.cuda.is_available()This should return True:
If it didn’t work, you can either (1) troubleshoot, or (2) run this whole thing over again. My honest suggestion is to just start over at the top (and maybe read someone else’s blog as a sanity check). There are lots of ways to break this, and diagnosis is finicky!
Every github repo gets its own dockerfile. There are
We try to solve a bunch of problems at this level
- Reproducible build. The goal is to set up a docker container that is easy and reproducible. Ideally, the only thing you should have to add to a new repo is a
Dockerfile. - File access Every container has access to
/codeand/data, but nothing else. - File ownership. By default, files generated from a container will have weird ownership, and you won’t be able to change/delete them in your main OS (i.e. the host)! We fix this by making a new
user/groupin the docker container following the discussion here and here. - Python ENV management. I prefer to set everything up in a Conda env, but I just do it in the
baseenv. I then run some combination ofconda install****andpip installto actually get the repo up and running. - Run Jupyter Lab Each container runs its own juypter-lab server. This is launched as the last command:
CMD jupyter-lab ...but needs to be modified to enable easy login. - Caveat: Security Security is intentionally broken here. You can, for instance, log into any jupyter-lab server without a password (or with a dumb password). You can then delete all the data in
/dataif you want. So this approach is not ideal for shared environments, mission-critical workflows, or network-exposed computers without good firewalls. I probably don’t do as much as I should here.
Here’s an example dockerfile:
FROM nvcr.io/nvidia/pytorch:22.02-py3
### PASSED FROM DOCKERFILE
# IF NOT, DEFAULTS HERE ARE USED
ARG USER_ID
ARG GROUP_ID
ARG JUPYTER_PORT
ARG JUPYTER_TOKEN
### Directory of repo in docker container
ARG REPO_DIR=/code/pytorch_test
### Python requirements file
ARG REQUIREMENTS_FILE=requirements.txt
### Name of conda env.
ARG CONDA_ENV=container_env
RUN echo USER_ID ${USER_ID} && \
echo GROUP_ID ${GROUP_ID} && \
echo JUPYTER_PORT ${JUPYTER_PORT} && \
echo JUPYTER_TOKEN ${JUPYTER_TOKEN} && \
echo REPO_DIR ${REPO_DIR} && \
echo REQUIREMENTS_FILE ${REQUIREMENTS_FILE} && \
echo CONDA_ENV ${CONDA_ENV}
# #################### BOILERPLATE ####################################3
### CHANGE USER
RUN addgroup --gid $GROUP_ID user
RUN adduser --disabled-password --gecos '' --uid ${USER_ID} --gid ${GROUP_ID} user
USER user
### INSTALL oh-my-bash
WORKDIR ${REPO_DIR}/external/
RUN bash -c "$(wget https://raw.githubusercontent.com/ohmybash/oh-my-bash/master/tools/install.sh -O -)"; exit 0
WORKDIR ${REPO_DIR}
### SET UP CONDA ENV
RUN conda create --name ${CONDA_ENV} --clone base
RUN conda env list
## NOTE _ REPLACED BY CONDA .BASHRC EDIT BELOW
# CHANGE CONDA ENV per https://pythonspeed.com/articles/activate-conda-dockerfile/
# We have to do "echo ${CONDA_ENV}" because each element in the shell JSON is execututed by shell separately.
# If conda >= 4.9
#SHELL ["conda", "run", "--no-capture-output", "-n", "echo $CONDA_ENV", "/bin/bash", "-c"] X
# If conda < 4.9
# SHELL ["conda", "run", "-n", "echo $CONDA_ENV", "/bin/bash", "-c"]
#RUN conda env list
### Copy External libs to local
# COPY ./external ${REPO_DIR}/external
### TO INSTALL DEPENDENCIES WRAP THEM IN A CMD LIKE
# RUN cmake <<whatever>> \
# && make \
# && make install
### MAKE CONDA WORK AT THE COMMAND LINE
RUN conda init bash
RUN echo "conda activate container_env" >> ~/.bashrc
RUN conda env list
RUN conda list
### INSTALL REQUIREMENTS
WORKDIR ${REPO_DIR}
COPY ./requirements.txt .
RUN ls -al
RUN conda install --file requirements.txt
### ACTIVATE JUPYTER_LAB ON DEFAULT PORT 8888
# COPY JUPYTER-LAB CONFIGS
COPY ./jupyter_lab_config.py /root/.jupyter/jupyter_lab_config.py
RUN echo jupyter-lab --ip=0.0.0.0 --port ${JUPYTER_PORT} --no-browser --allow-root --ServerApp.token=\${JUPYTER_TOKEN}
CMD nohup jupyter-lab --ip=0.0.0.0 --port ${JUPYTER_PORT} --no-browser --allow-root --ServerApp.token=\${JUPYTER_TOKEN} &
# RUN ["echo","jupyter-lab", "--ip=0.0.0.0 --port", "echo ${JUPYTER_PORT}","--no-browser --allow-root --ServerApp.token=","echo ${JUPYTER_TOKEN}"]
# CMD ["jupyter-lab", "--ip=0.0.0.0 --port", "echo ${JUPYTER_PORT}","--no-browser --allow-root --ServerApp.token=","echo ${JUPYTER_TOKEN}"]
Here’s an example docker-compose file. I run a few services:
- dns-proxy-server : this allows me to access any service (i.e.
repo—>container—>service) ashttp://repo_name.docker:PORT. This is helpful for making multiple jupyter-lab servers play nicely together. - citeseq: This is an example repo that does fthe following:
version: "3.2"
services:
dns-proxy-server:
image: defreitas/dns-proxy-server
hostname: dns.mageddo
# name: dns-proxy-server
ports:
- "5380:5380"
volumes:
- type: bind
source: /opt/dns-proxy-server/conf
target: /app/conf
- type: bind
source: /var/run/docker.sock
target: /var/run/docker.sock
- type: bind
source: /etc/resolv.conf
target: /etc/resolv.conf
pytorch_test:
image: nvcr.io/nvidia/pytorch:22.02-py3
build: ./code/pytorch_test
environment:
JUPYTER_PORT: 8082 # CHANGE FOR EVERY REPO - must match ports values below
JUPYTER_TOKEN: ${JUPYTER_TOKEN}
USER_ID: ${USER_ID}
GROUP_ID: ${GROUP_ID}
hostname: pytorch_test.docker
ports:
- "8082:8082" # MUST MATCH JUPYTER_PORT ABOVE
volumes:
- type: bind
source: ./data
target: /data
- type: bind
source: ./code
target: /code
# links:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities:
- gpu
- utility # nvidia-smi
- compute # CUDA. Required to avoid "CUDA version: N/A"
ipc: host
ulimits:
memlock: -1
stack: 67108864
stdin_open: true
tty: true



