Welcome to the Schmidt-AI repository! This project is a comprehensive collection of examples for machine learning and data processing on Nautilus. Within this repository, you will find several directories, each containing well-documented use cases that demonstrate practical implementations of various machine learning methods and data processing techniques.
These examples are designed to guide you through the setup, execution, and customization of workflows. Whether you're a beginner looking to understand the basics or an experienced developer aiming to optimize your projects, you'll find valuable resources here to enhance your skills and accelerate your development process.
Below are the steps to run minist-pytorch using different environments. main.py was taken from the official pytorch examples and adapted for this tutorial.
Note you will have to cd into mnist-pytorch for the rest of the README (except for Kubernetes).
cd mnist-pytorch
To further modify the environment (if needed), create .env with the following contents:
S3_ENDPOINT=https://s3-west.nrp-nautilus.io
Note that the mnist dataset was uploaded to a public NRP s3 bucket already so the rest of these steps are not needed and are only meant for completeness.
Download mnist dataset locally:
python3 download-mnist.py
Upload dataset to schmidt-ai bucket and make it public:
aws s3 cp ./data/MNIST/raw s3://schmidt-ai/mnist/ --recursive --profile nrp --endpoint-url https://s3-west.nrp-nautilus.io --acl public-read
This section explains how to use MLFlow for experiment tracking hosted on NDP. Before running MLFlow, ensure you have set the required environment variables. Without these, MLFlow won't track experiments, but the training sessions will still run successfully.
To enable MLFlow tracking on NDP, export the following variables:
MLFLOW_TRACKING_URI=https://nationaldataplatform.org/mlflow
MLFLOW_TRACKING_USERNAME=
MLFLOW_TRACKING_PASSWORD=
Alternatively, add these lines to your .env file.
With the environment variables set, run your training script as usual. MLFlow will log parameters and metrics to the configured MLFlow tracking server.
After running your script, verify the experiment and run details by visiting the MLFlow UI at MLFlow UI.
If these variables are missing or misconfigured, MLFlow tracking will not be activated.
-
Set Up a Virtual Environment:
- With Python's built-in venv:
python3 -m venv env source env/bin/activate # On Windows use: env\Scripts\activate
- With Python's built-in venv:
-
Install Dependencies:
- Install PyTorch, torchvision, and any other required packages:
pip3 install -r requirements.txt
- Install PyTorch, torchvision, and any other required packages:
-
Run the Application:
- Once dependencies are installed, run:
python main.py
- Once dependencies are installed, run:
-
Deleting a Pip Virtual Environment:
- Deactivate the Environment (if active):
- Simply run:
deactivate
- Simply run:
- Remove the Virtual Environment Directory:
- On macOS/Linux:
rm -rf env - On Windows:
rmdir /s /q env
- On macOS/Linux:
Ensure you're not inside the virtual environment directory when deleting it.
- Deactivate the Environment (if active):
- Build the Docker Image:
docker build -t minist-pytorch . - Run the Docker Container:
docker run -it --rm minist-pytorch - Run the Docker Container with the .env File:
- Use the
--env-fileoption to load the environment variables from the.envfile when running the container:docker run -it --rm --env-file .env minist-pytorch
Using Kubernetes (NRP)
For using Kubernetes, there are already images built to use the job manifest in NRPs gitlab registry.
Why Use Kubernetes for ML Training Jobs on NRP:
-
Scalability:
Kubernetes allows you to easily scale your training jobs by dynamically adjusting the number of pods based on workload, ensuring efficient resource usage. -
Resource Management:
It provides robust scheduling and resource allocation, making it simple to manage GPU and CPU resources for demanding machine learning tasks. -
Fault Tolerance:
Automatic restarts and pod health checks help maintain job reliability, minimizing downtime and ensuring training jobs complete successfully. -
Environment Consistency:
Containers guarantee that your ML environments remain consistent across different stages, reducing configuration errors and streamlining deployments.
With your .env file ready, create a Kubernetes secret:
kubectl create secret generic mnist-pytorch --from-env-file=.env
This command packages your environment variables into a secret named mnist-pytorch.
Note that if you are using the s3 bucket for the mnist dataset, the s3 endpoint must be changed to the inside endpoint:
S3_ENDPOINT=http://rook-ceph-rgw-nautiluss3.rook
For MlFlow tracking append the following .env variables to the .env:
MLFLOW_TRACKING_URI=https://nationaldataplatform.org/mlflow
MLFLOW_TRACKING_USERNAME=
MLFLOW_TRACKING_PASSWORD=
Create a job using data that is stored in a s3 bucket:
kubectl create -f kubernetes/mnist-pytorch/job-s3.yaml
Create a job using data that is stored in a pvc:
kubectl create -f kubernetes/mnist-pytroch/pvc.yaml
kubectl create -f kubernetes/mnist-pytorch/job-s3.yaml
Note that creating a pvc is only needed once.
View your pod statuses by executing:
kubectl get pods
For detailed logs on a specific pod, use:
kubectl logs <pod-name>
To remove the job (and its associated pods), run:
kubectl delete job minist-pytorch-job
Note that this is not needed and can just let the job clean up itself (set to 10 minutes after all pods have compelted)