Skip to content

samaR7Q/mlops_kubeflow

Repository files navigation

MLOps Pipeline with Kubeflow, DVC, and GitHub Actions

Project Overview

This project demonstrates a complete Machine Learning Operations (MLOps) pipeline for predicting Boston Housing prices using a Random Forest Regressor. The pipeline showcases industry best practices for ML workflow automation, version control, and continuous integration.

ML Problem

Task: Regression - Predict median house values in Boston
Model: Random Forest Regressor (100 estimators)
Evaluation: R² Score
Dataset: Boston Housing Dataset

Key Technologies

  • Kubeflow Pipelines (KFP): ML workflow orchestration on Kubernetes
  • DVC (Data Version Control): Dataset versioning and management
  • Minikube: Local Kubernetes cluster
  • GitHub Actions: Automated CI/CD testing
  • Python 3.9: Core language
  • scikit-learn: ML library

Pipeline Stages

  1. Load Data: Fetch dataset from remote URL
  2. Preprocess: Clean data, split train/test (80/20)
  3. Train Model: Train Random Forest on training data
  4. Evaluate: Calculate R² score on test data

Project Structure

mlops_kubeflow/
├── .github/workflows/
│   └── ci.yml                    # GitHub Actions CI/CD
├── components/                   # Compiled Kubeflow components
│   ├── load_data.yaml
│   ├── preprocess_data.yaml
│   ├── train_model.yaml
│   └── evaluate_model.yaml
├── data/                         # Dataset (DVC tracked)
├── src/
│   └── pipeline_components.py   # Component definitions
├── pipeline.py                   # Main pipeline definition
├── pipeline.yaml                 # Compiled pipeline
├── requirements.txt              # Python dependencies
├── Jenkinsfile                   # Jenkins CI/CD
├── test-ci-locally.bat          # Local CI test script
└── README.md                     # This file

---


## Setup Instructions

### Prerequisites
- Python 3.9+
- Docker
- kubectl
- Git

### 1. Install Minikube

**Windows**:
```powershell
choco install minikube

macOS:

brew install minikube

Linux:

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube

2. Start Minikube

minikube start --cpus=4 --memory=8192 --disk-size=20g
minikube status

3. Deploy Kubeflow Pipelines

export PIPELINE_VERSION=1.8.5

kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-pns?ref=$PIPELINE_VERSION"

# Wait for pods to be ready
kubectl get pods -n kubeflow -w

4. Fix Workflow Controller

kubectl patch configmap workflow-controller-configmap -n kubeflow --type merge -p '{"data":{"containerRuntimeExecutor":"emissary"}}'
kubectl rollout restart deployment workflow-controller -n kubeflow

5. Install Python Dependencies

pip install -r requirements.txt

6. Setup DVC (Optional)

dvc init
dvc remote add -d local_storage ../dvc_storage_simulation
dvc add data/
dvc push

Pipeline Walkthrough

Step 1: Compile Components

python src/pipeline_components.py

This generates YAML files in components/:

  • load_data.yaml
  • preprocess_data.yaml
  • train_model.yaml
  • evaluate_model.yaml

Step 2: Compile Pipeline

python pipeline.py

This generates pipeline.yaml - the complete workflow.

Step 3: Submit Pipeline to Kubeflow

kubectl -n kubeflow create -f pipeline.yaml --dry-run=client -o json | python -c "import sys, json; d=json.load(sys.stdin); d['spec']['arguments']={'parameters':[{'name':'url','value':'https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv'}]}; print(json.dumps(d))" | kubectl create -f - -n kubeflow

Step 4: Monitor Execution

# Check workflow status
kubectl get workflow -n kubeflow

# Watch pods
kubectl get pods -n kubeflow | grep boston-housing

# View logs
kubectl logs <pod-name> -n kubeflow

# Get workflow details
kubectl describe workflow <workflow-name> -n kubeflow

CI/CD Pipeline

GitHub Actions

File: .github/workflows/ci.yml

Stages:

  1. Environment Setup: Install Python and dependencies
  2. Pipeline Compilation: Compile components and pipeline
  3. Verification: Verify YAML files generated

Triggers: Push to main, pull requests, manual

Local Testing

# Windows
test-ci-locally.bat

# Linux/macOS
python src/pipeline_components.py
python pipeline.py
ls -l pipeline.yaml components/

Common Commands

Minikube

minikube start
minikube stop
minikube status
minikube delete

Kubernetes

kubectl get pods -n kubeflow
kubectl get workflow -n kubeflow
kubectl logs <pod-name> -n kubeflow
kubectl delete workflow <workflow-name> -n kubeflow

Pipeline

python src/pipeline_components.py  # Compile components
python pipeline.py                  # Compile pipeline
kubectl create -f pipeline.yaml -n kubeflow  # Submit

DVC

dvc add data/
dvc push
dvc pull
dvc status

Troubleshooting

ImagePullBackOff Errors

Solution: Switch to emissary executor

kubectl patch configmap workflow-controller-configmap -n kubeflow --type merge -p '{"data":{"containerRuntimeExecutor":"emissary"}}'
kubectl rollout restart deployment workflow-controller -n kubeflow

Pipeline Fails - Missing URL Parameter

Solution: Submit with parameters (see Step 3 above)

Pods Stuck in ContainerCreating

Solution: Check events

kubectl describe pod <pod-name> -n kubeflow

Minikube Won't Start

Solution:

minikube delete
minikube start --driver=docker --cpus=4 --memory=8192

Testing

Unit Tests

python src/pipeline_components.py
echo "Components compiled successfully!"

Integration Tests

test-ci-locally.bat

Pipeline Tests

python pipeline.py
ls -l pipeline.yaml

Assignment Deliverables

Task 1: DVC Setup ✅

  • Data versioning with DVC
  • Remote storage configuration
  • .dvc files

Task 2: Kubeflow Components ✅

  • 4 components (Load, Preprocess, Train, Evaluate)
  • Component YAML files
  • Function-based components

Task 3: Pipeline Orchestration ✅

  • Pipeline definition
  • Minikube deployment
  • Pipeline execution

Task 4: CI/CD ✅

  • GitHub Actions workflow
  • Jenkinsfile
  • 3-stage pipeline

Task 5: Documentation ✅

  • Comprehensive README
  • Setup instructions
  • Pipeline walkthrough

Dependencies

kfp==1.8.22
dvc
pandas
scikit-learn

Quick Start (TL;DR)

# 1. Start Minikube
minikube start --cpus=4 --memory=8192

# 2. Deploy Kubeflow
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=1.8.5"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-pns?ref=1.8.5"

# 3. Fix executor
kubectl patch configmap workflow-controller-configmap -n kubeflow --type merge -p '{"data":{"containerRuntimeExecutor":"emissary"}}'
kubectl rollout restart deployment workflow-controller -n kubeflow

# 4. Install dependencies
pip install -r requirements.txt

# 5. Compile and run
python src/pipeline_components.py
python pipeline.py

# 6. Submit pipeline
kubectl -n kubeflow create -f pipeline.yaml --dry-run=client -o json | python -c "import sys, json; d=json.load(sys.stdin); d['spec']['arguments']={'parameters':[{'name':'url','value':'https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv'}]}; print(json.dumps(d))" | kubectl create -f - -n kubeflow

# 7. Monitor
kubectl get workflow -n kubeflow -w

Author

Student ID: i222141
Course: MLOps
Assignment: 4
Semester: 7


License

Educational project for MLOps coursework.


Last Updated: November 29, 2025

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors