YOLO Auto-Training System

An AI-driven end-to-end YOLO model training and deployment platform. From dataset discovery to edge inference — fully automated with CrewAI multi-agent orchestration, Ray Tune hyperparameter optimization, and one-click export to NVIDIA Jetson / Rockchip RK3588.

Key Features

Feature	Description
Dataset Discovery	Multi-source search across Roboflow, Kaggle, and HuggingFace with relevance scoring
Auto Training	YOLO11 training with Ray Tune HPO, MLflow experiment tracking
Knowledge Distillation	Train compact student models from large teacher models
Edge Deployment	One-click export to ONNX / TensorRT for Jetson Nano, Jetson Orin, RK3588
Multi-Agent	CrewAI orchestration — Data Discovery, Generation, Training, Deployment agents
Async Pipeline	Celery + Redis task queue for GPU-intensive background jobs
MLOps Observability	Prometheus metrics, Grafana dashboards, structured logging (ELK stack)
Auto Labeling	Semi-automated annotation pipeline using Grounded SAM

Architecture

┌──────────────────────────────────────────────────────────────┐
│                         Web UI (Next.js)                     │
│               Discovery │ Training │ Labeling │ Analysis     │
└──────────────────────────┬─────────────────────────────────┘
                           │ HTTP / REST
┌──────────────────────────▼─────────────────────────────────┐
│              Business API  (FastAPI, port 8000)            │
│  Auth │ Dataset Discovery │ Agent Orchestration │ Routing    │
└──────────┬──────────────────────────────────────────────────┘
           │ Internal HTTP
┌──────────▼──────────────────────────────────────────────────┐
│              Training API  (FastAPI, port 8001)             │
│     YOLO Training │ Ray Tune HPO │ Model Export │ MLflow    │
└──────────┬──────────────────────────────────────────────────┘
           │
    ┌──────▼──────┐     ┌─────────────┐     ┌─────────────┐
    │  GPU Server  │     │ Redis 7     │     │ MLflow      │
    │  (CUDA 12.1)│     │ (Broker)    │     │ (Tracking)  │
    └─────────────┘     └─────────────┘     └─────────────┘

Quick Start

Prerequisites

Python 3.10+
Docker & Docker Compose
(GPU training) NVIDIA GPU with CUDA 12.1 support

1. Clone & Configure

git clone https://github.com/mightyoung/yolo-auto-trainning.git
cd yolo-auto-training
cp .env.example .env
# Edit .env with your API keys (Roboflow, Kaggle, HuggingFace, etc.)

2. Start Services (Docker Compose)

# All-in-one: Redis + Business API + Celery worker + GPU training
docker-compose up -d --build

# With full MLOps stack (Prometheus + Grafana + ELK)
docker-compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d

3. Access

Service	URL
Business API	http://localhost:8000
API Docs	http://localhost:8000/docs
Training API	http://localhost:8001
Training API Docs	http://localhost:8001/docs
Grafana	http://localhost:3000
Kibana	http://localhost:5601

Development Setup

Python Environment

# Create virtual environment
python3.11 -m venv .venv
source .venv/bin/activate      # Linux/macOS
# .venv\Scripts\activate       # Windows

# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt

# Run tests
pytest tests/ -v

Start APIs Manually

# Business API
uvicorn business-api.src.api.gateway:app --host 0.0.0.0 --port 8000 --reload

# Training API (GPU node)
uvicorn training-api.src.api.gateway:app --host 0.0.0.0 --port 8001 --reload

# Celery worker
celery -A business-api.src.api.tasks worker --loglevel=info

API Reference

Dataset Discovery

# Search datasets across Roboflow, Kaggle, HuggingFace
curl -X POST http://localhost:8000/api/v1/data/search \
  -H "Content-Type: application/json" \
  -d '{"query": "vehicle detection", "max_results": 10}'

Submit Training Job

# Start YOLO11 training
curl -X POST http://localhost:8000/api/v1/train/submit \
  -H "Content-Type: application/json" \
  -d '{
    "model": "yolo11m",
    "data_yaml": "/data/my_dataset.yaml",
    "epochs": 100,
    "imgsz": 640
  }'

Hyperparameter Optimization

# Start Ray Tune HPO
curl -X POST http://localhost:8000/api/v1/train/hpo/start \
  -H "Content-Type: application/json" \
  -d '{
    "model": "yolo11m",
    "data_yaml": "/data/my_dataset.yaml",
    "n_trials": 50,
    "epochs_per_trial": 50
  }'

Export to Edge Platform

# Export for NVIDIA Jetson Orin
curl -X POST http://localhost:8000/api/v1/deploy/export \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "/runs/train/exp/weights/best.pt",
    "platform": "jetson_orin",
    "imgsz": 640
  }'

# Export for Rockchip RK3588
curl -X POST http://localhost:8000/api/v1/deploy/export \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "/runs/train/exp/weights/best.pt",
    "platform": "rk3588",
    "imgsz": 640
  }'

Authentication

# Obtain JWT token
curl -X POST http://localhost:8000/api/v1/auth/token \
  -H "Content-Type: application/json" \
  -d '{"username": "admin", "password": "your-password"}'

# Use token in requests
curl -X POST http://localhost:8000/api/v1/train/submit \
  -H "Authorization: Bearer <YOUR_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"model": "yolo11n", "data_yaml": "/data/dataset.yaml", "epochs": 50}'

Project Structure

yolo-auto-training/
├── business-api/           # Business orchestration API (port 8000)
│   └── src/api/
│       ├── gateway.py      # FastAPI app + JWT auth
│       ├── routes.py       # Data, training, export, analysis endpoints
│       ├── training_client.py   # HTTP client → Training API
│       ├── agent_routes.py     # CrewAI agent endpoints
│       └── agents/
│           └── orchestration.py  # Multi-agent workflow definitions
├── training-api/           # GPU training API (port 8001)
│   └── src/
│       ├── training/
│       │   ├── runner.py   # YOLO trainer (ultralytics wrapper)
│       │   └── mlflow_tracker.py  # MLflow integration
│       └── deployment/
│           └── exporter.py # ONNX / TensorRT / TFLite export
├── web-ui-react/           # Next.js frontend
│   └── src/app/
│       ├── discovery/      # Dataset discovery UI
│       ├── training/       # Training management UI
│       ├── labeling/       # Auto-labeling UI
│       └── analysis/       # Analysis dashboard UI
├── src/                    # Monolithic core (legacy)
│   ├── api/               # FastAPI routes + Celery tasks
│   ├── agents/            # CrewAI orchestration
│   ├── data/              # Dataset discovery + quality filter
│   ├── training/          # YOLO training + Ray Tune HPO
│   ├── deployment/        # Model exporter
│   ├── inference/         # Inference engine
│   ├── monitoring/        # Drift detection
│   ├── pipeline/          # End-to-end orchestrator
│   └── features/         # Feature store
├── tests/                 # Unit & integration tests
├── docs/                  # Documentation (en/zh)
│   ├── en/               # English docs
│   └── zh/               # Chinese docs
├── docker-compose.yml     # Core stack (Redis + APIs + Celery)
├── docker-compose.monitoring.yml  # Prometheus + Grafana
├── docker-compose.logging.yml    # ELK stack
└── pyproject.toml        # Project metadata + dependencies

Configuration

All sensitive configuration is managed via environment variables. Copy .env.example to .env and configure:

# Core
JWT_SECRET_KEY=<generate-with: python -c "import secrets; print(secrets.token_urlsafe(32))">
REDIS_URL=redis://localhost:6379/0

# Training API URL (intra-network address of GPU server)
TRAINING_API_URL=http://localhost:8001
TRAINING_API_KEY=<your-api-key>

# Dataset Sources
ROBOFLOW_API_KEY=<your-roboflow-key>
KAGGLE_USERNAME=<your-kaggle-username>
KAGGLE_KEY=<your-kaggle-key>
HF_TOKEN=<your-huggingface-token>

# AI Providers
DEEPSEEK_API_KEY=<your-deepseek-key>

# MLflow
MLFLOW_TRACKING_URI=http://localhost:5000

Edge Deployment Targets

Platform	Format	Toolkit	Max Batch Size
Jetson Nano	TensorRT FP16	`tensorrt`	8
Jetson Orin	TensorRT FP16/INT8	`tensorrt`	32
RK3588	ONNX + RKNN	`rknn`	16
x86 Server	ONNX	`onnx`	64

Monitoring & Alerting

Prometheus Metrics

Metric	Description
`yolo_training_jobs_total`	Total training jobs submitted
`yolo_training_duration_seconds`	Training job duration histogram
`yolo_api_requests_total`	API request counter by endpoint
`yolo_gpu_memory_usage`	GPU memory usage gauge
`yolo_export_jobs_total`	Model export job counter

Alert Rules

Alert	Condition	Severity
`HighErrorRate`	Error rate > 5% in 5 min	warning
`APIDown`	Business API unreachable > 1 min	critical
`GPUMemoryHigh`	GPU memory > 90% for 5 min	warning
`TrainingJobFailed`	3+ consecutive failures	critical

Contributing

Contributions are welcome! Please read our contributing guidelines before submitting PRs.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License. See LICENSE for details.

Acknowledgements

Ultralytics — YOLO implementation
CrewAI — Multi-agent framework
Ray Tune — Hyperparameter optimization
Roboflow — Dataset platform
HuggingFace — Model hub

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
business-api		business-api
colab		colab
design-docs		design-docs
docs		docs
src		src
tests		tests
training-api		training-api
web-ui-react		web-ui-react
web-ui		web-ui
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
GPU_DEPLOYMENT.md		GPU_DEPLOYMENT.md
GPU_SERVER_SETUP.md		GPU_SERVER_SETUP.md
MANUAL_DEPLOY.md		MANUAL_DEPLOY.md
README.md		README.md
README_zh.md		README_zh.md
SESSION_SUMMARY.md		SESSION_SUMMARY.md
alertmanager.yml		alertmanager.yml
alerts.yml		alerts.yml
deploy_gpu.sh		deploy_gpu.sh
deploy_online.sh		deploy_online.sh
deploy_package.tar.gz		deploy_package.tar.gz
docker-compose.logging.yml		docker-compose.logging.yml
docker-compose.monitoring.yml		docker-compose.monitoring.yml
docker-compose.yml		docker-compose.yml
dvc.yaml		dvc.yaml
findings.md		findings.md
findings_react_migration.md		findings_react_migration.md
fluent-bit.conf		fluent-bit.conf
manual_deploy.sh		manual_deploy.sh
parsers.conf		parsers.conf
progress.md		progress.md
prometheus.yml		prometheus.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
task_plan.md		task_plan.md
task_plan_react_migration.md		task_plan_react_migration.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YOLO Auto-Training System

Key Features

Architecture

Quick Start

Prerequisites

1. Clone & Configure

2. Start Services (Docker Compose)

3. Access

Development Setup

Python Environment

Start APIs Manually

API Reference

Dataset Discovery

Submit Training Job

Hyperparameter Optimization

Export to Edge Platform

Authentication

Project Structure

Configuration

Edge Deployment Targets

Monitoring & Alerting

Prometheus Metrics

Alert Rules

Contributing

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

YOLO Auto-Training System

Key Features

Architecture

Quick Start

Prerequisites

1. Clone & Configure

2. Start Services (Docker Compose)

3. Access

Development Setup

Python Environment

Start APIs Manually

API Reference

Dataset Discovery

Submit Training Job

Hyperparameter Optimization

Export to Edge Platform

Authentication

Project Structure

Configuration

Edge Deployment Targets

Monitoring & Alerting

Prometheus Metrics

Alert Rules

Contributing

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages