A progressive learning path for computer vision, from fundamentals to production-ready solutions.
| Module | Topic | Approach |
|---|---|---|
| 001 | Object Detection Fundamentals | From scratch with PyTorch |
| 002 | YOLO11 Image Tasks | Pre-trained models |
| 003 | YOLO11 Video Processing | Real-time inference |
| 004 | Object Tracking & ReID | Multi-object tracking |
| 005 | Image Captioning | Vision-Language Models (BLIP) |
Build an object detector from scratch
Learn the fundamentals by implementing a simplified YOLO-style detector:
- Grid-based detection architecture
- Multi-task learning (objectness, localization, classification)
- Loss functions and evaluation metrics
- Non-Maximum Suppression (NMS)
cd 001-object-classification
python train.py
python predict.py --image test.jpgYOLO11 for static images
Introduction to Ultralytics YOLO11 with all 5 task types:
- Object Detection
- Instance Segmentation
- Image Classification
- Pose Estimation
- Oriented Bounding Boxes (OBB)
cd 002-ultralytics
python demo_detect.py
python demo_pose.pyReal-time video processing
Process video streams (webcam, files, RTSP) in real-time:
- Frame-by-frame inference
- Multiple video sources
- Saving annotated videos
cd 003-ultralytics-video
python demo_detect.py # Webcam
python demo_pose.py data/video_people_walking.mp4 # Video file
python demo_save_video.py input.mp4 output.mp4 # Save outputObject tracking with re-identification
Track objects across video frames with persistent IDs:
- Multi-Object Tracking (MOT)
- Re-Identification (ReID) after occlusion
- Movement trail visualization
- People counting
cd 004-reid
python demo_track_basic.py data/video_city_4k.mp4
python demo_track_persons.py data/video_people_walking.mp4Image captioning with Vision-Language Models
Generate natural language descriptions of images:
- BLIP model for captioning
- Two-stage pipeline (YOLO + BLIP)
- Visual Question Answering (VQA)
- Video captioning
cd 005-image-caption
python demo_caption_blip.py image.jpg
python demo_caption_yolo_blip.py image.jpg # Two-stage
python demo_vqa.py image.jpg # Ask questions# Clone and setup
git clone <repo-url>
cd introduction-to-computer-vision
# Choose a module
cd 002-ultralytics
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run demo
python demo_detect.py- Python 3.8+
- PyTorch 2.0+ (for 001)
- Ultralytics 8.3+ (for 002-004)
- OpenCV 4.8+
- CUDA (optional, for GPU acceleration)
introduction-to-computer-vision/
├── 001-object-classification/ # From-scratch detector
│ ├── model.py # CNN architecture
│ ├── train.py # Training loop
│ └── predict.py # Inference
│
├── 002-ultralytics/ # YOLO11 image demos
│ ├── demo_detect.py
│ ├── demo_segment.py
│ ├── demo_classify.py
│ ├── demo_pose.py
│ └── demo_obb.py
│
├── 003-ultralytics-video/ # YOLO11 video demos
│ ├── demo_detect.py
│ ├── demo_save_video.py
│ └── data/ # Sample videos
│
├── 004-reid/ # Tracking with ReID
│ ├── demo_track_basic.py
│ ├── demo_track_reid.py
│ ├── demo_track_trails.py
│ └── demo_track_persons.py
│
└── 005-image-caption/ # Image Captioning
├── demo_caption_blip.py
├── demo_caption_yolo_blip.py
├── demo_vqa.py
└── demo_caption_video.py
┌─────────────────────────────────────────────────────────────────┐
│ 001: Fundamentals → Understand how detection works │
│ (From scratch) │
└──────────────────────────────────┬──────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 002: Image Inference → Use pre-trained YOLO11 models │
│ (Static images) │
└──────────────────────────────────┬──────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 003: Video Processing → Real-time inference on video │
│ (Webcam/Files) │
└──────────────────────────────────┬──────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 004: Tracking & ReID → Persistent IDs across frames │
│ (Multi-object) │
└──────────────────────────────────┬──────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 005: Image Captioning → Vision-Language understanding │
│ (BLIP/VQA) │
└─────────────────────────────────────────────────────────────────┘
| Module | Core Concepts |
|---|---|
| 001 | CNN, Grid detection, IoU, NMS, mAP, Multi-task loss |
| 002 | Pre-trained models, Inference API, Task types |
| 003 | VideoCapture, Frame processing, VideoWriter |
| 004 | MOT, BoT-SORT, ByteTrack, ReID, Track persistence |
| 005 | Vision-Language Models, BLIP, Captioning, VQA |
Located in 003-ultralytics-video/data/:
| Video | Resolution | Best For |
|---|---|---|
video_city_4k.mp4 |
4K | Detection, Segmentation |
video_people_walking.mp4 |
1080p | Pose, Person tracking |
video_aerial_roundabout.mp4 |
1080p | OBB |
video_broadway_nyc.mp4 |
1080p | Urban scenes |
MIT License - Educational use encouraged.