Skip to content

marcus888-techstack/introduction-to-computer-vision

Repository files navigation

Introduction to Computer Vision

A progressive learning path for computer vision, from fundamentals to production-ready solutions.

Learning Path

Module Topic Approach
001 Object Detection Fundamentals From scratch with PyTorch
002 YOLO11 Image Tasks Pre-trained models
003 YOLO11 Video Processing Real-time inference
004 Object Tracking & ReID Multi-object tracking
005 Image Captioning Vision-Language Models (BLIP)

Modules Overview

001-object-classification

Build an object detector from scratch

Learn the fundamentals by implementing a simplified YOLO-style detector:

  • Grid-based detection architecture
  • Multi-task learning (objectness, localization, classification)
  • Loss functions and evaluation metrics
  • Non-Maximum Suppression (NMS)
cd 001-object-classification
python train.py
python predict.py --image test.jpg

002-ultralytics

YOLO11 for static images

Introduction to Ultralytics YOLO11 with all 5 task types:

  • Object Detection
  • Instance Segmentation
  • Image Classification
  • Pose Estimation
  • Oriented Bounding Boxes (OBB)
cd 002-ultralytics
python demo_detect.py
python demo_pose.py

003-ultralytics-video

Real-time video processing

Process video streams (webcam, files, RTSP) in real-time:

  • Frame-by-frame inference
  • Multiple video sources
  • Saving annotated videos
cd 003-ultralytics-video
python demo_detect.py                      # Webcam
python demo_pose.py data/video_people_walking.mp4  # Video file
python demo_save_video.py input.mp4 output.mp4     # Save output

004-reid

Object tracking with re-identification

Track objects across video frames with persistent IDs:

  • Multi-Object Tracking (MOT)
  • Re-Identification (ReID) after occlusion
  • Movement trail visualization
  • People counting
cd 004-reid
python demo_track_basic.py data/video_city_4k.mp4
python demo_track_persons.py data/video_people_walking.mp4

005-image-caption

Image captioning with Vision-Language Models

Generate natural language descriptions of images:

  • BLIP model for captioning
  • Two-stage pipeline (YOLO + BLIP)
  • Visual Question Answering (VQA)
  • Video captioning
cd 005-image-caption
python demo_caption_blip.py image.jpg
python demo_caption_yolo_blip.py image.jpg  # Two-stage
python demo_vqa.py image.jpg                # Ask questions

Quick Start

# Clone and setup
git clone <repo-url>
cd introduction-to-computer-vision

# Choose a module
cd 002-ultralytics

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run demo
python demo_detect.py

Requirements

  • Python 3.8+
  • PyTorch 2.0+ (for 001)
  • Ultralytics 8.3+ (for 002-004)
  • OpenCV 4.8+
  • CUDA (optional, for GPU acceleration)

Project Structure

introduction-to-computer-vision/
├── 001-object-classification/   # From-scratch detector
│   ├── model.py                # CNN architecture
│   ├── train.py                # Training loop
│   └── predict.py              # Inference
│
├── 002-ultralytics/            # YOLO11 image demos
│   ├── demo_detect.py
│   ├── demo_segment.py
│   ├── demo_classify.py
│   ├── demo_pose.py
│   └── demo_obb.py
│
├── 003-ultralytics-video/      # YOLO11 video demos
│   ├── demo_detect.py
│   ├── demo_save_video.py
│   └── data/                   # Sample videos
│
├── 004-reid/                   # Tracking with ReID
│   ├── demo_track_basic.py
│   ├── demo_track_reid.py
│   ├── demo_track_trails.py
│   └── demo_track_persons.py
│
└── 005-image-caption/          # Image Captioning
    ├── demo_caption_blip.py
    ├── demo_caption_yolo_blip.py
    ├── demo_vqa.py
    └── demo_caption_video.py

Progression Path

┌─────────────────────────────────────────────────────────────────┐
│  001: Fundamentals        →  Understand how detection works     │
│       (From scratch)                                            │
└──────────────────────────────────┬──────────────────────────────┘
                                   ↓
┌─────────────────────────────────────────────────────────────────┐
│  002: Image Inference     →  Use pre-trained YOLO11 models      │
│       (Static images)                                           │
└──────────────────────────────────┬──────────────────────────────┘
                                   ↓
┌─────────────────────────────────────────────────────────────────┐
│  003: Video Processing    →  Real-time inference on video       │
│       (Webcam/Files)                                            │
└──────────────────────────────────┬──────────────────────────────┘
                                   ↓
┌─────────────────────────────────────────────────────────────────┐
│  004: Tracking & ReID     →  Persistent IDs across frames       │
│       (Multi-object)                                            │
└──────────────────────────────────┬──────────────────────────────┘
                                   ↓
┌─────────────────────────────────────────────────────────────────┐
│  005: Image Captioning    →  Vision-Language understanding      │
│       (BLIP/VQA)                                                │
└─────────────────────────────────────────────────────────────────┘

Key Concepts by Module

Module Core Concepts
001 CNN, Grid detection, IoU, NMS, mAP, Multi-task loss
002 Pre-trained models, Inference API, Task types
003 VideoCapture, Frame processing, VideoWriter
004 MOT, BoT-SORT, ByteTrack, ReID, Track persistence
005 Vision-Language Models, BLIP, Captioning, VQA

Sample Videos (003/004)

Located in 003-ultralytics-video/data/:

Video Resolution Best For
video_city_4k.mp4 4K Detection, Segmentation
video_people_walking.mp4 1080p Pose, Person tracking
video_aerial_roundabout.mp4 1080p OBB
video_broadway_nyc.mp4 1080p Urban scenes

Resources

License

MIT License - Educational use encouraged.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages