Introduction to Computer Vision

A progressive learning path for computer vision, from fundamentals to production-ready solutions.

Learning Path

Module	Topic	Approach
001	Object Detection Fundamentals	From scratch with PyTorch
002	YOLO11 Image Tasks	Pre-trained models
003	YOLO11 Video Processing	Real-time inference
004	Object Tracking & ReID	Multi-object tracking
005	Image Captioning	Vision-Language Models (BLIP)

Modules Overview

001-object-classification

Build an object detector from scratch

Learn the fundamentals by implementing a simplified YOLO-style detector:

Grid-based detection architecture
Multi-task learning (objectness, localization, classification)
Loss functions and evaluation metrics
Non-Maximum Suppression (NMS)

cd 001-object-classification
python train.py
python predict.py --image test.jpg

002-ultralytics

YOLO11 for static images

Introduction to Ultralytics YOLO11 with all 5 task types:

Object Detection
Instance Segmentation
Image Classification
Pose Estimation
Oriented Bounding Boxes (OBB)

cd 002-ultralytics
python demo_detect.py
python demo_pose.py

003-ultralytics-video

Real-time video processing

Process video streams (webcam, files, RTSP) in real-time:

Frame-by-frame inference
Multiple video sources
Saving annotated videos

cd 003-ultralytics-video
python demo_detect.py                      # Webcam
python demo_pose.py data/video_people_walking.mp4  # Video file
python demo_save_video.py input.mp4 output.mp4     # Save output

004-reid

Object tracking with re-identification

Track objects across video frames with persistent IDs:

Multi-Object Tracking (MOT)
Re-Identification (ReID) after occlusion
Movement trail visualization
People counting

cd 004-reid
python demo_track_basic.py data/video_city_4k.mp4
python demo_track_persons.py data/video_people_walking.mp4

005-image-caption

Image captioning with Vision-Language Models

Generate natural language descriptions of images:

BLIP model for captioning
Two-stage pipeline (YOLO + BLIP)
Visual Question Answering (VQA)
Video captioning

cd 005-image-caption
python demo_caption_blip.py image.jpg
python demo_caption_yolo_blip.py image.jpg  # Two-stage
python demo_vqa.py image.jpg                # Ask questions

Quick Start

# Clone and setup
git clone <repo-url>
cd introduction-to-computer-vision

# Choose a module
cd 002-ultralytics

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run demo
python demo_detect.py

Requirements

Python 3.8+
PyTorch 2.0+ (for 001)
Ultralytics 8.3+ (for 002-004)
OpenCV 4.8+
CUDA (optional, for GPU acceleration)

Project Structure

introduction-to-computer-vision/
├── 001-object-classification/   # From-scratch detector
│   ├── model.py                # CNN architecture
│   ├── train.py                # Training loop
│   └── predict.py              # Inference
│
├── 002-ultralytics/            # YOLO11 image demos
│   ├── demo_detect.py
│   ├── demo_segment.py
│   ├── demo_classify.py
│   ├── demo_pose.py
│   └── demo_obb.py
│
├── 003-ultralytics-video/      # YOLO11 video demos
│   ├── demo_detect.py
│   ├── demo_save_video.py
│   └── data/                   # Sample videos
│
├── 004-reid/                   # Tracking with ReID
│   ├── demo_track_basic.py
│   ├── demo_track_reid.py
│   ├── demo_track_trails.py
│   └── demo_track_persons.py
│
└── 005-image-caption/          # Image Captioning
    ├── demo_caption_blip.py
    ├── demo_caption_yolo_blip.py
    ├── demo_vqa.py
    └── demo_caption_video.py

Progression Path

┌─────────────────────────────────────────────────────────────────┐
│  001: Fundamentals        →  Understand how detection works     │
│       (From scratch)                                            │
└──────────────────────────────────┬──────────────────────────────┘
                                   ↓
┌─────────────────────────────────────────────────────────────────┐
│  002: Image Inference     →  Use pre-trained YOLO11 models      │
│       (Static images)                                           │
└──────────────────────────────────┬──────────────────────────────┘
                                   ↓
┌─────────────────────────────────────────────────────────────────┐
│  003: Video Processing    →  Real-time inference on video       │
│       (Webcam/Files)                                            │
└──────────────────────────────────┬──────────────────────────────┘
                                   ↓
┌─────────────────────────────────────────────────────────────────┐
│  004: Tracking & ReID     →  Persistent IDs across frames       │
│       (Multi-object)                                            │
└──────────────────────────────────┬──────────────────────────────┘
                                   ↓
┌─────────────────────────────────────────────────────────────────┐
│  005: Image Captioning    →  Vision-Language understanding      │
│       (BLIP/VQA)                                                │
└─────────────────────────────────────────────────────────────────┘

Key Concepts by Module

Module	Core Concepts
001	CNN, Grid detection, IoU, NMS, mAP, Multi-task loss
002	Pre-trained models, Inference API, Task types
003	VideoCapture, Frame processing, VideoWriter
004	MOT, BoT-SORT, ByteTrack, ReID, Track persistence
005	Vision-Language Models, BLIP, Captioning, VQA

Sample Videos (003/004)

Located in 003-ultralytics-video/data/:

Video	Resolution	Best For
`video_city_4k.mp4`	4K	Detection, Segmentation
`video_people_walking.mp4`	1080p	Pose, Person tracking
`video_aerial_roundabout.mp4`	1080p	OBB
`video_broadway_nyc.mp4`	1080p	Urban scenes

Resources

License

MIT License - Educational use encouraged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction to Computer Vision

Learning Path

Modules Overview

001-object-classification

002-ultralytics

003-ultralytics-video

004-reid

005-image-caption

Quick Start

Requirements

Project Structure

Progression Path

Key Concepts by Module

Sample Videos (003/004)

Resources

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
001-object-classification		001-object-classification
002-ultralytics		002-ultralytics
003-ultralytics-video		003-ultralytics-video
004-reid		004-reid
005-image-caption		005-image-caption
.gitignore		.gitignore
README.md		README.md

marcus888-techstack/introduction-to-computer-vision

Folders and files

Latest commit

History

Repository files navigation

Introduction to Computer Vision

Learning Path

Modules Overview

001-object-classification

002-ultralytics

003-ultralytics-video

004-reid

005-image-caption

Quick Start

Requirements

Project Structure

Progression Path

Key Concepts by Module

Sample Videos (003/004)

Resources

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages