Snow Pole Detection using YOLOv9t and YOLO11n

TDT17 -- Visual Intelligence Mini Project

1. Background & Motivation

The Problem

Autonomous driving (AD) systems rely heavily on lane markings.
In Nordic winters, roads are often covered in snow, making standard lane detection unreliable.

The Solution

Snow poles act as the ground truth for road boundaries in winter.
Accurate detection of these poles is therefore critical for safe autonomous driving.

Challenges

Thin Objects - Snow poles are extremely thin. - At distance they can appear only 1--2 pixels wide.

Real-time Constraint - Detection must run on edge hardware with low latency.

Goal

Develop a robust object detection pipeline that maximizes mAP (Mean Average Precision) while maintaining real-time performance.

⬆️ Back to Top

2. Approach & Strategy

We adopted a Data-Centric AI approach rather than only tuning model hyperparameters.

1. Data Engine

The provided dataset (~1k images) was insufficient for robust generalization.

We therefore collected additional data by:

Scraping 10+ hours of YouTube winter driving footage
Extracting frames to create a large pseudo-dataset.

2. Auto-Labeling Pipeline

We used SAM 3 (Segment Anything Model) to automatically generate labels.

This produced approximately:

~36,000 pseudo-labeled frames

3. Transfer Learning Hierarchy

Stage 1 -- Pre-training

Train on large noisy YouTube dataset
→ provides general visual knowledge.

Stage 2 -- Fine-tuning

Train on high-quality iPhone / Roadpoles dataset
→ provides domain specificity.

4. Ensemble Architecture

To increase robustness we combined:

CNN detectors (YOLO)
Transformer detectors (RF-DETR)

This architectural diversity improves performance across different scenarios.

⬆️ Back to Top

3. Data Analysis (EDA)

Provided Dataset (iPhone)

Size ~1,000 labeled images

Quality High resolution: 1920 × 1080

Issue Dataset splits were sequential, causing data leakage where train and test images looked very similar.

Scraped Dataset (YouTube)

Size ~15,000 processed frames

Environmental Variety

Sunny
Overcast
Heavy snow
Highway vs rural roads

Labeling Auto-labeled using SAM 3 with prompt:

snowpole

Filtering

Low-confidence detections were removed to avoid training on incorrect labels.

⬆️ Back to Top

4. Methods & Models

Architecture 1: YOLOv9t & YOLO11n

Why YOLOv9t?

YOLOv9 introduces PGI (Programmable Gradient Information).

This mechanism helps preserve fine-grained visual details, which is crucial for detecting thin snow poles.

Our experiments showed that YOLOv9 preserved faint pole structures better than nano-scale models like YOLO11n.

Training Configuration

Native resolution training:

imgsz = 1280
imgsz = 1920

Higher resolution was required because poles become invisible at low resolution.

Architecture 2: RF-DETR (Transformer)

Why RF-DETR?

CNN detectors focus mainly on local features.

Transformers instead use global attention, allowing the model to reason about scene context.

Benefit

RF-DETR can understand that:

A vertical white line inside a tree is not a snow pole.

YOLO detectors sometimes hallucinate poles in forest backgrounds, while transformers reduce such errors.

SAM 3 Auto-Labeling Pipeline

We implemented a custom Python pipeline (pipeline.py) to scale data generation.

Pipeline Steps

Input -- YouTube URL\
Extract -- ffmpeg extracts frames at 1 FPS (high quality)\
Rotate -- handle orientation differences\
Label -- SAM3 inference with prompt "snowpole"\
Filter -- remove low confidence detections (<0.30)\
Output -- COCO formatted dataset

This increased dataset size by ~15× without manual labeling.

Inference Optimization ("Secret Sauce")

To achieve top leaderboard performance we applied two techniques.

1. TTA --- Test Time Augmentation

Inference is run on:

original image
horizontally flipped image

Predictions are averaged.

Effect:

~1.5% improvement in mAP

2. WBF --- Weighted Boxes Fusion

Instead of Non-Max Suppression (NMS), WBF:

averages overlapping boxes.

Ensemble Strategy

Final model combines:

YOLOv9t (shape expert)
YOLO11n (generalist)
RF-DETR (context expert)

Result:

More accurate bounding boxes and higher mAP@50-95.

⬆️ Back to Top

5. Real-World Feasibility

Deployment Strategy

Although ensembles increase compute cost, modern edge AI hardware (e.g., NVIDIA Orin) supports asynchronous parallel execution.

Reference:

https://developer.nvidia.com/blog/maximizing-deep-learning-performance-on-nvidia-jetson-orin-with-dla/

Latency Benchmarks (RTX 4090)

Model	Latency
YOLOv9t	18.0 ms
YOLO11n	18.5 ms
RF-DETR	36.4 ms

Parallel inference means system latency is determined by the slowest model, not the sum.

⬆️ Back to Top

6. Results

Model	mAP@50	mAP@50:95	Notes
Baseline (YOLOv11n)	92.0%	65.0%	Fast but loose boxes
RF-DETR (Stage 1)	89.8%	66.6%	Robust but missed domain specifics
RF-DETR (Stage 2)	95.0%	74.5%	Fine-tuned on iPhone data
Final Ensemble (WBF)	97.6%	79.5%	Rank #1 / #2

Key finding: mAP@50 was easy (~97%) but mAP@50-95 required tighter bounding boxes.

⬆️ Back to Top

7. Discussion

Resolution Matters

Training at 640px made distant poles invisible.

Training at 1280px and above was necessary.

Teacher Effect

SAM-generated labels acted as a teacher, improving generalization to new weather conditions.

YOLO vs Transformer

YOLO: - Faster - Higher recall on simple cases

RF-DETR: - Slower - Higher precision in complex backgrounds

Combining them solved both weaknesses.

⬆️ Back to Top

8. Sustainability & Compute

Training was performed on:

IDUN Cluster (A100 GPUs)
Cybele Lab (RTX 4090)

Total Training Time

~25 GPU hours

Breakdown:

Task	Time
SAM3 Pipeline	~5 hours
RF-DETR Training	~12 hours
YOLO Experiments	~8 hours

Energy Consumption

Average GPU power:

~350 W

Total energy:

25h × 0.35kW ≈ 8.75 kWh

Tesla Metric

Tesla Model Y consumption:

~16 kWh / 100km

Project energy (8.75 kWh) ≈ 54 km driving distance.

⬆️ Back to Top

9. Key Learning Points

Data Engineering > Model Tuning
Building the SAM3 pipeline yielded larger gains than hyperparameter tuning.
Pseudo‑Labeling Risks
Too many epochs on pseudo-labels causes memorization of teacher mistakes.
Smart Ensembling
Combining different architectures (CNN + Transformer) works better than identical models.
Infrastructure Skills
Handling:

slurm queues
rsync transfers
distributed training

⬆️ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
EDA.ipynb		EDA.ipynb
Evaluation.ipynb		Evaluation.ipynb
LICENSE		LICENSE
README.md		README.md
TDT17_project_2025.pdf		TDT17_project_2025.pdf
Testing_MSJ.ipynb		Testing_MSJ.ipynb
Testing_Road_poles_iPhone.ipynb		Testing_Road_poles_iPhone.ipynb
Testing_roadpoles_v1.ipynb		Testing_roadpoles_v1.ipynb
Training.ipynb		Training.ipynb

Folders and files

Latest commit

History

Repository files navigation

Snow Pole Detection using YOLOv9t and YOLO11n

Table of Contents

1. Background & Motivation

The Problem

The Solution

Challenges

Goal

2. Approach & Strategy

1. Data Engine

2. Auto-Labeling Pipeline

3. Transfer Learning Hierarchy

4. Ensemble Architecture

3. Data Analysis (EDA)

Provided Dataset (iPhone)

Scraped Dataset (YouTube)

4. Methods & Models

Architecture 1: YOLOv9t & YOLO11n

Why YOLOv9t?

Training Configuration

Architecture 2: RF-DETR (Transformer)

Why RF-DETR?

Benefit

SAM 3 Auto-Labeling Pipeline

Pipeline Steps

Inference Optimization ("Secret Sauce")

1. TTA --- Test Time Augmentation

2. WBF --- Weighted Boxes Fusion

Ensemble Strategy

5. Real-World Feasibility

Deployment Strategy

Latency Benchmarks (RTX 4090)

6. Results

7. Discussion

Resolution Matters

Teacher Effect

YOLO vs Transformer

8. Sustainability & Compute

Total Training Time

Energy Consumption

Tesla Metric

9. Key Learning Points

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages