TDT17 -- Visual Intelligence Mini Project
- 1. Background & Motivation
- 2. Approach & Strategy
- 3. Data Analysis (EDA)
- 4. Methods & Models
- 5. Real-World Feasibility
- 6. Results
- 7. Discussion
- 8. Sustainability & Compute
- 9. Key Learning Points
Autonomous driving (AD) systems rely heavily on lane markings.
In Nordic winters, roads are often covered in snow, making standard
lane detection unreliable.
Snow poles act as the ground truth for road boundaries in winter.
Accurate detection of these poles is therefore critical for safe
autonomous driving.
Thin Objects - Snow poles are extremely thin. - At distance they can appear only 1--2 pixels wide.
Real-time Constraint - Detection must run on edge hardware with low latency.
Develop a robust object detection pipeline that maximizes mAP (Mean Average Precision) while maintaining real-time performance.
We adopted a Data-Centric AI approach rather than only tuning model hyperparameters.
The provided dataset (~1k images) was insufficient for robust generalization.
We therefore collected additional data by:
- Scraping 10+ hours of YouTube winter driving footage
- Extracting frames to create a large pseudo-dataset.
We used SAM 3 (Segment Anything Model) to automatically generate labels.
This produced approximately:
~36,000 pseudo-labeled frames
Stage 1 -- Pre-training
Train on large noisy YouTube dataset
→ provides general visual knowledge.
Stage 2 -- Fine-tuning
Train on high-quality iPhone / Roadpoles dataset
→ provides domain specificity.
To increase robustness we combined:
- CNN detectors (YOLO)
- Transformer detectors (RF-DETR)
This architectural diversity improves performance across different scenarios.
Size ~1,000 labeled images
Quality High resolution: 1920 × 1080
Issue Dataset splits were sequential, causing data leakage where train and test images looked very similar.
Size ~15,000 processed frames
Environmental Variety
- Sunny
- Overcast
- Heavy snow
- Highway vs rural roads
Labeling Auto-labeled using SAM 3 with prompt:
snowpole
Filtering
Low-confidence detections were removed to avoid training on incorrect labels.
YOLOv9 introduces PGI (Programmable Gradient Information).
This mechanism helps preserve fine-grained visual details, which is crucial for detecting thin snow poles.
Our experiments showed that YOLOv9 preserved faint pole structures better than nano-scale models like YOLO11n.
Native resolution training:
- imgsz = 1280
- imgsz = 1920
Higher resolution was required because poles become invisible at low resolution.
CNN detectors focus mainly on local features.
Transformers instead use global attention, allowing the model to reason about scene context.
RF-DETR can understand that:
A vertical white line inside a tree is not a snow pole.
YOLO detectors sometimes hallucinate poles in forest backgrounds, while transformers reduce such errors.
We implemented a custom Python pipeline (pipeline.py) to scale data
generation.
- Input -- YouTube URL\
- Extract --
ffmpegextracts frames at 1 FPS (high quality)\ - Rotate -- handle orientation differences\
- Label -- SAM3 inference with prompt
"snowpole"\ - Filter -- remove low confidence detections (<0.30)\
- Output -- COCO formatted dataset
This increased dataset size by ~15× without manual labeling.
To achieve top leaderboard performance we applied two techniques.
Inference is run on:
- original image
- horizontally flipped image
Predictions are averaged.
Effect:
~1.5% improvement in mAP
Instead of Non-Max Suppression (NMS), WBF:
averages overlapping boxes.
Final model combines:
- YOLOv9t (shape expert)
- YOLO11n (generalist)
- RF-DETR (context expert)
Result:
More accurate bounding boxes and higher mAP@50-95.
Although ensembles increase compute cost, modern edge AI hardware (e.g., NVIDIA Orin) supports asynchronous parallel execution.
Reference:
| Model | Latency |
|---|---|
| YOLOv9t | 18.0 ms |
| YOLO11n | 18.5 ms |
| RF-DETR | 36.4 ms |
Parallel inference means system latency is determined by the slowest model, not the sum.
| Model | mAP@50 | mAP@50:95 | Notes |
|---|---|---|---|
| Baseline (YOLOv11n) | 92.0% | 65.0% | Fast but loose boxes |
| RF-DETR (Stage 1) | 89.8% | 66.6% | Robust but missed domain specifics |
| RF-DETR (Stage 2) | 95.0% | 74.5% | Fine-tuned on iPhone data |
| Final Ensemble (WBF) | 97.6% | 79.5% | Rank #1 / #2 |
Key finding: mAP@50 was easy (~97%) but mAP@50-95 required tighter bounding boxes.
Training at 640px made distant poles invisible.
Training at 1280px and above was necessary.
SAM-generated labels acted as a teacher, improving generalization to new weather conditions.
YOLO: - Faster - Higher recall on simple cases
RF-DETR: - Slower - Higher precision in complex backgrounds
Combining them solved both weaknesses.
Training was performed on:
- IDUN Cluster (A100 GPUs)
- Cybele Lab (RTX 4090)
~25 GPU hours
Breakdown:
| Task | Time |
|---|---|
| SAM3 Pipeline | ~5 hours |
| RF-DETR Training | ~12 hours |
| YOLO Experiments | ~8 hours |
Average GPU power:
~350 W
Total energy:
25h × 0.35kW ≈ 8.75 kWh
Tesla Model Y consumption:
~16 kWh / 100km
Project energy (8.75 kWh) ≈ 54 km driving distance.
-
Data Engineering > Model Tuning
Building the SAM3 pipeline yielded larger gains than hyperparameter tuning. -
Pseudo‑Labeling Risks
Too many epochs on pseudo-labels causes memorization of teacher mistakes. -
Smart Ensembling
Combining different architectures (CNN + Transformer) works better than identical models. -
Infrastructure Skills
Handling:
- slurm queues
- rsync transfers
- distributed training


