Trustworthy and Efficient Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus
This repository contains the official implementation for "Semantic-Drive", a local-first framework designed to mine safety-critical edge cases from raw autonomous vehicle video logs using "System 2" neuro-symbolic reasoning.
A project by Antonio Guillen-Perez | Portfolio | LinkedIn | Google Scholar
Figure 1: Semantic-Drive Framework Overview. The system combines real-time object detection (YOLOE-11) with Neuro-Symbolic reasoning VLMs (Qwen3-VL, Kimi-VL) to mine safety-critical edge cases from raw autonomous vehicle video logs without cloud costs.
Figure 2: The "Dark Data" Crisis. While 99% of logs represent nominal driving, the critical 1% lies in the "Long Tail" (e.g., erratic VRUs, sensor degradation). Semantic-Drive automates the mining of this region without cloud costs.
The development of Autonomous Vehicles (AVs) is currently hampered by a scarcity of long-tail training data. Semantic-Drive is an offline DataOps engine designed to mine safety-critical scenarios, specifically rare events like erratic jaywalking or complex construction diversions, from unlabelled data lakes. It provides an accessible and privacy-preserving alternative to cloud-based auto-labelers by running entirely on consumer-grade hardware (NVIDIA RTX 3090) without transmitting data to external APIs.
The system employs a Neuro-Symbolic Architecture that separates perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis where an ensemble of reasoning Vision-Language Models (VLMs) performs forensic scene analysis. By enforcing a strict "Scenario DNA" schema aligned with the Waymo Open Dataset (WOD-E2E), Semantic-Drive transforms unstructured video into a queryable semantic database.
Key Achievements:
- High Recall: It was observed that Semantic-Drive achieves a 0.966 recall on safety-critical scenarios (vs. 0.331 for OWL-v2 and 0.271 for Grounding DINO).
- Trustworthy Reasoning: The system reduces Risk Assessment Error (MAE) by 40% compared to single-model baselines via a multi-model consensus mechanism.
- Hardware Accessible: Designed to fit within a 24GB VRAM compute budget, enabling local execution on a single RTX 3090.
- Cost Efficiency: A ~97% cost reduction was estimated compared to commercial cloud APIs like GPT-4o.
- Semantic-Drive
Semantic-Drive/
├── assets/ # Architecture diagrams and paper figures
├── llama.cpp/ # Setup location for llama.cpp inference engine
├── models/ # GGUF Quantized Models & Vision Projectors
├── notebooks/ # Interactive experiments (Grounding checks, Dashboard)
├── nuscenes_data/ # Local dataset storage (blobs + metadata)
├── output/ # Generated semantic indexes (.jsonl) and execution logs
├── src/ # Core Neuro-Symbolic Framework
│ ├── data/
│ │ ├── loader.py # NuScenes API wrapper (Sparse/Dense sampling)
│ │ └── visuals.py # Multi-view image stitching utilities
│ ├── model/
│ │ ├── detector.py # YOLOE-11 Open-Vocabulary Segmentor wrapper
│ │ ├── prompts.py # System Prompts & WOD-E2E Schema definition
│ │ └── vlm_client.py # Robust API client (OpenAI/Gemini compatible)
│ ├── analytics.py # Cost/Latency/Token usage analysis
│ ├── benchmark.py # Precision/Recall ablation scripts
│ ├── config.py # Global configuration (Paths, Camera selection)
│ ├── main.py # Neuro-Symbolic Pipeline Orchestrator (The Scout)
│ ├── judge.py # Multi-Model Consensus Engine (The Judge)
│ └── reward.py # Inference-Time Symbolic Verification logic
├── download_nuscenes.sh # Automated downloader for NuScenes TrainVal
├── requirements.txt # Python dependencies
└── README.md # Project documentation
The system employs a "Judge-Scout" architecture that separates perception into four distinct stages to mitigate the hallucinations common in pure Vision-Language Models.
We utilize YOLOE-11 (Real-Time Open-Vocabulary Segmentation) to perform an initial visual sweep. It detects objects from the WOD-E2E Taxonomy (e.g., "construction barrel", "debris") with a high-recall threshold (0.15). This "Object Inventory" is converted to text and injected into the VLM's context window.
Figure 3: Stage 1 - Symbolic Grounding with YOLOE-11. The object inventory is extracted and formatted for VLM consumption.
An ensemble of Reasoning VLMs (Qwen3-VL, Kimi-VL, Gemma-3) processes the images and the symbolic inventory. They execute a Chain-of-Thought (CoT) process to verify detections ("Skepticism Policy"), assess environmental conditions, and determine causal risks (e.g., "Is the pedestrian interacting with the scene?").
Figure 4: Stage 2 - Cognitive Analysis with Reasoning VLMs. The VLMs verify detections and assess scenario risks using Chain-of-Thought reasoning.
A separate Local LLM (Mistral-14B) aggregates the reports from multiple scouts. It performs an Inference-Time Search (Best-of-N), generating candidate scenarios and scoring them using a deterministic Symbolic Reward Model ($R(y)$). This filters out hallucinations that are not grounded in the YOLO inventory.
Figure 5: Stage 3 - Inference-Time Consensus with the Judge. The LLM aggregates multiple scout reports and selects the most consistent scenario.
To ensure logical consistency, the system generates
Performance was evaluated on a verified Gold Set (
| Method | Prec. (Stress) | Rec. (Stress) ↑ | Risk Error (MAE) ↓ | Latency |
|---|---|---|---|---|
| Metadata Search | 0.406 | 0.602 | 5.70 | 0.0s |
| Grounding DINO | 0.182 | 0.271 | 5.70 | 0.4s |
| OWL-v2 | 0.386 | 0.331 | 3.96 | 0.5s |
| Single Scout (Qwen3) | 0.714 | 0.932 | 1.13 | 31.5s |
| Semantic-Drive (Full) | 0.712 | 0.966 | 0.67 | ~60s |
To facilitate reproducibility, we have released the data and an interactive explorer:
- Live Demo (Hugging Face Space): Semantic-Drive Explorer
- Dataset (Hugging Face): Semantic-Drive Results (N=2,550)
- OS: Linux (Ubuntu 22.04+ Recommended) or Windows WSL2.
- Hardware: NVIDIA GPU with 24GB VRAM (RTX 3090/Similar) for local inference.
- Software:
- Docker & NVIDIA Container Toolkit.
- Python 3.10+.
- Data: NuScenes v1.0-trainval (approx 30GB).
# Clone repository
git clone https://github.com/AntonioAlgaida/Semantic-Drive.git
cd Semantic-Drive
# Install dependencies
pip install -r requirements.txtEdit src/config.py to point to your local NuScenes dataset path:
NUSCENES_DATAROOT = "/path/to/your/nuscenes"For consumer hardware with 24GB VRAM, we use 4-bit Quantized (Q4_K_M) models. This retains reasoning performance while fitting the model (~19GB) and image context within memory limits.
We use llama.cpp server for high-throughput local inference.
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
docker build -t local/llama.cpp:server-cuda --target server -f .devops/cuda.Dockerfile .
cd ..Create a models directory: mkdir -p models
Option A: Qwen3-VL-30B (Thinking)
# Main Model (4-bit, ~19GB)
wget -O models/Qwen3-VL-30B-A3B-Thinking-Q4_K_M.gguf "https://huggingface.co/unsloth/Qwen3-VL-30B-A3B-Thinking-GGUF/resolve/main/Qwen3-VL-30B-A3B-Thinking-Q4_K_M.gguf?download=true"
# Vision Projector
wget -O models/mmproj-F16.gguf "https://huggingface.co/unsloth/Qwen3-VL-30B-A3B-Thinking-GGUF/resolve/main/mmproj-F16.gguf?download=true"Option B: Kimi-VL-Thinking
wget -O models/Kimi-VL-A3B-Thinking-2506-Q4_K_M.gguf "https://huggingface.co/ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF/resolve/main/Kimi-VL-A3B-Thinking-2506-Q4_K_M.gguf?download=true"
wget -O models/mmproj-Kimi-VL-A3B-Thinking-2506-f16.gguf "https://huggingface.co/ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF/resolve/main/mmproj-Kimi-VL-A3B-Thinking-2506-f16.gguf?download=true"The Judge (Mistral-Small-24B-Instruct)
wget -O models/Mistral-Small-24B-Instruct-Q4_K_M.gguf "https://huggingface.co/maziyarpanahi/Mistral-Small-24B-Instruct-2501-GGUF/resolve/main/Mistral-Small-24B-Instruct-2501.Q4_K_M.gguf?download=true"- Download
v1.0-miniorv1.0-trainvalfrom NuScenes.org. - Extract it to
./nuscenes_data. - Update
src/config.pyif your path differs.
Run this in a separate terminal. Note the reduced context size (8192) to save VRAM for the 3090.
# Example for Qwen3 (Adjust filenames for Kimi)
docker run --rm -it --gpus all \
-v $(pwd)/models:/models \
-p 1234:1234 \
local/llama.cpp:server-cuda \
-m /models/Qwen3-VL-30B-A3B-Thinking-Q4_K_M.gguf \
--mmproj /models/mmproj-F16.gguf \
--port 1234 --host 0.0.0.0 \
--ctx-size 8192 \
--n-gpu-layers 999# The script will detect the local server on port 1234
python -m src.main \
--model "qwen3-30b-local" \
--output_name "qwen3_local_run" \
--verbose-
Stop previous server:
docker stop $(docker ps -q) -
Start Judge Server (Text-Only):
docker run --rm -d --gpus all -v $(pwd)/models:/models -p 1234:1234 \ local/llama.cpp:server-cuda \ -m /models/Mistral-Small-24B-Instruct-Q4_K_M.gguf \ --port 1234 --host 0.0.0.0 --ctx-size 32768 --n-gpu-layers 999 -
Run Consensus:
python -m src.judge \ --files output/logs_qwen_run.jsonl \ --output output/consensus_final.jsonl \ --n 3
Launch the local Streamlit app to manually verify the results and build your own Gold Set.
streamlit run src/tools/gold_curator_app.pyAccess the dashboard at http://localhost:8501.
For large-scale mining, we recommend using Docker with llama.cpp server to bypass GUI overhead.
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Build server with CUDA support
docker build -t local/llama.cpp:server-cuda --target server -f .devops/cuda.Dockerfile .
cd ..Use the provided script to download and extract the NuScenes blobs (trainval).
chmod +x download_nuscenes.sh
# Download blobs 1 through 10 (Full Dataset)
./download_nuscenes.sh ./nuscenes_data 1 10We use Q8_0 (High Precision) models for maximum reasoning fidelity. Ensure you create a models/ directory first.
# 1. Download Model & Projector
wget -O models/Qwen3-VL-30B-A3B-Thinking-Q8_0.gguf "https://huggingface.co/unsloth/Qwen3-VL-30B-A3B-Thinking-GGUF/resolve/main/Qwen3-VL-30B-A3B-Thinking-Q8_0.gguf?download=true"
wget -O models/mmproj-F16.gguf "https://huggingface.co/unsloth/Qwen3-VL-30B-A3B-Thinking-GGUF/resolve/main/mmproj-F16.gguf?download=true"
# 2. Start Server (Background)
# Note: Adjust --gpus device=1 based on available hardware
docker run --rm -it --gpus '"device=1"' \
-v $(pwd)/models:/models -p 1234:1234 \
local/llama.cpp:server-cuda \
-m /models/Qwen3-VL-30B-A3B-Thinking-Q8_0.gguf \
--mmproj /models/mmproj-F16.gguf \
--port 1234 --host 0.0.0.0 \
--ctx-size 32768 --n-gpu-layers 999 --batch-size 2048
# 3. Run Miner
CUDA_VISIBLE_DEVICES=1 python -m src.main \
--model "qwen3-30b-docker" --output_name "qwen3_run" \
--verbose --port 1234# 1. Download
wget -O models/Kimi-VL-A3B-Thinking-2506-Q8_0.gguf "https://huggingface.co/ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF/resolve/main/Kimi-VL-A3B-Thinking-2506-Q8_0.gguf?download=true"
wget -O models/mmproj-Kimi-VL-A3B-Thinking-2506-f16.gguf "https://huggingface.co/ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF/resolve/main/mmproj-Kimi-VL-A3B-Thinking-2506-f16.gguf?download=true"
# 2. Start Server
docker run --rm -it --gpus '"device=1"' \
-v $(pwd)/models:/models -p 1234:1234 \
local/llama.cpp:server-cuda \
-m /models/Kimi-VL-A3B-Thinking-2506-Q8_0.gguf \
--mmproj /models/mmproj-Kimi-VL-A3B-Thinking-2506-f16.gguf \
--port 1234 --host 0.0.0.0 \
--ctx-size 32768 --n-gpu-layers 999 --batch-size 2048
# 3. Run Miner
CUDA_VISIBLE_DEVICES=1 python -m src.main \
--model "kimi-thinking-q8" --output_name "kimi_run" \
--verbose --port 1234# 1. Download
wget -O models/gemma-3-27b-it-Q8_0.gguf "https://huggingface.co/unsloth/gemma-3-27b-it-GGUF/resolve/main/gemma-3-27b-it-Q8_0.gguf?download=true"
wget -O models/mmproj-gemma-3-27b-f16.gguf "https://huggingface.co/unsloth/gemma-3-27b-it-GGUF/resolve/main/mmproj-F16.gguf?download=true"
# 2. Start Server
docker run --rm -it --gpus '"device=1"' \
-v $(pwd)/models:/models -p 1234:1234 \
local/llama.cpp:server-cuda \
-m /models/gemma-3-27b-it-Q8_0.gguf \
--mmproj /models/mmproj-gemma-3-27b-f16.gguf \
--port 1234 --host 0.0.0.0 \
--ctx-size 32768 --n-gpu-layers 999 --batch-size 2048
# 3. Run Miner
CUDA_VISIBLE_DEVICES=1 python -m src.main \
--model "gemma-3-27b-q8" --output_name "gemma_run" \
--verbose --port 1234Once all scouts have finished, merge the results using the Multi-Model Judge (supporting both Local LLMs and Cloud models). For 3 scouts (e.g., Qwen3, Kimi, Gemma):
python -m src.judge --files output/index_qwen3_run.jsonl output/index_kimi_run.jsonl output/index_gemma_run.jsonl --n 3For 2 scouts (e.g., Qwen3 and Kimi):
python -m src.judge --files output/index_kimi_run.jsonl output/index_qwen3_run.jsonl --n 3The system is engineered to detect specific long-tail categories defined in the Waymo Open Dataset for End-to-End Driving:
- Construction: Lane diversions, orange drums, workers.
- VRU Interaction: Jaywalking, hesitation at crosswalks.
- Foreign Object Debris (FOD): Trash, rocks, lost cargo.
- Adverse Weather: Hydroplaning risks, glare, sensor occlusion.
- Special Vehicles: Emergency vehicles, school buses.
If you use Semantic-Drive in your research, please cite our work:
@article{guillen2026semantic,
title={Semantic-Drive: Trustworthy and Efficient Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus},
author={Guillen-Perez, Antonio},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2026},
url={https://openreview.net/forum?id=qN2oN36L3k},
note={Published in TMLR (04/2026)}
}
@misc{guillen2025semanticdrive,
author = {Guillen-Perez, Antonio},
title = {{Semantic-Drive: Trustworthy and Efficient Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus}},
year = {2025},
month = dec,
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/AntonioAlgaida/Semantic-Drive}}
}
@article{Guillen-Perez2025Dec,
author = {Guillen-Perez, Antonio},
title = {{Semantic-Drive: Trustworthy and Efficient Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus}},
journal = {On Review},
year = {2026},
month = jan,
}
This project is licensed under the MIT License - see the LICENSE file for details.