Skip to content

AntonioAlgaida/Semantic-Drive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic-Drive

Trustworthy and Efficient Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus

This repository contains the official implementation for "Semantic-Drive", a local-first framework designed to mine safety-critical edge cases from raw autonomous vehicle video logs using "System 2" neuro-symbolic reasoning.

TMLR Paper Live Demo Dataset Code License Hardware Python 3.10

A project by Antonio Guillen-Perez | Portfolio | LinkedIn | Google Scholar


Figure 1: Semantic-Drive Framework Overview. The system combines real-time object detection (YOLOE-11) with Neuro-Symbolic reasoning VLMs (Qwen3-VL, Kimi-VL) to mine safety-critical edge cases from raw autonomous vehicle video logs without cloud costs.


Figure 2: The "Dark Data" Crisis. While 99% of logs represent nominal driving, the critical 1% lies in the "Long Tail" (e.g., erratic VRUs, sensor degradation). Semantic-Drive automates the mining of this region without cloud costs.

Abstract

The development of Autonomous Vehicles (AVs) is currently hampered by a scarcity of long-tail training data. Semantic-Drive is an offline DataOps engine designed to mine safety-critical scenarios, specifically rare events like erratic jaywalking or complex construction diversions, from unlabelled data lakes. It provides an accessible and privacy-preserving alternative to cloud-based auto-labelers by running entirely on consumer-grade hardware (NVIDIA RTX 3090) without transmitting data to external APIs.

The system employs a Neuro-Symbolic Architecture that separates perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis where an ensemble of reasoning Vision-Language Models (VLMs) performs forensic scene analysis. By enforcing a strict "Scenario DNA" schema aligned with the Waymo Open Dataset (WOD-E2E), Semantic-Drive transforms unstructured video into a queryable semantic database.

Key Achievements:

  • High Recall: It was observed that Semantic-Drive achieves a 0.966 recall on safety-critical scenarios (vs. 0.331 for OWL-v2 and 0.271 for Grounding DINO).
  • Trustworthy Reasoning: The system reduces Risk Assessment Error (MAE) by 40% compared to single-model baselines via a multi-model consensus mechanism.
  • Hardware Accessible: Designed to fit within a 24GB VRAM compute budget, enabling local execution on a single RTX 3090.
  • Cost Efficiency: A ~97% cost reduction was estimated compared to commercial cloud APIs like GPT-4o.

Folder Structure

Semantic-Drive/
├── assets/                     # Architecture diagrams and paper figures
├── llama.cpp/                  # Setup location for llama.cpp inference engine
├── models/                     # GGUF Quantized Models & Vision Projectors
├── notebooks/                  # Interactive experiments (Grounding checks, Dashboard)
├── nuscenes_data/              # Local dataset storage (blobs + metadata)
├── output/                     # Generated semantic indexes (.jsonl) and execution logs
├── src/                        # Core Neuro-Symbolic Framework
│   ├── data/                   
│   │   ├── loader.py           # NuScenes API wrapper (Sparse/Dense sampling)
│   │   └── visuals.py          # Multi-view image stitching utilities
│   ├── model/                  
│   │   ├── detector.py         # YOLOE-11 Open-Vocabulary Segmentor wrapper
│   │   ├── prompts.py          # System Prompts & WOD-E2E Schema definition
│   │   └── vlm_client.py       # Robust API client (OpenAI/Gemini compatible)
│   ├── analytics.py            # Cost/Latency/Token usage analysis
│   ├── benchmark.py            # Precision/Recall ablation scripts
│   ├── config.py               # Global configuration (Paths, Camera selection)
│   ├── main.py                 # Neuro-Symbolic Pipeline Orchestrator (The Scout)
│   ├── judge.py                # Multi-Model Consensus Engine (The Judge)
│   └── reward.py               # Inference-Time Symbolic Verification logic
├── download_nuscenes.sh        # Automated downloader for NuScenes TrainVal
├── requirements.txt            # Python dependencies
└── README.md                   # Project documentation

Methodology: The Neuro-Symbolic Pipeline

The system employs a "Judge-Scout" architecture that separates perception into four distinct stages to mitigate the hallucinations common in pure Vision-Language Models.

Stage 1: Symbolic Grounding (The Eye)

We utilize YOLOE-11 (Real-Time Open-Vocabulary Segmentation) to perform an initial visual sweep. It detects objects from the WOD-E2E Taxonomy (e.g., "construction barrel", "debris") with a high-recall threshold (0.15). This "Object Inventory" is converted to text and injected into the VLM's context window.


Figure 3: Stage 1 - Symbolic Grounding with YOLOE-11. The object inventory is extracted and formatted for VLM consumption.

Stage 2: Cognitive Analysis (The Brain)

An ensemble of Reasoning VLMs (Qwen3-VL, Kimi-VL, Gemma-3) processes the images and the symbolic inventory. They execute a Chain-of-Thought (CoT) process to verify detections ("Skepticism Policy"), assess environmental conditions, and determine causal risks (e.g., "Is the pedestrian interacting with the scene?").


Figure 4: Stage 2 - Cognitive Analysis with Reasoning VLMs. The VLMs verify detections and assess scenario risks using Chain-of-Thought reasoning.

Stage 3: Inference-Time Consensus (The Judge)

A separate Local LLM (Mistral-14B) aggregates the reports from multiple scouts. It performs an Inference-Time Search (Best-of-N), generating candidate scenarios and scoring them using a deterministic Symbolic Reward Model ($R(y)$). This filters out hallucinations that are not grounded in the YOLO inventory.


Figure 5: Stage 3 - Inference-Time Consensus with the Judge. The LLM aggregates multiple scout reports and selects the most consistent scenario.

Stage 4: Symbolic Verification:

To ensure logical consistency, the system generates $N$ candidate scenarios and selects the optimal one using a deterministic Symbolic Reward Model ($R(y)$) that penalizes ungrounded hallucinations.

Key Quantitative Results

Performance was evaluated on a verified Gold Set ($N=108$) targeting rare edge cases and an Unbiased Blind Split ($N=107$) to measure average-case false positives.

Method Prec. (Stress) Rec. (Stress) ↑ Risk Error (MAE) ↓ Latency
Metadata Search 0.406 0.602 5.70 0.0s
Grounding DINO 0.182 0.271 5.70 0.4s
OWL-v2 0.386 0.331 3.96 0.5s
Single Scout (Qwen3) 0.714 0.932 1.13 31.5s
Semantic-Drive (Full) 0.712 0.966 0.67 ~60s

Interactive Demo & Dataset

To facilitate reproducibility, we have released the data and an interactive explorer:

Setup and Reproducibility

1. Prerequisites

2. Installation

# Clone repository
git clone https://github.com/AntonioAlgaida/Semantic-Drive.git
cd Semantic-Drive

# Install dependencies
pip install -r requirements.txt

3. Configuration

Edit src/config.py to point to your local NuScenes dataset path:

NUSCENES_DATAROOT = "/path/to/your/nuscenes"

Local Setup (RTX 3090 / 4090 / Similar)

For consumer hardware with 24GB VRAM, we use 4-bit Quantized (Q4_K_M) models. This retains reasoning performance while fitting the model (~19GB) and image context within memory limits.

1. Build the Inference Engine

We use llama.cpp server for high-throughput local inference.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
docker build -t local/llama.cpp:server-cuda --target server -f .devops/cuda.Dockerfile .
cd ..

2. Download Optimized Models (Q4_K_M)

Create a models directory: mkdir -p models

Option A: Qwen3-VL-30B (Thinking)

# Main Model (4-bit, ~19GB)
wget -O models/Qwen3-VL-30B-A3B-Thinking-Q4_K_M.gguf "https://huggingface.co/unsloth/Qwen3-VL-30B-A3B-Thinking-GGUF/resolve/main/Qwen3-VL-30B-A3B-Thinking-Q4_K_M.gguf?download=true"
# Vision Projector
wget -O models/mmproj-F16.gguf "https://huggingface.co/unsloth/Qwen3-VL-30B-A3B-Thinking-GGUF/resolve/main/mmproj-F16.gguf?download=true"

Option B: Kimi-VL-Thinking

wget -O models/Kimi-VL-A3B-Thinking-2506-Q4_K_M.gguf "https://huggingface.co/ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF/resolve/main/Kimi-VL-A3B-Thinking-2506-Q4_K_M.gguf?download=true"
wget -O models/mmproj-Kimi-VL-A3B-Thinking-2506-f16.gguf "https://huggingface.co/ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF/resolve/main/mmproj-Kimi-VL-A3B-Thinking-2506-f16.gguf?download=true"

The Judge (Mistral-Small-24B-Instruct)

wget -O models/Mistral-Small-24B-Instruct-Q4_K_M.gguf "https://huggingface.co/maziyarpanahi/Mistral-Small-24B-Instruct-2501-GGUF/resolve/main/Mistral-Small-24B-Instruct-2501.Q4_K_M.gguf?download=true"

3. Prepare NuScenes Data

  1. Download v1.0-mini or v1.0-trainval from NuScenes.org.
  2. Extract it to ./nuscenes_data.
  3. Update src/config.py if your path differs.

4. Run Inference Server (Docker)

Run this in a separate terminal. Note the reduced context size (8192) to save VRAM for the 3090.

# Example for Qwen3 (Adjust filenames for Kimi)
docker run --rm -it --gpus all \
    -v $(pwd)/models:/models \
    -p 1234:1234 \
    local/llama.cpp:server-cuda \
    -m /models/Qwen3-VL-30B-A3B-Thinking-Q4_K_M.gguf \
    --mmproj /models/mmproj-F16.gguf \
    --port 1234 --host 0.0.0.0 \
    --ctx-size 8192 \
    --n-gpu-layers 999

5. Run the Scouts (Mining)

# The script will detect the local server on port 1234
python -m src.main \
    --model "qwen3-30b-local" \
    --output_name "qwen3_local_run" \
    --verbose

6. Run the Judge (Consensus)

  1. Stop previous server: docker stop $(docker ps -q)
  2. Start Judge Server (Text-Only):
    docker run --rm -d --gpus all -v $(pwd)/models:/models -p 1234:1234 \
        local/llama.cpp:server-cuda \
        -m /models/Mistral-Small-24B-Instruct-Q4_K_M.gguf \
        --port 1234 --host 0.0.0.0 --ctx-size 32768 --n-gpu-layers 999
  3. Run Consensus:
    python -m src.judge \
        --files output/logs_qwen_run.jsonl \
        --output output/consensus_final.jsonl \
        --n 3

7. Human Verification (Optional)

Launch the local Streamlit app to manually verify the results and build your own Gold Set.

streamlit run src/tools/gold_curator_app.py

Access the dashboard at http://localhost:8501.


Enterprise / Server Setup

For large-scale mining, we recommend using Docker with llama.cpp server to bypass GUI overhead.

1. Build Inference Engine (Docker)

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Build server with CUDA support
docker build -t local/llama.cpp:server-cuda --target server -f .devops/cuda.Dockerfile .
cd ..

2. Download Dataset

Use the provided script to download and extract the NuScenes blobs (trainval).

chmod +x download_nuscenes.sh
# Download blobs 1 through 10 (Full Dataset)
./download_nuscenes.sh ./nuscenes_data 1 10

3. Run Scouts (Mining Agents)

We use Q8_0 (High Precision) models for maximum reasoning fidelity. Ensure you create a models/ directory first.

Option A: Qwen3-VL-30B (Thinking)

# 1. Download Model & Projector
wget -O models/Qwen3-VL-30B-A3B-Thinking-Q8_0.gguf "https://huggingface.co/unsloth/Qwen3-VL-30B-A3B-Thinking-GGUF/resolve/main/Qwen3-VL-30B-A3B-Thinking-Q8_0.gguf?download=true"
wget -O models/mmproj-F16.gguf "https://huggingface.co/unsloth/Qwen3-VL-30B-A3B-Thinking-GGUF/resolve/main/mmproj-F16.gguf?download=true"

# 2. Start Server (Background)
# Note: Adjust --gpus device=1 based on available hardware
docker run --rm -it --gpus '"device=1"' \
    -v $(pwd)/models:/models -p 1234:1234 \
    local/llama.cpp:server-cuda \
    -m /models/Qwen3-VL-30B-A3B-Thinking-Q8_0.gguf \
    --mmproj /models/mmproj-F16.gguf \
    --port 1234 --host 0.0.0.0 \
    --ctx-size 32768 --n-gpu-layers 999 --batch-size 2048

# 3. Run Miner
CUDA_VISIBLE_DEVICES=1 python -m src.main \
    --model "qwen3-30b-docker" --output_name "qwen3_run" \
    --verbose --port 1234

Option B: Kimi-VL-Thinking

# 1. Download
wget -O models/Kimi-VL-A3B-Thinking-2506-Q8_0.gguf "https://huggingface.co/ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF/resolve/main/Kimi-VL-A3B-Thinking-2506-Q8_0.gguf?download=true"
wget -O models/mmproj-Kimi-VL-A3B-Thinking-2506-f16.gguf "https://huggingface.co/ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF/resolve/main/mmproj-Kimi-VL-A3B-Thinking-2506-f16.gguf?download=true"

# 2. Start Server
docker run --rm -it --gpus '"device=1"' \
    -v $(pwd)/models:/models -p 1234:1234 \
    local/llama.cpp:server-cuda \
    -m /models/Kimi-VL-A3B-Thinking-2506-Q8_0.gguf \
    --mmproj /models/mmproj-Kimi-VL-A3B-Thinking-2506-f16.gguf \
    --port 1234 --host 0.0.0.0 \
    --ctx-size 32768 --n-gpu-layers 999 --batch-size 2048

# 3. Run Miner
CUDA_VISIBLE_DEVICES=1 python -m src.main \
    --model "kimi-thinking-q8" --output_name "kimi_run" \
    --verbose --port 1234

Option C: Gemma-3-27B

# 1. Download
wget -O models/gemma-3-27b-it-Q8_0.gguf "https://huggingface.co/unsloth/gemma-3-27b-it-GGUF/resolve/main/gemma-3-27b-it-Q8_0.gguf?download=true"
wget -O models/mmproj-gemma-3-27b-f16.gguf "https://huggingface.co/unsloth/gemma-3-27b-it-GGUF/resolve/main/mmproj-F16.gguf?download=true"

# 2. Start Server
docker run --rm -it --gpus '"device=1"' \
    -v $(pwd)/models:/models -p 1234:1234 \
    local/llama.cpp:server-cuda \
    -m /models/gemma-3-27b-it-Q8_0.gguf \
    --mmproj /models/mmproj-gemma-3-27b-f16.gguf \
    --port 1234 --host 0.0.0.0 \
    --ctx-size 32768 --n-gpu-layers 999 --batch-size 2048

# 3. Run Miner
CUDA_VISIBLE_DEVICES=1 python -m src.main \
    --model "gemma-3-27b-q8" --output_name "gemma_run" \
    --verbose --port 1234

4. Run Consensus (The Judge)

Once all scouts have finished, merge the results using the Multi-Model Judge (supporting both Local LLMs and Cloud models). For 3 scouts (e.g., Qwen3, Kimi, Gemma):

python -m src.judge --files output/index_qwen3_run.jsonl output/index_kimi_run.jsonl output/index_gemma_run.jsonl --n 3

For 2 scouts (e.g., Qwen3 and Kimi):

python -m src.judge --files output/index_kimi_run.jsonl output/index_qwen3_run.jsonl --n 3

Semantic Taxonomy (WOD-E2E)

The system is engineered to detect specific long-tail categories defined in the Waymo Open Dataset for End-to-End Driving:

  • Construction: Lane diversions, orange drums, workers.
  • VRU Interaction: Jaywalking, hesitation at crosswalks.
  • Foreign Object Debris (FOD): Trash, rocks, lost cargo.
  • Adverse Weather: Hydroplaning risks, glare, sensor occlusion.
  • Special Vehicles: Emergency vehicles, school buses.

Citation

If you use Semantic-Drive in your research, please cite our work:

@article{guillen2026semantic,
  title={Semantic-Drive: Trustworthy and Efficient Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus},
  author={Guillen-Perez, Antonio},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2026},
  url={https://openreview.net/forum?id=qN2oN36L3k},
  note={Published in TMLR (04/2026)}
}

@misc{guillen2025semanticdrive,
  author = {Guillen-Perez, Antonio},
	title = {{Semantic-Drive: Trustworthy and Efficient Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus}},
  year = {2025},
	month = dec,
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/AntonioAlgaida/Semantic-Drive}}
}

@article{Guillen-Perez2025Dec,
	author = {Guillen-Perez, Antonio},
	title = {{Semantic-Drive: Trustworthy and Efficient Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus}},
	journal = {On Review},
	year = {2026},
	month = jan,
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Official implementation of "Semantic-Drive": a local-first neuro-symbolic framework for long-tail AV data curation. We utilize open-vocabulary grounding (YOLOE) and multi-VLM consensus to mine safety-critical "Dark Data" within a consumer hardware compute budget.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors