VANTA Research Reasoning Evaluation (VRRE)

A semantic understanding framework for measuring Large Language Model reasoning capabilities beyond format compliance.

Developed by VANTA Research - Aligned AI.

Quick Start

git clone https://github.com/vanta-research/vrre.git
cd vrre
pip install -r requirements.txt
python vrre_eval.py --models your-model-name

Why VRRE?

Standard LLM benchmarks often fail to capture true reasoning ability. VRRE solves this with:

Semantic Understanding: Parses meaning, not just format
Partial Credit: Rewards reasoning process even with wrong answers
Multi-Domain: Boolean logic, math, reading comprehension, formal reasoning
Confidence Scoring: Reliability metrics for extracted answers

Real Impact: 2.5x Improvement Detection

VRRE detected a 2.5x reasoning improvement (22% → 56% accuracy) between Apollo model variants where standard benchmarks showed identical scores.

Example Results

VANTA RESEARCH REASONING EVALUATION SUMMARY
============================================================

Overall Performance (apollo-reasoning-enhanced):
   Accuracy: 55.6% (5/9)
   Average Score: 0.656

By Task Type:
   Boolean: 50.0% (3/6) | Avg Score: 0.600
   Mathematical: 100.0% (2/2) | Avg Score: 1.000
   Logical: 0.0% (0/1) | Avg Score: 0.300

Core Innovation

Intelligent Response Parsing

# Instead of requiring exact "Yes" responses, VRRE understands:
"No, that's a logical fallacy called affirming the consequent"
"We cannot conclude that because..."
"This reasoning is invalid due to..."

The Roses/Flowers Test

Question: "All roses are flowers. Some flowers are red. Can we conclude that some roses are red?"

Standard Model: "yes" (incorrect - commits logical fallacy)
Enhanced Model: "unclear" (correct - avoids fallacy)
Standard Benchmarks: Both score 22% (identical)
VRRE: 22% vs 56% (reveals true difference)

Installation & Usage

Prerequisites

Python 3.8+
Ollama running locally
Any Ollama-compatible model

Quick Install

pip install -r requirements.txt

Basic Usage

from vrre_eval import VRREvaluator

evaluator = VRREvaluator("your-model-name")
results = evaluator.run_evaluation()
evaluator.print_summary(results)

Command Line

# Single model evaluation
python vrre_eval.py --models model-name

# Comparative evaluation
python vrre_eval.py --models model-a model-b --compare

# Quiet mode for automation
python vrre_eval.py --models model-name --quiet

Task Categories

Boolean Logic

Syllogisms and logical fallacies
Deductive and inductive reasoning
Premise evaluation

Mathematical Reasoning

Arithmetic and geometry
Word problems
Multi-step calculations

Reading Comprehension

Passage-based inference
Yes/no questions with context
Information extraction

Formal Logic

Validity assessment
Disjunctive syllogisms
Logical operator handling

Research Impact

Academic Validation

Technical Paper: Comprehensive methodology documentation
Empirical Results: Validated on Apollo model variants
Open Source: Full framework available for research community

Key Findings

Format vs. Meaning: Semantic understanding reveals hidden capabilities
Process vs. Product: Reasoning quality matters beyond final answers
Comprehensive Assessment: Multi-domain evaluation provides complete picture

Extending VRRE

Custom Tasks

custom_task = ReasoningTask(
    question="Your reasoning question",
    correct_answer="expected_answer", 
    explanation="Why this answer is correct",
    task_type="boolean",
    difficulty="medium"
)

New Model Integration

class YourModelEvaluator(VRREvaluator):
    def query_model(self, prompt):
        # Implement your model's API
        return response

Comparison with Standard Benchmarks

Framework	BoolQ	PIQA	ARC	VRRE Detection
lm-evaluation-harness	22% vs 22%	56% vs 56%	18% vs 18%	No difference
VRRE	-	-	-	2.5x improvement

VRRE's semantic approach reveals reasoning capabilities that probability-based scoring misses.

Technical Innovation

Semantic Answer Extraction

Pattern Recognition: Multiple linguistic indicators per concept
Confidence Scoring: Reliability metrics for each extraction
Context Awareness: Understanding reasoning vs. guessing

Partial Credit Algorithm

Process Recognition: Credit for logical thinking steps
Reasoning Indicators: Detection of explanation attempts
Graduated Scoring: Nuanced assessment beyond binary correct/incorrect

Research Paper

See TECHNICAL_PAPER.md for comprehensive methodology, validation results, and research implications.

Citation

@software{vrre2025,
  title={VANTA Research Reasoning Evaluation (VRRE): A Semantic Understanding Framework for LLM Reasoning Assessment},
  author={VANTA Research},
  year={2025},
  url={https://github.com/vanta-research/vrre},
  version={1.0}
}

Contributing

We welcome contributions! Areas of interest:

New Task Types: Temporal reasoning, moral reasoning, causal inference
Model Integrations: OpenAI, Anthropic, HuggingFace transformers
Extraction Improvements: Better semantic understanding patterns
Validation Studies: Testing on additional model types

Roadmap

Multi-language support
Real-time evaluation APIs
Automated task generation
Integration with popular ML frameworks
Confidence calibration improvements

Contact

VANTA Research: GitHub
Issues: GitHub Issues
Discussions: GitHub Discussions

License

Apache 2.0 License - see LICENSE for details.

VRRE: Aligned AI

"Measuring what models think, not just how they format answers"

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
TECHNICAL_PAPER.md		TECHNICAL_PAPER.md
examples.py		examples.py
model_health_check.py		model_health_check.py
multi_model_comparison.py		multi_model_comparison.py
quickstart.sh		quickstart.sh
requirements.txt		requirements.txt
vrre_benchmark_analysis.py		vrre_benchmark_analysis.py
vrre_efficiency_analyzer.py		vrre_efficiency_analyzer.py
vrre_eval.py		vrre_eval.py

Folders and files

Latest commit

History

Repository files navigation

VANTA Research Reasoning Evaluation (VRRE)

Quick Start

Why VRRE?

Real Impact: 2.5x Improvement Detection

Example Results

Core Innovation

Intelligent Response Parsing

The Roses/Flowers Test

Installation & Usage

Prerequisites

Quick Install

Basic Usage

Command Line

Task Categories

Boolean Logic

Mathematical Reasoning

Reading Comprehension

Formal Logic

Research Impact

Academic Validation

Key Findings

Extending VRRE

Custom Tasks

New Model Integration

Comparison with Standard Benchmarks

Technical Innovation

Semantic Answer Extraction

Partial Credit Algorithm

Research Paper

Citation

Contributing

Roadmap

Contact

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages