Skip to content

vanta-research/vrre

Repository files navigation

VANTA Research Reasoning Evaluation (VRRE)

License: Apache 2.0 Python 3.8+ Ollama

A semantic understanding framework for measuring Large Language Model reasoning capabilities beyond format compliance.

Developed by VANTA Research - Aligned AI.

Quick Start

git clone https://github.com/vanta-research/vrre.git
cd vrre
pip install -r requirements.txt
python vrre_eval.py --models your-model-name

Why VRRE?

Standard LLM benchmarks often fail to capture true reasoning ability. VRRE solves this with:

  • Semantic Understanding: Parses meaning, not just format
  • Partial Credit: Rewards reasoning process even with wrong answers
  • Multi-Domain: Boolean logic, math, reading comprehension, formal reasoning
  • Confidence Scoring: Reliability metrics for extracted answers

Real Impact: 2.5x Improvement Detection

VRRE detected a 2.5x reasoning improvement (22% → 56% accuracy) between Apollo model variants where standard benchmarks showed identical scores.

Example Results

VANTA RESEARCH REASONING EVALUATION SUMMARY
============================================================

Overall Performance (apollo-reasoning-enhanced):
   Accuracy: 55.6% (5/9)
   Average Score: 0.656

By Task Type:
   Boolean: 50.0% (3/6) | Avg Score: 0.600
   Mathematical: 100.0% (2/2) | Avg Score: 1.000
   Logical: 0.0% (0/1) | Avg Score: 0.300

Core Innovation

Intelligent Response Parsing

# Instead of requiring exact "Yes" responses, VRRE understands:
"No, that's a logical fallacy called affirming the consequent"
"We cannot conclude that because..."
"This reasoning is invalid due to..."

The Roses/Flowers Test

Question: "All roses are flowers. Some flowers are red. Can we conclude that some roses are red?"

  • Standard Model: "yes" (incorrect - commits logical fallacy)
  • Enhanced Model: "unclear" (correct - avoids fallacy)
  • Standard Benchmarks: Both score 22% (identical)
  • VRRE: 22% vs 56% (reveals true difference)

Installation & Usage

Prerequisites

  • Python 3.8+
  • Ollama running locally
  • Any Ollama-compatible model

Quick Install

pip install -r requirements.txt

Basic Usage

from vrre_eval import VRREvaluator

evaluator = VRREvaluator("your-model-name")
results = evaluator.run_evaluation()
evaluator.print_summary(results)

Command Line

# Single model evaluation
python vrre_eval.py --models model-name

# Comparative evaluation
python vrre_eval.py --models model-a model-b --compare

# Quiet mode for automation
python vrre_eval.py --models model-name --quiet

Task Categories

Boolean Logic

  • Syllogisms and logical fallacies
  • Deductive and inductive reasoning
  • Premise evaluation

Mathematical Reasoning

  • Arithmetic and geometry
  • Word problems
  • Multi-step calculations

Reading Comprehension

  • Passage-based inference
  • Yes/no questions with context
  • Information extraction

Formal Logic

  • Validity assessment
  • Disjunctive syllogisms
  • Logical operator handling

Research Impact

Academic Validation

  • Technical Paper: Comprehensive methodology documentation
  • Empirical Results: Validated on Apollo model variants
  • Open Source: Full framework available for research community

Key Findings

  1. Format vs. Meaning: Semantic understanding reveals hidden capabilities
  2. Process vs. Product: Reasoning quality matters beyond final answers
  3. Comprehensive Assessment: Multi-domain evaluation provides complete picture

Extending VRRE

Custom Tasks

custom_task = ReasoningTask(
    question="Your reasoning question",
    correct_answer="expected_answer", 
    explanation="Why this answer is correct",
    task_type="boolean",
    difficulty="medium"
)

New Model Integration

class YourModelEvaluator(VRREvaluator):
    def query_model(self, prompt):
        # Implement your model's API
        return response

Comparison with Standard Benchmarks

Framework BoolQ PIQA ARC VRRE Detection
lm-evaluation-harness 22% vs 22% 56% vs 56% 18% vs 18% No difference
VRRE - - - 2.5x improvement

VRRE's semantic approach reveals reasoning capabilities that probability-based scoring misses.

Technical Innovation

Semantic Answer Extraction

  • Pattern Recognition: Multiple linguistic indicators per concept
  • Confidence Scoring: Reliability metrics for each extraction
  • Context Awareness: Understanding reasoning vs. guessing

Partial Credit Algorithm

  • Process Recognition: Credit for logical thinking steps
  • Reasoning Indicators: Detection of explanation attempts
  • Graduated Scoring: Nuanced assessment beyond binary correct/incorrect

Research Paper

See TECHNICAL_PAPER.md for comprehensive methodology, validation results, and research implications.

Citation

@software{vrre2025,
  title={VANTA Research Reasoning Evaluation (VRRE): A Semantic Understanding Framework for LLM Reasoning Assessment},
  author={VANTA Research},
  year={2025},
  url={https://github.com/vanta-research/vrre},
  version={1.0}
}

Contributing

We welcome contributions! Areas of interest:

  • New Task Types: Temporal reasoning, moral reasoning, causal inference
  • Model Integrations: OpenAI, Anthropic, HuggingFace transformers
  • Extraction Improvements: Better semantic understanding patterns
  • Validation Studies: Testing on additional model types

Roadmap

  • Multi-language support
  • Real-time evaluation APIs
  • Automated task generation
  • Integration with popular ML frameworks
  • Confidence calibration improvements

Contact

License

Apache 2.0 License - see LICENSE for details.


VRRE: Aligned AI

"Measuring what models think, not just how they format answers"

About

VANTA Research Reasoning Evaluation is a novel reasoning benchmark that displays differences in reasoning quality often missed by standard benchmarks.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors