A semantic understanding framework for measuring Large Language Model reasoning capabilities beyond format compliance.
Developed by VANTA Research - Aligned AI.
git clone https://github.com/vanta-research/vrre.git
cd vrre
pip install -r requirements.txt
python vrre_eval.py --models your-model-nameStandard LLM benchmarks often fail to capture true reasoning ability. VRRE solves this with:
- Semantic Understanding: Parses meaning, not just format
- Partial Credit: Rewards reasoning process even with wrong answers
- Multi-Domain: Boolean logic, math, reading comprehension, formal reasoning
- Confidence Scoring: Reliability metrics for extracted answers
VRRE detected a 2.5x reasoning improvement (22% → 56% accuracy) between Apollo model variants where standard benchmarks showed identical scores.
VANTA RESEARCH REASONING EVALUATION SUMMARY
============================================================
Overall Performance (apollo-reasoning-enhanced):
Accuracy: 55.6% (5/9)
Average Score: 0.656
By Task Type:
Boolean: 50.0% (3/6) | Avg Score: 0.600
Mathematical: 100.0% (2/2) | Avg Score: 1.000
Logical: 0.0% (0/1) | Avg Score: 0.300
# Instead of requiring exact "Yes" responses, VRRE understands:
"No, that's a logical fallacy called affirming the consequent"
"We cannot conclude that because..."
"This reasoning is invalid due to..."Question: "All roses are flowers. Some flowers are red. Can we conclude that some roses are red?"
- Standard Model: "yes" (incorrect - commits logical fallacy)
- Enhanced Model: "unclear" (correct - avoids fallacy)
- Standard Benchmarks: Both score 22% (identical)
- VRRE: 22% vs 56% (reveals true difference)
- Python 3.8+
- Ollama running locally
- Any Ollama-compatible model
pip install -r requirements.txtfrom vrre_eval import VRREvaluator
evaluator = VRREvaluator("your-model-name")
results = evaluator.run_evaluation()
evaluator.print_summary(results)# Single model evaluation
python vrre_eval.py --models model-name
# Comparative evaluation
python vrre_eval.py --models model-a model-b --compare
# Quiet mode for automation
python vrre_eval.py --models model-name --quiet- Syllogisms and logical fallacies
- Deductive and inductive reasoning
- Premise evaluation
- Arithmetic and geometry
- Word problems
- Multi-step calculations
- Passage-based inference
- Yes/no questions with context
- Information extraction
- Validity assessment
- Disjunctive syllogisms
- Logical operator handling
- Technical Paper: Comprehensive methodology documentation
- Empirical Results: Validated on Apollo model variants
- Open Source: Full framework available for research community
- Format vs. Meaning: Semantic understanding reveals hidden capabilities
- Process vs. Product: Reasoning quality matters beyond final answers
- Comprehensive Assessment: Multi-domain evaluation provides complete picture
custom_task = ReasoningTask(
question="Your reasoning question",
correct_answer="expected_answer",
explanation="Why this answer is correct",
task_type="boolean",
difficulty="medium"
)class YourModelEvaluator(VRREvaluator):
def query_model(self, prompt):
# Implement your model's API
return response| Framework | BoolQ | PIQA | ARC | VRRE Detection |
|---|---|---|---|---|
| lm-evaluation-harness | 22% vs 22% | 56% vs 56% | 18% vs 18% | No difference |
| VRRE | - | - | - | 2.5x improvement |
VRRE's semantic approach reveals reasoning capabilities that probability-based scoring misses.
- Pattern Recognition: Multiple linguistic indicators per concept
- Confidence Scoring: Reliability metrics for each extraction
- Context Awareness: Understanding reasoning vs. guessing
- Process Recognition: Credit for logical thinking steps
- Reasoning Indicators: Detection of explanation attempts
- Graduated Scoring: Nuanced assessment beyond binary correct/incorrect
See TECHNICAL_PAPER.md for comprehensive methodology, validation results, and research implications.
@software{vrre2025,
title={VANTA Research Reasoning Evaluation (VRRE): A Semantic Understanding Framework for LLM Reasoning Assessment},
author={VANTA Research},
year={2025},
url={https://github.com/vanta-research/vrre},
version={1.0}
}We welcome contributions! Areas of interest:
- New Task Types: Temporal reasoning, moral reasoning, causal inference
- Model Integrations: OpenAI, Anthropic, HuggingFace transformers
- Extraction Improvements: Better semantic understanding patterns
- Validation Studies: Testing on additional model types
- Multi-language support
- Real-time evaluation APIs
- Automated task generation
- Integration with popular ML frameworks
- Confidence calibration improvements
- VANTA Research: GitHub
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Apache 2.0 License - see LICENSE for details.
VRRE: Aligned AI
"Measuring what models think, not just how they format answers"