An automated evaluation framework for benchmarking LLMs across reasoning accuracy, hallucination propensity, emotional alignment consistency, and code correctness. Designed for reproducible experiments and deployment-oriented model selection.
This framework runs entirely offline using local Ollama models, making it perfect for private evaluation and reproducible research.
The framework evaluates LLMs across four key axes:
-
Reasoning Accuracy: Logical reasoning, math problems, syllogisms
- Uses exact match, numeric tolerance, and optional LLM-judge rubrics
-
Hallucination Propensity: Tendency to invent facts, names, or unsupported claims
- Tests unknown facts, fictional entities, and citation requirements
- Uses LLM-judge rubrics with failure tags like
hallucinated_entity,unsupported_claim,overconfident
-
Emotional Alignment Consistency: Empathy, validation, and consistency across prompt paraphrases
- Uses LLM-judge rubrics per response
- Measures consistency across prompt variant groups (3-5 paraphrases) and across multiple repeats
-
Code Correctness: Executable Python code with unit test assertions
- Extracts code blocks, executes in subprocess with timeout
- Runs provided test cases and scores partial credit
Additional Analyses:
- Consistency gaps: Variance and standard deviation across repeats and prompt variants
- Failure modes: Systematic vs sporadic failures, categorized by failure tags
- Behavioral tradeoffs: Cross-axis correlations (e.g., reasoning vs hallucination)
- Python: 3.8 or higher
- Ollama: Installed and running locally (ollama.com)
- Models: Pull the models you want to evaluate (see Setup below)
pip install -r requirements.txtPull the models you want to evaluate:
ollama pull phi3:3.8b
ollama pull mistral:7b
ollama pull samantha-mistral:7bYou can evaluate any Ollama model—just pull it and add it to your config (see "Evaluate Any LLM" below).
python main.py --config configs/run_config.jsonThis will:
- Load the test suite from
prompts/suites/core_suite.json(40+ tests) - Run each test across all models in the config with configurable repeats
- Evaluate responses using appropriate methods (exact_match, numeric_tolerance, llm_judge, python_exec)
- Generate results in
data/runs/<run_id>/
Options:
--config: Path to config file (default:configs/run_config.json)--run_id: Custom run ID (default: timestamp-based)--out_dir: Custom output directory--models: Comma-separated list to override models (e.g.,--models "llama3:8b,custom-model:latest")--suite: Override suite path (e.g.,--suite prompts/suites/core_suite.json)
Example:
python main.py --config configs/run_config.json --run_id my_experiment_001streamlit run dashboard.pyThe dashboard provides:
- Overview: Per-model and per-category performance metrics
- Heatmap: Model × test score visualization
- Tradeoffs: Cross-axis correlations and scatter plots
- Failure Modes: Failure tag analysis and worst tests
- Consistency: Variance analysis and consistency scores
Select your run_id from the sidebar to explore results.
python select_model.py --run_id <run_id>This creates data/runs/<run_id>/recommendations.json with:
- Overall model ranking (weighted scores)
- Best model per category
- Deployment-oriented use case recommendations
- Consistency-based recommendations
Custom weights:
python select_model.py --run_id <run_id> --weights '{"reasoning": 2.0, "hallucination": 1.5}'python reports/generate_report.py --run_id <run_id>This generates data/runs/<run_id>/REPORT.md with:
- Topline metrics
- Tradeoffs narrative
- Failure mode insights
- Consistency gaps
- Deployment recommendations
- Prompt strategy suggestions
To evaluate a different LLM, you don't need to change any code:
-
Pull the model in Ollama:
ollama pull my-model:latest
-
Update the config (
configs/run_config.json):{ "models": ["my-model:latest", "other-model:7b"], ... }OR use the CLI override:
python main.py --config configs/run_config.json --models "my-model:latest,other-model:7b" -
Run the experiment as usual—no code changes needed!
Each test case in prompts/suites/core_suite.json follows this schema:
{
"id": "reasoning-001",
"category": "reasoning|hallucination|emotion|code",
"prompt": "Your test prompt here...",
"expected": { ... } or "..." or number,
"rubric": { ... },
"eval": {
"method": "exact_match|numeric_tolerance|contains|llm_judge|python_exec",
"params": { ... }
},
"meta": {
"difficulty": "easy|med|hard",
"tags": ["..."],
"prompt_variant_group": "emotion-set-01" or null
}
}Evaluation Methods:
exact_match: Normalized string comparisonnumeric_tolerance: Numeric comparison within tolerancecontains: Checks for required/forbidden termsllm_judge: LLM-based evaluation with rubricpython_exec: Executes code and runs unit tests
For code tests, expected must include:
{
"entrypoint": "function_name",
"tests": [
{"input": {"args": [1, 2]}, "output": 3},
{"input": {"args": [5, 7]}, "output": 12}
]
}Each experiment run creates data/runs/<run_id>/ with:
results.jsonl: Incremental results (JSONL format, one line per run)results.json: Full results (JSON array)results.csv: CSV export withoutput_preview(truncated to 200 chars)summary.json: Aggregated statistics including:- Per-model and per-category averages
- Consistency metrics (variance, std dev, consistency scores)
- Failure mode analysis (systematic/sporadic failures, top tags)
- Tradeoff analysis (correlations, strengths/weaknesses)
recommendations.json: Model selection recommendations (ifselect_model.pywas run)REPORT.md: Human-readable evaluation report (ifgenerate_report.pywas run)- Aggregated CSVs:
aggregated_model_category.csv,aggregated_model.csv,aggregated_category.csv
Note: data/runs/ is gitignored—your experiment outputs stay local.
The examples/sample_run/ folder contains a tiny, committed example from a real run:
summary.json: Aggregated metrics from run20260106_025634example_results.csv: Results for testreasoning-001only (2 models, truncated outputs)recommendations.json: Model selection output
This example is small (~7KB) and safe to commit. It demonstrates the output format without including large model outputs. Real runs are stored in data/runs/ and are not committed.
To export your own example snapshot:
python reports/export_example.py --run_id <run_id> --test_id reasoning-001 --out_dir examples/sample_runBased on evaluation results, the framework suggests prompt engineering strategies. Here are the canonical templates:
Template 1: Explicit Uncertainty Acknowledgment
[Your task here]
Important: If you are unsure about any fact, name, date, or detail, explicitly state that you don't know. Do not invent information. If the information is not available or you cannot verify it, say so clearly.
Use when: Hallucination failure rate > 30%
Example:
Who is the current president of Mars?
Important: If you are unsure about any fact, name, date, or detail, explicitly state that you don't know. Do not invent information. If the information is not available or you cannot verify it, say so clearly.
Template 2: Citation Required Format
[Your task here]
Format your response as JSON with the following structure:
{
"answer": "...",
"claims": ["claim1", "claim2"],
"evidence": ["source1", "source2"],
"confidence": "high|medium|low"
}
If you cannot provide evidence for a claim, set confidence to "low" and note uncertainty in the evidence field.
Use when: Need verifiable outputs with source tracking
Template 1: Mental Tracing Instruction
[Your coding task here]
Before writing code, mentally trace through the logic:
1. What are the edge cases?
2. What happens with empty inputs?
3. What happens with invalid inputs?
4. Does the function handle all expected cases?
Write the code, then verify it handles all cases above.
Use when: Code execution failure rate > 30%
Template 2: Unit Test First
[Your coding task here]
First, think about what test cases your function should pass:
- Test case 1: [describe]
- Test case 2: [describe]
- Edge case: [describe]
Now write the function to pass all these tests.
Use when: Partial test pass rate is low
Template 1: Deterministic Instructions
[Your task here]
Provide a clear, direct answer. Be consistent in your reasoning approach.
Use when: Consistency score < 0.7
Note: Also ensure temperature=0.0 and seed is set in sampling config.
Template 2: Structured Output
[Your task here]
Format your response as:
1. [First element]
2. [Second element]
3. [Conclusion]
This structure helps ensure consistent formatting across runs.
Use when: Variant group consistency is low
Template 1: Validation-First Approach
[User's emotional concern]
I understand this is difficult. [Acknowledge the emotion first]
[Then provide helpful response]
Use when: Emotion alignment scores are low
Template 2: Non-Judgmental Tone
[User's situation]
There's no right or wrong way to feel about this. [Acknowledge without judgment]
[Provide supportive response]
Use when: Emotion tests show judgmental language
Template 1: Step-by-Step Reasoning
[Your reasoning task]
Think through this step by step:
1. [First step]
2. [Second step]
3. [Conclusion]
Show your work.
Use when: Reasoning accuracy is low
Template 2: Explicit Answer Format
[Your reasoning task]
Answer with: "Yes" or "No" followed by your reasoning.
Use when: Exact match failures are high
For best results, combine multiple strategies:
[Task]
Important guidelines:
- If uncertain, say so explicitly (hallucination prevention)
- Show your reasoning step-by-step (reasoning accuracy)
- Format as JSON with evidence field (verifiability)
- Consider edge cases (code quality)
Model-Behavior-Lab/
├── README.md # This file (single source of truth)
├── LICENSE # MIT License
├── .gitignore # Git ignore rules
├── requirements.txt # Python dependencies
│
├── main.py # Experiment runner
├── analyze_results.py # CSV export and aggregation
├── select_model.py # Model ranking and recommendations
├── dashboard.py # Interactive Streamlit dashboard
│
├── configs/
│ └── run_config.json # Experiment configuration
│
├── prompts/
│ └── suites/
│ └── core_suite.json # Test suite (40+ tests)
│
├── models/
│ └── ollama_runner.py # Model execution (Ollama CLI)
│
├── analyzers/
│ ├── evaluator.py # Evaluation methods
│ ├── consistency.py # Consistency analysis
│ ├── failure_modes.py # Failure categorization
│ └── tradeoff_analysis.py # Cross-axis tradeoffs
│
├── reports/
│ ├── generate_report.py # Auto-generated report
│ └── export_example.py # Export example snapshot
│
├── examples/
│ └── sample_run/ # Committed example (CSV/JSON only)
│ ├── summary.json
│ ├── example_results.csv
│ └── recommendations.json
│
└── tests/
├── test_config_loading.py
├── test_consistency.py
└── test_evaluator.py
Error: ollama: command not found
Solution: Install Ollama from ollama.com and ensure it's in your PATH.
Error: model not found or model does not exist
Solution: Pull the model first:
ollama pull <model-name>Error: Timeout after X seconds
Solution: Increase timeout_s in configs/run_config.json or use --timeout_s if supported.
Warning: sampling_supported: false in results
Solution: Your Ollama version may not support all sampling flags. The framework will fall back to basic execution. Update Ollama or ignore this warning if results are still reproducible.
Error: ModuleNotFoundError
Solution: Ensure all dependencies are installed:
pip install -r requirements.txtRun the test suite:
python -m pytestTests cover:
- JSON schema loading
- Evaluation methods on known outputs
- Code executor safety and timeout behavior
- Consistency analysis functions
MIT License
Aditya Chunduri
🔗 github.com/Chunduri-Aditya
🎓 M.S. Applied Data Science, USC