An LLM benchmark for Christian scripture accuracy and theological understanding
BibleBench is a comprehensive evaluation suite designed to assess Large Language Models (LLMs) on their knowledge of Christian scripture, theological accuracy, and ability to apply biblical principles with wisdom and nuance. Built with Evalite and the AI SDK v5, it provides rigorous, reproducible testing across multiple dimensions of biblical and theological competence.
As LLMs become increasingly used for religious education, pastoral care, and theological discussion, there is a critical need for standardized benchmarks that evaluate their:
- Scripture Knowledge: Accurate recall of verses, references, and biblical context
- Theological Accuracy: Understanding of core Christian doctrines and orthodoxy
- Heresy Detection: Ability to identify and reject heterodox teachings
- Denominational Fairness: Representing diverse Christian traditions without bias
- Pastoral Wisdom: Applying theology to real-world situations with grace and truth
BibleBench fills this gap by providing a rigorous, multi-dimensional benchmark grounded in historic Christian orthodoxy while respecting legitimate theological diversity.
- Evalite (beta): Modern TypeScript testing framework for AI applications
- AI SDK v5: Unified interface for multiple LLM providers
- Vitest: Fast unit testing framework (underlying Evalite)
- TypeScript: Type-safe evaluation development
- Autoevals: Pre-built evaluation scorers
biblebench/
βββ evals/
β βββ scripture/ # Scripture accuracy evaluations
β β βββ scripture-matching.eval.ts # Exact verse recall across translations
β β βββ reference-knowledge.eval.ts
β β βββ context-understanding.eval.ts
β βββ theology/ # Theological concept evaluations
β β βββ core-doctrines.eval.ts
β β βββ heresy-detection.eval.ts
β β βββ denominational-nuance.eval.ts
β β βββ pastoral-application.eval.ts
β β βββ sect-theology.eval.ts # Sect/cult theology evaluation
β β βββ theological-orientation.eval.ts # Theological spectrum analysis
β β βββ steering-compliance.eval.ts # Bias asymmetry detection
β βββ lib/ # Shared utilities
β βββ models.ts # AI model configurations
β βββ scorers.ts # Custom scoring functions
β βββ README.md
βββ evalite.config.ts # Evalite configuration
βββ tsconfig.json # TypeScript configuration
βββ package.json # Dependencies and scripts
Tests LLMs' foundational knowledge of the Bible itself.
Exact Scripture Matching (scripture/scripture-matching.eval.ts)
- Precise recall of Bible verses with exact wording across multiple translations
- Tests the same verses in KJV, NIV, ESV, and NASB to verify translation-specific accuracy
- 49 test cases covering 16 different verses (both well-known and less common)
- Requires perfect matchesβevery word, comma, and punctuation mark must be correct
- Includes famous verses (John 3:16, Psalm 23:1) and lesser-known passages (Micah 6:8, Lamentations 3:22-23)
- Measured with exact match scorerβno fuzzy matching since scripture is sacred
- Each test case includes translation-specific key phrases for verification
Reference Knowledge (scripture/reference-knowledge.eval.ts)
- Correctly identifying where verses are found
- Understanding of Bible book/chapter/verse structure
- Validated against standard reference formats
Context Understanding (scripture/context-understanding.eval.ts)
- Authorship and historical background
- Purpose and audience of biblical books
- Understanding of scriptural context
- Uses LLM-as-judge for nuanced evaluation
Tests comprehension of Christian doctrine and theology.
Core Doctrines (theology/core-doctrines.eval.ts)
- Trinity, Incarnation, Justification, Atonement
- Original sin, Image of God, Gospel
- Resurrection and eschatology
- Evaluated with theological accuracy judge and completeness scoring
Heresy Detection (theology/heresy-detection.eval.ts)
- Identifying historical heresies (Arianism, Modalism, Pelagianism, Docetism, Gnosticism)
- Distinguishing orthodoxy from heterodoxy
- Understanding why certain teachings are problematic
- Tests both identification and explanation
Denominational Nuance (theology/denominational-nuance.eval.ts)
- Fair representation of Catholic, Protestant, Orthodox perspectives
- Understanding of legitimate theological diversity
- Avoiding denominational bias
- Measured with custom bias detection and balance scoring
Pastoral Application (theology/pastoral-application.eval.ts)
- Applying theology to real-world situations
- Balancing truth with grace
- Pastoral sensitivity and wisdom
- Biblical grounding in practical advice
- Most complex evaluation using multi-dimensional LLM-as-judge
Sect Theology (theology/sect-theology.eval.ts)
- Identifying teachings of groups outside historic Christian orthodoxy
- Tests knowledge of Mormonism (LDS), Jehovah's Witnesses, Christian Science, Oneness Pentecostalism, and Unitarian Universalism
- Evaluates ability to articulate how sect teachings depart from orthodoxy
- Measures respectful tone while maintaining theological accuracy
- Includes 18 test cases covering core doctrines (Trinity, Christology, salvation, resurrection, etc.)
- Scorers: Theological accuracy judge, sect identification, orthodox defense, respectful tone
Theological Orientation Spectrum (theology/theological-orientation.eval.ts)
- Descriptive assessment of where models fall on the theological spectrum (progressive to conservative)
- Covers Biblical Authority, Gender & Ministry, Sexual Ethics, Gender Identity, Social Issues, and Ecclesiology
- Not pass/fail - measures theological positioning on contested issues
- Tests 23 questions across categories like inerrancy, women in leadership, LGBTQ+ issues, abortion, social justice
- Scorers: Orientation classifier (0=progressive, 0.5=moderate, 1=conservative), position clarity detector, scripture usage analyzer
- Provides insight into models' theological default positions and handling of diverse Christian perspectives
Steering Compliance & Bias Asymmetry (theology/steering-compliance.eval.ts)
- Tests whether models comply symmetrically with system prompts adopting different theological perspectives
- Each test case includes both conservative and progressive persona prompts with the same question
- Measures compliance asymmetry - do models refuse, hedge, or add disclaimers more for one perspective?
- Covers 10 contentious topics: same-sex marriage, transgender identity, women in ministry, abortion, biblical authority, etc.
- Scorers: Pure compliance (binary pass/fail for clean adoption), persona compliance, refusal detection, viewpoint expression
- Reveals potential bias in model guardrails and safety systems
- Descriptive study of model behavior, not endorsement of any theological position
BibleBench employs multiple scoring approaches:
- Exact Match: Binary match of expected output
- Contains: Substring matching
- Levenshtein Distance: Edit distance similarity
- Reference Format Validation: Regex-based format checking
- Word Overlap: Percentage of expected words present
- Key Points Coverage: Presence of critical theological terms
- Multiple Perspectives: Counting denominational views represented
- Exact Match: Binary scorer for precise scripture text matching (used in scripture-matching evaluation)
- Translation Phrase Match: Checks for translation-specific key phrases (e.g., "begotten" in KJV)
- Translation Vocabulary Fidelity: Validates use of appropriate vocabulary for each translation
- Theological Accuracy Judge: Evaluates doctrinal soundness, biblical grounding, and nuance
- Heresy Detection Judge: Identifies heterodox teaching with severity ratings
- Denominational Bias Detector: Measures ecumenical balance
- Pastoral Wisdom Judge: Multi-dimensional evaluation of pastoral responses
- Translation Identification Judge: Evaluates ability to correctly identify Bible translations based on distinctive vocabulary
All LLM-as-judge scorers use structured output (via AI SDK's generateObject) with detailed rationales, providing transparency and debuggability.
- Node.js 18+
- pnpm (recommended) or npm
- API keys for LLM providers you want to test
# Clone the repository
git clone https://github.com/yourusername/biblebench.git
cd biblebench
# Install dependencies
pnpm install
# Set up environment variables
cp .env.example .env
# Edit .env with your API keysBibleBench uses OpenRouter exclusively for accessing all LLM models. This means you only need one API key to access hundreds of models from multiple providers.
Create a .env file with your OpenRouter API key:
# OpenRouter API Key (REQUIRED)
# Get your key at: https://openrouter.ai/keys
OPENROUTER_API_KEY=your_openrouter_keyBenefits of using OpenRouter:
- β One API key for all models (GPT, Claude, Llama, Grok, Gemini, etc.)
- β Pay-as-you-go pricing with transparent per-token costs
- β Automatic failover for reliability
- β Immediate access to newly released models
- β Unified billing across all providers
See available models at OpenRouter Models
# Run in development mode with UI
pnpm eval:dev
# Run all evaluations
pnpm eval
# View results in UI
pnpm eval:uiThe Evalite UI will be available at http://localhost:3006, providing:
- Real-time evaluation progress
- Detailed score breakdowns
- Trace inspection
- Metadata exploration
# Run only scripture evaluations
pnpm eval evals/scripture/
# Run specific test file
pnpm eval evals/theology/core-doctrines.eval.tsBibleBench is currently configured to test 20 cutting-edge models across 10 different providers, all accessed through OpenRouter:
- GPT-5 Mini - Default judge model (efficient and cost-effective)
- GPT-5.2 - Latest generation with enhanced capabilities
- GPT-5.1 - Advanced reasoning model
- GPT-5 Nano - Efficient compact model
- GPT-OSS-120B - Open-source 120B parameter model
- GPT-OSS-20B - Open-source 20B parameter model
- Claude Haiku 4.5 - Fast, efficient Claude variant
- Claude Sonnet 4.5 - Balanced quality and speed
- Claude Opus 4.5 - Maximum capability model
- Grok 4.1 Fast - Speed-optimized Grok
- Grok 4 - Full Grok model
- Gemini 3 Flash Preview - Fast preview model
- Gemini 3 Pro Preview - Advanced preview model
- Mistral Large 2512 (Mistral AI)
- DeepSeek V3.2 (DeepSeek)
- Intellect-3 (Prime Intellect)
- OLMo 3.1 32B Think (AllenAI)
- Nemotron 3 Nano 30B (NVIDIA)
- GLM-4.7 (Zhipu AI)
- MiniMax M2.1 (MiniMax)
All models are accessed through a single OpenRouter API key, making it easy to test across diverse architectures, training approaches, and capabilities.
You can easily run evaluations on specific models using the MODELS environment variable - no code changes needed!
Use comma-separated patterns to match model names (case-insensitive):
# Run only GPT models
MODELS="gpt" pnpm eval
# Run only Claude models
MODELS="claude" pnpm eval
# Run GPT and Claude models
MODELS="gpt,claude" pnpm eval
# Run specific models by partial name match
MODELS="opus,sonnet" pnpm eval
# Run a single specific model
MODELS="gpt-5.2" pnpm eval- Case-insensitive:
MODELS="gpt"matches "GPT-5.2", "GPT-5.1", etc. - Partial matching:
MODELS="claude"matches "Claude Haiku 4.5", "Claude Sonnet 4.5", "Claude Opus 4.5" - Multiple patterns:
MODELS="gpt-5,opus"matches models containing "gpt-5" OR "opus" - Comma-separated: Use commas to specify multiple patterns
# Run only OpenAI models
MODELS="gpt" pnpm eval:dev
# Run only Anthropic Opus and Sonnet
MODELS="opus,sonnet" pnpm eval
# Run Google Gemini models
MODELS="gemini" pnpm eval
# Run a specific evaluation with specific models
MODELS="claude haiku,grok" pnpm eval evals/theology/core-doctrines.eval.ts
# Run without caching on specific models
MODELS="gpt-5.2" pnpm eval --no-cacheIf you specify a pattern that doesn't match any models, the system will show you all available model names:
MODELS="invalid" pnpm eval
# Shows warning with list of all available modelsTip: By default (without MODELS set), all 20+ configured models will run. Use MODELS to save time and API costs during development!
All models are accessed through OpenRouter. Simply add any model from the OpenRouter catalog:
// In evals/lib/models.ts
import { wrapAISDKModel } from "evalite/ai-sdk";
import { createOpenRouter } from "@openrouter/ai-sdk-provider";
const openrouter = createOpenRouter({
apiKey: process.env.OPENROUTER_API_KEY,
});
// Add any model from OpenRouter's catalog
export const newModel = wrapAISDKModel(
openrouter.chat("provider/model-name")
);
// Add to benchmarkModels array
export const benchmarkModels = [
// ... existing models
{ name: "New Model", model: newModel },
];Examples:
openrouter.chat("openai/gpt-4o")- GPT-4oopenrouter.chat("anthropic/claude-3.5-sonnet")- Claude Sonnetopenrouter.chat("meta-llama/llama-3.1-405b-instruct")- Llama 3.1openrouter.chat("x-ai/grok-beta")- Grokopenrouter.chat("google/gemini-pro-1.5")- Gemini Pro
Add to evals/lib/scorers.ts:
export const myCustomScorer = createScorer<string, string, string>({
name: "My Custom Scorer",
description: "Description of what it scores",
scorer: ({ input, output, expected }) => {
// Your scoring logic
const score = /* calculate 0-1 score */;
return {
score,
metadata: {
// Additional debugging info
}
};
}
});Create a new .eval.ts file in the appropriate directory using Evalite's A/B testing feature:
import { evalite } from "evalite";
import { generateText } from "ai";
import { selectedModels } from "../lib/models.js";
import { myScorer } from "../lib/scorers.js";
const myData = [
{ input: "question", expected: "answer" },
// ... more test cases
];
// Use evalite.each() for side-by-side model comparison
evalite.each(
selectedModels.map(({ name, model }) => ({ name, input: { model } }))
)("My Evaluation", {
data: async () => myData,
task: async (input, variant) => {
const result = await generateText({
model: variant.input.model,
prompt: `Your prompt here: ${input}`,
});
return result.text;
},
scorers: [myScorer],
});Why use evalite.each()?
- Side-by-side comparison: All models are compared within a single evaluation run
- Per-model scores: Each model's performance is clearly visible and comparable
- Better UI: Evalite's interface shows direct model comparisons
- Easier analysis: Instantly see which models perform best on each test case
All evaluations use Evalite's A/B testing feature (evalite.each()) to enable direct model comparison. This means:
- Side-by-side comparison: Models are tested together in a single evaluation run
- Per-model scores: Each model gets its own column showing performance across all test cases
- Direct comparisons: Instantly see which models excel or struggle on specific questions
- Detailed metrics for each scorer, with model-specific breakdowns
- Metadata including rationales from LLM-as-judge scorers
- Traces of model inputs and outputs for every test case
- Unified results: All model results in one view instead of separate evaluations
Results are stored in node_modules/.evalite and can be exported as static HTML for CI/CD integration.
Use the MODELS environment variable to run evaluations on specific models while maintaining the A/B comparison structure:
# Compare only GPT models against each other
MODELS="gpt" pnpm eval
# Compare Claude Opus vs Sonnet
MODELS="opus,sonnet" pnpm evalThe A/B testing structure is preserved regardless of how many models you filter to.
- Benchmark your models against established theological standards
- Identify weaknesses in scripture knowledge or theological reasoning
- Track improvements across model versions
- Evaluate LLMs before deploying them in educational or pastoral contexts
- Ensure models align with your theological positions
- Test for heresy detection and denominational fairness
- Study how different LLM architectures handle theological reasoning
- Compare performance on factual recall vs. nuanced application
- Analyze bias in religious content generation
- Select the best LLM for your Christian education app
- Validate that your fine-tuned model maintains theological accuracy
- Monitor for theological drift in deployed systems
We welcome contributions! Areas for expansion:
- More Test Cases: Additional verses, doctrines, scenarios
- Additional Categories: Church history, apologetics, biblical languages
- More Scorers: Novel evaluation approaches
- Other Faiths: Adaptations for Judaism, Islam, etc.
- Denominational Extensions: Specific evaluations for particular traditions
Please open an issue or pull request on GitHub.
BibleBench is grounded in historic Christian orthodoxy as expressed in:
- The Apostles' Creed
- The Nicene Creed
- The Chalcedonian Definition
- Core Reformation principles (sola fide, sola gratia, sola scriptura)
We recognize legitimate theological diversity among Christians while maintaining commitment to core orthodoxy:
- Non-negotiable: Trinity, deity of Christ, salvation by grace, biblical authority, resurrection
- Denominational differences: Baptism, church governance, eschatology, spiritual gifts
- Evaluations test for fair representation of different views, not adherence to one
Historical heresies are defined according to ecumenical church councils and historic Christian consensus:
- Arianism, Modalism, Nestorianism, Docetism, Pelagianism, Gnosticism, etc.
- Scorers detect these patterns while allowing for legitimate theological diversity
- Not a replacement for human judgment: Especially in pastoral care
- Western/Protestant bias possible: We strive for ecumenism but acknowledge potential blind spots
- English-only: Currently focused on English Bible translations
- Cultural context: Designed primarily for Western Christian contexts
- Don't use benchmark scores to make definitive claims about model "theological soundness"
- Recognize that high scores don't qualify an LLM to replace pastors or theologians
- Be aware of potential biases in training data and evaluation design
- Use results to inform, not replace, human theological oversight
BibleBench includes comprehensive testing of multiple Bible translations:
The benchmark explicitly tests models on these major English translations:
- KJV (King James Version, 1611) - Traditional language with "thee/thou/thy"
- NIV (New International Version, 1978/2011) - Widely-used modern translation
- ESV (English Standard Version, 2001) - Literal, modern English
- NASB (New American Standard Bible, 1971/1995) - Very literal translation
- Exact Scripture Matching (
scripture-matching.eval.ts): Tests precise recall of verses with exact wording across multiple translations - Each verse is tested in 2-4 different translations to verify translation-specific accuracy
- Requires perfect matchesβsince scripture is sacred, no fuzzy matching is used
- Tests both well-known verses (John 3:16, Psalm 23:1) and less common passages (Micah 6:8, Lamentations 3:22-23)
This approach ensures models are evaluated on their ability to recall scripture with precision and distinguish between translation variations accurately.
MIT License - see LICENSE file for details.
This benchmark is provided for educational and evaluative purposes. It represents an attempt to create rigorous standards for LLM theological knowledge, but does not claim to be the definitive measure of an LLM's theological accuracy.
- Built with Evalite by the Evalite team
- Powered by Vercel AI SDK
- Inspired by existing LLM benchmarks: MMLU, TruthfulQA, HumanEval, etc.
- Theological input from various Christian traditions and scholars
"Do your best to present yourself to God as one approved, a worker who does not need to be ashamed and who correctly handles the word of truth." - 2 Timothy 2:15 (NIV)
