GroundTruth

Geotechnical LLM Benchmark Suite -- domain-specific model evaluation that generic benchmarks miss.

What Is This

GroundTruth is a benchmark suite for evaluating LLMs on geotechnical and civil engineering tasks. It ships 39 curated questions across 9 engineering disciplines and 8 difficulty tiers, from basic recall ("What does SPT stand for?") through multi-hop reasoning chains and adversarial trick questions designed to catch hallucination.

Beyond domain knowledge, GroundTruth includes a 49-capability agent bench, an RFPQ document generation test, a context saturation benchmark, and a codebase comprehension test -- measuring not just what a model knows, but how it performs as a working tool.

The benchmark runs against any OpenAI-compatible API endpoint. It supports automated Docker model swapping, HuggingFace model card sampling parameters, crash-safe resume, a live dashboard, and energy/cost metrics per token.

Why

Generic benchmarks (MMLU, HumanEval, ARC, etc.) don't test domain expertise. A model that scores 90% on MMLU might hallucinate soil properties, apply the wrong bearing capacity equation, or confidently classify a CL as a CH. It might calculate settlement using Terzaghi's equation when it should use Schmertmann's method, or produce a number with the right magnitude but wrong units.

GroundTruth fills that gap. Every question was written by an engineer who knows where models break on technical content. The adversarial tier specifically tests whether a model refuses to answer when given insufficient parameters -- because in geotechnical engineering, a confident wrong answer is worse than no answer at all.

If you run local models and care about domain accuracy, this gives you data that MMLU never will.

Testing Phases

Each model goes through up to five benchmark phases. The standard automated pipeline (tools/bench_models.py) runs them in sequence, swapping models and context sizes between phases.

Phase 1: Geotech Domain Benchmark

The core of GroundTruth. Tests domain knowledge across geotechnical engineering.

Matrix: 39 questions x 2 modes (bare, grounded) x 2 thinking modes (think, nothink) = 156 items per model

Test Modes:

Mode	Context Source	What It Tests
`bare`	None	Raw model knowledge. No hints, no context. What does the model actually know?
`grounded`	Live RAG retrieval	Real-world RAG performance. Retrieves from a frozen corpus via vector search + reranking.

Two additional modes exist but are not in the standard pipeline:

open_book -- curated context from the question author (13/39 questions have this)
grounded_prebaked -- pre-retrieved chunks stored in question JSON (deterministic, no infra needed)

Disciplines (9):

Code	Discipline
`soil_mechanics`	Stress, strain, consolidation, shear strength
`foundation_engineering`	Bearing capacity, settlement, pile design
`site_investigation`	SPT, CPT, field methods
`geotechnical_lab`	Atterberg limits, compaction, permeability
`slope_stability`	Factor of safety, failure modes, remediation
`earth_structures`	Embankments, retaining walls, reinforced earth
`groundwater`	Flow nets, dewatering, seepage
`standards_specs`	ASTM, AASHTO, Eurocode
`professional_practice`	Reports, ethics, project management

Difficulty Tiers (8):

Tier	Name	Description
1	Recall	Direct factual recall. "What does SPT stand for?"
2	Comprehension	Understanding concepts. "Explain drained vs. undrained conditions."
3	Application	Apply a method to given data. Calculate bearing capacity from parameters.
4	Analysis	Interpret results, identify patterns. "What does this CPT profile indicate?"
5	Synthesis	Combine multiple concepts. Design a foundation considering multiple constraints.
6	Multi-hop	Chain of reasoning across topics. Classify soil, then select foundation, then calculate.
7	Expert Judgment	No single correct answer. Requires engineering judgment and experience.
8	Adversarial	Trick questions with missing parameters, contradictory data, or impossible scenarios. Must refuse.

Grading Types (7):

Type	Auto-graded	How It Works
`numerical`	Yes	Extracts numbers from response, compares closest match to expected value within tolerance (default 5%). Handles LaTeX, scientific notation, and comma-separated numbers.
`classification`	Yes	Checks if expected classification code (e.g., "CL", "GP-GM") appears in response. Supports LaTeX stripping, formula fingerprinting, and synonym lists.
`keyword`	Yes	Checks what fraction of required keywords appear in response, with configurable threshold. Includes a synonym dictionary for geotechnical terms (e.g., "shaft resistance" = "skin friction" = "side friction").
`adversarial`	Yes	Verifies model refuses to answer when it should. Negation-aware -- won't penalize "this is NOT safe" for containing "safe". Detects refusal-then-educate patterns.
`engineering_multi_step`	Yes	5-tier composite scoring: identification (concepts/assumptions), methodology (method selection), execution (intermediate checkpoints), reasonableness (asymmetric penalty for unconservative answers), precision (final value accuracy). Weighted composite with 0.60 pass threshold.
`judgment`	No	Requires manual expert review. Auto-grade returns null.
`subjective`	No	Open-ended, requires human scoring. Auto-grade returns null.

The engineering_multi_step grader deserves special mention: it penalizes unconservative answers more harshly than conservative ones. In geotechnical engineering, overestimating bearing capacity is dangerous; underestimating it is merely expensive. The grader reflects this asymmetry.

Phase 2: Agent Bench

Tests the model as a general-purpose home/office agent across 48 capabilities in 8 categories. Uses real read-only API calls against a live homelab stack (Plex, Sonarr, Radarr, SABnzbd, Overseerr) with mock fallback.

48 capabilities, 480 max points, 8 categories:

Category	Caps	What It Tests
Agentic	8	Multi-step planning, calendar scheduling, email triage, safety guardrails, tool error recovery, agent trace debugging, prompt ambiguity, over-refusal
Coding	7	Code generation, TDD, codebase exploration, code review, JSON reliability, math reasoning, self-repair
Context	8	Data transformation, instruction following (+ verifiable), multi-turn coherence, long-context needle, adversarial context, multi-hop, pass^k reliability
Documents	5	Web summarization, financial processing, document processing/routing, news intelligence
Engineering	6	Document synthesis, code-from-spec, feature implementation, engineering reports, RFPQ generation, seed reverse engineering
Homelab	6	Cron simulation, Docker stack challenges, troubleshooting, SSH deployment, log analysis, HA automation
Media	4	Overseerr API, autonomous discovery (easy + hard), homelab stack orchestration
Squire	5	Context compression, brainstorm capture, knowledge acquisition, extended multi-turn, self-organization

Each capability is scored 0-10. The agent bench runs in both thinking and nothinking modes where applicable.

Phase 3: RFPQ Bench

Standalone long-document generation test, extracted from the agent bench engineering category. The model is reloaded at 64k context and asked to produce a complete Request for Proposal/Qualification (RFPQ) response document for a geotechnical engineering project.

Tests: long-form structured writing, technical accuracy, document organization, and the ability to maintain coherence across a multi-page professional document.

Phase 4: Context Saturation Bench

Measures tok/s degradation as KV cache fills. Two modes:

Mode	Method	What It Shows
`single`	Independent requests at increasing prompt sizes	Raw model scaling with context length
`multiturn`	Accumulating conversation that grows turn-by-turn	Real-world degradation with prefix caching

The model is reloaded at its native maximum context (e.g., 262k for Qwen3.5) for this phase. Supports using real corpus content (PDF textbooks) instead of synthetic filler.

Phase 5: Codebase Comprehension Bench

Dumps curated GroundTruth source files (~196k tokens, ~45 files) into a single prompt and asks "explain this codebase." The response is keyword-graded against a 20-item architectural checklist of facts we know to be true.

Tests: extreme long-context comprehension, cross-file reasoning, and architecture understanding. Smaller models will choke.

Quick Start

Install

git clone https://github.com/3spky5u-oss/GroundTruth.git
cd GroundTruth
pip install -e .

# Optional: rich CLI output
pip install -e ".[cli]"

# Optional: PySide6 GUI
pip install -e ".[gui]"

Run a Quick Benchmark

# Quick mode: 1 trial, temperature 0.0, bare mode only
groundtruth run --endpoint http://localhost:10100/v1 --model "my-model" --quick

Run the Full Pipeline

# Automated multi-model pipeline (recommended)
python tools/bench_models.py                        # All models in queue
python tools/bench_models.py --models 0.8b 2b       # Specific models
python tools/bench_models.py --geotech-only          # Geotech phase only
python tools/bench_models.py --agent-only             # Agent bench only
python tools/bench_models.py --smoke                  # Quick validation

View Results

# Live dashboard
python tools/dashboard_server.py --no-browser   # Start on port 8777

# Reddit markdown report
groundtruth export results/my-model.json --format reddit

# Compare two models
groundtruth compare results/model_a.json results/model_b.json

# CSV for spreadsheets
groundtruth export results/my-model.json --format csv

Question Bank

Questions live in groundtruth/questions/ as versioned JSON files (v1.json, v2.json, etc.). Each question follows a strict schema (schema.json). Versions are frozen on release -- once a version ships, it never changes, so results are always comparable.

Question Schema

{
  "id": "v1-sm-001",
  "version": "v1",
  "discipline": "soil_mechanics",
  "tier": 3,
  "grading_type": "numerical",
  "question": "A strip footing 1.5m wide carries a load of 200 kN/m...",
  "expected": {
    "value": 445.5,
    "tolerance_pct": 5
  },
  "open_book_context": "Terzaghi bearing capacity factors for phi=30...",
  "grounded_context": null,
  "tags": ["bearing-capacity", "terzaghi", "strip-footing"],
  "adversarial_notes": null,
  "multi_hop_chain": ["identify footing type", "select equation", "look up Nc", "calculate"]
}

Adding Questions

Create or edit groundtruth/questions/vN.json (never modify a released version)
Follow the schema in groundtruth/questions/schema.json
Run validation: groundtruth list --versions vN (will fail if schema is invalid)
Each question needs a unique ID matching vN-XX-NNN (e.g., v2-fe-015)

RAG Stack

GroundTruth includes an isolated RAG pipeline for the grounded test mode. It runs its own Qdrant and Infinity containers on different ports from any other local setup.

Setup

# Start the isolated Qdrant (port 6334) + Infinity (port 7998) stack
docker compose up -d

# Populate corpus/ with your reference material (.txt, .md, .json)
# Ingest into Qdrant
python -m groundtruth.ingest

How It Works

Ingest (ingest.py): Reads files from corpus/, chunks text (512 words, 64-word overlap), embeds via Infinity (Qwen3-Embedding-0.6B), upserts into Qdrant
Retrieve (rag.py): Embeds the question, vector-searches Qdrant for top-20 candidates, reranks via Infinity (Jina-Reranker-v2-base-multilingual) down to top-5
Inject: Formats retrieved chunks as numbered context and injects into the system prompt

The corpus is gitignored. The repo ships the harness and questions, not the reference material.

Service	Port	Notes
Qdrant	6334	Separate from any other Qdrant instance
Infinity	7998	Runs Qwen3-Embedding-0.6B + Jina-Reranker-v2

Docker Model Swapping

docker_swap.py automates restarting a llama.cpp Docker container with a different model. Used by the pipeline for overnight multi-model runs.

from groundtruth.docker_swap import restart_llama_server, warmup

restart_llama_server(
    "/models/Qwen3-30B-A3B-Q4_K_M.gguf",
    alias="qwen3-30b",
    thinking=True,
    context_size=32768,
    gpu_layers=999,
)
warmup("http://localhost:10100/v1")

Energy & Cost Metrics

GroundTruth tracks GPU power draw during inference and derives energy efficiency metrics for every model run.

Metric	Unit	Description
Avg GPU Power	W	Mean GPU power draw across all inference calls
J/token	J	Energy per generated token
BMU/MTok	--	Big Mac Units per million tokens (see below)
$/MTok	CAD	Electricity cost per million tokens at $0.11 CAD/kWh
tokens/Wh	--	Token generation efficiency (higher is better)

BMU/MTok -- Big Mac Units

1 BMU = 563 kcal (the caloric content of a standard McDonald's Big Mac).

BMU/MTok expresses the energy consumed by your GPU to produce one million tokens, normalized to Big Macs. Same idea as The Economist's Big Mac Index -- an immediately intuitive reference quantity that makes an abstract number tangible.

Formula: BMU/MTok = (cal_per_token * 1,000,000) / 563,000

Electricity rate: All $/MTok figures use $0.11 CAD/kWh (representative Canadian residential rate).

Live Dashboard

The multi-page dashboard (tools/dashboard_server.py, port 8777) provides real-time monitoring during benchmark runs and a leaderboard for completed results.

Model cards with 12 live metric boxes per model
Leaderboard with expandable sub-rows (by mode, discipline, tier)
Compare tab with Chart.js radar + bar charts
Rolling totals banner (tokens generated, GPU-hours, BMU consumed, electricity cost)
Response viewer with LaTeX rendering (marked.js + KaTeX)

python tools/dashboard_server.py --no-browser

Project Structure

groundtruth/                    # Python package
  runner.py                     # CLI entry point + benchmark orchestration
  bank.py                       # Question bank loader + validator
  grader.py                     # Auto-grading engine (7 grading types)
  rag.py                        # RAG pipeline (embed, search, rerank)
  ingest.py                     # Corpus ingestion into Qdrant
  docker_swap.py                # Docker llama-server model swapping
  export.py                     # Reddit markdown, CSV, JSON export
  db.py                         # SQLite cache layer over result JSON files
  questions/
    v1.json                     # Frozen V1 question set (30 questions)
    v2.json                     # Frozen V2 question set (9 questions)
    schema.json                 # JSON Schema for question validation
  templates/
    reddit.md                   # Standardized Reddit report template
  results/                      # Benchmark output (gitignored)
  gui/                          # PySide6 desktop app (optional)
  tests/                        # pytest test suite

tools/
  bench_models.py               # 2-phase benchmark pipeline (geotech + agent per model)
  bench_queue.py                # 5-phase queue orchestrator (swap, geotech, agent, rfpq, context, codebase)
  agent_bench.py                # Agent bench entry point (shim)
  agent_bench/                  # Agent bench package (48 capabilities, 8 categories)
  rfpq_bench.py                 # RFPQ document generation benchmark
  context_bench.py              # Context saturation benchmark
  codebase_bench.py             # Codebase comprehension benchmark
  dashboard_server.py           # Live dashboard (port 8777)
  moe_optimizer.py              # MoE expert offload optimizer
  semantic_grader_experiment.py # Experimental semantic similarity grader

corpus/                         # Reference material for RAG (gitignored)
docker-compose.yml              # Isolated Qdrant + Infinity stack
pyproject.toml                  # Package config

CLI Reference

All commands are available via the groundtruth entry point or python -m groundtruth.runner.

# Run benchmark
groundtruth run --endpoint URL --model NAME [--versions v1,v2] [--modes bare,grounded]
                [--quick] [--resume] [--top-p 0.95] [--top-k 20] [--presence-penalty 1.5]

# List / stats
groundtruth list [--versions v1,v2] [--discipline soil_mechanics]
groundtruth stats

# Export
groundtruth export results.json --format reddit|csv|json

# Compare
groundtruth compare results_a.json results_b.json

# Regrade (fix false negatives with updated synonyms or think-runaway recovery)
groundtruth regrade [--dry-run] [--recover-thinking]

# GUI
groundtruth gui

Running Tests

pip install -e ".[dev]"
python -m pytest groundtruth/tests/ -v

Contributing

Adding Questions

The most useful contribution is new questions. Good questions have:

A clear, unambiguous prompt
A well-defined expected answer with appropriate grading type
Correct tier assignment (see the tier descriptions above)
Proper discipline classification
For numerical questions: a reasonable tolerance (default 5%)
For adversarial questions: clear forbidden patterns and a must_refuse flag

Never edit a released version file. Create a new version (v2.json, etc.) for new questions.

Code

The codebase is intentionally minimal. No ML dependencies, no frameworks, no abstractions for the sake of abstraction. It talks to external HTTP endpoints and grades the responses. If you want to add a feature, look at the existing patterns first.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
docs		docs
groundtruth		groundtruth
reference/prior-benchmarks		reference/prior-benchmarks
tools		tools
.gitignore		.gitignore
Dockerfile.jupiter		Dockerfile.jupiter
README.md		README.md
api_test.json		api_test.json
docker-compose.jupiter.yml		docker-compose.jupiter.yml
docker-compose.yml		docker-compose.yml
kill_dashboard.ps1		kill_dashboard.ps1
pyproject.toml		pyproject.toml
restart_dashboard.ps1		restart_dashboard.ps1

Folders and files

Latest commit

History

Repository files navigation

GroundTruth

What Is This

Why

Testing Phases

Phase 1: Geotech Domain Benchmark

Phase 2: Agent Bench

Phase 3: RFPQ Bench

Phase 4: Context Saturation Bench

Phase 5: Codebase Comprehension Bench

Quick Start

Install

Run a Quick Benchmark

Run the Full Pipeline

View Results

Question Bank

Question Schema

Adding Questions

RAG Stack

Setup

How It Works

Docker Model Swapping

Energy & Cost Metrics

BMU/MTok -- Big Mac Units

Live Dashboard

Project Structure

CLI Reference

Running Tests

Contributing

Adding Questions

Code

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages