Geotechnical LLM Benchmark Suite -- domain-specific model evaluation that generic benchmarks miss.
Python 3.11+ | MIT License | 39 questions | 9 disciplines | 8 difficulty tiers | 48 agent capabilities
GroundTruth is a benchmark suite for evaluating LLMs on geotechnical and civil engineering tasks. It ships 39 curated questions across 9 engineering disciplines and 8 difficulty tiers, from basic recall ("What does SPT stand for?") through multi-hop reasoning chains and adversarial trick questions designed to catch hallucination.
Beyond domain knowledge, GroundTruth includes a 49-capability agent bench, an RFPQ document generation test, a context saturation benchmark, and a codebase comprehension test -- measuring not just what a model knows, but how it performs as a working tool.
The benchmark runs against any OpenAI-compatible API endpoint. It supports automated Docker model swapping, HuggingFace model card sampling parameters, crash-safe resume, a live dashboard, and energy/cost metrics per token.
Generic benchmarks (MMLU, HumanEval, ARC, etc.) don't test domain expertise. A model that scores 90% on MMLU might hallucinate soil properties, apply the wrong bearing capacity equation, or confidently classify a CL as a CH. It might calculate settlement using Terzaghi's equation when it should use Schmertmann's method, or produce a number with the right magnitude but wrong units.
GroundTruth fills that gap. Every question was written by an engineer who knows where models break on technical content. The adversarial tier specifically tests whether a model refuses to answer when given insufficient parameters -- because in geotechnical engineering, a confident wrong answer is worse than no answer at all.
If you run local models and care about domain accuracy, this gives you data that MMLU never will.
Each model goes through up to five benchmark phases. The standard automated pipeline (tools/bench_models.py) runs them in sequence, swapping models and context sizes between phases.
The core of GroundTruth. Tests domain knowledge across geotechnical engineering.
Matrix: 39 questions x 2 modes (bare, grounded) x 2 thinking modes (think, nothink) = 156 items per model
Test Modes:
| Mode | Context Source | What It Tests |
|---|---|---|
bare |
None | Raw model knowledge. No hints, no context. What does the model actually know? |
grounded |
Live RAG retrieval | Real-world RAG performance. Retrieves from a frozen corpus via vector search + reranking. |
Two additional modes exist but are not in the standard pipeline:
open_book-- curated context from the question author (13/39 questions have this)grounded_prebaked-- pre-retrieved chunks stored in question JSON (deterministic, no infra needed)
Disciplines (9):
| Code | Discipline |
|---|---|
soil_mechanics |
Stress, strain, consolidation, shear strength |
foundation_engineering |
Bearing capacity, settlement, pile design |
site_investigation |
SPT, CPT, field methods |
geotechnical_lab |
Atterberg limits, compaction, permeability |
slope_stability |
Factor of safety, failure modes, remediation |
earth_structures |
Embankments, retaining walls, reinforced earth |
groundwater |
Flow nets, dewatering, seepage |
standards_specs |
ASTM, AASHTO, Eurocode |
professional_practice |
Reports, ethics, project management |
Difficulty Tiers (8):
| Tier | Name | Description |
|---|---|---|
| 1 | Recall | Direct factual recall. "What does SPT stand for?" |
| 2 | Comprehension | Understanding concepts. "Explain drained vs. undrained conditions." |
| 3 | Application | Apply a method to given data. Calculate bearing capacity from parameters. |
| 4 | Analysis | Interpret results, identify patterns. "What does this CPT profile indicate?" |
| 5 | Synthesis | Combine multiple concepts. Design a foundation considering multiple constraints. |
| 6 | Multi-hop | Chain of reasoning across topics. Classify soil, then select foundation, then calculate. |
| 7 | Expert Judgment | No single correct answer. Requires engineering judgment and experience. |
| 8 | Adversarial | Trick questions with missing parameters, contradictory data, or impossible scenarios. Must refuse. |
Grading Types (7):
| Type | Auto-graded | How It Works |
|---|---|---|
numerical |
Yes | Extracts numbers from response, compares closest match to expected value within tolerance (default 5%). Handles LaTeX, scientific notation, and comma-separated numbers. |
classification |
Yes | Checks if expected classification code (e.g., "CL", "GP-GM") appears in response. Supports LaTeX stripping, formula fingerprinting, and synonym lists. |
keyword |
Yes | Checks what fraction of required keywords appear in response, with configurable threshold. Includes a synonym dictionary for geotechnical terms (e.g., "shaft resistance" = "skin friction" = "side friction"). |
adversarial |
Yes | Verifies model refuses to answer when it should. Negation-aware -- won't penalize "this is NOT safe" for containing "safe". Detects refusal-then-educate patterns. |
engineering_multi_step |
Yes | 5-tier composite scoring: identification (concepts/assumptions), methodology (method selection), execution (intermediate checkpoints), reasonableness (asymmetric penalty for unconservative answers), precision (final value accuracy). Weighted composite with 0.60 pass threshold. |
judgment |
No | Requires manual expert review. Auto-grade returns null. |
subjective |
No | Open-ended, requires human scoring. Auto-grade returns null. |
The engineering_multi_step grader deserves special mention: it penalizes unconservative answers more harshly than conservative ones. In geotechnical engineering, overestimating bearing capacity is dangerous; underestimating it is merely expensive. The grader reflects this asymmetry.
Tests the model as a general-purpose home/office agent across 48 capabilities in 8 categories. Uses real read-only API calls against a live homelab stack (Plex, Sonarr, Radarr, SABnzbd, Overseerr) with mock fallback.
48 capabilities, 480 max points, 8 categories:
| Category | Caps | What It Tests |
|---|---|---|
| Agentic | 8 | Multi-step planning, calendar scheduling, email triage, safety guardrails, tool error recovery, agent trace debugging, prompt ambiguity, over-refusal |
| Coding | 7 | Code generation, TDD, codebase exploration, code review, JSON reliability, math reasoning, self-repair |
| Context | 8 | Data transformation, instruction following (+ verifiable), multi-turn coherence, long-context needle, adversarial context, multi-hop, pass^k reliability |
| Documents | 5 | Web summarization, financial processing, document processing/routing, news intelligence |
| Engineering | 6 | Document synthesis, code-from-spec, feature implementation, engineering reports, RFPQ generation, seed reverse engineering |
| Homelab | 6 | Cron simulation, Docker stack challenges, troubleshooting, SSH deployment, log analysis, HA automation |
| Media | 4 | Overseerr API, autonomous discovery (easy + hard), homelab stack orchestration |
| Squire | 5 | Context compression, brainstorm capture, knowledge acquisition, extended multi-turn, self-organization |
Each capability is scored 0-10. The agent bench runs in both thinking and nothinking modes where applicable.
Standalone long-document generation test, extracted from the agent bench engineering category. The model is reloaded at 64k context and asked to produce a complete Request for Proposal/Qualification (RFPQ) response document for a geotechnical engineering project.
Tests: long-form structured writing, technical accuracy, document organization, and the ability to maintain coherence across a multi-page professional document.
Measures tok/s degradation as KV cache fills. Two modes:
| Mode | Method | What It Shows |
|---|---|---|
single |
Independent requests at increasing prompt sizes | Raw model scaling with context length |
multiturn |
Accumulating conversation that grows turn-by-turn | Real-world degradation with prefix caching |
The model is reloaded at its native maximum context (e.g., 262k for Qwen3.5) for this phase. Supports using real corpus content (PDF textbooks) instead of synthetic filler.
Dumps curated GroundTruth source files (~196k tokens, ~45 files) into a single prompt and asks "explain this codebase." The response is keyword-graded against a 20-item architectural checklist of facts we know to be true.
Tests: extreme long-context comprehension, cross-file reasoning, and architecture understanding. Smaller models will choke.
git clone https://github.com/3spky5u-oss/GroundTruth.git
cd GroundTruth
pip install -e .
# Optional: rich CLI output
pip install -e ".[cli]"
# Optional: PySide6 GUI
pip install -e ".[gui]"# Quick mode: 1 trial, temperature 0.0, bare mode only
groundtruth run --endpoint http://localhost:10100/v1 --model "my-model" --quick# Automated multi-model pipeline (recommended)
python tools/bench_models.py # All models in queue
python tools/bench_models.py --models 0.8b 2b # Specific models
python tools/bench_models.py --geotech-only # Geotech phase only
python tools/bench_models.py --agent-only # Agent bench only
python tools/bench_models.py --smoke # Quick validation# Live dashboard
python tools/dashboard_server.py --no-browser # Start on port 8777
# Reddit markdown report
groundtruth export results/my-model.json --format reddit
# Compare two models
groundtruth compare results/model_a.json results/model_b.json
# CSV for spreadsheets
groundtruth export results/my-model.json --format csvQuestions live in groundtruth/questions/ as versioned JSON files (v1.json, v2.json, etc.). Each question follows a strict schema (schema.json). Versions are frozen on release -- once a version ships, it never changes, so results are always comparable.
{
"id": "v1-sm-001",
"version": "v1",
"discipline": "soil_mechanics",
"tier": 3,
"grading_type": "numerical",
"question": "A strip footing 1.5m wide carries a load of 200 kN/m...",
"expected": {
"value": 445.5,
"tolerance_pct": 5
},
"open_book_context": "Terzaghi bearing capacity factors for phi=30...",
"grounded_context": null,
"tags": ["bearing-capacity", "terzaghi", "strip-footing"],
"adversarial_notes": null,
"multi_hop_chain": ["identify footing type", "select equation", "look up Nc", "calculate"]
}- Create or edit
groundtruth/questions/vN.json(never modify a released version) - Follow the schema in
groundtruth/questions/schema.json - Run validation:
groundtruth list --versions vN(will fail if schema is invalid) - Each question needs a unique ID matching
vN-XX-NNN(e.g.,v2-fe-015)
GroundTruth includes an isolated RAG pipeline for the grounded test mode. It runs its own Qdrant and Infinity containers on different ports from any other local setup.
# Start the isolated Qdrant (port 6334) + Infinity (port 7998) stack
docker compose up -d
# Populate corpus/ with your reference material (.txt, .md, .json)
# Ingest into Qdrant
python -m groundtruth.ingest- Ingest (
ingest.py): Reads files fromcorpus/, chunks text (512 words, 64-word overlap), embeds via Infinity (Qwen3-Embedding-0.6B), upserts into Qdrant - Retrieve (
rag.py): Embeds the question, vector-searches Qdrant for top-20 candidates, reranks via Infinity (Jina-Reranker-v2-base-multilingual) down to top-5 - Inject: Formats retrieved chunks as numbered context and injects into the system prompt
The corpus is gitignored. The repo ships the harness and questions, not the reference material.
| Service | Port | Notes |
|---|---|---|
| Qdrant | 6334 | Separate from any other Qdrant instance |
| Infinity | 7998 | Runs Qwen3-Embedding-0.6B + Jina-Reranker-v2 |
docker_swap.py automates restarting a llama.cpp Docker container with a different model. Used by the pipeline for overnight multi-model runs.
from groundtruth.docker_swap import restart_llama_server, warmup
restart_llama_server(
"/models/Qwen3-30B-A3B-Q4_K_M.gguf",
alias="qwen3-30b",
thinking=True,
context_size=32768,
gpu_layers=999,
)
warmup("http://localhost:10100/v1")GroundTruth tracks GPU power draw during inference and derives energy efficiency metrics for every model run.
| Metric | Unit | Description |
|---|---|---|
| Avg GPU Power | W | Mean GPU power draw across all inference calls |
| J/token | J | Energy per generated token |
| BMU/MTok | -- | Big Mac Units per million tokens (see below) |
| $/MTok | CAD | Electricity cost per million tokens at $0.11 CAD/kWh |
| tokens/Wh | -- | Token generation efficiency (higher is better) |
1 BMU = 563 kcal (the caloric content of a standard McDonald's Big Mac).
BMU/MTok expresses the energy consumed by your GPU to produce one million tokens, normalized to Big Macs. Same idea as The Economist's Big Mac Index -- an immediately intuitive reference quantity that makes an abstract number tangible.
Formula: BMU/MTok = (cal_per_token * 1,000,000) / 563,000
Electricity rate: All $/MTok figures use $0.11 CAD/kWh (representative Canadian residential rate).
The multi-page dashboard (tools/dashboard_server.py, port 8777) provides real-time monitoring during benchmark runs and a leaderboard for completed results.
- Model cards with 12 live metric boxes per model
- Leaderboard with expandable sub-rows (by mode, discipline, tier)
- Compare tab with Chart.js radar + bar charts
- Rolling totals banner (tokens generated, GPU-hours, BMU consumed, electricity cost)
- Response viewer with LaTeX rendering (marked.js + KaTeX)
python tools/dashboard_server.py --no-browsergroundtruth/ # Python package
runner.py # CLI entry point + benchmark orchestration
bank.py # Question bank loader + validator
grader.py # Auto-grading engine (7 grading types)
rag.py # RAG pipeline (embed, search, rerank)
ingest.py # Corpus ingestion into Qdrant
docker_swap.py # Docker llama-server model swapping
export.py # Reddit markdown, CSV, JSON export
db.py # SQLite cache layer over result JSON files
questions/
v1.json # Frozen V1 question set (30 questions)
v2.json # Frozen V2 question set (9 questions)
schema.json # JSON Schema for question validation
templates/
reddit.md # Standardized Reddit report template
results/ # Benchmark output (gitignored)
gui/ # PySide6 desktop app (optional)
tests/ # pytest test suite
tools/
bench_models.py # 2-phase benchmark pipeline (geotech + agent per model)
bench_queue.py # 5-phase queue orchestrator (swap, geotech, agent, rfpq, context, codebase)
agent_bench.py # Agent bench entry point (shim)
agent_bench/ # Agent bench package (48 capabilities, 8 categories)
rfpq_bench.py # RFPQ document generation benchmark
context_bench.py # Context saturation benchmark
codebase_bench.py # Codebase comprehension benchmark
dashboard_server.py # Live dashboard (port 8777)
moe_optimizer.py # MoE expert offload optimizer
semantic_grader_experiment.py # Experimental semantic similarity grader
corpus/ # Reference material for RAG (gitignored)
docker-compose.yml # Isolated Qdrant + Infinity stack
pyproject.toml # Package config
All commands are available via the groundtruth entry point or python -m groundtruth.runner.
# Run benchmark
groundtruth run --endpoint URL --model NAME [--versions v1,v2] [--modes bare,grounded]
[--quick] [--resume] [--top-p 0.95] [--top-k 20] [--presence-penalty 1.5]
# List / stats
groundtruth list [--versions v1,v2] [--discipline soil_mechanics]
groundtruth stats
# Export
groundtruth export results.json --format reddit|csv|json
# Compare
groundtruth compare results_a.json results_b.json
# Regrade (fix false negatives with updated synonyms or think-runaway recovery)
groundtruth regrade [--dry-run] [--recover-thinking]
# GUI
groundtruth guipip install -e ".[dev]"
python -m pytest groundtruth/tests/ -vThe most useful contribution is new questions. Good questions have:
- A clear, unambiguous prompt
- A well-defined expected answer with appropriate grading type
- Correct tier assignment (see the tier descriptions above)
- Proper discipline classification
- For numerical questions: a reasonable tolerance (default 5%)
- For adversarial questions: clear forbidden patterns and a
must_refuseflag
Never edit a released version file. Create a new version (v2.json, etc.) for new questions.
The codebase is intentionally minimal. No ML dependencies, no frameworks, no abstractions for the sake of abstraction. It talks to external HTTP endpoints and grades the responses. If you want to add a feature, look at the existing patterns first.
MIT