Skip to content

3spky5u-oss/GroundTruth

Repository files navigation

GroundTruth

Geotechnical LLM Benchmark Suite -- domain-specific model evaluation that generic benchmarks miss.

Python 3.11+ | MIT License | 39 questions | 9 disciplines | 8 difficulty tiers | 48 agent capabilities


What Is This

GroundTruth is a benchmark suite for evaluating LLMs on geotechnical and civil engineering tasks. It ships 39 curated questions across 9 engineering disciplines and 8 difficulty tiers, from basic recall ("What does SPT stand for?") through multi-hop reasoning chains and adversarial trick questions designed to catch hallucination.

Beyond domain knowledge, GroundTruth includes a 49-capability agent bench, an RFPQ document generation test, a context saturation benchmark, and a codebase comprehension test -- measuring not just what a model knows, but how it performs as a working tool.

The benchmark runs against any OpenAI-compatible API endpoint. It supports automated Docker model swapping, HuggingFace model card sampling parameters, crash-safe resume, a live dashboard, and energy/cost metrics per token.

Why

Generic benchmarks (MMLU, HumanEval, ARC, etc.) don't test domain expertise. A model that scores 90% on MMLU might hallucinate soil properties, apply the wrong bearing capacity equation, or confidently classify a CL as a CH. It might calculate settlement using Terzaghi's equation when it should use Schmertmann's method, or produce a number with the right magnitude but wrong units.

GroundTruth fills that gap. Every question was written by an engineer who knows where models break on technical content. The adversarial tier specifically tests whether a model refuses to answer when given insufficient parameters -- because in geotechnical engineering, a confident wrong answer is worse than no answer at all.

If you run local models and care about domain accuracy, this gives you data that MMLU never will.


Testing Phases

Each model goes through up to five benchmark phases. The standard automated pipeline (tools/bench_models.py) runs them in sequence, swapping models and context sizes between phases.

Phase 1: Geotech Domain Benchmark

The core of GroundTruth. Tests domain knowledge across geotechnical engineering.

Matrix: 39 questions x 2 modes (bare, grounded) x 2 thinking modes (think, nothink) = 156 items per model

Test Modes:

Mode Context Source What It Tests
bare None Raw model knowledge. No hints, no context. What does the model actually know?
grounded Live RAG retrieval Real-world RAG performance. Retrieves from a frozen corpus via vector search + reranking.

Two additional modes exist but are not in the standard pipeline:

  • open_book -- curated context from the question author (13/39 questions have this)
  • grounded_prebaked -- pre-retrieved chunks stored in question JSON (deterministic, no infra needed)

Disciplines (9):

Code Discipline
soil_mechanics Stress, strain, consolidation, shear strength
foundation_engineering Bearing capacity, settlement, pile design
site_investigation SPT, CPT, field methods
geotechnical_lab Atterberg limits, compaction, permeability
slope_stability Factor of safety, failure modes, remediation
earth_structures Embankments, retaining walls, reinforced earth
groundwater Flow nets, dewatering, seepage
standards_specs ASTM, AASHTO, Eurocode
professional_practice Reports, ethics, project management

Difficulty Tiers (8):

Tier Name Description
1 Recall Direct factual recall. "What does SPT stand for?"
2 Comprehension Understanding concepts. "Explain drained vs. undrained conditions."
3 Application Apply a method to given data. Calculate bearing capacity from parameters.
4 Analysis Interpret results, identify patterns. "What does this CPT profile indicate?"
5 Synthesis Combine multiple concepts. Design a foundation considering multiple constraints.
6 Multi-hop Chain of reasoning across topics. Classify soil, then select foundation, then calculate.
7 Expert Judgment No single correct answer. Requires engineering judgment and experience.
8 Adversarial Trick questions with missing parameters, contradictory data, or impossible scenarios. Must refuse.

Grading Types (7):

Type Auto-graded How It Works
numerical Yes Extracts numbers from response, compares closest match to expected value within tolerance (default 5%). Handles LaTeX, scientific notation, and comma-separated numbers.
classification Yes Checks if expected classification code (e.g., "CL", "GP-GM") appears in response. Supports LaTeX stripping, formula fingerprinting, and synonym lists.
keyword Yes Checks what fraction of required keywords appear in response, with configurable threshold. Includes a synonym dictionary for geotechnical terms (e.g., "shaft resistance" = "skin friction" = "side friction").
adversarial Yes Verifies model refuses to answer when it should. Negation-aware -- won't penalize "this is NOT safe" for containing "safe". Detects refusal-then-educate patterns.
engineering_multi_step Yes 5-tier composite scoring: identification (concepts/assumptions), methodology (method selection), execution (intermediate checkpoints), reasonableness (asymmetric penalty for unconservative answers), precision (final value accuracy). Weighted composite with 0.60 pass threshold.
judgment No Requires manual expert review. Auto-grade returns null.
subjective No Open-ended, requires human scoring. Auto-grade returns null.

The engineering_multi_step grader deserves special mention: it penalizes unconservative answers more harshly than conservative ones. In geotechnical engineering, overestimating bearing capacity is dangerous; underestimating it is merely expensive. The grader reflects this asymmetry.

Phase 2: Agent Bench

Tests the model as a general-purpose home/office agent across 48 capabilities in 8 categories. Uses real read-only API calls against a live homelab stack (Plex, Sonarr, Radarr, SABnzbd, Overseerr) with mock fallback.

48 capabilities, 480 max points, 8 categories:

Category Caps What It Tests
Agentic 8 Multi-step planning, calendar scheduling, email triage, safety guardrails, tool error recovery, agent trace debugging, prompt ambiguity, over-refusal
Coding 7 Code generation, TDD, codebase exploration, code review, JSON reliability, math reasoning, self-repair
Context 8 Data transformation, instruction following (+ verifiable), multi-turn coherence, long-context needle, adversarial context, multi-hop, pass^k reliability
Documents 5 Web summarization, financial processing, document processing/routing, news intelligence
Engineering 6 Document synthesis, code-from-spec, feature implementation, engineering reports, RFPQ generation, seed reverse engineering
Homelab 6 Cron simulation, Docker stack challenges, troubleshooting, SSH deployment, log analysis, HA automation
Media 4 Overseerr API, autonomous discovery (easy + hard), homelab stack orchestration
Squire 5 Context compression, brainstorm capture, knowledge acquisition, extended multi-turn, self-organization

Each capability is scored 0-10. The agent bench runs in both thinking and nothinking modes where applicable.

Phase 3: RFPQ Bench

Standalone long-document generation test, extracted from the agent bench engineering category. The model is reloaded at 64k context and asked to produce a complete Request for Proposal/Qualification (RFPQ) response document for a geotechnical engineering project.

Tests: long-form structured writing, technical accuracy, document organization, and the ability to maintain coherence across a multi-page professional document.

Phase 4: Context Saturation Bench

Measures tok/s degradation as KV cache fills. Two modes:

Mode Method What It Shows
single Independent requests at increasing prompt sizes Raw model scaling with context length
multiturn Accumulating conversation that grows turn-by-turn Real-world degradation with prefix caching

The model is reloaded at its native maximum context (e.g., 262k for Qwen3.5) for this phase. Supports using real corpus content (PDF textbooks) instead of synthetic filler.

Phase 5: Codebase Comprehension Bench

Dumps curated GroundTruth source files (~196k tokens, ~45 files) into a single prompt and asks "explain this codebase." The response is keyword-graded against a 20-item architectural checklist of facts we know to be true.

Tests: extreme long-context comprehension, cross-file reasoning, and architecture understanding. Smaller models will choke.


Quick Start

Install

git clone https://github.com/3spky5u-oss/GroundTruth.git
cd GroundTruth
pip install -e .

# Optional: rich CLI output
pip install -e ".[cli]"

# Optional: PySide6 GUI
pip install -e ".[gui]"

Run a Quick Benchmark

# Quick mode: 1 trial, temperature 0.0, bare mode only
groundtruth run --endpoint http://localhost:10100/v1 --model "my-model" --quick

Run the Full Pipeline

# Automated multi-model pipeline (recommended)
python tools/bench_models.py                        # All models in queue
python tools/bench_models.py --models 0.8b 2b       # Specific models
python tools/bench_models.py --geotech-only          # Geotech phase only
python tools/bench_models.py --agent-only             # Agent bench only
python tools/bench_models.py --smoke                  # Quick validation

View Results

# Live dashboard
python tools/dashboard_server.py --no-browser   # Start on port 8777

# Reddit markdown report
groundtruth export results/my-model.json --format reddit

# Compare two models
groundtruth compare results/model_a.json results/model_b.json

# CSV for spreadsheets
groundtruth export results/my-model.json --format csv

Question Bank

Questions live in groundtruth/questions/ as versioned JSON files (v1.json, v2.json, etc.). Each question follows a strict schema (schema.json). Versions are frozen on release -- once a version ships, it never changes, so results are always comparable.

Question Schema

{
  "id": "v1-sm-001",
  "version": "v1",
  "discipline": "soil_mechanics",
  "tier": 3,
  "grading_type": "numerical",
  "question": "A strip footing 1.5m wide carries a load of 200 kN/m...",
  "expected": {
    "value": 445.5,
    "tolerance_pct": 5
  },
  "open_book_context": "Terzaghi bearing capacity factors for phi=30...",
  "grounded_context": null,
  "tags": ["bearing-capacity", "terzaghi", "strip-footing"],
  "adversarial_notes": null,
  "multi_hop_chain": ["identify footing type", "select equation", "look up Nc", "calculate"]
}

Adding Questions

  1. Create or edit groundtruth/questions/vN.json (never modify a released version)
  2. Follow the schema in groundtruth/questions/schema.json
  3. Run validation: groundtruth list --versions vN (will fail if schema is invalid)
  4. Each question needs a unique ID matching vN-XX-NNN (e.g., v2-fe-015)

RAG Stack

GroundTruth includes an isolated RAG pipeline for the grounded test mode. It runs its own Qdrant and Infinity containers on different ports from any other local setup.

Setup

# Start the isolated Qdrant (port 6334) + Infinity (port 7998) stack
docker compose up -d

# Populate corpus/ with your reference material (.txt, .md, .json)
# Ingest into Qdrant
python -m groundtruth.ingest

How It Works

  1. Ingest (ingest.py): Reads files from corpus/, chunks text (512 words, 64-word overlap), embeds via Infinity (Qwen3-Embedding-0.6B), upserts into Qdrant
  2. Retrieve (rag.py): Embeds the question, vector-searches Qdrant for top-20 candidates, reranks via Infinity (Jina-Reranker-v2-base-multilingual) down to top-5
  3. Inject: Formats retrieved chunks as numbered context and injects into the system prompt

The corpus is gitignored. The repo ships the harness and questions, not the reference material.

Service Port Notes
Qdrant 6334 Separate from any other Qdrant instance
Infinity 7998 Runs Qwen3-Embedding-0.6B + Jina-Reranker-v2

Docker Model Swapping

docker_swap.py automates restarting a llama.cpp Docker container with a different model. Used by the pipeline for overnight multi-model runs.

from groundtruth.docker_swap import restart_llama_server, warmup

restart_llama_server(
    "/models/Qwen3-30B-A3B-Q4_K_M.gguf",
    alias="qwen3-30b",
    thinking=True,
    context_size=32768,
    gpu_layers=999,
)
warmup("http://localhost:10100/v1")

Energy & Cost Metrics

GroundTruth tracks GPU power draw during inference and derives energy efficiency metrics for every model run.

Metric Unit Description
Avg GPU Power W Mean GPU power draw across all inference calls
J/token J Energy per generated token
BMU/MTok -- Big Mac Units per million tokens (see below)
$/MTok CAD Electricity cost per million tokens at $0.11 CAD/kWh
tokens/Wh -- Token generation efficiency (higher is better)

BMU/MTok -- Big Mac Units

1 BMU = 563 kcal (the caloric content of a standard McDonald's Big Mac).

BMU/MTok expresses the energy consumed by your GPU to produce one million tokens, normalized to Big Macs. Same idea as The Economist's Big Mac Index -- an immediately intuitive reference quantity that makes an abstract number tangible.

Formula: BMU/MTok = (cal_per_token * 1,000,000) / 563,000

Electricity rate: All $/MTok figures use $0.11 CAD/kWh (representative Canadian residential rate).

Live Dashboard

The multi-page dashboard (tools/dashboard_server.py, port 8777) provides real-time monitoring during benchmark runs and a leaderboard for completed results.

  • Model cards with 12 live metric boxes per model
  • Leaderboard with expandable sub-rows (by mode, discipline, tier)
  • Compare tab with Chart.js radar + bar charts
  • Rolling totals banner (tokens generated, GPU-hours, BMU consumed, electricity cost)
  • Response viewer with LaTeX rendering (marked.js + KaTeX)
python tools/dashboard_server.py --no-browser

Project Structure

groundtruth/                    # Python package
  runner.py                     # CLI entry point + benchmark orchestration
  bank.py                       # Question bank loader + validator
  grader.py                     # Auto-grading engine (7 grading types)
  rag.py                        # RAG pipeline (embed, search, rerank)
  ingest.py                     # Corpus ingestion into Qdrant
  docker_swap.py                # Docker llama-server model swapping
  export.py                     # Reddit markdown, CSV, JSON export
  db.py                         # SQLite cache layer over result JSON files
  questions/
    v1.json                     # Frozen V1 question set (30 questions)
    v2.json                     # Frozen V2 question set (9 questions)
    schema.json                 # JSON Schema for question validation
  templates/
    reddit.md                   # Standardized Reddit report template
  results/                      # Benchmark output (gitignored)
  gui/                          # PySide6 desktop app (optional)
  tests/                        # pytest test suite

tools/
  bench_models.py               # 2-phase benchmark pipeline (geotech + agent per model)
  bench_queue.py                # 5-phase queue orchestrator (swap, geotech, agent, rfpq, context, codebase)
  agent_bench.py                # Agent bench entry point (shim)
  agent_bench/                  # Agent bench package (48 capabilities, 8 categories)
  rfpq_bench.py                 # RFPQ document generation benchmark
  context_bench.py              # Context saturation benchmark
  codebase_bench.py             # Codebase comprehension benchmark
  dashboard_server.py           # Live dashboard (port 8777)
  moe_optimizer.py              # MoE expert offload optimizer
  semantic_grader_experiment.py # Experimental semantic similarity grader

corpus/                         # Reference material for RAG (gitignored)
docker-compose.yml              # Isolated Qdrant + Infinity stack
pyproject.toml                  # Package config

CLI Reference

All commands are available via the groundtruth entry point or python -m groundtruth.runner.

# Run benchmark
groundtruth run --endpoint URL --model NAME [--versions v1,v2] [--modes bare,grounded]
                [--quick] [--resume] [--top-p 0.95] [--top-k 20] [--presence-penalty 1.5]

# List / stats
groundtruth list [--versions v1,v2] [--discipline soil_mechanics]
groundtruth stats

# Export
groundtruth export results.json --format reddit|csv|json

# Compare
groundtruth compare results_a.json results_b.json

# Regrade (fix false negatives with updated synonyms or think-runaway recovery)
groundtruth regrade [--dry-run] [--recover-thinking]

# GUI
groundtruth gui

Running Tests

pip install -e ".[dev]"
python -m pytest groundtruth/tests/ -v

Contributing

Adding Questions

The most useful contribution is new questions. Good questions have:

  • A clear, unambiguous prompt
  • A well-defined expected answer with appropriate grading type
  • Correct tier assignment (see the tier descriptions above)
  • Proper discipline classification
  • For numerical questions: a reasonable tolerance (default 5%)
  • For adversarial questions: clear forbidden patterns and a must_refuse flag

Never edit a released version file. Create a new version (v2.json, etc.) for new questions.

Code

The codebase is intentionally minimal. No ML dependencies, no frameworks, no abstractions for the sake of abstraction. It talks to external HTTP endpoints and grades the responses. If you want to add a feature, look at the existing patterns first.

License

MIT

About

Geotechnical LLM Benchmark Suite — domain-specific model evaluation for civil/geotechnical engineering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors