ARENA is a statistical evaluation engine that determines whether ADS-B receiver hardware changes actually improved performance — or whether observed differences are just noise.
The core difficulty is that observed improvements are easily confounded: traffic varies by time-of-day and season, observation conditions are non-stationary, and metrics become unstable under sparse or bursty data. Simple before/after comparisons cannot separate real gains from these confounders. ARENA is designed to make that separation explicit through multiple complementary statistical methods (Bayesian NB-GLM with NumPyro/NUTS, frequentist NB-GLM, Mann-Whitney U, change-point detection, OpenSky-normalized capture ratios, distance-band analysis) and draws conclusions from convergence or divergence across models — no single method is treated as authoritative.
Raspberry Pi (edge) WSL2 / Linux (analysis)
┌──────────────────┐ ┌────────────────────────────┐
│ readsb → PLAO │ rsync/pull │ │
│ → adsb-eval│──────────────>│ pipeline (8 stages, │
└──────────────────┘ │ wave-parallel) │
│ │ │
│ ├──> /output │
│ │ (PNGs, reports) │
│ │ │
│ artifact run │
│ │ │
│ └──> /output/payload │
│ (CSVs, bundles) │
└────────────┬───────────────┘
│
CSVs (~55 files) + prompt
│
▼
┌────────────────────────────┐
│ Multiple LLMs │
│ → structured JSON claims │
│ → raw/{ai_name}/ │
│ YYYYMMDD.json │
└────────────┬───────────────┘
│
▼
┌────────────────────────────┐
│ synthesis │
│ ingest → triage → │
│ proposition review │
│ (SQLite, two-layer DB) │
└────────────────────────────┘
Three subsystems:
- Pipeline — 8-stage orchestration with wave-parallel scheduling, failure-resilient execution, and append-only audit logging. Outputs human-readable graphs and reports to
/output. - Artifacts — Converts pipeline outputs into verifiable, LLM-ready evidence bundles. Content identity (SHA-256), schema validation, and provenance/lineage ensure that downstream analysis operates on auditable evidence, not implicit assumptions. Integrity verification carries through to synthesis ingestion.
- Synthesis — Cross-model claim ingestion from multiple LLMs, enrichment, baseline clustering, proposition mapping, automated triage, and human review queue. Two-layer DB design (proposition + claim layers with convergence judgments). SQLite-backed, path-isolated.
ARENA treats LLMs as hypothesis generators, not truth sources — claims are validated through structured evidence and cross-model convergence. The full catalogue of 31 engineering decisions is in docs/principles/engineering-decisions.md.
python -m venv .venv && source .venv/bin/activate
pip install -U pip && pip install -e .[dev]
python -m arena.cli validate
pytestarena validate # check settings/paths
arena run --only 1 --dry-run --no-gpu --skip-plao # pipeline dry run
arena artifacts verify <bundle> # verify artifact bundle
arena synthesis run --path sample_data/synthesis/raw --db ./tmp/s.db \
--enriched-dir ./tmp/enriched --review-dir ./tmp/review \
--raw-original-dir ./tmp/orig --raw-repaired-dir ./tmp/repaired \
--repair-log-dir ./tmp/logs # synthesis smoke rundocker compose -f docker/docker-compose.yml run --rm arena-tests
docker compose -f docker/docker-compose.yml run --rm arena-validate
docker compose -f docker/docker-compose.yml run --rm arena-synthesis-smokeDetailed docs live in docs/. Start at docs/README.md.
| Category | Key Documents |
|---|---|
| Operations | Architecture · Reproducibility · Synthesis |
| Design | Engineering Decisions · Artifact Design · AI-Assisted Analysis · AEME |
| Context | System Context · Statistical Assumptions |
Python 3.11+, NumPyro/JAX, PyMC, statsmodels, pandas, SQLite, Docker, GitHub Actions CI (lint + test matrix + coverage + smoke + Docker).
- Secrets, credentials, and private data are excluded by design.
- Use environment variables or CLI overrides for local paths.
- Sample data under
sample_data/is synthetic and deterministic.