Skip to content

yukimurata0421/arena-eval-engine

Repository files navigation

ci verify-smoke docker-smoke coverage-threshold version

ARENA — ADS-B Receiver Evaluation Engine

ARENA is a statistical evaluation engine that determines whether ADS-B receiver hardware changes actually improved performance — or whether observed differences are just noise.

The core difficulty is that observed improvements are easily confounded: traffic varies by time-of-day and season, observation conditions are non-stationary, and metrics become unstable under sparse or bursty data. Simple before/after comparisons cannot separate real gains from these confounders. ARENA is designed to make that separation explicit through multiple complementary statistical methods (Bayesian NB-GLM with NumPyro/NUTS, frequentist NB-GLM, Mann-Whitney U, change-point detection, OpenSky-normalized capture ratios, distance-band analysis) and draws conclusions from convergence or divergence across models — no single method is treated as authoritative.

System Overview

Raspberry Pi (edge)                WSL2 / Linux (analysis)
┌──────────────────┐               ┌────────────────────────────┐
│  readsb → PLAO   │  rsync/pull   │                            │
│       → adsb-eval│──────────────>│  pipeline (8 stages,       │
└──────────────────┘               │           wave-parallel)   │
                                   │      │                     │
                                   │      ├──> /output          │
                                   │      │    (PNGs, reports)  │
                                   │      │                     │
                                   │  artifact run              │
                                   │      │                     │
                                   │      └──> /output/payload  │
                                   │           (CSVs, bundles)  │
                                   └────────────┬───────────────┘
                                                │
                                   CSVs (~55 files) + prompt
                                                │
                                                ▼
                                   ┌────────────────────────────┐
                                   │  Multiple LLMs             │
                                   │  → structured JSON claims  │
                                   │  → raw/{ai_name}/          │
                                   │    YYYYMMDD.json           │
                                   └────────────┬───────────────┘
                                                │
                                                ▼
                                   ┌────────────────────────────┐
                                   │  synthesis                 │
                                   │  ingest → triage →         │
                                   │  proposition review        │
                                   │  (SQLite, two-layer DB)    │
                                   └────────────────────────────┘

Three subsystems:

  • Pipeline — 8-stage orchestration with wave-parallel scheduling, failure-resilient execution, and append-only audit logging. Outputs human-readable graphs and reports to /output.
  • Artifacts — Converts pipeline outputs into verifiable, LLM-ready evidence bundles. Content identity (SHA-256), schema validation, and provenance/lineage ensure that downstream analysis operates on auditable evidence, not implicit assumptions. Integrity verification carries through to synthesis ingestion.
  • Synthesis — Cross-model claim ingestion from multiple LLMs, enrichment, baseline clustering, proposition mapping, automated triage, and human review queue. Two-layer DB design (proposition + claim layers with convergence judgments). SQLite-backed, path-isolated.

Design Philosophy

ARENA treats LLMs as hypothesis generators, not truth sources — claims are validated through structured evidence and cross-model convergence. The full catalogue of 31 engineering decisions is in docs/principles/engineering-decisions.md.

Quick Start

python -m venv .venv && source .venv/bin/activate
pip install -U pip && pip install -e .[dev]
python -m arena.cli validate
pytest

Key Commands

arena validate                                    # check settings/paths
arena run --only 1 --dry-run --no-gpu --skip-plao # pipeline dry run
arena artifacts verify <bundle>                   # verify artifact bundle
arena synthesis run --path sample_data/synthesis/raw --db ./tmp/s.db \
  --enriched-dir ./tmp/enriched --review-dir ./tmp/review \
  --raw-original-dir ./tmp/orig --raw-repaired-dir ./tmp/repaired \
  --repair-log-dir ./tmp/logs                     # synthesis smoke run

Docker

docker compose -f docker/docker-compose.yml run --rm arena-tests
docker compose -f docker/docker-compose.yml run --rm arena-validate
docker compose -f docker/docker-compose.yml run --rm arena-synthesis-smoke

Documentation

Detailed docs live in docs/. Start at docs/README.md.

Category Key Documents
Operations Architecture · Reproducibility · Synthesis
Design Engineering Decisions · Artifact Design · AI-Assisted Analysis · AEME
Context System Context · Statistical Assumptions

Tech Stack

Python 3.11+, NumPyro/JAX, PyMC, statsmodels, pandas, SQLite, Docker, GitHub Actions CI (lint + test matrix + coverage + smoke + Docker).

Notes

  • Secrets, credentials, and private data are excluded by design.
  • Use environment variables or CLI overrides for local paths.
  • Sample data under sample_data/ is synthetic and deterministic.

About

Statistical evaluation engine for ADS-B receiver performance. Measures real-world coverage using telemetry logs and coverage AUC instead of peak range.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors