Lightweight experiment tracking for LLM evaluation.
Three days into your experiment sprint: models scattered across five directories, evaluation results in a notebook you can't find, and that one promising checkpoint you forgot to save. Sound familiar? LightML is a zero-config experiment tracker that turns that mess into structured, searchable, exportable knowledge -- in four lines of Python.
pip install light-ml-registry
lightml init --path ./my_registry --name main
- Why LightML
- Installation
- Quick Start (5 minutes)
- Core Concepts
- Python API Reference
- CLI Reference
- Database resolution (--db is optional)
- Interactive mode
- list — List models, runs, or families
- summary — Quick overview of the registry
- info — Detailed info for a model
- top — Leaderboard by metric
- metric-get — Read a single metric value
- diff — Compare N models side-by-side
- compare — Compare two models
- exists — Check if a model or metric exists
- stats — Statistical comparison
- notes — Read or write model notes
- rename — Rename a model
- prune — Remove empty models and runs
- watch — Continuous scan
- merge — Merge two registries
- migrate — Database migration
- version — Show version
- Dashboard (GUI)
- Excel Export
- Walkthrough: lm_eval pipeline
- Database Schema
- Project Structure
| Feature | LightML | MLflow | W&B |
|---|---|---|---|
| Setup | pip install light-ml-registry |
Server + DB | Cloud signup |
| Storage | Single SQLite file | Postgres/MySQL | Cloud |
| Dependencies | 4 packages | 20+ packages | API key required |
| Dashboard | Built-in (lightml gui) |
Separate server | Web app |
| Excel export | Built-in | No | No |
| Offline | ✅ | Partially | ❌ |
LightML is ideal when you need structured experiment tracking without the infrastructure.
pip install light-ml-registrygit clone <repo-url> && cd LightML
pip install -e ".[dev]"Dependencies (auto-installed):
pydantic— schema validationfastapi+uvicorn— dashboard serveropenpyxl— Excel export
For the lm_eval example you also need:
pip install lm-eval pyyamllightml init --path ./my_registry --name mainThis creates ./my_registry/main.db with all required tables.
from lightml.handle import LightMLHandle
# Connect to registry and create an experiment run
handle = LightMLHandle(db="./my_registry/main.db", run_name="gpt2-eval")
# Register the model
handle.register_model(
model_name="gpt2-eval",
path="openai-community/gpt2",
)
# Log metrics — family groups related metrics together
handle.log_model_metric(
model_name="gpt2-eval",
family="hellaswag_0shot",
metric_name="hellaswag_acc",
value=0.289,
)
handle.log_model_metric(
model_name="gpt2-eval",
family="hellaswag_0shot",
metric_name="hellaswag_acc_norm",
value=0.312,
)lightml gui --db ./my_registry/main.db --port 5050
# or: export LIGHTML_DB=./my_registry/main.db && lightml guiOpen http://localhost:5050 in your browser.
lightml export --db ./my_registry/main.db --output report.xlsxGenerates one sheet per metric family with automatic color-scale formatting.
LightML organizes data around four entities:
Run (experiment)
└── Model
├── Metrics (family / metric_name / value)
└── Checkpoint (step N)
└── Metrics
An experiment context. Every model belongs to a run. Runs are created automatically when you instantiate LightMLHandle.
A trained model registered under a run. Supports parent-child lineage to track fine-tuning chains (e.g., base → SFT → DPO).
An intermediate training snapshot linked to a model. Identified by step number.
Numeric values attached to either a model or a checkpoint. Organized by family (a logical group like "hellaswag_0shot") and metric_name (like "hellaswag_acc").
The main entry point. All operations go through this handle.
from lightml.handle import LightMLHandle
handle = LightMLHandle(db="path/to/registry.db", run_name="my-experiment")Register a model in the current run. Idempotent — calling twice with the same name is safe.
Parent linkage can be specified by name (parent_name) or by database id (parent_id). Using parent_id avoids name-mismatch issues when the parent was registered with a different name convention.
handle.register_model(
model_name="llama-sft",
path="/models/llama-3-8b-sft",
parent_name="llama-base", # optional: link to parent model
)
# Or link by id (useful in automation pipelines)
parent_id = handle.register_model(model_name="llama-base", path="/models/llama-base")
handle.register_model(
model_name="llama-sft",
path="/models/llama-sft",
parent_id=parent_id,
)Register a training checkpoint.
ckpt_id = handle.register_checkpoint(
model_name="llama-sft",
step=5000,
path="/checkpoints/llama-sft/step-5000",
)Look up a checkpoint id by model name and step. When multiple checkpoints share the same step (e.g. grid search), path_hint disambiguates by matching against the stored path.
ckpt_id = handle.find_checkpoint(
model_name="llama-sft",
step=5000,
path_hint="EXP1234T5678", # optional: disambiguate
)Log a metric on a model. Returns a status code. Optionally attach per-sample scores (a list of 0/1 values) for statistical testing (see Statistical testing).
from lightml.metrics import METRIC_INSERTED, METRIC_UPDATED, METRIC_SKIPPED
rc = handle.log_model_metric(
model_name="llama-sft",
family="mmlu_5shot",
metric_name="mmlu_acc",
value=0.634,
scores=[1, 0, 1, 1, 0, ...], # optional: per-sample binary scores
force=False, # True = overwrite if exists
)
if rc == METRIC_INSERTED: print("New metric logged")
if rc == METRIC_SKIPPED: print("Already existed, skipped")
if rc == METRIC_UPDATED: print("Overwritten (force=True)")Same as above, but attached to a checkpoint instead of a model.
handle.log_checkpoint_metric(
checkpoint_id=ckpt_id,
family="hellaswag_0shot",
metric_name="hellaswag_acc_norm",
value=0.412,
scores=[1, 1, 0, 1, ...], # optional
)When you log a metric with scores=[1, 0, 1, ...], LightML stores the per-sample binary vector in a dedicated detailed_scores table. This enables:
- McNemar's test — exact binomial test on discordant pairs
- Bootstrap confidence intervals — 95% CI on the accuracy delta
- Contingency tables — how many samples both models get right/wrong
Scores are stored as JSON and linked 1:1 to the metric row.
# Log with detailed scores
handle.log_model_metric(
model_name="llama-sft",
family="hellaswag_0shot",
metric_name="hellaswag_acc_norm",
value=0.75,
scores=[1, 1, 0, 1, 0, 1, 1, 1], # 6/8 = 0.75
)
# Retrieve stored scores
scores = handle.get_detailed_scores(
model_name="llama-sft",
family="hellaswag_0shot",
metric_name="hellaswag_acc_norm",
)Compare two models with rigorous statistical tests using their stored per-sample scores:
result = handle.compare_stats(
model_a="llama-base",
model_b="llama-sft",
family="hellaswag_0shot",
metric_name="hellaswag_acc_norm",
)
print(result["contingency"]) # both_correct, only_a, only_b, both_wrong
print(result["mcnemar"]) # p_value, significant, winner
print(result["bootstrap"]) # delta, ci_lower, ci_upper
print(result["mean_a"], result["mean_b"])The result dict contains:
| Key | Description |
|---|---|
contingency |
2×2 contingency table: both_correct, only_a, only_b, both_wrong, n_discordant |
mcnemar |
McNemar's exact test: p_value, significant (p < 0.05), winner ("a" or "b") |
bootstrap |
Bootstrap CI (10k resamples): delta (A−B), ci_lower, ci_upper, confidence |
mean_a / mean_b |
Accuracy of each model |
Also available as an interactive CLI — see lightml stats.
Instead of calling log_model_metric() once per metric, use log_metrics() to log an entire evaluation result in one call:
# Nested dict: {family: {metric_name: value}}
counts = handle.log_metrics("llama-sft", {
"ENG 5-shot": {"MMLU": 56.2, "ARC": 48.7, "HellaSwag": 71.9},
"ITA 0-shot": {"MMLU": 52.8, "HellaSwag": 62.1},
})
print(counts) # {"inserted": 5, "updated": 0, "skipped": 0}For a single family, use the flat variant:
counts = handle.log_metrics_flat("llama-sft", {
"MMLU": 56.2,
"ARC": 48.7,
}, family="ENG 5-shot")Both methods support force=True to overwrite existing metrics, and return a summary dict with insert/update/skip counts.
Compare two models side-by-side to see per-metric deltas:
from lightml.compare import compare_models
result = compare_models(
db="./registry/main.db",
model_a="llama-base", # baseline
model_b="llama-sft", # candidate
run_name="my-experiment", # optional filter
family="ENG 5-shot", # optional filter
)
# Convenience properties
print(f"Improved: {len(result.improved)}")
print(f"Regressed: {len(result.regressed)}")
print(f"Unchanged: {len(result.unchanged)}")
print(f"Missing: {len(result.missing)}")
# Pretty terminal output (color-coded)
print(result.to_text())
# JSON-serializable dict (for APIs)
data = result.to_dict()Each delta contains family, metric_name, value_a, value_b, delta (B−A), and pct_change.
Bulk-import eval results from a directory tree without writing any Python:
from lightml.scan import scan_and_import
stats = scan_and_import(
db="./registry/main.db",
run_name="lm-eval-run",
path="./eval_results", # each subfolder = one model
format="lm_eval", # or "json"
model_prefix="eval/", # optional prefix
force=False, # True = overwrite duplicates
)
print(f"Models: {stats.models_registered}")
print(f"Metrics: {stats.metrics_logged}")
print(f"Skipped: {stats.skipped_dirs}")Directory layout expected:
eval_results/
├── model-alpha/
│ └── results_2026-01-15T10-30-00.json # lm_eval format
├── model-beta/
│ └── results_2026-01-16T09-00-00.json
└── model-gamma/
│ └── metrics.json # generic JSON format
Supported formats:
| Format | File pattern | Structure |
|---|---|---|
lm_eval |
results_*.json |
{"results": {"task": {"metric": value}}} |
json |
metrics*.json / *.json |
{"metric": value} or {"family": {"metric": value}} |
Check whether a model or metric exists in the database. All three methods return True/False:
# Does this model exist (any run)?
handle.model_exists("llama-sft") # True
# Does this specific metric exist for a model (any run)?
handle.metric_exists("llama-sft", "eng", "hellaswag_0shot_acc_norm") # True
# Does this metric exist for a model in the handle's run?
handle.run_metric_exists("llama-sft", "eng", "hellaswag_0shot_acc_norm") # TrueFor exploratory queries, use search() with glob patterns (* and ?):
# All hellaswag metrics for any model starting with "MIIA"
results = handle.search(model="MIIA*", family="eng", metric="hella*")
for r in results:
print(f"{r['model']} {r['family']}/{r['metric']} = {r['value']:.4f}")
# MIIA14B-BASE eng/hellaswag_0shot_acc_norm = 0.7240
# MIIA14B-BASE eng/hellaswag_5shot_acc_norm = 0.7378
# MIIA7B eng/hellaswag_0shot_acc_norm = 0.7193
# ...
# Search models only
models = handle.search(model="llama*")
# [{"model": "llama-base"}, {"model": "llama-sft"}, ...]search() returns a list of dicts with keys: model, family, metric, value, run (when searching metrics) or just model (when searching models only).
Delete a model and all its associated data (checkpoints, metrics, detailed scores) in a single cascade operation:
from lightml.models.delete import DeleteResult
result = handle.delete_model(model_name="llama-sft")
print(result.model_name) # "llama-sft"
print(result.checkpoints_deleted) # 5
print(result.metrics_deleted) # 80The deletion:
- Removes the model row (cascade deletes checkpoints + metrics + detailed scores via foreign keys)
- Removes the symlink from the registry
models/directory (if present) - Raises
ValueErrorif the model doesn't exist - Does not delete child models — they keep their
parent_idreference but become orphans
LightML prevents accidental duplicate metrics:
| Scenario | force=False (default) |
force=True |
|---|---|---|
| Metric does not exist | INSERT → METRIC_INSERTED |
INSERT → METRIC_INSERTED |
| Metric already exists | SKIP → METRIC_SKIPPED |
UPDATE → METRIC_UPDATED |
This means you can safely re-run evaluation scripts without polluting your database.
lightml <command> [options]
Every command that needs a database accepts --db, but you don't have to type it every time. LightML resolves the database in this order:
| Priority | Source | Example |
|---|---|---|
| 1 | --db argument |
lightml summary --db ./registry/main.db |
| 2 | LIGHTML_DB environment variable |
export LIGHTML_DB=./registry/main.db |
| 3 | .lightml config file in the current directory |
echo 'db=./registry/main.db' > .lightml |
| 4 | Auto-detect single *.db file in the current directory |
Just cd into the directory containing your .db file |
Once configured, all commands become shorter:
# Before
lightml summary --db ./registry/main.db
lightml top --db ./registry/main.db --family hellaswag --metric acc
# After (with LIGHTML_DB or .lightml set)
lightml summary
lightml top --family hellaswag --metric accMany commands support interactive selection: if you omit arguments like --model, --family, or --metric, the CLI shows a numbered list and lets you pick.
$ lightml info
Select a model:
1. gemma-9b
2. llama-base
3. llama-dpo
4. llama-sft
Model: 2
Model : llama-base
Run : eval-1
Path : meta-llama/Llama-3-8B
...Commands with interactive support: info, top, metric-get, notes, rename, diff, compare, model-delete, stats.
All these commands still accept explicit arguments for scripting.
lightml init --path ./registry --name main [--overwrite]lightml model-register \
--run my-experiment \
--name llama-sft \
--path /models/llama-sft \
--parent llama-base # optionallightml checkpoint-register \
--run my-experiment \
--model llama-sft \
--step 5000 \
--path /checkpoints/step-5000lightml metric-log \
--run my-experiment \
--model llama-sft \
--family mmlu_5shot \
--metric mmlu_acc \
--value 0.634 \
--force # optional: overwritelightml export [--output report.xlsx]Scan a directory tree and bulk-import models + metrics:
lightml scan \
--run lm-eval-run \
--path ./eval_results \
--format lm_eval # or "json"
--prefix "eval/" # optional model name prefix
--force # optional: overwrite duplicatesEach immediate subdirectory of --path is treated as one model.
Browse the contents of your registry without opening the dashboard.
# List all models (default)
lightml list
# List runs
lightml list runs
# List metric families
lightml list families
# Filter models by run, include hidden
lightml list models --run my-experiment --include-hiddenOutput:
Model Run Parent Notes
─────────────────────────────────── ──────────────────── ───────────────────────── ────────────────────
gemma-9b eval-1 —
llama-base eval-1 —
llama-dpo eval-1 llama-sft DPO with UltraFeedback
llama-sft eval-1 llama-base
4 model(s)
lightml summaryOutput:
main.db
─────────────────────────────────────────────
Runs : 2
Models : 8 (1 hidden)
Checkpoints : 3
Families : 4
Metrics : 156
Updated : 2026-03-15
Runs:
eval-1 (5 models)
eval-2 (3 models)
Families:
gsm8k (3 metrics · 5 models)
hellaswag (4 metrics · 8 models)
mmlu (6 metrics · 7 models)
winogrande (2 metrics · 4 models)
# Specify directly
lightml info --model llama-sft
# Or pick interactively
lightml infoOutput:
Model : llama-sft
Run : eval-1
Path : my-org/llama-sft
Parent : llama-base
Children : llama-dpo
Checkpoints (2):
step 1000 /checkpoints/llama-sft/step-1000
step 5000 /checkpoints/llama-sft/step-5000
Metrics (12):
[hellaswag]
acc 0.6300
acc_norm 0.6600
[gsm8k]
exact_match 0.5500
Rank models by a specific metric. Family and metric can be selected interactively.
# Specify directly
lightml top --family hellaswag --metric acc --n 5
# Or pick interactively
lightml topOutput:
Leaderboard hellaswag / acc
# Model Score Run
─── ─────────────────── ──────── ────────────────────
#1 llama-dpo 0.6500 eval-1
#2 llama-sft 0.6300 eval-1
#3 gemma-9b 0.6100 eval-1
#4 llama-base 0.5900 eval-1
Read one metric value. Useful for scripting with --raw.
# Interactive
lightml metric-get
# Explicit
lightml metric-get --model llama-sft --family hellaswag --metric acc
# llama-sft hellaswag/acc = 0.6300
# Scriptable (just the number)
lightml metric-get --model llama-sft --family hellaswag --metric acc --raw
# 0.63Colorized table comparing metrics across two or more models — like git diff but for metrics.
# Interactive — pick models from a list
lightml diff
# Explicit
lightml diff \
--models llama-base llama-sft gemma-9b \
--run my-experiment # optional
--family "ENG 5-shot" # optional
--no-color # optional: disable colors (for piping)Output:
lightml diff — 3 models (run: my-experiment)
══════════════════════════════════════════════════════════════════════
Family Metric llama-base llama-sft gemma-9b
──────────────────────────────────────────────────────────────────────
ENG 5-shot ARC 0.4430 0.4870 0.5120
ENG 5-shot HellaSwag 0.6950 0.7190 0.7340
ENG 5-shot MMLU 0.5210 0.5620 0.5480
──────────────────────────────────────────────────────────────────────
AVG (3 metrics) 0.5530 0.5893 0.5980
- Best value per metric is highlighted in green, worst in red (when 3+ models)
- An AVG row summarizes all metrics where every model has a value
- Missing metrics are shown as
—
Also available as a Python API:
from lightml.diff import diff_models, format_diff
data = diff_models(
db="./registry/main.db",
model_names=["llama-base", "llama-sft", "gemma-9b"],
run_name="my-experiment",
family="ENG 5-shot",
)
print(format_diff(data))Print a side-by-side metric delta table. Models can be selected interactively.
# Interactive
lightml compare
# Explicit
lightml compare \
--model-a llama-base \
--model-b llama-sft \
--run my-experiment # optional
--family "ENG 5-shot" # optionalOutput:
Compare: llama-base vs llama-sft
Run: my-experiment
──────────────────────────────────────────────────────────────────────────
Family Metric A B Δ %
──────────────────────────────────────────────────────────────────────────
ENG 5-shot MMLU 52.10 56.20 +4.10 +7.9%
ENG 5-shot ARC 44.30 48.70 +4.40 +9.9%
ENG 5-shot HellaSwag 69.50 71.90 +2.40 +3.5%
──────────────────────────────────────────────────────────────────────────
Check existence with exact names or glob patterns (*, ?). Exit code 0 = found, 1 = not found.
# Check if a model exists
lightml exists --model llama-sft
# Check if a specific metric exists
lightml exists --model llama-sft --family eng --metric hellaswag_0shot_acc_norm
# Glob search — find all hellaswag metrics for MIIA models
lightml exists --model MIIA* --family eng --metric hella*
# ✓ MIIA14B-BASE eng/hellaswag_0shot_acc_norm = 0.7240 (run: MIIA14B)
# ✓ MIIA7B eng/hellaswag_0shot_acc_norm = 0.7193 (run: MIIA14B)
# 8 match(es)
# Restrict to a specific run
lightml exists --model llama* --family eng --metric mmlu* --run my-experimentUsable in scripts:
if lightml exists --model my-model 2>/dev/null; then
echo "Model already registered, skipping"
fiInteractively compare two models using McNemar's test and bootstrap confidence intervals. Requires detailed scores (logged with the scores parameter).
# Fully interactive
lightml stats
# Explicit
lightml stats \
--model-a llama-base \
--model-b llama-sft \
--family hellaswag_0shot \
--metric hellaswag_acc_normOutput:
Statistical comparison: llama-base vs llama-sft
Family: hellaswag_0shot Metric: hellaswag_acc_norm
──────────────────────────────────────────────────────
Both correct: 7234
Only llama-base: 312
Only llama-sft: 487
Both wrong: 1967
Discordant: 799
──────────────────────────────────────────────────────
Mean llama-base: 0.7546
Mean llama-sft: 0.7721
Delta (A - B): -0.0175
95% CI: [-0.0264, -0.0087]
──────────────────────────────────────────────────────
McNemar p-value: 0.000012
Result: Significant (p < 0.05), llama-sft is better
When multiple metrics are selected, an overview table summarizes all tests at the end.
Attach text notes to a model (without opening the dashboard).
# Read notes (interactive model selection if --model omitted)
lightml notes --model llama-sft
# llama-sft: trained on 10k examples, lr=1e-4
# Write notes
lightml notes --model llama-sft --set "trained on 10k examples, lr=1e-4"Fix typos or update naming conventions.
# Interactive
lightml rename
# Explicit
lightml rename --old llama-sfft --new llama-sftRemove models with no metrics and no checkpoints, and runs with no models.
# Preview what would be removed
lightml prune --dry-run
# Actually remove
lightml pruneOutput:
Models (2):
- empty-test-model
- orphan-model
Runs (1):
- abandoned-run
Pruned: 2 model(s), 1 run(s)
Like scan, but runs in a loop — useful during training/evaluation to auto-import new results as they appear.
lightml watch \
--path ./eval_results \
--run training-run \
--interval 60 # poll every 60 seconds (default: 30)
--format lm_eval # or "json"
--prefix "eval/" # optional
--force # optional: overwriteOutput:
Watching ./eval_results
DB : ./registry/main.db
Run : training-run | Format: lm_eval | Interval: 60s
Ctrl+C to stop
[14:32:10] +2 models, +24 metrics
[14:33:10] no new data
[14:34:10] +1 models, +12 metrics
Merge all runs, models, and metrics from a source registry into your current one. Useful for combining results from different machines or teammates.
lightml merge --src ./other_registry.db
# Overwrite existing metrics (instead of skipping)
lightml merge --src ./other_registry.db --forceOutput:
Merging ./other_registry.db → ./registry/main.db
run 'eval-gpu1': 3 model(s), 45 metric(s)
run 'eval-gpu2': 2 model(s), 30 metric(s)
Done. 5 model(s), 75 metric(s) processed.
Models with the same name are deduplicated (idempotent). Metrics are skipped if they already exist (use --force to overwrite).
Delete a model and all related data (checkpoints, metrics, detailed scores). Interactive if --name is omitted.
# Interactive
lightml model-delete
# Explicit
lightml model-delete --name llama-sftApply pending schema migrations to an older database (e.g. add the detailed_scores table introduced in v1.1.0):
lightml migratelightml versionlightml gui [--port 5050] [--host 0.0.0.0]LightML ships with an interactive web dashboard — no external tools needed.
lightml gui --db ./registry/main.dbPivoted metrics table with:
- Family tabs — one tab per metric family, plus "All Families" (properly scoped — same metric name across different families shows distinct values)
- Sorting — click any column header
- Search — filter models by name
- Color coding — best values highlighted in green, worst in red
- Checkpoints toggle — show/hide checkpoint rows
- Run filter — dropdown to isolate a specific run
- Model selection — checkbox column for selecting models
D3.js force-directed graph showing model lineage:
- Nodes = models, colored by run
- Edges = parent → child relationships
- Checkpoints hidden by default — toggle "Show checkpoints" in the control bar to reveal them
- Hover = tooltip with green/red dots showing which benchmarks have been evaluated
- Search — filter nodes by name, path, or run
- Drag & zoom — fully interactive
Select models from either view and compare them side-by-side:
- Select: click checkboxes in the table, or click nodes in the graph — selections sync across both views
- Selection bar: appears at the top showing count and actions
- Filter table: click "Filter table" to show only selected models
- Compare: select exactly 2 models, click "Compare" → a modal shows per-metric deltas with color-coded improvements (green) and regressions (red)
- Clear: reset selection in both views
Click ⬇ Excel in the header to download a formatted .xlsx report directly from the dashboard.
The export engine creates professional Excel reports from the database:
- One sheet per metric family — keeps related metrics grouped
- Automatic color scales — red → yellow → green formatting on all metric columns
- Frozen headers — first row + model name column stay visible while scrolling
- Models (Phase F) and Checkpoints (Phase S) on the same sheet
from pathlib import Path
from lightml.export import export_excel
export_excel(
db_path=Path("./registry/main.db"),
output_path=Path("./report.xlsx"),
)Or via CLI:
lightml export --db ./registry/main.db --output report.xlsxThis walkthrough shows how to use LightML with lm-evaluation-harness to evaluate an LLM and track results. The complete example is in examples/lm_eval/.
Edit examples/lm_eval/config.yaml:
# ── LightML settings ──────────────────────────────
db: ./my_registry/main.db
run_name: llama-3-eval
# ── Model to evaluate ────────────────────────────
model_path: meta-llama/Llama-3-8B
# ── Evaluation matrix ────────────────────────────
lang: [eng]
benchmarks: [hellaswag, mmlu]
shots: [0, 5]
num_gpus: 1Every field is explained inline. The key LightML fields are db (path to registry) and run_name (experiment name).
cd examples/lm_eval
python run_eval.pyThe script does three things:
- Connects to LightML and registers the model (2 lines of setup)
- Runs lm_eval for each (benchmark × language × shots) combination
- Logs every metric to the registry with
handle.log_model_metric()
Here's the core LightML integration — it's just 4 API calls:
from lightml.handle import LightMLHandle
# Setup — 2 lines
handle = LightMLHandle(db=cfg["db"], run_name=cfg["run_name"])
handle.register_model(model_name=cfg["run_name"], path=cfg["model_path"])
# After each benchmark completes — 1 call per metric
handle.log_model_metric(
model_name=handle.run_name,
family="eng_hellaswag_0shot",
metric_name="hellaswag_acc",
value=0.452,
)lightml gui --db ./my_registry/main.dbClick ⬇ Excel in the dashboard header, or:
lightml export --db ./my_registry/main.dbLightML uses a single SQLite file with 6 tables:
-- Experiment container
CREATE TABLE run (
id INTEGER PRIMARY KEY AUTOINCREMENT,
run_name TEXT UNIQUE NOT NULL,
description TEXT,
metadata TEXT -- JSON blob
);
-- Trained model, scoped to a run
CREATE TABLE model (
id INTEGER PRIMARY KEY AUTOINCREMENT,
model_name TEXT NOT NULL,
path TEXT,
parent_id INTEGER REFERENCES model(id),
run_id INTEGER NOT NULL REFERENCES run(id),
UNIQUE(model_name, run_id)
);
-- Training checkpoint, linked to a model
CREATE TABLE checkpoint (
id INTEGER PRIMARY KEY AUTOINCREMENT,
model_id INTEGER NOT NULL REFERENCES model(id),
step INTEGER NOT NULL,
path TEXT,
created_at TEXT DEFAULT (datetime('now'))
);
-- Metric value, linked to a model OR a checkpoint
CREATE TABLE metrics (
id INTEGER PRIMARY KEY AUTOINCREMENT,
model_id INTEGER REFERENCES model(id),
checkpoint_id INTEGER REFERENCES checkpoint(id),
family TEXT NOT NULL,
metric_name TEXT NOT NULL,
value REAL NOT NULL
);
-- Per-sample binary scores for statistical testing (v1.1.0+)
CREATE TABLE detailed_scores (
metric_id INTEGER NOT NULL PRIMARY KEY,
scores TEXT NOT NULL, -- JSON array of 0/1 values
n_samples INTEGER NOT NULL,
FOREIGN KEY(metric_id) REFERENCES metrics(id) ON DELETE CASCADE
);
-- Optional: restrict allowed metrics
CREATE TABLE registry_schema (
id INTEGER PRIMARY KEY AUTOINCREMENT,
family TEXT NOT NULL,
metric_name TEXT NOT NULL
);Databases created with v1.0.x can be upgraded with lightml migrate --db <path> to add the detailed_scores table.
LightML/
├── pyproject.toml # Package config, CLI entry point
├── README.md # This file
│
├── lightml/ # Library source
│ ├── __init__.py
│ ├── handle.py # LightMLHandle — main API (incl. bulk log_metrics)
│ ├── registry.py # Run & model registration logic
│ ├── checkpoints.py # Checkpoint registration + find_checkpoint
│ ├── metrics.py # Metric logging + deduplication + detailed scores
│ ├── database.py # SQLite schema initialization + migration
│ ├── stats.py # Statistical testing (McNemar, Bootstrap CI)
│ ├── export.py # Excel export engine
│ ├── compare.py # Model comparison (Pydantic models + compare_models)
│ ├── diff.py # N-model side-by-side diff (terminal table)
│ ├── scan.py # Auto-import from eval result directories
│ ├── gui.py # FastAPI dashboard server + /api/compare
│ ├── cli.py # CLI entry point (lightml command)
│ ├── models/ # Pydantic schemas (incl. DeleteResult)
│ ├── templates/
│ │ └── dashboard.html # Single-file SPA dashboard
│ └── tests/
│ ├── test_bugfix.py # Core regression tests (41 tests)
│ ├── test_compare.py # Compare feature tests (15 tests)
│ ├── test_diff.py # Diff feature tests (17 tests)
│ ├── test_scan.py # Scan / auto-import tests (17 tests)
│ ├── test_bulk.py # Bulk metric API tests (15 tests)
│ ├── test_delete.py # Model deletion tests (6 tests)
│ └── conftest.py # Shared fixtures
│
├── examples/
│ └── lm_eval/ # End-to-end evaluation example
│ ├── run_eval.py # lm_eval + LightML pipeline
│ └── config.yaml # Example configuration
│
└── docs/
└── gifs/ # GIF recordings for README





