⚡ LightML

Lightweight experiment tracking for LLM evaluation.

Three days into your experiment sprint: models scattered across five directories, evaluation results in a notebook you can't find, and that one promising checkpoint you forgot to save. Sound familiar? LightML is a zero-config experiment tracker that turns that mess into structured, searchable, exportable knowledge -- in four lines of Python.

pip install light-ml-registry
lightml init --path ./my_registry --name main

Why LightML

Feature	LightML	MLflow	W&B
Setup	`pip install light-ml-registry`	Server + DB	Cloud signup
Storage	Single SQLite file	Postgres/MySQL	Cloud
Dependencies	4 packages	20+ packages	API key required
Dashboard	Built-in (`lightml gui`)	Separate server	Web app
Excel export	Built-in	No	No
Offline	✅	Partially	❌

LightML is ideal when you need structured experiment tracking without the infrastructure.

Installation

From PyPI (recommended)

pip install light-ml-registry

From source (development)

git clone <repo-url> && cd LightML
pip install -e ".[dev]"

Dependencies (auto-installed):

pydantic — schema validation
fastapi + uvicorn — dashboard server
openpyxl — Excel export

For the lm_eval example you also need:

pip install lm-eval pyyaml

Quick Start (5 minutes)

1. Create a registry

lightml init --path ./my_registry --name main

This creates ./my_registry/main.db with all required tables.

2. Register a model and log metrics

from lightml.handle import LightMLHandle

# Connect to registry and create an experiment run
handle = LightMLHandle(db="./my_registry/main.db", run_name="gpt2-eval")

# Register the model
handle.register_model(
    model_name="gpt2-eval",
    path="openai-community/gpt2",
)

# Log metrics — family groups related metrics together
handle.log_model_metric(
    model_name="gpt2-eval",
    family="hellaswag_0shot",
    metric_name="hellaswag_acc",
    value=0.289,
)

handle.log_model_metric(
    model_name="gpt2-eval",
    family="hellaswag_0shot",
    metric_name="hellaswag_acc_norm",
    value=0.312,
)

3. View results

lightml gui --db ./my_registry/main.db --port 5050
# or: export LIGHTML_DB=./my_registry/main.db && lightml gui

Open http://localhost:5050 in your browser.

4. Export to Excel

lightml export --db ./my_registry/main.db --output report.xlsx

Generates one sheet per metric family with automatic color-scale formatting.

Core Concepts

LightML organizes data around four entities:

Run (experiment)
 └── Model
      ├── Metrics (family / metric_name / value)
      └── Checkpoint (step N)
           └── Metrics

Run

An experiment context. Every model belongs to a run. Runs are created automatically when you instantiate LightMLHandle.

Model

A trained model registered under a run. Supports parent-child lineage to track fine-tuning chains (e.g., base → SFT → DPO).

Checkpoint

An intermediate training snapshot linked to a model. Identified by step number.

Metrics

Numeric values attached to either a model or a checkpoint. Organized by family (a logical group like "hellaswag_0shot") and metric_name (like "hellaswag_acc").

Python API Reference

LightMLHandle

The main entry point. All operations go through this handle.

from lightml.handle import LightMLHandle

handle = LightMLHandle(db="path/to/registry.db", run_name="my-experiment")

`register_model(model_name, path, parent_name=None, parent_id=None)`

Register a model in the current run. Idempotent — calling twice with the same name is safe.

Parent linkage can be specified by name (parent_name) or by database id (parent_id). Using parent_id avoids name-mismatch issues when the parent was registered with a different name convention.

handle.register_model(
    model_name="llama-sft",
    path="/models/llama-3-8b-sft",
    parent_name="llama-base",  # optional: link to parent model
)

# Or link by id (useful in automation pipelines)
parent_id = handle.register_model(model_name="llama-base", path="/models/llama-base")
handle.register_model(
    model_name="llama-sft",
    path="/models/llama-sft",
    parent_id=parent_id,
)

`register_checkpoint(model_name, step, path)`

Register a training checkpoint.

ckpt_id = handle.register_checkpoint(
    model_name="llama-sft",
    step=5000,
    path="/checkpoints/llama-sft/step-5000",
)

`find_checkpoint(model_name, step, path_hint=None)`

Look up a checkpoint id by model name and step. When multiple checkpoints share the same step (e.g. grid search), path_hint disambiguates by matching against the stored path.

ckpt_id = handle.find_checkpoint(
    model_name="llama-sft",
    step=5000,
    path_hint="EXP1234T5678",  # optional: disambiguate
)

`log_model_metric(model_name, family, metric_name, value, scores=None, force=False)`

Log a metric on a model. Returns a status code. Optionally attach per-sample scores (a list of 0/1 values) for statistical testing (see Statistical testing).

from lightml.metrics import METRIC_INSERTED, METRIC_UPDATED, METRIC_SKIPPED

rc = handle.log_model_metric(
    model_name="llama-sft",
    family="mmlu_5shot",
    metric_name="mmlu_acc",
    value=0.634,
    scores=[1, 0, 1, 1, 0, ...],  # optional: per-sample binary scores
    force=False,  # True = overwrite if exists
)

if rc == METRIC_INSERTED:  print("New metric logged")
if rc == METRIC_SKIPPED:   print("Already existed, skipped")
if rc == METRIC_UPDATED:   print("Overwritten (force=True)")

`log_checkpoint_metric(checkpoint_id, family, metric_name, value, scores=None, force=False)`

Same as above, but attached to a checkpoint instead of a model.

handle.log_checkpoint_metric(
    checkpoint_id=ckpt_id,
    family="hellaswag_0shot",
    metric_name="hellaswag_acc_norm",
    value=0.412,
    scores=[1, 1, 0, 1, ...],  # optional
)

Detailed Scores

When you log a metric with scores=[1, 0, 1, ...], LightML stores the per-sample binary vector in a dedicated detailed_scores table. This enables:

McNemar's test — exact binomial test on discordant pairs
Bootstrap confidence intervals — 95% CI on the accuracy delta
Contingency tables — how many samples both models get right/wrong

Scores are stored as JSON and linked 1:1 to the metric row.

# Log with detailed scores
handle.log_model_metric(
    model_name="llama-sft",
    family="hellaswag_0shot",
    metric_name="hellaswag_acc_norm",
    value=0.75,
    scores=[1, 1, 0, 1, 0, 1, 1, 1],  # 6/8 = 0.75
)

# Retrieve stored scores
scores = handle.get_detailed_scores(
    model_name="llama-sft",
    family="hellaswag_0shot",
    metric_name="hellaswag_acc_norm",
)

Statistical Testing

Compare two models with rigorous statistical tests using their stored per-sample scores:

result = handle.compare_stats(
    model_a="llama-base",
    model_b="llama-sft",
    family="hellaswag_0shot",
    metric_name="hellaswag_acc_norm",
)

print(result["contingency"])  # both_correct, only_a, only_b, both_wrong
print(result["mcnemar"])      # p_value, significant, winner
print(result["bootstrap"])    # delta, ci_lower, ci_upper
print(result["mean_a"], result["mean_b"])

The result dict contains:

Key	Description
`contingency`	2×2 contingency table: `both_correct`, `only_a`, `only_b`, `both_wrong`, `n_discordant`
`mcnemar`	McNemar's exact test: `p_value`, `significant` (p < 0.05), `winner` (`"a"` or `"b"`)
`bootstrap`	Bootstrap CI (10k resamples): `delta` (A−B), `ci_lower`, `ci_upper`, `confidence`
`mean_a` / `mean_b`	Accuracy of each model

Also available as an interactive CLI — see lightml stats.

Bulk Metric Logging

Instead of calling log_model_metric() once per metric, use log_metrics() to log an entire evaluation result in one call:

# Nested dict: {family: {metric_name: value}}
counts = handle.log_metrics("llama-sft", {
    "ENG 5-shot": {"MMLU": 56.2, "ARC": 48.7, "HellaSwag": 71.9},
    "ITA 0-shot": {"MMLU": 52.8, "HellaSwag": 62.1},
})

print(counts)  # {"inserted": 5, "updated": 0, "skipped": 0}

For a single family, use the flat variant:

counts = handle.log_metrics_flat("llama-sft", {
    "MMLU": 56.2,
    "ARC": 48.7,
}, family="ENG 5-shot")

Both methods support force=True to overwrite existing metrics, and return a summary dict with insert/update/skip counts.

Compare Models

Compare two models side-by-side to see per-metric deltas:

from lightml.compare import compare_models

result = compare_models(
    db="./registry/main.db",
    model_a="llama-base",      # baseline
    model_b="llama-sft",       # candidate
    run_name="my-experiment",  # optional filter
    family="ENG 5-shot",       # optional filter
)

# Convenience properties
print(f"Improved: {len(result.improved)}")
print(f"Regressed: {len(result.regressed)}")
print(f"Unchanged: {len(result.unchanged)}")
print(f"Missing: {len(result.missing)}")

# Pretty terminal output (color-coded)
print(result.to_text())

# JSON-serializable dict (for APIs)
data = result.to_dict()

Each delta contains family, metric_name, value_a, value_b, delta (B−A), and pct_change.

Auto-import (Scan)

Bulk-import eval results from a directory tree without writing any Python:

from lightml.scan import scan_and_import

stats = scan_and_import(
    db="./registry/main.db",
    run_name="lm-eval-run",
    path="./eval_results",         # each subfolder = one model
    format="lm_eval",              # or "json"
    model_prefix="eval/",          # optional prefix
    force=False,                   # True = overwrite duplicates
)

print(f"Models: {stats.models_registered}")
print(f"Metrics: {stats.metrics_logged}")
print(f"Skipped: {stats.skipped_dirs}")

Directory layout expected:

eval_results/
├── model-alpha/
│   └── results_2026-01-15T10-30-00.json   # lm_eval format
├── model-beta/
│   └── results_2026-01-16T09-00-00.json
└── model-gamma/
│   └── metrics.json                        # generic JSON format

Supported formats:

Format	File pattern	Structure
`lm_eval`	`results_*.json`	`{"results": {"task": {"metric": value}}}`
`json`	`metrics.json` / `.json`	`{"metric": value}` or `{"family": {"metric": value}}`

Existence Checks & Search

Check whether a model or metric exists in the database. All three methods return True/False:

# Does this model exist (any run)?
handle.model_exists("llama-sft")  # True

# Does this specific metric exist for a model (any run)?
handle.metric_exists("llama-sft", "eng", "hellaswag_0shot_acc_norm")  # True

# Does this metric exist for a model in the handle's run?
handle.run_metric_exists("llama-sft", "eng", "hellaswag_0shot_acc_norm")  # True

For exploratory queries, use search() with glob patterns (* and ?):

# All hellaswag metrics for any model starting with "MIIA"
results = handle.search(model="MIIA*", family="eng", metric="hella*")

for r in results:
    print(f"{r['model']}  {r['family']}/{r['metric']} = {r['value']:.4f}")
# MIIA14B-BASE  eng/hellaswag_0shot_acc_norm = 0.7240
# MIIA14B-BASE  eng/hellaswag_5shot_acc_norm = 0.7378
# MIIA7B        eng/hellaswag_0shot_acc_norm = 0.7193
# ...

# Search models only
models = handle.search(model="llama*")
# [{"model": "llama-base"}, {"model": "llama-sft"}, ...]

search() returns a list of dicts with keys: model, family, metric, value, run (when searching metrics) or just model (when searching models only).

Model Deletion

Delete a model and all its associated data (checkpoints, metrics, detailed scores) in a single cascade operation:

from lightml.models.delete import DeleteResult

result = handle.delete_model(model_name="llama-sft")

print(result.model_name)           # "llama-sft"
print(result.checkpoints_deleted)  # 5
print(result.metrics_deleted)      # 80

The deletion:

Removes the model row (cascade deletes checkpoints + metrics + detailed scores via foreign keys)
Removes the symlink from the registry models/ directory (if present)
Raises ValueError if the model doesn't exist
Does not delete child models — they keep their parent_id reference but become orphans

Metric Deduplication

LightML prevents accidental duplicate metrics:

Scenario	`force=False` (default)	`force=True`
Metric does not exist	INSERT → `METRIC_INSERTED`	INSERT → `METRIC_INSERTED`
Metric already exists	SKIP → `METRIC_SKIPPED`	UPDATE → `METRIC_UPDATED`

This means you can safely re-run evaluation scripts without polluting your database.

CLI Reference

lightml <command> [options]

Database resolution (`--db` is optional)

Every command that needs a database accepts --db, but you don't have to type it every time. LightML resolves the database in this order:

Priority	Source	Example
1	`--db` argument	`lightml summary --db ./registry/main.db`
2	`LIGHTML_DB` environment variable	`export LIGHTML_DB=./registry/main.db`
3	`.lightml` config file in the current directory	`echo 'db=./registry/main.db' > .lightml`
4	Auto-detect single `*.db` file in the current directory	Just `cd` into the directory containing your `.db` file

Once configured, all commands become shorter:

# Before
lightml summary --db ./registry/main.db
lightml top --db ./registry/main.db --family hellaswag --metric acc

# After (with LIGHTML_DB or .lightml set)
lightml summary
lightml top --family hellaswag --metric acc

Interactive mode

Many commands support interactive selection: if you omit arguments like --model, --family, or --metric, the CLI shows a numbered list and lets you pick.

$ lightml info
  Select a model:

    1. gemma-9b
    2. llama-base
    3. llama-dpo
    4. llama-sft

  Model: 2

  Model  : llama-base
  Run    : eval-1
  Path   : meta-llama/Llama-3-8B
  ...

Commands with interactive support: info, top, metric-get, notes, rename, diff, compare, model-delete, stats.

All these commands still accept explicit arguments for scripting.

`init` — Create a new registry

lightml init --path ./registry --name main [--overwrite]

`model-register` — Register a model

lightml model-register \
    --run my-experiment \
    --name llama-sft \
    --path /models/llama-sft \
    --parent llama-base          # optional

`checkpoint-register` — Register a checkpoint

lightml checkpoint-register \
    --run my-experiment \
    --model llama-sft \
    --step 5000 \
    --path /checkpoints/step-5000

`metric-log` — Log a single metric

lightml metric-log \
    --run my-experiment \
    --model llama-sft \
    --family mmlu_5shot \
    --metric mmlu_acc \
    --value 0.634 \
    --force                      # optional: overwrite

`export` — Export Excel report

lightml export [--output report.xlsx]

`scan` — Auto-import eval results

Scan a directory tree and bulk-import models + metrics:

lightml scan \
    --run lm-eval-run \
    --path ./eval_results \
    --format lm_eval              # or "json"
    --prefix "eval/"              # optional model name prefix
    --force                       # optional: overwrite duplicates

Each immediate subdirectory of --path is treated as one model.

`list` — List models, runs, or families

Browse the contents of your registry without opening the dashboard.

# List all models (default)
lightml list

# List runs
lightml list runs

# List metric families
lightml list families

# Filter models by run, include hidden
lightml list models --run my-experiment --include-hidden

Output:

  Model                              Run                   Parent                     Notes
  ───────────────────────────────────  ────────────────────  ─────────────────────────  ────────────────────
  gemma-9b                            eval-1                —
  llama-base                          eval-1                —
  llama-dpo                           eval-1                llama-sft                  DPO with UltraFeedback
  llama-sft                           eval-1                llama-base

  4 model(s)

`summary` — Quick overview of the registry

lightml summary

Output:

  main.db
  ─────────────────────────────────────────────
  Runs        : 2
  Models      : 8  (1 hidden)
  Checkpoints : 3
  Families    : 4
  Metrics     : 156
  Updated     : 2026-03-15

  Runs:
    eval-1  (5 models)
    eval-2  (3 models)

  Families:
    gsm8k  (3 metrics · 5 models)
    hellaswag  (4 metrics · 8 models)
    mmlu  (6 metrics · 7 models)
    winogrande  (2 metrics · 4 models)

`info` — Detailed info for a model

# Specify directly
lightml info --model llama-sft

# Or pick interactively
lightml info

Output:

  Model  : llama-sft
  Run    : eval-1
  Path   : my-org/llama-sft
  Parent : llama-base
  Children : llama-dpo

  Checkpoints (2):
    step   1000  /checkpoints/llama-sft/step-1000
    step   5000  /checkpoints/llama-sft/step-5000

  Metrics (12):
    [hellaswag]
      acc                                 0.6300
      acc_norm                            0.6600
    [gsm8k]
      exact_match                         0.5500

`top` — Leaderboard by metric

Rank models by a specific metric. Family and metric can be selected interactively.

# Specify directly
lightml top --family hellaswag --metric acc --n 5

# Or pick interactively
lightml top

Output:

  Leaderboard  hellaswag / acc

    #  Model                Score       Run
  ───  ───────────────────  ────────  ────────────────────
  #1   llama-dpo              0.6500  eval-1
  #2   llama-sft              0.6300  eval-1
  #3   gemma-9b               0.6100  eval-1
  #4   llama-base             0.5900  eval-1

`metric-get` — Read a single metric value

Read one metric value. Useful for scripting with --raw.

# Interactive
lightml metric-get

# Explicit
lightml metric-get --model llama-sft --family hellaswag --metric acc
#   llama-sft  hellaswag/acc = 0.6300

# Scriptable (just the number)
lightml metric-get --model llama-sft --family hellaswag --metric acc --raw
# 0.63

`diff` — Compare N models side-by-side

Colorized table comparing metrics across two or more models — like git diff but for metrics.

# Interactive — pick models from a list
lightml diff

# Explicit
lightml diff \
    --models llama-base llama-sft gemma-9b \
    --run my-experiment           # optional
    --family "ENG 5-shot"         # optional
    --no-color                    # optional: disable colors (for piping)

Output:

  lightml diff — 3 models  (run: my-experiment)
  ══════════════════════════════════════════════════════════════════════
  Family       Metric       llama-base     llama-sft      gemma-9b
  ──────────────────────────────────────────────────────────────────────
  ENG 5-shot   ARC              0.4430        0.4870        0.5120
  ENG 5-shot   HellaSwag        0.6950        0.7190        0.7340
  ENG 5-shot   MMLU             0.5210        0.5620        0.5480
  ──────────────────────────────────────────────────────────────────────
  AVG          (3 metrics)      0.5530        0.5893        0.5980

Best value per metric is highlighted in green, worst in red (when 3+ models)
An AVG row summarizes all metrics where every model has a value
Missing metrics are shown as —

Also available as a Python API:

from lightml.diff import diff_models, format_diff

data = diff_models(
    db="./registry/main.db",
    model_names=["llama-base", "llama-sft", "gemma-9b"],
    run_name="my-experiment",
    family="ENG 5-shot",
)
print(format_diff(data))

`compare` — Compare two models

Print a side-by-side metric delta table. Models can be selected interactively.

# Interactive
lightml compare

# Explicit
lightml compare \
    --model-a llama-base \
    --model-b llama-sft \
    --run my-experiment           # optional
    --family "ENG 5-shot"         # optional

Output:

  Compare: llama-base  vs  llama-sft
  Run: my-experiment
  ──────────────────────────────────────────────────────────────────────────
  Family             Metric              A          B          Δ        %
  ──────────────────────────────────────────────────────────────────────────
  ENG 5-shot         MMLU            52.10      56.20      +4.10    +7.9%
  ENG 5-shot         ARC             44.30      48.70      +4.40    +9.9%
  ENG 5-shot         HellaSwag       69.50      71.90      +2.40    +3.5%
  ──────────────────────────────────────────────────────────────────────────

`exists` — Check if a model or metric exists

Check existence with exact names or glob patterns (*, ?). Exit code 0 = found, 1 = not found.

# Check if a model exists
lightml exists --model llama-sft

# Check if a specific metric exists
lightml exists --model llama-sft --family eng --metric hellaswag_0shot_acc_norm

# Glob search — find all hellaswag metrics for MIIA models
lightml exists --model MIIA* --family eng --metric hella*
#   ✓ MIIA14B-BASE  eng/hellaswag_0shot_acc_norm = 0.7240  (run: MIIA14B)
#   ✓ MIIA7B        eng/hellaswag_0shot_acc_norm = 0.7193  (run: MIIA14B)
#   8 match(es)

# Restrict to a specific run
lightml exists --model llama* --family eng --metric mmlu* --run my-experiment

Usable in scripts:

if lightml exists --model my-model 2>/dev/null; then
    echo "Model already registered, skipping"
fi

`stats` — Statistical comparison

Interactively compare two models using McNemar's test and bootstrap confidence intervals. Requires detailed scores (logged with the scores parameter).

# Fully interactive
lightml stats

# Explicit
lightml stats \
    --model-a llama-base \
    --model-b llama-sft \
    --family hellaswag_0shot \
    --metric hellaswag_acc_norm

Output:

  Statistical comparison: llama-base vs llama-sft
  Family: hellaswag_0shot  Metric: hellaswag_acc_norm
  ──────────────────────────────────────────────────────
  Both correct:    7234
  Only llama-base: 312
  Only llama-sft:  487
  Both wrong:      1967
  Discordant:      799
  ──────────────────────────────────────────────────────
  Mean llama-base: 0.7546
  Mean llama-sft:  0.7721
  Delta (A - B):   -0.0175
  95% CI:          [-0.0264, -0.0087]
  ──────────────────────────────────────────────────────
  McNemar p-value: 0.000012
  Result:          Significant (p < 0.05), llama-sft is better

When multiple metrics are selected, an overview table summarizes all tests at the end.

`notes` — Read or write model notes

Attach text notes to a model (without opening the dashboard).

# Read notes (interactive model selection if --model omitted)
lightml notes --model llama-sft
#   llama-sft: trained on 10k examples, lr=1e-4

# Write notes
lightml notes --model llama-sft --set "trained on 10k examples, lr=1e-4"

`rename` — Rename a model

Fix typos or update naming conventions.

# Interactive
lightml rename

# Explicit
lightml rename --old llama-sfft --new llama-sft

`prune` — Remove empty models and runs

Remove models with no metrics and no checkpoints, and runs with no models.

# Preview what would be removed
lightml prune --dry-run

# Actually remove
lightml prune

Output:

  Models (2):
    - empty-test-model
    - orphan-model

  Runs (1):
    - abandoned-run

  Pruned: 2 model(s), 1 run(s)

`watch` — Continuous scan

Like scan, but runs in a loop — useful during training/evaluation to auto-import new results as they appear.

lightml watch \
    --path ./eval_results \
    --run training-run \
    --interval 60                     # poll every 60 seconds (default: 30)
    --format lm_eval                  # or "json"
    --prefix "eval/"                  # optional
    --force                           # optional: overwrite

Output:

  Watching  ./eval_results
  DB   : ./registry/main.db
  Run  : training-run  |  Format: lm_eval  |  Interval: 60s
  Ctrl+C to stop

  [14:32:10] +2 models, +24 metrics
  [14:33:10] no new data
  [14:34:10] +1 models, +12 metrics

`merge` — Merge two registries

Merge all runs, models, and metrics from a source registry into your current one. Useful for combining results from different machines or teammates.

lightml merge --src ./other_registry.db

# Overwrite existing metrics (instead of skipping)
lightml merge --src ./other_registry.db --force

Output:

  Merging ./other_registry.db → ./registry/main.db

  run 'eval-gpu1': 3 model(s), 45 metric(s)
  run 'eval-gpu2': 2 model(s), 30 metric(s)

  Done. 5 model(s), 75 metric(s) processed.

Models with the same name are deduplicated (idempotent). Metrics are skipped if they already exist (use --force to overwrite).

`model-delete` — Delete a model

Delete a model and all related data (checkpoints, metrics, detailed scores). Interactive if --name is omitted.

# Interactive
lightml model-delete

# Explicit
lightml model-delete --name llama-sft

`migrate` — Database migration

Apply pending schema migrations to an older database (e.g. add the detailed_scores table introduced in v1.1.0):

lightml migrate

`version` — Show version

lightml version

`gui` — Launch dashboard

lightml gui [--port 5050] [--host 0.0.0.0]

Dashboard (GUI)

LightML ships with an interactive web dashboard — no external tools needed.

lightml gui --db ./registry/main.db

Table View

Pivoted metrics table with:

Family tabs — one tab per metric family, plus "All Families" (properly scoped — same metric name across different families shows distinct values)
Sorting — click any column header
Search — filter models by name
Color coding — best values highlighted in green, worst in red
Checkpoints toggle — show/hide checkpoint rows
Run filter — dropdown to isolate a specific run
Model selection — checkbox column for selecting models

Graph View

D3.js force-directed graph showing model lineage:

Nodes = models, colored by run
Edges = parent → child relationships
Checkpoints hidden by default — toggle "Show checkpoints" in the control bar to reveal them
Hover = tooltip with green/red dots showing which benchmarks have been evaluated
Search — filter nodes by name, path, or run
Drag & zoom — fully interactive

Model Selection & Compare

Select models from either view and compare them side-by-side:

Select: click checkboxes in the table, or click nodes in the graph — selections sync across both views
Selection bar: appears at the top showing count and actions
Filter table: click "Filter table" to show only selected models
Compare: select exactly 2 models, click "Compare" → a modal shows per-metric deltas with color-coded improvements (green) and regressions (red)
Clear: reset selection in both views

Excel Export

Click ⬇ Excel in the header to download a formatted .xlsx report directly from the dashboard.

Excel Export

The export engine creates professional Excel reports from the database:

One sheet per metric family — keeps related metrics grouped
Automatic color scales — red → yellow → green formatting on all metric columns
Frozen headers — first row + model name column stay visible while scrolling
Models (Phase F) and Checkpoints (Phase S) on the same sheet

from pathlib import Path
from lightml.export import export_excel

export_excel(
    db_path=Path("./registry/main.db"),
    output_path=Path("./report.xlsx"),
)

Or via CLI:

lightml export --db ./registry/main.db --output report.xlsx

Walkthrough: lm_eval Pipeline

This walkthrough shows how to use LightML with lm-evaluation-harness to evaluate an LLM and track results. The complete example is in examples/lm_eval/.

Step 1 — Configure

Edit examples/lm_eval/config.yaml:

# ── LightML settings ──────────────────────────────
db:        ./my_registry/main.db
run_name:  llama-3-eval

# ── Model to evaluate ────────────────────────────
model_path: meta-llama/Llama-3-8B

# ── Evaluation matrix ────────────────────────────
lang:       [eng]
benchmarks: [hellaswag, mmlu]
shots:      [0, 5]
num_gpus:   1

Every field is explained inline. The key LightML fields are db (path to registry) and run_name (experiment name).

Step 2 — Run evaluation

cd examples/lm_eval
python run_eval.py

The script does three things:

Connects to LightML and registers the model (2 lines of setup)
Runs lm_eval for each (benchmark × language × shots) combination
Logs every metric to the registry with handle.log_model_metric()

Here's the core LightML integration — it's just 4 API calls:

from lightml.handle import LightMLHandle

# Setup — 2 lines
handle = LightMLHandle(db=cfg["db"], run_name=cfg["run_name"])
handle.register_model(model_name=cfg["run_name"], path=cfg["model_path"])

# After each benchmark completes — 1 call per metric
handle.log_model_metric(
    model_name=handle.run_name,
    family="eng_hellaswag_0shot",
    metric_name="hellaswag_acc",
    value=0.452,
)

Step 3 — Explore in dashboard

lightml gui --db ./my_registry/main.db

Step 4 — Export report

Click ⬇ Excel in the dashboard header, or:

lightml export --db ./my_registry/main.db

Database Schema

LightML uses a single SQLite file with 6 tables:

-- Experiment container
CREATE TABLE run (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    run_name    TEXT UNIQUE NOT NULL,
    description TEXT,
    metadata    TEXT  -- JSON blob
);

-- Trained model, scoped to a run
CREATE TABLE model (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    model_name  TEXT NOT NULL,
    path        TEXT,
    parent_id   INTEGER REFERENCES model(id),
    run_id      INTEGER NOT NULL REFERENCES run(id),
    UNIQUE(model_name, run_id)
);

-- Training checkpoint, linked to a model
CREATE TABLE checkpoint (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    model_id    INTEGER NOT NULL REFERENCES model(id),
    step        INTEGER NOT NULL,
    path        TEXT,
    created_at  TEXT DEFAULT (datetime('now'))
);

-- Metric value, linked to a model OR a checkpoint
CREATE TABLE metrics (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    model_id        INTEGER REFERENCES model(id),
    checkpoint_id   INTEGER REFERENCES checkpoint(id),
    family          TEXT NOT NULL,
    metric_name     TEXT NOT NULL,
    value           REAL NOT NULL
);

-- Per-sample binary scores for statistical testing (v1.1.0+)
CREATE TABLE detailed_scores (
    metric_id   INTEGER NOT NULL PRIMARY KEY,
    scores      TEXT NOT NULL,       -- JSON array of 0/1 values
    n_samples   INTEGER NOT NULL,
    FOREIGN KEY(metric_id) REFERENCES metrics(id) ON DELETE CASCADE
);

-- Optional: restrict allowed metrics
CREATE TABLE registry_schema (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    family      TEXT NOT NULL,
    metric_name TEXT NOT NULL
);

Databases created with v1.0.x can be upgraded with lightml migrate --db <path> to add the detailed_scores table.

Project Structure

LightML/
├── pyproject.toml              # Package config, CLI entry point
├── README.md                   # This file
│
├── lightml/                    # Library source
│   ├── __init__.py
│   ├── handle.py               # LightMLHandle — main API (incl. bulk log_metrics)
│   ├── registry.py             # Run & model registration logic
│   ├── checkpoints.py          # Checkpoint registration + find_checkpoint
│   ├── metrics.py              # Metric logging + deduplication + detailed scores
│   ├── database.py             # SQLite schema initialization + migration
│   ├── stats.py                # Statistical testing (McNemar, Bootstrap CI)
│   ├── export.py               # Excel export engine
│   ├── compare.py              # Model comparison (Pydantic models + compare_models)
│   ├── diff.py                 # N-model side-by-side diff (terminal table)
│   ├── scan.py                 # Auto-import from eval result directories
│   ├── gui.py                  # FastAPI dashboard server + /api/compare
│   ├── cli.py                  # CLI entry point (lightml command)
│   ├── models/                 # Pydantic schemas (incl. DeleteResult)
│   ├── templates/
│   │   └── dashboard.html      # Single-file SPA dashboard
│   └── tests/
│       ├── test_bugfix.py       # Core regression tests (41 tests)
│       ├── test_compare.py      # Compare feature tests (15 tests)
│       ├── test_diff.py         # Diff feature tests (17 tests)
│       ├── test_scan.py         # Scan / auto-import tests (17 tests)
│       ├── test_bulk.py         # Bulk metric API tests (15 tests)
│       ├── test_delete.py       # Model deletion tests (6 tests)
│       └── conftest.py          # Shared fixtures
│
├── examples/
│   └── lm_eval/                # End-to-end evaluation example
│       ├── run_eval.py          # lm_eval + LightML pipeline
│       └── config.yaml          # Example configuration
│
└── docs/
    └── gifs/                   # GIF recordings for README

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github/workflows		.github/workflows
assets		assets
examples/lm_eval		examples/lm_eval
lightml		lightml
server		server
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

⚡ LightML

Table of Contents

Why LightML

Installation

From PyPI (recommended)

From source (development)

Quick Start (5 minutes)

1. Create a registry

2. Register a model and log metrics

3. View results

4. Export to Excel

Core Concepts

Run

Model

Checkpoint

Metrics

Python API Reference

LightMLHandle

register_model(model_name, path, parent_name=None, parent_id=None)

register_checkpoint(model_name, step, path)

find_checkpoint(model_name, step, path_hint=None)

log_model_metric(model_name, family, metric_name, value, scores=None, force=False)

log_checkpoint_metric(checkpoint_id, family, metric_name, value, scores=None, force=False)

Detailed Scores

Statistical Testing

Bulk Metric Logging

Compare Models

Auto-import (Scan)

Existence Checks & Search

Model Deletion

Metric Deduplication

CLI Reference

Database resolution (--db is optional)

Interactive mode

init — Create a new registry

model-register — Register a model

checkpoint-register — Register a checkpoint

metric-log — Log a single metric

export — Export Excel report

scan — Auto-import eval results

list — List models, runs, or families

summary — Quick overview of the registry

info — Detailed info for a model

top — Leaderboard by metric

metric-get — Read a single metric value

diff — Compare N models side-by-side

compare — Compare two models

exists — Check if a model or metric exists

stats — Statistical comparison

notes — Read or write model notes

rename — Rename a model

prune — Remove empty models and runs

watch — Continuous scan

merge — Merge two registries

model-delete — Delete a model

migrate — Database migration

version — Show version

gui — Launch dashboard

Dashboard (GUI)

Table View

Graph View

Model Selection & Compare

Excel Export

Excel Export

Walkthrough: lm_eval Pipeline

Step 1 — Configure

Step 2 — Run evaluation

Step 3 — Explore in dashboard

Step 4 — Export report

Database Schema

Project Structure

About

Topics

Resources

License

Uh oh!

`register_model(model_name, path, parent_name=None, parent_id=None)`

`register_checkpoint(model_name, step, path)`

`find_checkpoint(model_name, step, path_hint=None)`

`log_model_metric(model_name, family, metric_name, value, scores=None, force=False)`

`log_checkpoint_metric(checkpoint_id, family, metric_name, value, scores=None, force=False)`

Database resolution (`--db` is optional)

`init` — Create a new registry

`model-register` — Register a model

`checkpoint-register` — Register a checkpoint

`metric-log` — Log a single metric

`export` — Export Excel report

`scan` — Auto-import eval results

`list` — List models, runs, or families

`summary` — Quick overview of the registry

`info` — Detailed info for a model

`top` — Leaderboard by metric

`metric-get` — Read a single metric value

`diff` — Compare N models side-by-side

`compare` — Compare two models

`exists` — Check if a model or metric exists

`stats` — Statistical comparison

`notes` — Read or write model notes

`rename` — Rename a model

`prune` — Remove empty models and runs

`watch` — Continuous scan

`merge` — Merge two registries

`model-delete` — Delete a model

`migrate` — Database migration

`version` — Show version

`gui` — Launch dashboard