Run lm-evaluation-harness at scale on Slurm clusters.
eval-hive manages the full evaluation lifecycle. Configure models and eval suites, plan work, submit Slurm jobs, track progress, and avoid duplicate work. Sibling project to inference-hive.
- Batch evaluation — evaluate many models and checkpoints against many eval suites with a single config file.
- Slurm native — automatic job script generation, array jobs, resource management, signal handling.
- Any lm-eval backend — works with vLLM, sglang, or serverless backends (hf, nemo). Any lm-eval task works out of the box.
- Flexible server deployment — single-node, multi-node, or multiple independent inference servers behind a built-in load balancer for high throughput eval.
- Parallel task execution — run multiple lm-eval processes concurrently to keep GPUs busy, with configurable task batching to reduce filesystem I/O on parallel filesystems.
- Request caching — pre-build lm-eval request caches for fast, race-free distribution to compute nodes.
- Result collection — collect lm-eval results into a parquet file with automatic score aggregation for task groups and eval suites.
- HuggingFace sync — configure
hf_result_repoonce and eval-hive handles the rest:create-runsyncs coverage from HF to skip already-evaluated tasks,collect --uploadpushes results back. - Deduplication — skips already-completed work at three levels: submit time (completed log + SLURM queue), HF coverage (synced at create-run), and runtime (existing result files). Resubmit safely after partial failures.
- Failure handling — per-task resumability, SIGUSR1 graceful shutdown, idempotent resubmission.
- Pydantic config validation — catches errors before any jobs are submitted.
Requires pixi. Clone the repo and let pixi handle the environment:
git clone git@github.com:ellamind/eval-hive.git
cd eval-hive
pixi installAll commands below use pixi run eval-hive (alias for python -m eval_hive).
# 1. Copy the template and fill in your cluster/model details
cp eh_config_template.yaml my_eh_config.yaml
# 2. Validate your config
pixi run eval-hive validate-config my_eh_config.yaml
# 3. Download datasets and build request caches (run on a login node with internet)
pixi run eval-hive prepare my_eh_config.yaml
# 4. Create a run directory with manifest and Slurm script
pixi run eval-hive create-run --config my_eh_config.yaml --output runs/my-run
# 5. Submit jobs
pixi run eval-hive submit runs/my-run
# 6. Monitor progress
pixi run eval-hive status runs/my-run
# 7. Collect results into parquet (with optional HF upload)
pixi run eval-hive collect runs/my-run
pixi run eval-hive collect runs/my-run --upload login node compute nodes
(has internet) (no internet)
┌──────────┐ ┌────────────┐ ┌──────────┐ ┌───────────────────────────┐
│ prepare │───►│ create_run │───►│ submit │───►│ slurm job per model │
│ │ │ │ │ │ │ │
│ download │ │ plan work │ │ sbatch │ │ (optional) start server(s)│
│ datasets │ │ write │ │ array │ │ (optional) load balancer │
│ + caches │ │ manifest │ │ jobs │ │ run lm_eval suites │
│ + tarball│ │ HF sync │ │ │ │ write results │
└──────────┘ └────────────┘ └──────────┘ └───────────────────────────┘
│
┌────┴─────┐
│ status │
│ monitor │
│ progress │
└────┬─────┘
│
┌────┴─────┐
│ collect │
│ parse + │
│ aggregate│
│ → parquet│
│ (→ HF) │
└──────────┘
# Validate config (optional — prepare and create-run also validate)
pixi run eval-hive validate-config eh_config.yaml
# Prepare: download datasets and build request caches (login node, has internet)
pixi run eval-hive prepare eh_config.yaml
pixi run eval-hive prepare eh_config.yaml --refresh # rebuild all caches
pixi run eval-hive prepare eh_config.yaml --workers 8 # parallel cache building
# Plan and generate run directory
pixi run eval-hive create-run --config eh_config.yaml --output runs/my-run
pixi run eval-hive create-run --config eh_config.yaml --output runs/my-run --force # overwrite
pixi run eval-hive create-run --config eh_config.yaml --output runs/my-run --update # update models/suites
# Submit jobs
pixi run eval-hive submit runs/my-run
pixi run eval-hive submit runs/my-run --dry # preview only
pixi run eval-hive submit runs/my-run --limit 5 # submit at most 5 jobs
pixi run eval-hive submit runs/my-run --retry-interval 10 # retry every 10 min until all submitted
# Monitor progress
pixi run eval-hive status runs/my-run
pixi run eval-hive status runs/my-run --detailed # per-suite/task breakdown
# Collect results into parquet
pixi run eval-hive collect runs/my-run # writes scores.parquet
pixi run eval-hive collect runs/my-run -o my_scores.parquet # custom output path
pixi run eval-hive collect runs/my-run --upload # merge + dedup + upload to HF (uses hf_result_repo from config)
pixi run eval-hive collect runs/my-run --upload org/other-scores # override HF repo
# Cancel all active jobs for a run
pixi run eval-hive cancel runs/my-runSingle YAML file per setup. Copy eh_config_template.yaml to get started. Models needing different server or Slurm settings get separate config files.
- Flat structure — one config = one set of server/Slurm/eval settings. Models needing different parallelism or backends get separate configs.
inference_server_commandis a plain string with placeholders, not a structured config. Supports any inference server backend (vLLM, sglang) and container wrapping (singularity, apptainer) without schema changes. Placeholders substituted at runtime:${EH_PORT},${EH_MODEL_PATH}.inference_server_command: nulldisables the server lifecycle entirely. Use this for lm-eval backends that load models directly (hf,nemo, etc.). The job skips server start/health-check/stop and runslm_evalimmediately.eval.model_argsdict is joined askey=value,key=valueforlm_eval --model_args.eval.lm_eval_argsis a passthrough dict — each key becomes a CLI flag (--key value). No schema changes needed when lm-eval adds new flags.- Model entries carry metadata —
path,display_name, optionalmodel_keyoverride, checkpoint info, and training metadata (train_batch_size/tokens_trained). A model and its checkpoints can share the samemodel_key— they differ only bystepin the parquet composite key(model, step, task, metric, metric_filter). - Pydantic validation catches config errors before any jobs are submitted.
Each model entry under models: defines a model or checkpoint series to evaluate:
models:
# Config dict keys must be unique but are only used internally.
# Use model_key to control the parquet 'model' column.
HPLT2c_eng_main:
model_key: HPLT2c_eng # parquet 'model' column (defaults to config key)
path: "/path/to/model/main"
display_name: "HPLT2c eng" # required: human-friendly name
tokens_trained: 100B # recommended for non-checkpoint models (supports K, M, B, T)
HPLT2c_eng_checkpoints:
model_key: HPLT2c_eng # same model_key → same model in parquet, different steps
path: "/path/to/model"
display_name: "HPLT2c eng"
checkpoint_pattern: "checkpoint_{step}"
steps: [5000, 10000, 20000] # optional filter (default: discover all)
train_batch_size: 2_097_152 # recommended for checkpoints (tokens_trained = batch_size × step)| Field | Required | Description |
|---|---|---|
path |
yes | Local path or HuggingFace model ID |
display_name |
yes | Human-friendly name for result display |
model_key |
no | Parquet model column value. Defaults to config dict key. Set this to share a model key between a main model and its checkpoints. |
checkpoint_pattern |
no | Pattern for checkpoint subdirectories, e.g. checkpoint_{step} |
steps |
no | Filter to specific steps. Requires checkpoint_pattern. |
train_batch_size |
no | Batch size in tokens. Recommended for checkpoints (tokens_trained = train_batch_size × step). |
tokens_trained |
no | Total tokens trained. Supports human-readable suffixes: 100B, 2T, 500M. Recommended for non-checkpoint models. |
The prepare step (prepare.py) runs on a node with internet access and does two things:
- Downloads datasets — loading each task triggers HuggingFace dataset downloads if not already in cache.
- Builds request caches — calls
task.build_all_requests(cache_requests=True)to pre-build and pickle the prompt instances lm-eval will need at eval time.
At the end, it packs all cache files into a compressed tarball (<cache_dir>/cache.tar.gz) for fast distribution to compute nodes.
Compute nodes on Lustre (or similar parallel filesystems) are slow at reading many small files due to metadata overhead. Eval jobs extract the tarball to a local temp directory and point LM_HARNESS_CACHE_PATH at it:
This also prevents race conditions. lm-eval has no file locking on cache writes, so if multiple jobs hit a cache miss simultaneously they'd all race to write the same pickle file. With per-job temp directories, each job has its own isolated copy.
The tarball is only rebuilt when needed:
- New cache files were created
--refreshwas used (all caches rewritten)- The tarball doesn't exist yet
If all caches are already up to date, the tarball step is skipped.
With --workers N (default 4), tasks are sharded across worker processes. Tasks sharing the same dataset are grouped together to avoid redundant downloads. A greedy bin-packing algorithm balances load across workers.
The create_run step resolves the config into a concrete execution plan:
- Loads config and validates with Pydantic.
- Builds manifest — iterates all models, resolves checkpoint patterns, produces a flat mapping of
{model_key, label, model_path, display_name, train_batch_size, tokens_trained}dicts keyed by{model_key}--{label}. Themodel_keycomes from the entry'smodel_keyfield (or the config dict key if not set). - HF sync (when
hf_result_repois set) — downloads the HF dataset parquet and writeseh_hf_covered.json, a skip list mapping manifest keys to their already-covered leaf tasks. The job script reads this at startup so individual tasks are skipped without running lm-eval. Manifest keys where all tasks are covered are also written tojobs_completed.logsosubmitskips them entirely. This is a one-time sync — HF data is assumed stable for the lifetime of the run. - Generates run directory containing:
eh_manifest.json— the manifest (read byjqin the job script)eh_config.yaml— frozen copy of the input configeh_job.slurm— generated SBATCH script with all config baked ineh_hf_covered.json— HF coverage skip list (ifhf_result_repois set)logs/andprogress/directories
The generated Slurm script handles the full job lifecycle:
- Reads its manifest entry via
jq .[${SLURM_ARRAY_TASK_ID}] - Sets
EH_MODEL_PATH,EH_PORT(random 30000–59999), environment variables - Optionally starts inference server, runs health checks, shuts down on completion
- Loops over all suites, skipping any with existing
results_*.jsonor HF coverage (runtime dedup) - Uses
--output_pathwith.jsonsuffix so lm-eval writes directly to the suite directory without creating model subdirectories - Signal handling: SIGUSR1 (approaching timeout) triggers auto-resubmission, SIGTERM marks failure
- Records completion/failure to append-only logs
Each Slurm job evaluates one model (or checkpoint) against all its suites, amortizing the cost of server startup. The job array maps each index to a (model, checkpoint) pair via a manifest.
Job array index 0 → model-A → [suite_easy, suite_main]
Job array index 1 → model-A/ckpt-10000 → [suite_easy, suite_main]
Job array index 2 → model-A/ckpt-20000 → [suite_easy, suite_main]
Job array index 3 → model-B → [suite_easy, suite_main]
By default (parallel_tasks: 1), suites are evaluated sequentially — one lm_eval process at a time. lm-eval tokenizes all requests on a single CPU core before sending any API calls, which causes a multi-minute stall with GPUs idle for large suites.
With parallel_tasks: N (N > 1), suites are expanded into their individual leaf tasks (e.g., my_suite_easy → arc_easy_rc, hellaswag_rc, ...) and up to N lm_eval processes run concurrently. This spreads tokenization across multiple CPU cores while the inference server handles concurrent requests from all processes.
Tasks are grouped into batches of task_batch_size (default 8) and each batch is passed to a single lm_eval invocation as a comma-separated --tasks list. This reduces filesystem I/O on parallel filesystems (GPFS/Lustre) by minimizing the number of concurrent processes, directories, and result files. The effective batch size is automatically capped so that all parallel_tasks slots are utilized — e.g. with 4 remaining tasks and parallel_tasks: 8, each task gets its own slot (batch size 1).
create_run resolves suite/group names to leaf tasks and writes eh_task_map.json into the run directory. At runtime, the Slurm script reads it to build a deduplicated task list (tasks shared across suites are evaluated once), filters out already-completed tasks by scanning existing result files, and batches the remainder.
Results are stored in batch directories:
{output_path}/{model_key}/{label}/batch_{SLURM_JOB_ID}_{NNN}/results_*.json # combined scores
{output_path}/{model_key}/{label}/batch_{SLURM_JOB_ID}_{NNN}/samples_*.jsonl # per-task sample files
{output_path}/{model_key}/{label}/batch_{SLURM_JOB_ID}_{NNN}/lm_eval.log # batch log
The main Slurm log (logs/) only shows batch start/finish messages. Full lm-eval output (progress bars, warnings, result tables) is captured exclusively in the per-batch lm_eval.log files.
Two independent scaling axes control how inference servers are deployed:
| Axis | Purpose | When to use |
|---|---|---|
num_nodes_per_inference_server |
Capacity — fit a large model across multiple nodes | 70B model needs TP=8 across 2× 4-GPU nodes |
num_inference_servers |
Throughput — run independent instances behind load balancer | Fast eval with 4 parallel vLLM instances |
Total Slurm nodes per job = num_nodes_per_inference_server × num_inference_servers.
Single server, single node (default: 1×1):
Slurm job (--nodes=1)
Node 0:
├── vLLM server (:64444)
└── lm_eval ──► http://localhost:64444/v1/completions
Multi-node server for large models (2×1):
Slurm job (--nodes=2)
Node 0 (head):
├── vLLM server (:64444, TP=8 via Ray across nodes 0-1)
└── lm_eval ──► http://localhost:64444/v1/completions
Node 1: └── vLLM Ray worker (joins head node's server)
Multiple independent servers for throughput (1×4):
Slurm job (--nodes=4)
Node 0 (coordinator):
├── vLLM server (:64444)
├── Load balancer (:8000) ──► least-connections to all 4 backends
└── lm_eval ──► http://localhost:8000/v1/completions
Node 1: └── vLLM server (:64444)
Node 2: └── vLLM server (:64444)
Node 3: └── vLLM server (:64444)
Both combined — large model with throughput scaling (2×2):
Slurm job (--nodes=4)
Nodes 0-1: Server instance 0 (TP=8 across 2 nodes)
Node 0 (coordinator): vLLM head + Load balancer (:8000) + lm_eval
Node 1: vLLM Ray worker
Nodes 2-3: Server instance 1 (TP=8 across 2 nodes)
Node 2: vLLM head
Node 3: vLLM Ray worker
Why least-connections over round-robin: lm-eval requests are non-uniform — a BPB loglikelihood takes ~10ms while a CoT generation takes seconds. Least-connections routes to the server with fewest in-flight requests, naturally adapting to uneven request costs.
The load balancer is a small async Python module (eval_hive/load_balancer.py) using aiohttp. No external dependencies (nginx, traefik) needed on compute nodes.
When num_inference_servers: 1, no load balancer is started — lm-eval connects directly to the server instance.
When inference_server_command is set to null, the job skips the entire server lifecycle (start, health check, load balancer, stop) and runs lm_eval directly. This supports lm-eval backends that load the model in-process (hf, nemo, etc.). Use a separate config with appropriate eval.lm_eval_args.model and eval.model_args.
Job script logic:
if inference_server_command is set:
start_server → health_check → (load_balancer if num_inference_servers > 1)
fi
load eh_hf_covered.json → EH_HF_COVERED associative array # HF skip list
# parallel_tasks=1 (sequential, per-suite):
for suite in eval.suites_and_tasks:
if results_*.json exists or suite in EH_HF_COVERED: skip
lm_eval run --tasks $suite ... > {suite}/lm_eval.log 2>&1
done
# parallel_tasks>1 (parallel, batched):
expand suites → deduplicated leaf tasks via eh_task_map.json
scan existing results → filter out completed tasks
chunk remaining into batches of task_batch_size
for batch in batches (up to parallel_tasks concurrent):
lm_eval run --tasks $batch_csv ... > batch_dir/lm_eval.log 2>&1 &
done; wait
if inference_server_command is set:
stop_server
fiThree levels of deduplication.
Submit-time dedup (submit.py): before submitting a job, checks progress/jobs_completed.log and the Slurm queue, skipping manifest keys that are already completed or active. Jobs fully covered by HF data are pre-marked as completed by create-run.
HF-based dedup (create-run): when hf_result_repo is set, downloads the HF dataset parquet and writes eh_hf_covered.json — a per-manifest-key list of covered leaf tasks. The job script reads this file at startup and skips covered tasks. Manifest keys where all tasks are covered are also written to jobs_completed.log so submit skips them entirely. Respects HF_HUB_OFFLINE for cached/offline access.
Runtime dedup (inside each job): before batching, scans all existing result files to find completed task names (handles both per-task and batch directories). Completed tasks and HF-covered tasks are filtered out before creating batches. Sequential mode checks per-suite as before.
Results are written to:
# sequential mode (parallel_tasks=1):
{output_path}/{model_key}/{label}/{suite_name}/results_*.json
# parallel mode (parallel_tasks>1):
{output_path}/{model_key}/{label}/batch_{SLURM_JOB_ID}_{NNN}/results_*.json
Completion tracking uses a simple append-only log (progress/jobs_completed.log). Each completed job appends its task ID. Failed jobs are tracked in progress/jobs_failed.log with reasons.
The collect command parses lm-eval result JSON files from a run directory and produces a single parquet file.
Discovery: walks {output_path}/{model_key}/{label}/*/results_*.json using the manifest to find all result files. For each manifest entry, the training step is extracted from the label (e.g. checkpoint_0005000 → 5000; main → None).
Parsing: each result JSON is parsed into ScoreRow records capturing the score, metric, task metadata, language, formulation type, and subtask structure.
Aggregation: after parsing leaf benchmark scores, group and suite scores are computed bottom-up from the YAML task hierarchy. For each group defined in the task YAMLs:
- The group's
aggregate_metric_listspecifies which metrics to aggregate and how. - Children's scores are collected from leaf results or already-computed sub-group scores.
- The aggregate is computed as a simple mean, or weighted by
n_sampleswhenweight_by_size: true. - A
subtask_treeadjacency map tracks which tasks contribute to each group score.
Groups reachable from the configured suites_and_tasks are tagged as eval_suite; intermediate groups as task_group.
HF upload (--upload): downloads the existing parquet from the HuggingFace dataset repo (configured via hf_result_repo), merges with local results, deduplicates on (model, step, task, metric, metric_filter) keeping the latest eval_date, and re-uploads. The local parquet is also updated with the merged result. Use --upload org/other-repo to override the target repo.
- Per-suite/batch resumability: each suite (sequential mode) or batch (parallel mode) completes independently. Failures in one don't affect others. On restart, completed tasks are detected by scanning existing result files and excluded from new batches.
- SIGUSR1 signal handling: Slurm sends SIGUSR1 before timeout. The job gracefully stops the server, marks progress, and can be resubmitted.
- Idempotent resubmission:
submitonly submits jobs for incomplete work. Run it again after partial completion.
Generated by create-run:
runs/my-run/
├── eh_config.yaml # frozen copy of config
├── eh_manifest.json # task_key → {model_key, label, model_path, display_name, ...}
├── eh_task_map.json # suite → [leaf_tasks] (used by HF sync, status, parallel mode)
├── eh_hf_covered.json # task_key → [covered_tasks] (HF skip list, if hf_result_repo set)
├── eh_job.slurm # generated sbatch script
├── logs/ # {model}-{checkpoint}-{jobid}.log (start/finish per task only)
└── progress/
├── jobs_completed.log # append-only completion tracking (includes HF-covered jobs)
└── jobs_failed.log # failure tracking with reasons
eval-hive/
├── eval_hive/
│ ├── __init__.py
│ ├── __main__.py # CLI entry point (subcommands)
│ ├── config.py # Pydantic config models (EhConfig, ModelEntry, EvalSection)
│ ├── create_run.py # Generate run directory, manifest, sbatch script
│ ├── prepare.py # Download datasets, build request caches, pack tarball
│ ├── validate_config.py # Validate config and display model/checkpoint table
│ ├── submit.py # Submit jobs from manifest with dedup
│ ├── collect.py # Collect results into parquet with aggregation (+ --upload to HF)
│ ├── status.py # Monitor run progress
│ ├── cancel.py # Cancel active Slurm jobs for a run
│ ├── load_balancer.py # Async least-connections reverse proxy
│ └── results/
│ ├── __init__.py
│ ├── schemas.py # ScoreRow, EvalConfig, TaskConfig (Pydantic models)
│ ├── parse.py # Parse lm-eval result JSONs into ScoreRows
│ ├── aggregate.py # Compute group/suite scores from YAML hierarchy
│ └── hf.py # HuggingFace parquet download, upload, merge + dedup
├── eh_config_template.yaml # Annotated config template
├── pixi.toml # Environment definition
└── README.md