Plan Evaluator

Plan Evaluator is a local-first experiment runner for evaluating plan datasets with deterministic heuristics and optional single-shot LLM judges. It is built to make evaluation work inspectable, repeatable, and easy to operate when you need to compare multiple datasets, model combinations, or metric selections.

System at a glance

The system has three cooperating parts:

a CLI for scripted and batch workflows,
a FastAPI web app with an intuitive control-panel style UI,
a background worker that processes queued jobs from SQLite.

This separation keeps workflows simple: configure and launch runs from the UI or CLI, let the worker execute evaluator jobs, and inspect results through artifacts and run summaries.

Transparency and reproducibility

Plan Evaluator is designed so every run can be audited and reproduced. Each run writes a dedicated artifact tree under runs/<run_id>/ with the resolved config snapshot, evaluator outputs, logs, and run summary. The evaluator pipeline normalizes inputs into a canonical record shape before scoring, so the same inputs and config produce stable behavior for deterministic metrics.

The system also emphasizes provenance:

run state and job state are tracked in SQLite,
per-evaluator outputs are persisted as append-only JSONL artifacts,
aggregate CSV output is generated from those persisted artifacts,
configuration and prompt/request hashes are recorded to support cross-run comparison.

Secrets are handled separately from artifacts. API keys are read from environment variables at runtime and are not written into run outputs.

Control-panel workflow for multi-dataset evaluations

The web UI is built to make repeated experimentation straightforward. You can upload and preview JSONL inputs, map fields, choose heuristic groups, enable or disable LLM judging, and select models in a single control panel. From there, launching multiple runs against different datasets or evaluator setups is quick because the same workflow and config shape are reused each time.

In practice, this makes it easy to:

run baseline heuristics-only evaluations on one dataset,
run mixed heuristics plus LLM evaluations on another dataset,
compare results across runs using a consistent artifact structure and summary outputs.

Quick start

Create and activate a Python 3.12+ virtual environment.
Install dependencies:
- pip install -e .[dev]
Run API server:
- plan-evaluator-api
Run worker process in another terminal:
- plan-evaluator-worker
Open http://127.0.0.1:8000.

CLI

plan-evaluator-cli validate-config --config path/to/config.yaml
plan-evaluator-cli run --config path/to/config.yaml
plan-evaluator-cli resume --run-id <run_id>
plan-evaluator-cli list-runs
plan-evaluator-cli show-run-summary --run-id <run_id>

Notes

Run artifacts are written under runs/<run_id>/.
API keys are read from environment variables and never persisted to run artifacts.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
plan_evaluator		plan_evaluator
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Plan Evaluator

System at a glance

Transparency and reproducibility

Control-panel workflow for multi-dataset evaluations

Quick start

CLI

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Plan Evaluator

System at a glance

Transparency and reproducibility

Control-panel workflow for multi-dataset evaluations

Quick start

CLI

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages