Skip to content

AronLT-5/PlanEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Plan Evaluator

Plan Evaluator is a local-first experiment runner for evaluating plan datasets with deterministic heuristics and optional single-shot LLM judges. It is built to make evaluation work inspectable, repeatable, and easy to operate when you need to compare multiple datasets, model combinations, or metric selections.

System at a glance

The system has three cooperating parts:

  • a CLI for scripted and batch workflows,
  • a FastAPI web app with an intuitive control-panel style UI,
  • a background worker that processes queued jobs from SQLite.

This separation keeps workflows simple: configure and launch runs from the UI or CLI, let the worker execute evaluator jobs, and inspect results through artifacts and run summaries.

Transparency and reproducibility

Plan Evaluator is designed so every run can be audited and reproduced. Each run writes a dedicated artifact tree under runs/<run_id>/ with the resolved config snapshot, evaluator outputs, logs, and run summary. The evaluator pipeline normalizes inputs into a canonical record shape before scoring, so the same inputs and config produce stable behavior for deterministic metrics.

The system also emphasizes provenance:

  • run state and job state are tracked in SQLite,
  • per-evaluator outputs are persisted as append-only JSONL artifacts,
  • aggregate CSV output is generated from those persisted artifacts,
  • configuration and prompt/request hashes are recorded to support cross-run comparison.

Secrets are handled separately from artifacts. API keys are read from environment variables at runtime and are not written into run outputs.

Control-panel workflow for multi-dataset evaluations

The web UI is built to make repeated experimentation straightforward. You can upload and preview JSONL inputs, map fields, choose heuristic groups, enable or disable LLM judging, and select models in a single control panel. From there, launching multiple runs against different datasets or evaluator setups is quick because the same workflow and config shape are reused each time.

In practice, this makes it easy to:

  • run baseline heuristics-only evaluations on one dataset,
  • run mixed heuristics plus LLM evaluations on another dataset,
  • compare results across runs using a consistent artifact structure and summary outputs.

Quick start

  1. Create and activate a Python 3.12+ virtual environment.
  2. Install dependencies:
    • pip install -e .[dev]
  3. Run API server:
    • plan-evaluator-api
  4. Run worker process in another terminal:
    • plan-evaluator-worker
  5. Open http://127.0.0.1:8000.

CLI

  • plan-evaluator-cli validate-config --config path/to/config.yaml
  • plan-evaluator-cli run --config path/to/config.yaml
  • plan-evaluator-cli resume --run-id <run_id>
  • plan-evaluator-cli list-runs
  • plan-evaluator-cli show-run-summary --run-id <run_id>

Notes

  • Run artifacts are written under runs/<run_id>/.
  • API keys are read from environment variables and never persisted to run artifacts.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors