Profile DPO preference datasets before training. Catches data quality issues that silently degrade RLHF performance.
| Check | What it detects |
|---|---|
| Length bias | Whether "chosen" responses are systematically longer (a known reward-hacking signal) |
| Embedding similarity | Low-contrast pairs where chosen and rejected are nearly identical |
| Easy pairs | Trivially distinguishable pairs that add little learning signal |
| Prompt coverage | Clusters prompts to find underrepresented topic areas |
| Duplicates | Exact and near-duplicate prompts via embedding cosine similarity |
| LLM judge | Optional GPT/open-model scoring to find mislabeled or trivial pairs |
pip install .
# With LLM judge support:
pip install ".[judge]"preflight audit dataset.jsonlThis produces a preflight_report.json with per-check results and actionable recommendations, plus a printed summary.
--output PATH Output JSON path (default: preflight_report.json)
--n-clusters N Number of prompt clusters for coverage analysis (default: 8)
--embedding-model NAME Sentence-transformers model (default: all-MiniLM-L6-v2)
--judge MODEL Enable LLM judge scoring (e.g. gpt-4o-mini)
--judge-base-url URL Custom API base URL for judge
--judge-api-key KEY API key for judge (or set OPENAI_API_KEY)
- HuggingFace (
prompt/chosen/rejectedkeys, string or message-list values) - Together AI (
input.messages/preferred_output/non_preferred_output) - Anthropic HH (
chosen/rejectedwith\n\nHuman:/\n\nAssistant:turns)
Format is auto-detected from the first line.
preflight audit examples/sample_hf.jsonl --output report.jsonSee experiments/dpo_validation.ipynb for a reproducible experiment showing that filtering preflight-flagged pairs from UltraFeedback improves DPO reward accuracy on a held-out eval set (Qwen2.5-0.5B, LoRA, Colab T4).
MIT