Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 47 additions & 14 deletions docs/benchmarks_and_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,36 +64,69 @@ Additional task documentation can be generated with the script `utils/generate-t

| Metrics Type | Metrics |
|-|-------------------------------|
| Completion Metrics | Accuracy
| Completion Metrics | Accuracy Completion
|| AidanBench |
|| **Text Similarity**
|| Bleu |
|| Linewise BLEU |
|| Response to Original BLEU |
|| Chrf |
|| Linewise chrF |
|| COMET |
|| Ter |
|| Linewise TER |
|| F1 |
|| Rouge 1 |
|| Rouge 2 |
|| Rouge-L |
|| Rouge-Geometric-Mean |
|| Exponential Similarity |
|| **Code Evaluation**
|| Code Assertion |
|| Code Execution Pass@1 |
|| **Language & Format**
|| Language Checker |
|| Length Checker |
|| Math Reasoning |
|| Placeholder Checker |
|| Text Counter |
|| German Completion Check |
|| Language Consistency |
|| Language Consistency Raw |
|| CSV Format |
|| JSON Format |
|| Postscript Format |
|| **Text Statistics & Control**
|| Length Checker |
|| Word Count |
|| Sentence Count |
|| Paragraph Count |
|| Response to Original Length Ratio |
|| Text Counter |
|| Word Repetition |
|| Placeholder Checker |
|| **Task-Specific**
|| Math Reasoning |
|| Concordance Index |
|| Custom IFEval Checker |
|| Custom CWE Checker |
|| Custom NIAH Checker |
|| Custom Grid Comparison Checker |
|| Repetition Checker |
|| StructMetric |
|| RenderableStructMetric |
| Loglikelihood Metrics | Accuracy Loglikelihood |
|| Normalized Accuracy Loglikelihood |
|| Accuracy Normalized Loglikelihood |
|| Confidence-weighted Accuracy |
|| Distributional Correctness Score |
|| Probability Mass |
| LLM Metrics | Chatbot Style Judge |
|| Completion Accuracy Judge
|| Conciseness Judge
|| Contains Names Judge
|| Instruction Judge
|| SQL Format
|| World Knowledge Judge
|| Probability Mass Normalized |
|| Ternary Score |
| LLM Judge Metrics | Chatbot Style |
|| Coherence |
|| Completion Accuracy |
|| Conciseness |
|| Avoids Names |
|| Format Correctness |
|| Instruction Following |
|| Pairwise Judgement (MTBench) |
|| Single Judgement (MTBench) |
|| Refusal Classifier |
|| SQL Quality |
|| World Knowledge |
| Efficiency Metrics | Bytes per Sequence Position |
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ Documentation Overview
controlling_upload_results.md
docker_guide.md
evaluate_huggingface_model.md
llm_as_judge_guide.md
loglikelihood_task_guide.md
model_arguments.md
overview_dataloading.md
Expand Down
Loading