Evaluation system for RAG submissions. Validates your search results and calculates recall metrics.
uv venv && uv sync
direnv allow # or source .envrc
Evaluate your search results against the ground truth dataset.
uv run python -m moulinette evaluate_student_search_results \
<student_results_path> \
<dataset_path> \
[--k K] \
[--max_context_length MAX_LENGTH]
Arguments:
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| student_results_path | str | required | Path to student search results JSON |
| dataset_path | str | required | Path to ground truth dataset JSON |
| --k | int | 10 | Maximum number of sources per question |
| --max_context_length | int | 2000 | Maximum context length per source |
Example:
uv run python -m moulinette evaluate_student_search_results \
../data/output/search_results/dataset_code_public.json \
../data/datasets/AnsweredQuestions/dataset_code_public.json \
--k 10 \
--max_context_length 2000
Output:
-
Validates student data format
-
Calculates Recall@1, Recall@3, Recall@5, Recall@10
-
Returns
Trueif Recall@5 >= 50%,Falseotherwise
Evaluate your generated answers (not yet implemented).
uv run python -m moulinette evaluate_student_answers <student_answer_path>
| Dataset | Metric | Threshold |
|---------|--------|-----------|
| Code | Recall@5 | >= 50% |
| Docs | Recall@5 | >= 80% |
{
"search_results": [
{
"question_id": "uuid",
"question_str": "What is the question text?",
"retrieved_sources": [
{
"file_path": "path/to/file",
"first_character_index": 0,
"last_character_index": 500
}
]
}
],
"k": 10
}
{
"rag_questions": [
{
"question_id": "uuid",
"question": "What is...",
"answer": "The answer is...",
"sources": [
{
"file_path": "path/to/file",
"first_character_index": 0,
"last_character_index": 500
}
]
}
]
}