Skip to content

feat(ragas): RAGAS evaluation — LLM-as-judge for RAG pipelines#70

Merged
lianghsun merged 1 commit intomainfrom
feat/ragas
Mar 26, 2026
Merged

feat(ragas): RAGAS evaluation — LLM-as-judge for RAG pipelines#70
lianghsun merged 1 commit intomainfrom
feat/ragas

Conversation

@lianghsun
Copy link
Member

Summary

  • RAGASExtractor: parses judge LLM's JSON response for 4 metric scores
  • RAGASScorer: threshold-based scoring (avg of faithfulness, answer_relevancy, context_precision, context_recall >= 0.5)
  • Consolidated judge prompt: 1 LLM call per sample (vs official RAGAS's 6-8 multi-step calls), ~1/6 API cost
  • Example dataset: 10 rows from explodinggradients/WikiEval (5 good + 3 ungrounded + 2 poor answers)
  • Config template: config.ragas.template.yaml
  • Tests: 36 tests covering extractor, scorer, presets, example dataset
  • Docs: docs/evals/ragas.md

Key Design Decision

RAGAS is fundamentally different from other benchmarks — it's a metric framework that requires LLM-as-judge. The LLM in config.yaml serves as the judge (not the model being evaluated). The dataset contains pre-assembled judge prompts with pre-existing RAG outputs.

Judge Results (Devstral-Small-2-24B as judge)

Answer Type Faithfulness Relevancy Precision Recall Average
good (×5) 1.00 1.00 1.00 1.00 1.00
ungrounded (×3) 0.00–0.80 0.00–0.90 1.00 1.00 0.50–0.93
poor (×2) 0.50 0.70 1.00 1.00 0.80

Judge correctly distinguishes good/ungrounded/poor answers.

Test plan

  • python3 -m pytest tests/test_ragas.py -v — 36 passed
  • python3 -m pytest tests/ -v — 324 passed, 2 pre-existing failures, 1 skipped
  • Actual judge evaluation with Devstral 24B on WikiEval dataset

Closes #63, #64, #65, #66, #67, #68

🤖 Generated with Claude Code

- RAGASExtractor: parses judge LLM JSON response (4 metric scores)
- RAGASScorer: threshold-based scoring on faithfulness, answer_relevancy,
  context_precision, context_recall (avg >= 0.5 default)
- Consolidated judge prompt: 1 LLM call per sample (vs official 6-8)
- Example dataset: 10 rows from WikiEval (5 good, 3 ungrounded, 2 poor)
- Config template: config.ragas.template.yaml
- Tests: 36 tests covering extractor, scorer, presets, example dataset
- Docs: docs/evals/ragas.md with Devstral 24B judge results

Closes #63, closes #64

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-project-automation github-project-automation bot moved this to Backlog in Backlog Mar 26, 2026
@lianghsun lianghsun merged commit f7bade6 into main Mar 26, 2026
2 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in Backlog Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

feat(ragas): prepare example dataset from RAGAS official

1 participant