feat(ragas): RAGAS evaluation — LLM-as-judge for RAG pipelines by lianghsun · Pull Request #70 · ai-twinkle/Eval

lianghsun · 2026-03-26T14:28:56Z

Summary

RAGASExtractor: parses judge LLM's JSON response for 4 metric scores
RAGASScorer: threshold-based scoring (avg of faithfulness, answer_relevancy, context_precision, context_recall >= 0.5)
Consolidated judge prompt: 1 LLM call per sample (vs official RAGAS's 6-8 multi-step calls), ~1/6 API cost
Example dataset: 10 rows from explodinggradients/WikiEval (5 good + 3 ungrounded + 2 poor answers)
Config template: config.ragas.template.yaml
Tests: 36 tests covering extractor, scorer, presets, example dataset
Docs: docs/evals/ragas.md

Key Design Decision

RAGAS is fundamentally different from other benchmarks — it's a metric framework that requires LLM-as-judge. The LLM in config.yaml serves as the judge (not the model being evaluated). The dataset contains pre-assembled judge prompts with pre-existing RAG outputs.

Judge Results (Devstral-Small-2-24B as judge)

Answer Type	Faithfulness	Relevancy	Precision	Recall	Average
good (×5)	1.00	1.00	1.00	1.00	1.00
ungrounded (×3)	0.00–0.80	0.00–0.90	1.00	1.00	0.50–0.93
poor (×2)	0.50	0.70	1.00	1.00	0.80

Judge correctly distinguishes good/ungrounded/poor answers.

Test plan

python3 -m pytest tests/test_ragas.py -v — 36 passed
python3 -m pytest tests/ -v — 324 passed, 2 pre-existing failures, 1 skipped
Actual judge evaluation with Devstral 24B on WikiEval dataset

Closes #63, #64, #65, #66, #67, #68

🤖 Generated with Claude Code

- RAGASExtractor: parses judge LLM JSON response (4 metric scores) - RAGASScorer: threshold-based scoring on faithfulness, answer_relevancy, context_precision, context_recall (avg >= 0.5 default) - Consolidated judge prompt: 1 LLM call per sample (vs official 6-8) - Example dataset: 10 rows from WikiEval (5 good, 3 ungrounded, 2 poor) - Config template: config.ragas.template.yaml - Tests: 36 tests covering extractor, scorer, presets, example dataset - Docs: docs/evals/ragas.md with Devstral 24B judge results Closes #63, closes #64 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-project-automation bot added this to Backlog Mar 26, 2026

github-project-automation bot moved this to Backlog in Backlog Mar 26, 2026

lianghsun merged commit f7bade6 into main Mar 26, 2026
2 checks passed

github-project-automation bot moved this from Backlog to Done in Backlog Mar 26, 2026

This was referenced Mar 26, 2026

feat(ragas): score comparison vs ragas official library #65

Closed

feat(ragas): speed benchmark vs ragas official library #66

Closed

feat(ragas): write docs/evals/ragas.md #67

Closed

feat(ragas): write tests/test_ragas.py #68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ragas): RAGAS evaluation — LLM-as-judge for RAG pipelines#70

feat(ragas): RAGAS evaluation — LLM-as-judge for RAG pipelines#70
lianghsun merged 1 commit intomainfrom
feat/ragas

lianghsun commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lianghsun commented Mar 26, 2026

Summary

Key Design Decision

Judge Results (Devstral-Small-2-24B as judge)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant