-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Describe the feature
To evaluate the quality of the search, you need to collect or generate a dataset. Firstly, for Retrieval part - entry consisting query and relevant documents (also may contain negative examples). Also, we can assess not only retrieval part, but full RAG-pipeline - it will require dataset with question-answer pairs, and some LLM-judge.
- Collect dataset with query and relevant documents. Also may contain subset of documents that will be involved in search run (to make score stable across different runs).
- Calculate Offline metrics: HitRate@10, MeanAveragePrecision@10
- Save metrics somewhere with description: may be in repository issues, releases, or discussions
- Iterate on improving search pipeline and extending collected dataset
Suggested solution
Wiki: Relevance metrics
Weaviate: article about metrics
HuggingFace: RAG evaluation cookbook
Additional context
No response
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
📋 Backlog