Build an AI evaluator to rank the responses provided by our RAG system.

Here is a very crude prompt I used to evalute the query response and the concept response for a query about patch tuesday.

we are evaluating a semantec search capability using two different methods.  compare the two following responses to the users query and rate the compleatness of the answer on a scale of 1-5.   

Here is the user query: 
sumarize the patch tuesday releases from 2025

Response 1:
The information on the Patch Tuesday releases for 2025 includes notable updates from January 2025, where Microsoft issued a significant number of patches, totaling 159 CVEs. This marked one of the highest counts of patches seen in recent years, with 10 critical vulnerabilities and eight zero-days addressed. The most prevailing risk type was remote code execution (RCE), accounting for 36% of the issues. The updates indicated serious vulnerabilities that posed risks of unauthorized access and potential disruptions if exploited. Among them, Microsoft Office Access also received patches for three remote code execution vulnerabilities. Regarding the overall patch management approach, it was noted that Microsoft was implementing a new cycle that would reduce the number of required restarts for Windows updates from 12 to four per year through planned hotpatch updates. No further detailed summaries for subsequent months or other specific Patch Tuesdays in 2025 were provided in the relevant information.

Response 2:
The information regarding Patch Tuesday releases for 2025 is not available in the relevant details provided.

-----
Here is what ChatGPT said:
Here’s a comparative evaluation of the two responses based on how completely they address the user’s query:

⸻

User Query:

“Summarize the Patch Tuesday releases from 2025.”

⸻

Response 1 – Evaluation: 4/5

Strengths:
	•	Provides detailed information about the January 2025 release.
	•	Includes specific metrics: 159 CVEs, 10 critical, 8 zero-days, and 36% RCEs.
	•	Mentions a notable change in update policy: reduced restarts via hotpatching.
	•	Cites an example (Microsoft Access) to show types of patched vulnerabilities.

Weaknesses:
	•	Only covers January 2025 explicitly.
	•	States that no further detailed summaries were available, but doesn’t attempt to generalize trends or highlight any ongoing cadence for the rest of 2025.
	•	Does not clarify whether this is due to a lack of data or an incomplete search.

Rating: ★★★★☆ (4/5)
This is a strong, informative answer that addresses the question partially but meaningfully. It would earn a 5 if it covered more than just one month or clarified the status of the rest of the year more definitively.

⸻

Response 2 – Evaluation: 1/5

Strengths:
	•	Direct and honest: indicates no information was found.

Weaknesses:
	•	Offers no summary at all, despite at least some data being available (e.g., January 2025).
	•	Does not attempt to inform the user of any context, even to say that only limited data is available.
	•	Leaves the query completely unaddressed.

Rating: ★☆☆☆☆ (1/5)
This response is unhelpful and incomplete for the user’s query. It could have at least offered a partial summary like Response 1 or explained why no data was available.

⸻

Summary:
	•	Response 1 provides a solid partial answer: 4/5
	•	Response 2 fails to address the question in a meaningful way: 1/5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build an AI evaluator to rank the responses provided by our RAG system. #26

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Build an AI evaluator to rank the responses provided by our RAG system. #26

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions