Skip to content

[WIP] Add experimental evaluations for timing and recall analysis#8

Draft
Copilot wants to merge 2 commits intoevalsfrom
copilot/add-evaluation-experimentation
Draft

[WIP] Add experimental evaluations for timing and recall analysis#8
Copilot wants to merge 2 commits intoevalsfrom
copilot/add-evaluation-experimentation

Conversation

Copy link
Copy Markdown

Copilot AI commented Dec 4, 2025

Experimental Evaluation Framework for Retrieval System

Plan

  • Explore repository structure and understand existing evaluation code
  • Identify requirements: CSV export, plotting capabilities, batch size vs latency analysis, recall/precision analysis, end-to-end metric evals
  • Create CSV export functionality for evaluation results
  • Create plotting utilities for visualizing metrics (batch size vs latency, precision/recall)
  • Create batch size experiment runner that exports to CSV
  • Create end-to-end evaluation script with simple metrics
  • Add visualization scripts for graphing results
  • Update documentation with usage examples
  • Test the complete workflow
Original prompt

Ok, forgetting about langsmith for a moment. I need an easy way to run some experimental evaluations with regard to timing and recall. I need to get all the data as CSV and be able to plot different values on a graph i.e. batch size with respect to latency and also recall/precision to find the sweet spot for retrieval values. I also want some simple metric evals to test the retrieval end-to-end so that I know if changes to the graph actually improve it.

The user has attached the following file paths as relevant context:

  • test/agent/retrieval_evals
  • src/evals

Created from VS Code via the GitHub Pull Request extension.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: nemozak1 <60814891+nemozak1@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants