[WIP] Add experimental evaluations for timing and recall analysis by Copilot · Pull Request #8 · OpenBankProject/OBP-Opey-II

Copilot · 2025-12-04T08:41:22Z

Experimental Evaluation Framework for Retrieval System

Plan

Explore repository structure and understand existing evaluation code
Identify requirements: CSV export, plotting capabilities, batch size vs latency analysis, recall/precision analysis, end-to-end metric evals
Create CSV export functionality for evaluation results
Create plotting utilities for visualizing metrics (batch size vs latency, precision/recall)
Create batch size experiment runner that exports to CSV
Create end-to-end evaluation script with simple metrics
Add visualization scripts for graphing results
Update documentation with usage examples
Test the complete workflow

Original prompt

Ok, forgetting about langsmith for a moment. I need an easy way to run some experimental evaluations with regard to timing and recall. I need to get all the data as CSV and be able to plot different values on a graph i.e. batch size with respect to latency and also recall/precision to find the sweet spot for retrieval values. I also want some simple metric evals to test the retrieval end-to-end so that I know if changes to the graph actually improve it.

The user has attached the following file paths as relevant context:

test/agent/retrieval_evals

src/evals

Created from VS Code via the GitHub Pull Request extension.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: nemozak1 <60814891+nemozak1@users.noreply.github.com>

Initial plan

9c11c44

Copilot AI assigned Copilot and nemozak1 Dec 4, 2025

Copilot started work on behalf of nemozak1 December 4, 2025 08:41 View session

Initial exploration of repository structure

64a2dfb

Co-authored-by: nemozak1 <60814891+nemozak1@users.noreply.github.com>

Copilot AI requested a review from nemozak1 December 4, 2025 08:48

Copilot stopped work on behalf of nemozak1 due to an error December 4, 2025 08:48
Copilot has encountered an error. See logs for additional details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add experimental evaluations for timing and recall analysis#8

[WIP] Add experimental evaluations for timing and recall analysis#8
Copilot wants to merge 2 commits intoevalsfrom
copilot/add-evaluation-experimentation

Copilot AI commented Dec 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Experimental Evaluation Framework for Retrieval System

Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Dec 4, 2025 •

edited

Loading