Skip to content

[WIP] Add experimental evaluations for timing and recall#10

Draft
Copilot wants to merge 1 commit intoevalsfrom
copilot/add-experimental-evaluations
Draft

[WIP] Add experimental evaluations for timing and recall#10
Copilot wants to merge 1 commit intoevalsfrom
copilot/add-experimental-evaluations

Conversation

Copy link
Copy Markdown

Copilot AI commented Dec 4, 2025

  • Add CSV export functionality to evaluation runner
  • Create experimental evaluation script with configurable parameters
  • Add plotting utilities for analyzing metrics (batch size vs latency, precision/recall)
  • Create end-to-end evaluation script with simple metric tracking
  • Add documentation for running experiments
Original prompt

Ok, forgetting about langsmith for a moment. I need an easy way to run some experimental evaluations with regard to timing and recall. I need to get all the data as CSV and be able to plot different values on a graph i.e. batch size with respect to latency and also recall/precision to find the sweet spot for retrieval values. I also want some simple metric evals to test the retrieval end-to-end so that I know if changes to the graph actually improve it.

The user has attached the following file paths as relevant context:

  • test/agent/retrieval_evals
  • src/evals
[Chronological Review: The conversation began with the user expressing a need for an easy way to run experimental evaluations regarding timing and recall. The user specified the need for data in CSV format to plot various metrics, such as batch size against latency and recall/precision. The user also requested simple metric evaluations to test retrieval end-to-end. The user then attempted to delegate tasks to a cloud agent, indicating a desire to offload some work. The user repeated the delegation command multiple times, indicating a persistent issue with uncommitted changes that needed to be addressed before proceeding.]

[Intent Mapping:

  • "I need an easy way to run some experimental evaluations with regard to timing and recall."
  • "I need to get all the data as CSV and be able to plot different values on a graph i.e. batch size with respect to latency and also recall/precision."
  • "I also want some simple metric evals to test the retrieval end-to-end."
  • "Delegate to cloud agent" (repeatedly indicating a need for assistance).]

[Technical Inventory: The conversation involves experimental evaluations, data handling (CSV), plotting metrics, and retrieval testing. The user is working within the OBP-Opey-II repository, specifically on the 'evals' branch.]

[Code Archaeology: No specific files or code changes were discussed in detail, but the user referenced the need for evaluations and metrics related to retrieval processes.]

[Progress Assessment: The user has articulated their requirements for evaluations but has not yet implemented any specific solutions or code changes. The delegation to the cloud agent indicates a desire to move forward but is currently stalled due to uncommitted changes.]

[Context Validation: The user’s goals for experimental evaluations, data collection, and plotting metrics are clearly articulated. The need for cloud assistance is also noted, along with the issue of uncommitted changes.]

[Recent Commands Analysis:

  • Last Agent Commands: The user issued commands to delegate tasks to a cloud agent and noted "Uncommitted changes detected."
  • Tool Results Summary: The system acknowledged the uncommitted changes but did not proceed with the delegation until those changes were addressed.
  • Pre-Summary State: The agent was actively trying to delegate tasks to the cloud agent but was unable to do so due to the presence of uncommitted changes.
  • Operation Context: The commands were executed in the context of preparing for experimental evaluations, indicating that the user is looking to streamline the process of running evaluations and gathering data.]
1. Conversation Overview: - Primary Objectives: The user requested, "I need an easy way to run some experimental evaluations with regard to timing and recall," and specified the need for CSV data for plotting metrics related to batch size, latency, recall, and precision. - Session Context: The conversation focused on the user's need for experimental evaluations and the challenges faced in delegating tasks due to uncommitted changes. - User Intent Evolution: The user’s intent shifted from outlining requirements for evaluations to attempting to delegate tasks to a cloud agent, indicating a need for assistance in executing their plans.
  1. Technical Foundation:
  • Core Technology: The conversation revolves around experimental evaluations and data handling, specifically using CSV for data output.
  • Framework/Library: No specific frameworks or libraries were mentioned, but the context suggests a need for data visualization tools for plotting metrics.
  • Architectural Pattern: The user is looking for a structured approach to evaluate retrieval performance and gather metrics.
  • Environment Detail: The user is working within the OBP-Opey-II repository on the 'evals' branch.
  1. Codebase Status:
  • File Name: Not specified in the conversation.
  • Purpose: The user aims to implement evaluations and metrics related to retrieval processes.
  • Current State: No specific code changes were discussed, but the user is preparing to implement evaluations.
  • Key Code Segments: Not applicable as no code was provided.
  • Dependencies: The user’s work is dependent on resolving uncommitted changes before proceeding.
  1. Problem Resolution:
  • Issues Encountered: The user faced issues with uncommitted changes that prevented task delegation.
  • Solutions Implemented: No solu...

Created from VS Code via the GitHub Pull Request extension.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants