[WIP] Add experimental evaluations for timing and recall#10
Draft
[WIP] Add experimental evaluations for timing and recall#10
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Original prompt
Ok, forgetting about langsmith for a moment. I need an easy way to run some experimental evaluations with regard to timing and recall. I need to get all the data as CSV and be able to plot different values on a graph i.e. batch size with respect to latency and also recall/precision to find the sweet spot for retrieval values. I also want some simple metric evals to test the retrieval end-to-end so that I know if changes to the graph actually improve it.
The user has attached the following file paths as relevant context:
- test/agent/retrieval_evals
- src/evals
[Chronological Review: The conversation began with the user expressing a need for an easy way to run experimental evaluations regarding timing and recall. The user specified the need for data in CSV format to plot various metrics, such as batch size against latency and recall/precision. The user also requested simple metric evaluations to test retrieval end-to-end. The user then attempted to delegate tasks to a cloud agent, indicating a desire to offload some work. The user repeated the delegation command multiple times, indicating a persistent issue with uncommitted changes that needed to be addressed before proceeding.][Intent Mapping:
[Technical Inventory: The conversation involves experimental evaluations, data handling (CSV), plotting metrics, and retrieval testing. The user is working within the OBP-Opey-II repository, specifically on the 'evals' branch.]
[Code Archaeology: No specific files or code changes were discussed in detail, but the user referenced the need for evaluations and metrics related to retrieval processes.]
[Progress Assessment: The user has articulated their requirements for evaluations but has not yet implemented any specific solutions or code changes. The delegation to the cloud agent indicates a desire to move forward but is currently stalled due to uncommitted changes.]
[Context Validation: The user’s goals for experimental evaluations, data collection, and plotting metrics are clearly articulated. The need for cloud assistance is also noted, along with the issue of uncommitted changes.]
[Recent Commands Analysis:
1. Conversation Overview: - Primary Objectives: The user requested, "I need an easy way to run some experimental evaluations with regard to timing and recall," and specified the need for CSV data for plotting metrics related to batch size, latency, recall, and precision. - Session Context: The conversation focused on the user's need for experimental evaluations and the challenges faced in delegating tasks due to uncommitted changes. - User Intent Evolution: The user’s intent shifted from outlining requirements for evaluations to attempting to delegate tasks to a cloud agent, indicating a need for assistance in executing their plans.- Technical Foundation:
- Core Technology: The conversation revolves around experimental evaluations and data handling, specifically using CSV for data output.
- Framework/Library: No specific frameworks or libraries were mentioned, but the context suggests a need for data visualization tools for plotting metrics.
- Architectural Pattern: The user is looking for a structured approach to evaluate retrieval performance and gather metrics.
- Environment Detail: The user is working within the OBP-Opey-II repository on the 'evals' branch.
- Codebase Status:
- File Name: Not specified in the conversation.
- Purpose: The user aims to implement evaluations and metrics related to retrieval processes.
- Current State: No specific code changes were discussed, but the user is preparing to implement evaluations.
- Key Code Segments: Not applicable as no code was provided.
- Dependencies: The user’s work is dependent on resolving uncommitted changes before proceeding.
- Problem Resolution:
- Issues Encountered: The user faced issues with uncommitted changes that prevented task delegation.
- Solutions Implemented: No solu...
Created from VS Code via the GitHub Pull Request extension.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.