-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or requestpriority-mediumImportant but not blockingImportant but not blockingtestingTesting and evaluationTesting and evaluation
Milestone
Description
Description
Create a benchmark task suite to automatically evaluate copilot success/failure on standard microscopy workflows.
Requirements
1. Benchmark Categories
- Navigation: "Move to embryo X", "Center the brightest embryo"
- Acquisition: "Acquire a volume", "Start timelapse for 1 hour"
- Analysis: "Find the hatching embryo", "Measure embryo sizes"
- Multi-step: "Calibrate all embryos and start timelapse"
- Error Recovery: "Handle stage limit error", "Recover from failed detection"
2. Evaluation Metrics
- Task completion (binary + partial credit)
- Tool call efficiency (optimal vs actual)
- Time to completion
- Error handling quality
3. Mock Hardware
- Simulated responses for all device operations
- Configurable failure scenarios
- Deterministic for reproducibility
4. Reporting
- Per-task results
- Aggregate scores
- Regression tracking over time
Technical Approach
- Define benchmark tasks in YAML/JSON format
- Create
MockQueueServerClientwith scripted responses - Run copilot in evaluation mode (no human input)
- Score based on tool calls and final state
Key Files
- New:
benchmarks/tasks/ - New:
benchmarks/mock_client.py - New:
benchmarks/evaluator.py
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestpriority-mediumImportant but not blockingImportant but not blockingtestingTesting and evaluationTesting and evaluation