Skip to content

[Feature] Automated benchmark suite for copilot evaluation #8

@pskeshu

Description

@pskeshu

Description

Create a benchmark task suite to automatically evaluate copilot success/failure on standard microscopy workflows.

Requirements

1. Benchmark Categories

  • Navigation: "Move to embryo X", "Center the brightest embryo"
  • Acquisition: "Acquire a volume", "Start timelapse for 1 hour"
  • Analysis: "Find the hatching embryo", "Measure embryo sizes"
  • Multi-step: "Calibrate all embryos and start timelapse"
  • Error Recovery: "Handle stage limit error", "Recover from failed detection"

2. Evaluation Metrics

  • Task completion (binary + partial credit)
  • Tool call efficiency (optimal vs actual)
  • Time to completion
  • Error handling quality

3. Mock Hardware

  • Simulated responses for all device operations
  • Configurable failure scenarios
  • Deterministic for reproducibility

4. Reporting

  • Per-task results
  • Aggregate scores
  • Regression tracking over time

Technical Approach

  • Define benchmark tasks in YAML/JSON format
  • Create MockQueueServerClient with scripted responses
  • Run copilot in evaluation mode (no human input)
  • Score based on tool calls and final state

Key Files

  • New: benchmarks/tasks/
  • New: benchmarks/mock_client.py
  • New: benchmarks/evaluator.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestpriority-mediumImportant but not blockingtestingTesting and evaluation

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions