[Feature] Automated benchmark suite for copilot evaluation

## Description
Create a benchmark task suite to automatically evaluate copilot success/failure on standard microscopy workflows.

## Requirements

### 1. Benchmark Categories
- **Navigation**: "Move to embryo X", "Center the brightest embryo"
- **Acquisition**: "Acquire a volume", "Start timelapse for 1 hour"
- **Analysis**: "Find the hatching embryo", "Measure embryo sizes"
- **Multi-step**: "Calibrate all embryos and start timelapse"
- **Error Recovery**: "Handle stage limit error", "Recover from failed detection"

### 2. Evaluation Metrics
- Task completion (binary + partial credit)
- Tool call efficiency (optimal vs actual)
- Time to completion
- Error handling quality

### 3. Mock Hardware
- Simulated responses for all device operations
- Configurable failure scenarios
- Deterministic for reproducibility

### 4. Reporting
- Per-task results
- Aggregate scores
- Regression tracking over time

## Technical Approach
- Define benchmark tasks in YAML/JSON format
- Create `MockQueueServerClient` with scripted responses
- Run copilot in evaluation mode (no human input)
- Score based on tool calls and final state

## Key Files
- New: `benchmarks/tasks/`
- New: `benchmarks/mock_client.py`
- New: `benchmarks/evaluator.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Automated benchmark suite for copilot evaluation #8

Description

Requirements

1. Benchmark Categories

2. Evaluation Metrics

3. Mock Hardware

4. Reporting

Technical Approach

Key Files

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature] Automated benchmark suite for copilot evaluation #8

Description

Description

Requirements

1. Benchmark Categories

2. Evaluation Metrics

3. Mock Hardware

4. Reporting

Technical Approach

Key Files

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions