Add cost and token tracking per benchmark run

## Summary
Add instrumentation to track token usage (input/output) and estimated cost per benchmark task and per full run. This enables cost comparison across providers and helps users estimate expenses before running large benchmarks.

## What needs to happen
- [ ] Capture input and output token counts from each LLM call
- [ ] Map token counts to estimated cost using per-provider pricing
- [ ] Aggregate totals per task and per full benchmark run
- [ ] Output a cost summary at the end of each run (total tokens, total cost, avg cost/task)
- [ ] Save cost data alongside benchmark results for later analysis

## Example output
```
Benchmark complete: 162 tasks
Total tokens: 1,245,000 (input: 980,000 / output: 265,000)
Estimated cost: $4.82
Avg cost/task: $0.03
Provider: claude-sonnet-4-6
```

## Acceptance criteria
- Token counts are captured per LLM call
- Cost summary is printed at end of run
- Cost data is saved to results file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cost and token tracking per benchmark run #2

Summary

What needs to happen

Example output

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add cost and token tracking per benchmark run #2

Description

Summary

What needs to happen

Example output

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions