Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
188 changes: 188 additions & 0 deletions BENCHMARK_RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# MCP Universe Benchmark Results

Comparison of Claude Code agent performance across different MCP transport configurations.

## Test Configuration

- **Test Suite**: Repository Management (10 GitHub tasks, 48 evaluations)
- **Agent**: claude-code-agent
- **Model**: claude-opus-4-5-20251101
- **Max Iterations**: 20

---

## Results Summary

| Run | Transport | Passed | Failed | Score | Total Time | Notes |
|-----|-----------|--------|--------|-------|------------|-------|
| **Run 1** | Direct API (GitHub MCP via Docker) | 15 | 33 | **31.25%** | ~50min | Baseline |
| **Run 2** | ContextBridge (mcp-remote stdio) | 15 | 33 | **31.25%** | ~50min | Same as baseline |
| **Run 3** | ContextBridge (HTTP transport) | 1 | 47 | **2.08%** | ~50min | Significant regression |

---

## Run 1: Direct API (Baseline)

**Date**: 2024-12-20 07:30
**Report**: `log/report_20251220_073022_060a7ee3-59a8-4c92-970b-5df96e9e5c81.md`
**Transport**: GitHub MCP server via Docker (stdio)

| Task | Passed | Failed | Score | Time |
|------|--------|--------|-------|------|
| github_task_0001 | 3 | 4 | 0.43 | 111s |
| github_task_0002 | 2 | 5 | 0.29 | - |
| github_task_0003 | 2 | 8 | 0.20 | - |
| github_task_0004 | 3 | 4 | 0.43 | - |
| github_task_0005 | 3 | 4 | 0.43 | - |
| github_task_0006 | 0 | 2 | 0.00 | - |
| github_task_0007 | 1 | 1 | 0.50 | - |
| github_task_0008 | 0 | 2 | 0.00 | - |
| github_task_0009 | 0 | 2 | 0.00 | - |
| github_task_0010 | 1 | 1 | 0.50 | - |

**Total**: 15/48 passed (31.25%)

---

## Run 2: ContextBridge (mcp-remote stdio)

**Date**: 2024-12-20 07:58
**Report**: `log/report_20251220_075853_f9e8c86a-8599-4c75-b3ce-5ffc73b6db91.md`
**Transport**: ContextBridge via mcp-remote (stdio proxy)

| Task | Passed | Failed | Score | Time |
|------|--------|--------|-------|------|
| github_task_0001 | 2 | 5 | 0.29 | 82s |
| github_task_0002 | 3 | 4 | 0.43 | - |
| github_task_0003 | 2 | 8 | 0.20 | - |
| github_task_0004 | 3 | 4 | 0.43 | - |
| github_task_0005 | 3 | 4 | 0.43 | - |
| github_task_0006 | 0 | 2 | 0.00 | - |
| github_task_0007 | 0 | 2 | 0.00 | - |
| github_task_0008 | 1 | 1 | 0.50 | - |
| github_task_0009 | 0 | 2 | 0.00 | - |
| github_task_0010 | 1 | 1 | 0.50 | - |

**Total**: 15/48 passed (31.25%)

---

## Run 3: ContextBridge (HTTP transport)

**Date**: 2024-12-20 09:58
**Report**: `log/report_20251220_095837_8ec88e24-a7d9-4a30-9e8f-e8d74c4783f3.md`
**Transport**: ContextBridge via Claude Code SDK HTTP transport

| Task | Passed | Failed | Score | Time |
|------|--------|--------|-------|------|
| github_task_0001 | 0 | 7 | 0.00 | 100s |
| github_task_0002 | 0 | 7 | 0.00 | 12s |
| github_task_0003 | 0 | 10 | 0.00 | 225s |
| github_task_0004 | 0 | 7 | 0.00 | 153s |
| github_task_0005 | 0 | 7 | 0.00 | 565s |
| github_task_0006 | 0 | 2 | 0.00 | 405s |
| github_task_0007 | 1 | 1 | 0.50 | 395s |
| github_task_0008 | 0 | 2 | 0.00 | 52s |
| github_task_0009 | 0 | 2 | 0.00 | 305s |
| github_task_0010 | 0 | 2 | 0.00 | 567s |

**Total**: 1/48 passed (2.08%)

### Run 3 Failure Analysis

Primary failure reasons:
- **"the repository doesn't exist"** - Most common failure, indicates the agent couldn't create repos via ContextBridge
- **"the branches don't exist"** - Secondary failure
- **"the file content is not found"** - Tertiary failure
- **"the PR doesn't exist"** - Downstream failure

**Root Cause**: The Claude Code SDK HTTP transport to ContextBridge appears to have connectivity or authentication issues. The agent received the prompts but couldn't execute GitHub operations through the gateway.

---

## Analysis

### Performance Comparison

| Metric | Run 1 (Direct) | Run 2 (mcp-remote) | Run 3 (HTTP) |
|--------|----------------|--------------------| --------------|
| Success Rate | 31.25% | 31.25% | 2.08% |
| Total Passed | 15 | 15 | 1 |
| Total Failed | 33 | 33 | 47 |
| Task 1 Latency | 111s | 82s | 100s |

### Key Findings

1. **Run 1 vs Run 2**: Equivalent performance
- Both achieved 31.25% success rate
- mcp-remote stdio transport works correctly with ContextBridge
- Task-level variance exists but balances out

2. **Run 3: HTTP transport failure**
- Dramatic regression: 2.08% vs 31.25%
- Only github_task_0007 partially succeeded (1/2 evals)
- All other tasks failed to create repositories
- Suggests HTTP transport configuration or ContextBridge authentication issue

### Potential Run 3 Issues

1. **HTTP transport not fully supported** by Claude Code SDK for MCP
2. **Missing authentication headers** in HTTP config
3. **ContextBridge gateway** may require different authentication for HTTP vs SSE
4. **Tool discovery failure** - agent may not have received tool list from gateway

---

## Known Issues

1. **Evaluator Bug**: `IndexError` in `github__get_file_contents` (line 61 in functions.py)
- `output.content[1].resource.text` fails when content list is empty
- Affects all runs equally

2. **LLM Call Tracking**: Reports show 0 LLM calls for claude-code-agent
- Tracking issue only, doesn't affect actual execution

---

## Recommendations

1. **Investigate HTTP transport failure**
- Check ContextBridge logs for Run 3
- Verify HTTP authentication is working
- Consider using mcp-remote as the stable option

2. **Fix evaluator bug**
- Add bounds checking in `github__get_file_contents`
- Would likely improve reported success rates

3. **For production use**
- Use mcp-remote stdio transport until HTTP is debugged
- Both Run 1 and Run 2 show equivalent 31.25% success rate

---

## Quick Mode Comparison (Run 4 vs Run 5)

**Date**: 2024-12-21

| Transport | Task 0001 | Task 0007 | Task 0010 | Total | Score | Time |
|-----------|-----------|-----------|-----------|-------|-------|------|
| **Run 4: Direct GitHub MCP** | 0/7 | 1/2 | 0/2 | **1/11** | **9.09%** | ~2min |
| **Run 5: ContextBridge HTTP** | 0/7 | 0/2 | 0/2 | **0/11** | **0.00%** | ~2min |

### Key Finding

ContextBridge via HTTP transport performed worse than direct GitHub MCP:
- The agent trace shows `search_tools` as first action instead of actual GitHub operations
- Suggests tools aren't properly exposed via HTTP transport
- Authentication works (Bearer token accepted) but tool discovery/execution may be incomplete

### ContextBridge Connection Issues Encountered

1. **mcp-remote SSE errors** - Required Node.js 20.18.1+ (upgraded to 22)
2. **OAuth localhost callback** - ContextBridge only supports hosted callback, not localhost
3. **HTTP transport fallback** - Used direct HTTP with Bearer token from cached auth

---

*Last Updated: 2024-12-21*
133 changes: 133 additions & 0 deletions claude.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# MCP Universe - Fork for MCP Gateway Testing

## Project Overview

This is a fork of the original MCP Universe repository, specifically created to test and evaluate an MCP gateway implementation.

## Project Goals

### 1. Initial Testing Phase
- **Objective**: Run repository management tests using direct Anthropic API access
- **Approach**: Use personal Anthropic API key to establish baseline performance
- **Test Suite**: Repository management benchmark (34 GitHub-related tasks)
- **Models**: Testing with Claude 4.5 models (Sonnet, Opus, Haiku)

### 2. MCP Gateway Integration Phase
- **Objective**: Test the same benchmarks through an MCP gateway
- **Approach**: Configure the gateway URL and route requests through it
- **Purpose**: Validate gateway functionality and performance

### 3. Comparison & Analysis Phase
- **Objective**: Compare direct API vs. gateway performance
- **Metrics to Compare**:
- Test pass/fail rates
- Response times
- Token usage
- Cost efficiency
- Error rates
- Overall reliability

## Current Status

**Phase**: Initial Setup
**Next Step**: Run repository management tests with direct Anthropic API

## Implementation Plan

### Step 1: Direct Anthropic API Testing (Current)
Detailed implementation plan saved at: `REPO_MANAGEMENT_TEST_PLAN.md`

**Summary**:
1. Configure `.env` with `ANTHROPIC_API_KEY` and GitHub credentials
2. Update `mcpuniverse/benchmark/configs/test/repository_management.yaml`:
- Change `type: openai` → `type: claude`
- Set `model_name` to Claude 4.5 variant
3. Run benchmark: `pytest tests/benchmark/test_benchmark_repository_management.py`
4. Collect baseline metrics and results

### Step 2: MCP Gateway Testing (Planned)
1. Configure MCP gateway URL in environment
2. Update configuration to route through gateway
3. Run the same benchmark suite
4. Collect gateway performance metrics

### Step 3: Comparison Analysis (Planned)
1. Compare direct API vs. gateway results
2. Document performance differences
3. Identify optimization opportunities
4. Generate comprehensive comparison report

## Repository Structure

Key files and directories:
- `REPO_MANAGEMENT_TEST_PLAN.md` - Detailed test execution plan
- `mcpuniverse/benchmark/configs/test/` - Benchmark configurations
- `tests/benchmark/` - Benchmark test suites
- `log/` - Test execution logs and reports
- `.env` - Environment configuration (not committed)

## Reference Documentation

### Previous Work
- **OpenRouter Migration Plan**: `/Users/hev/.claude/plans/soft-swimming-snowflake.md`
- Documents previous effort to consolidate LLM providers
- Not currently active for this fork

### Claude 4.5 Models
| Model | API Name | Use Case |
|-------|----------|----------|
| Sonnet 4.5 | `claude-sonnet-4-5-20250929` | Balanced performance/cost |
| Opus 4.5 | `claude-opus-4-5-20251101` | Maximum capability |
| Haiku 4.5 | `claude-haiku-4-5` | Speed/cost optimization |

## Testing Methodology

### Baseline Testing (Direct API)
- **Provider**: Anthropic (direct API)
- **Authentication**: `ANTHROPIC_API_KEY`
- **Configuration**: `type: claude` in YAML config
- **Benchmark**: Repository management (34 tasks)

### Gateway Testing (Upcoming)
- **Provider**: MCP Gateway
- **Authentication**: Gateway-specific credentials
- **Configuration**: Gateway URL + model routing
- **Benchmark**: Same 34 repository management tasks

### Comparison Metrics
1. **Functional Metrics**
- Task success rate
- Correctness of outputs
- Error handling

2. **Performance Metrics**
- Request latency
- Total execution time
- Throughput

3. **Cost Metrics**
- Token usage
- API costs
- Resource utilization

4. **Reliability Metrics**
- Error rates
- Retry counts
- Failure patterns

## Next Steps

1. ✅ Create implementation plan (REPO_MANAGEMENT_TEST_PLAN.md)
2. ⏳ Set up environment (.env file)
3. ⏳ Run baseline tests with direct Anthropic API
4. ⏳ Document baseline results
5. ⏳ Configure MCP gateway
6. ⏳ Run gateway tests
7. ⏳ Generate comparison analysis
8. ⏳ Document findings and recommendations

---

**Last Updated**: 2025-12-07
**Primary Contact**: [Your contact info]
**Original Repository**: [Link to upstream MCP-Universe]
Loading