Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 36 additions & 6 deletions apps/docs/docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,17 @@
],
"tab": "Developer Platform"
},
{
"icon": "book-open",
"anchors": [
{
"anchor": "API Reference",
"icon": "unplug",
"openapi": "https://api.supermemory.ai/v3/openapi"
}
],
"tab": "API Reference"
},
{
"icon": "plug",
"anchors": [
Expand Down Expand Up @@ -234,15 +245,35 @@
"tab": "SDKs"
},
{
"icon": "book-open",
"icon": "flask-conical",
"anchors": [
{
"anchor": "API Reference",
"icon": "unplug",
"openapi": "https://api.supermemory.ai/v3/openapi"
"anchor": "MemoryBench",
"icon": "flask-conical",
"pages": [
"memorybench/overview",
"memorybench/github",
{
"group": "Getting Started",
"pages": ["memorybench/installation", "memorybench/quickstart"]
},
{
"group": "Development",
"pages": [
"memorybench/architecture",
"memorybench/extend-provider",
"memorybench/extend-benchmark",
"memorybench/contributing"
]
},
{
"group": "Reference",
"pages": ["memorybench/cli", "memorybench/integrations"]
}
]
}
],
"tab": "API Reference"
"tab": "MemoryBench"
},
{
"icon": "chef-hat",
Expand All @@ -269,7 +300,6 @@
],
"tab": "Cookbook"
},

{
"icon": "list-ordered",
"anchors": [
Expand Down
99 changes: 99 additions & 0 deletions apps/docs/memorybench/architecture.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
title: "Architecture"
description: "Understanding MemoryBench's design and implementation"
sidebarTitle: "Architecture"
---

## System Overview

```mermaid
flowchart TB
B["Benchmarks<br/>(LoCoMo, LongMemEval..)"]
P["Providers<br/>(Supermemory, Mem0, Zep)"]
J["Judges<br/>(GPT-4o, Claude..)"]

B --> O[Orchestrator]
P --> O
J --> O

O --> Pipeline

subgraph Pipeline[" "]
direction LR
I[Ingest] --> IX[Indexing] --> S[Search] --> A[Answer] --> E[Evaluate]
end

style B fill:#E0F2FE,stroke:#0369A1,color:#0C4A6E
style P fill:#E0F2FE,stroke:#0369A1,color:#0C4A6E
style J fill:#E0F2FE,stroke:#0369A1,color:#0C4A6E
style O fill:#0369A1,stroke:#0369A1,color:#fff
style I fill:#F1F5F9,stroke:#64748B,color:#334155
style IX fill:#F1F5F9,stroke:#64748B,color:#334155
style S fill:#F1F5F9,stroke:#64748B,color:#334155
style A fill:#F1F5F9,stroke:#64748B,color:#334155
style E fill:#F1F5F9,stroke:#64748B,color:#334155
```

## Core Components

| Component | Role |
|-----------|------|
| **Benchmarks** | Load test data and provide questions with ground truth answers |
| **Providers** | Memory services being evaluated (handle ingestion and search) |
| **Judges** | LLM-based evaluators that score answers against ground truth |

See [Integrations](/memorybench/integrations) for all supported benchmarks, providers, and models.

## Pipeline

```mermaid
flowchart LR
A[Ingest] --> B[Index] --> C[Search] --> D[Answer] --> E[Evaluate] --> F[Report]

style A fill:#E0F2FE,stroke:#0369A1,color:#0C4A6E
style B fill:#E0F2FE,stroke:#0369A1,color:#0C4A6E
style C fill:#E0F2FE,stroke:#0369A1,color:#0C4A6E
style D fill:#E0F2FE,stroke:#0369A1,color:#0C4A6E
style E fill:#E0F2FE,stroke:#0369A1,color:#0C4A6E
style F fill:#DCFCE7,stroke:#16A34A,color:#166534
```

| Phase | What Happens |
|-------|--------------|
| **Ingest** | Load benchmark sessions → Push to provider |
| **Index** | Wait for provider indexing |
| **Search** | Query provider → Retrieve context |
| **Answer** | Build prompt → Generate answer via LLM |
| **Evaluate** | Compare to ground truth → Score via judge |
| **Report** | Aggregate scores → Output accuracy + latency |

Each phase checkpoints independently. Failed runs resume from last successful point.

## Advanced Checkpointing

Runs persist to `data/runs/{runId}/`:

```
data/runs/my-run/
├── checkpoint.json # Run state and progress
├── results/ # Search results per question
└── report.json # Final report
```

Re-running same ID resumes. Use `--force` to restart.

## File Structure

```
src/
├── cli/commands/ # run, compare, test, serve, status...
├── orchestrator/phases/ # ingest, search, answer, evaluate, report
├── benchmarks/
│ └── <name>/index.ts # e.g. locomo/, longmemeval/, convomem/
├── providers/
│ └── <name>/
│ ├── index.ts # Provider implementation
│ └── prompts.ts # Custom prompts (optional)
├── judges/ # openai.ts, anthropic.ts, google.ts
└── types/ # provider.ts, benchmark.ts, unified.ts
```
117 changes: 117 additions & 0 deletions apps/docs/memorybench/cli.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---
title: "CLI Reference"
description: "Command-line interface for running MemoryBench evaluations"
sidebarTitle: "CLI"
---

## Commands

### run

Execute the full benchmark pipeline.

```bash
bun run src/index.ts run -p <provider> -b <benchmark> -j <judge> -r <run-id>
```

| Option | Description |
|--------|-------------|
| `-p, --provider` | Memory provider (`supermemory`, `mem0`, `zep`) |
| `-b, --benchmark` | Benchmark (`locomo`, `longmemeval`, `convomem`) |
| `-j, --judge` | Judge model (default: `gpt-4o`) |
| `-r, --run-id` | Run identifier (auto-generated if omitted) |
| `-m, --answering-model` | Model for answer generation (default: `gpt-4o`) |
| `-l, --limit` | Limit number of questions |
| `-s, --sample` | Sample N questions per category |
| `--sample-type` | Sampling strategy: `consecutive` (default), `random` |
| `--force` | Clear checkpoint and restart |

See [Supported Models](/memorybench/supported-models) for all available judge and answering models.

---

### compare

Run benchmark across multiple providers in parallel.

```bash
bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -j gpt-4o
```

---

### test

Evaluate a single question for debugging.

```bash
bun run src/index.ts test -r <run-id> -q <question-id>
```

---

### status

Check progress of a run.

```bash
bun run src/index.ts status -r <run-id>
```

---

### show-failures

Debug failed questions with full context.

```bash
bun run src/index.ts show-failures -r <run-id>
```

---

### list-questions

Browse benchmark questions.

```bash
bun run src/index.ts list-questions -b <benchmark>
```

---

### Random Sampling

Sample N questions per category with optional randomization.

```bash
bun run src/index.ts run -p supermemory -b longmemeval -s 3 --sample-type random
```

---

### serve

Start the web UI.

```bash
bun run src/index.ts serve
```

Opens at [http://localhost:3000](http://localhost:3000).

---

### help

Get help on providers, models, or benchmarks.

```bash
bun run src/index.ts help providers
bun run src/index.ts help models
bun run src/index.ts help benchmarks
```

## Checkpointing

Runs are saved to `data/runs/{runId}/` and automatically resume from the last successful phase. Use `--force` to restart.
89 changes: 89 additions & 0 deletions apps/docs/memorybench/contributing.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
title: "Contributing"
description: "Guidelines for contributing to MemoryBench"
sidebarTitle: "Contributing"
---

## Getting Started

1. Fork the repository
2. Clone your fork:
```bash
git clone https://github.com/YOUR_USERNAME/memorybench
cd memorybench
bun install
```
3. Create a branch:
```bash
git checkout -b feature/your-feature
```

## Development Workflow

### Running Tests

```bash
bun test
```

### Running the CLI

```bash
bun run src/index.ts <command>
```

### Running the Web UI

```bash
cd ui
bun run dev
```

## Code Structure

| Directory | Purpose |
|-----------|---------|
| `src/cli/` | CLI commands |
| `src/orchestrator/` | Pipeline execution |
| `src/benchmarks/` | Benchmark adapters |
| `src/providers/` | Provider integrations |
| `src/judges/` | LLM judge implementations |
| `src/types/` | TypeScript interfaces |
| `ui/` | Next.js web interface |

## Contribution Types

### Adding a Provider

See [Extending MemoryBench](/memorybench/extend-provider) for the full guide.

1. Create `src/providers/yourprovider/index.ts`
2. Implement the `Provider` interface
3. Register in `src/providers/index.ts`
4. Add config in `src/utils/config.ts`
5. Submit PR with tests

### Adding a Benchmark

1. Create `src/benchmarks/yourbenchmark/index.ts`
2. Implement the `Benchmark` interface
3. Register in `src/benchmarks/index.ts`
4. Document question types
5. Submit PR with sample data

### Bug Fixes

1. Create an issue describing the bug
2. Reference the issue in your PR
3. Include test cases that reproduce the bug

## Pull Request Guidelines

- Keep PRs focused on a single change
- Update documentation if needed
- Ensure all tests pass
- Follow existing code style

## Questions?

Open an issue on [GitHub](https://github.com/supermemoryai/memorybench/issues).
Loading
Loading