Run full-scale LongMemEval and LoCoMo benchmarks

## Context

db0's current benchmark scores are based on limited samples:
- LongMemEval: 80.0% on 50 questions (of 500 total)
- LoCoMo: 76.9% on 199 queries from 1 sample (of 10 samples, ~1986 queries total)

These are promising but not directly comparable to published scores from other systems that run the full datasets.

## What's needed

1. **Full LongMemEval run** — all 500 questions, multiple profiles (conversational, high-recall, knowledge-base), with Gemini and OpenAI embeddings
2. **Full LoCoMo run** — all 10 samples (~1986 queries), multiple ingestion modes
3. **Category breakdown** — especially temporal reasoning, knowledge-update, and multi-session categories where architecture differences matter most
4. **Profile comparison** — demonstrate that the right profile matters: run the same benchmark with conversational vs. high-recall vs. knowledge-base profiles

## Expected outcome

- Validated scores comparable to published results from Hindsight (91.4%), Zep (71.2%), Mem0 (49-68.5%)
- Data showing profile impact on different query categories
- Identification of specific weak spots (likely temporal reasoning) to guide #10, #11, #12

## Practical notes

- Full LongMemEval with Gemini embeddings + judge takes ~4-8 hours
- Full LoCoMo with 10 samples takes ~2-3 hours
- Both require GEMINI_API_KEY or OPENAI_API_KEY
- Results should be updated in packages/benchmark/README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run full-scale LongMemEval and LoCoMo benchmarks #13

Context

What's needed

Expected outcome

Practical notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Run full-scale LongMemEval and LoCoMo benchmarks #13

Description

Context

What's needed

Expected outcome

Practical notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions