Context
db0's current benchmark scores are based on limited samples:
- LongMemEval: 80.0% on 50 questions (of 500 total)
- LoCoMo: 76.9% on 199 queries from 1 sample (of 10 samples, ~1986 queries total)
These are promising but not directly comparable to published scores from other systems that run the full datasets.
What's needed
- Full LongMemEval run — all 500 questions, multiple profiles (conversational, high-recall, knowledge-base), with Gemini and OpenAI embeddings
- Full LoCoMo run — all 10 samples (~1986 queries), multiple ingestion modes
- Category breakdown — especially temporal reasoning, knowledge-update, and multi-session categories where architecture differences matter most
- Profile comparison — demonstrate that the right profile matters: run the same benchmark with conversational vs. high-recall vs. knowledge-base profiles
Expected outcome
Practical notes
- Full LongMemEval with Gemini embeddings + judge takes ~4-8 hours
- Full LoCoMo with 10 samples takes ~2-3 hours
- Both require GEMINI_API_KEY or OPENAI_API_KEY
- Results should be updated in packages/benchmark/README.md
Context
db0's current benchmark scores are based on limited samples:
These are promising but not directly comparable to published scores from other systems that run the full datasets.
What's needed
Expected outcome
Practical notes