Skip to content

Run full-scale LongMemEval and LoCoMo benchmarks #13

@lightcone0

Description

@lightcone0

Context

db0's current benchmark scores are based on limited samples:

  • LongMemEval: 80.0% on 50 questions (of 500 total)
  • LoCoMo: 76.9% on 199 queries from 1 sample (of 10 samples, ~1986 queries total)

These are promising but not directly comparable to published scores from other systems that run the full datasets.

What's needed

  1. Full LongMemEval run — all 500 questions, multiple profiles (conversational, high-recall, knowledge-base), with Gemini and OpenAI embeddings
  2. Full LoCoMo run — all 10 samples (~1986 queries), multiple ingestion modes
  3. Category breakdown — especially temporal reasoning, knowledge-update, and multi-session categories where architecture differences matter most
  4. Profile comparison — demonstrate that the right profile matters: run the same benchmark with conversational vs. high-recall vs. knowledge-base profiles

Expected outcome

Practical notes

  • Full LongMemEval with Gemini embeddings + judge takes ~4-8 hours
  • Full LoCoMo with 10 samples takes ~2-3 hours
  • Both require GEMINI_API_KEY or OPENAI_API_KEY
  • Results should be updated in packages/benchmark/README.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions