Skip to content

benchmark: flat Recall@k curve suggests scoring granularity issue #47

@salishforge

Description

@salishforge

Observation

LongMemEval benchmark shows R@1 = R@3 = R@5 = R@10 = 88.0% across all k values. This flat curve suggests the scoring methodology lacks granularity.

Root Cause

MemForge consolidation batches up to 50 hot-tier rows into one warm-tier row. Since each LongMemEval question has ~50 sessions on average, most questions consolidate into a single warm-tier row containing ALL sessions. The top-1 result already contains all session IDs, so R@1 = R@10.

Impact

The current benchmark measures "can MemForge find the right content at all" but not "can it rank relevant content higher than irrelevant content." The 88% score reflects FTS match rate, not ranking quality.

Proposed Fix

  1. Per-session granularity: Consolidate with batch_size=1 (one warm row per session) so Recall@k reflects actual ranking quality
  2. Alternative scoring: Use turn-level scoring with the has_answer field instead of session-level
  3. Add NDCG metric: Measures ranking quality within the result set
  4. Support configurable CONSOLIDATION_BATCH_SIZE via the benchmark config

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesttestingTest coverage and quality

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions