Observation
LongMemEval benchmark shows R@1 = R@3 = R@5 = R@10 = 88.0% across all k values. This flat curve suggests the scoring methodology lacks granularity.
Root Cause
MemForge consolidation batches up to 50 hot-tier rows into one warm-tier row. Since each LongMemEval question has ~50 sessions on average, most questions consolidate into a single warm-tier row containing ALL sessions. The top-1 result already contains all session IDs, so R@1 = R@10.
Impact
The current benchmark measures "can MemForge find the right content at all" but not "can it rank relevant content higher than irrelevant content." The 88% score reflects FTS match rate, not ranking quality.
Proposed Fix
- Per-session granularity: Consolidate with batch_size=1 (one warm row per session) so Recall@k reflects actual ranking quality
- Alternative scoring: Use turn-level scoring with the has_answer field instead of session-level
- Add NDCG metric: Measures ranking quality within the result set
- Support configurable CONSOLIDATION_BATCH_SIZE via the benchmark config
Observation
LongMemEval benchmark shows R@1 = R@3 = R@5 = R@10 = 88.0% across all k values. This flat curve suggests the scoring methodology lacks granularity.
Root Cause
MemForge consolidation batches up to 50 hot-tier rows into one warm-tier row. Since each LongMemEval question has ~50 sessions on average, most questions consolidate into a single warm-tier row containing ALL sessions. The top-1 result already contains all session IDs, so R@1 = R@10.
Impact
The current benchmark measures "can MemForge find the right content at all" but not "can it rank relevant content higher than irrelevant content." The 88% score reflects FTS match rate, not ranking quality.
Proposed Fix