benchmark: flat Recall@k curve suggests scoring granularity issue

## Observation

LongMemEval benchmark shows R@1 = R@3 = R@5 = R@10 = 88.0% across all k values. This flat curve suggests the scoring methodology lacks granularity.

## Root Cause

MemForge consolidation batches up to 50 hot-tier rows into one warm-tier row. Since each LongMemEval question has ~50 sessions on average, most questions consolidate into a single warm-tier row containing ALL sessions. The top-1 result already contains all session IDs, so R@1 = R@10.

## Impact

The current benchmark measures "can MemForge find the right content at all" but not "can it rank relevant content higher than irrelevant content." The 88% score reflects FTS match rate, not ranking quality.

## Proposed Fix

1. **Per-session granularity**: Consolidate with batch_size=1 (one warm row per session) so Recall@k reflects actual ranking quality
2. **Alternative scoring**: Use turn-level scoring with the has_answer field instead of session-level
3. **Add NDCG metric**: Measures ranking quality within the result set
4. **Support configurable CONSOLIDATION_BATCH_SIZE** via the benchmark config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark: flat Recall@k curve suggests scoring granularity issue #47

Observation

Root Cause

Impact

Proposed Fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

benchmark: flat Recall@k curve suggests scoring granularity issue #47

Description

Observation

Root Cause

Impact

Proposed Fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions