Observation
LongMemEval single-session-preference questions score 66.7% R@5 — the weakest category by a significant margin (next lowest is single-session-assistant at 85.7%).
Analysis Needed
- Examine the 10 failing preference questions — what patterns do they share?
- Preference questions likely use implicit/indirect language that FTS struggles with
- Semantic search (with embeddings) may significantly improve this category
Action Items
- Run benchmark with semantic and hybrid modes to compare
- Analyze failing questions for common patterns
- Consider adding preference-specific indexing or boosting
Observation
LongMemEval single-session-preference questions score 66.7% R@5 — the weakest category by a significant margin (next lowest is single-session-assistant at 85.7%).
Analysis Needed
Action Items