benchmark: single-session-preference category underperforms at 66.7%

## Observation

LongMemEval single-session-preference questions score 66.7% R@5 — the weakest category by a significant margin (next lowest is single-session-assistant at 85.7%).

## Analysis Needed

1. Examine the 10 failing preference questions — what patterns do they share?
2. Preference questions likely use implicit/indirect language that FTS struggles with
3. Semantic search (with embeddings) may significantly improve this category

## Action Items

- Run benchmark with semantic and hybrid modes to compare
- Analyze failing questions for common patterns
- Consider adding preference-specific indexing or boosting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark: single-session-preference category underperforms at 66.7% #48

Observation

Analysis Needed

Action Items

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

benchmark: single-session-preference category underperforms at 66.7% #48

Description

Observation

Analysis Needed

Action Items

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions