A reproducible benchmark for comparing memory search modes in AI agent systems.
Compares two retrieval strategies for AI agent long-term memory:
- FTS-only: SQLite FTS5 keyword matching. No ML model required.
- Hybrid: FTS + semantic vector search (e.g., Ollama bge-m3). Requires embedding model.
Tested against 303 memory files (~14K lines) accumulated over 6+ weeks of daily AI agent operation.
| Category | Questions | FTS Score | Hybrid Score | FTS % | Hybrid % | Delta |
|---|---|---|---|---|---|---|
| Exact | 10 | 17/20 | 17/20 | 85% | 85% | 0% |
| Paraphrase | 10 | 6/20 | 11/20 | 30% | 55% | +25% |
| Contextual | 10 | 6/20 | 11/20 | 30% | 55% | +25% |
| Total | 30 | 29/60 | 39/60 | 48% | 65% | +17% |
Full analysis: blog.clawsouls.ai/posts/fts-vs-hybrid-memory-benchmark
- Exact: Query uses terms that appear verbatim in source documents
- Paraphrase: Query uses synonyms or indirect references
- Contextual: Abstract questions requiring contextual understanding
The included questions.json contains sanitized example questions demonstrating the format. Replace them with questions specific to your own agent's memory corpus.
Each question needs:
id: Unique identifier (E01, P01, C01, etc.)category:exact,paraphrase, orcontextualquestion: The query to search foranswer: Expected answer (for human evaluation)ground_truth_files: Which memory files contain the answer
-
Write your questions — Edit
questions.jsonwith questions specific to your agent's memory corpus. Include ground truth file paths. -
Run FTS benchmark:
chmod +x benchmark.sh ./benchmark.sh
-
Run hybrid search — Configure SoulClaw with Ollama, then use the memory search API.
-
Score results — Human evaluation using the rubric below.
| Score | Meaning |
|---|---|
| 0 | Irrelevant — retrieved results don't contain the answer |
| 1 | Partially relevant — related content found but answer incomplete |
| 2 | Correct — answer is directly retrievable from top-5 results |
questions.json # 30 benchmark questions with ground truth
benchmark.sh # FTS search script
results/ # Output directory (created on run)
README.md # This file
The benchmark is designed to run against any SoulClaw workspace. To adapt:
- Replace
questions.jsonwith questions about your agent's memory - Set
ground_truth_filesto the files where answers live in your corpus - Run the benchmark and score results
This is intentionally not an automated benchmark — human evaluation avoids the circular reasoning of using an LLM to judge LLM retrieval quality.
- Single evaluator: Results reflect one human's judgment
- Small sample: 30 questions (statistically limited but deeply evaluated)
- Bilingual corpus: Korean + English mixed; results may differ for monolingual corpora
- No semantic-only mode: Compared FTS vs Hybrid, not pure semantic
If you use this benchmark in research:
@misc{memory-bench-2026,
title={FTS vs Hybrid Memory Search: A Real-World Benchmark},
author={ClawSouls},
year={2026},
url={https://github.com/clawsouls/memory-bench}
}
MIT