Skip to content

Upgrade keyword search to BM25 via pg_textsearch #23

@salishforge

Description

@salishforge

Problem

MemForge uses PostgreSQL's built-in ts_rank_cd for keyword search ranking. This is TF-IDF-like and lacks proper document-length normalization and corpus-level IDF statistics. Short, focused memories don't rank higher than long ones with the same match count.

Proposed Solution

Adopt Timescale's pg_textsearch extension for true BM25 ranking.

What changes

  • Replace the content_tsv tsvector column and GIN index with a BM25 index on content
  • Replace ts_rank_cd(content_tsv, plainto_tsquery(...)) with the <@> BM25 operator
  • Feed BM25 scores into the existing RRF fusion alongside pgvector semantic scores
  • Remove manual tsvector/tsquery construction — BM25 indexes raw text directly

What stays the same

  • pgvector for semantic search (unchanged)
  • RRF for hybrid fusion (unchanged)
  • pg_trgm for entity deduplication (different use case)
  • Trigram fallback for typo-tolerant search

Benefits for memory retrieval

  • BM25 length normalization: short, focused memories rank higher than long rambling ones
  • Tunable k1/b parameters per index for memory-specific relevance profiles
  • Block-Max WAND optimization: fast top-k without scoring every match
  • Simpler code: no tsvector column maintenance

Prerequisites

  • PostgreSQL 17+ (currently targeting 16)
  • shared_preload_libraries configuration (Dockerfile change)
  • Linux/macOS only for pre-built binaries (no Windows builds yet)

Migration path

  1. Upgrade Docker image from postgres:16-alpine to postgres:17-alpine
  2. Add pg_textsearch to shared_preload_libraries
  3. CREATE EXTENSION pg_textsearch
  4. CREATE INDEX warm_tier_bm25_idx ON warm_tier USING bm25 (content)
  5. Update queryKeyword() in memory-manager.ts to use <@> operator
  6. Drop content_tsv column and GIN index (migration)
  7. Benchmark against ts_rank_cd on realistic dataset

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePerformance improvements

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions