feat(index): implement BM25 relevance scoring by poyrazK · Pull Request #94 · poyrazK/cloudSearch

poyrazK · 2026-05-07T16:28:00Z

Summary

Replace binary token-overlap scoring with BM25 probabilistic relevance scoring for match/term/phrase queries
Add bm25_idf() for inverse document frequency using the standard Lucene formula: ln((N - df + 0.5) / (df + 0.5))
Add compute_avg_field_length() to compute average field length for field length normalization (b=0.75)
score_match_query now uses the full BM25 formula with term frequency from postings and document frequency from PositionsReader across all segments
search() precomputes IDF map and average field length before scoring, then threads them through the scoring chain
k1=1.2 (TF saturation parameter) and b=0.75 (field length normalization) are the standard Lucene defaults

Test plan

cargo test --workspace — 333 tests pass
cargo clippy --workspace --all-targets -- -D warnings — clean

Replace binary token-overlap scoring with BM25 (Best Matching 25) probabilistic scoring for match/term/phrase queries. - Add bm25_idf() for inverse document frequency computation - Add compute_avg_field_length() for field length normalization - Add extract_query_terms() to collect query terms for IDF precomputation - score_match_query now uses full BM25 formula with TF from postings and document frequency from PositionsReader across all segments - search() precomputes IDF map and avg field length before scoring - Update match_queries_find_tokens_in_text_fields test to accept BM25 scores (non-exact since scoring now considers term frequency and document frequency, not just token overlap) - Fix clippy collapsible_if in compute_avg_field_length - Add #[allow(clippy::too_many_arguments)] to scoring functions

coderabbitai · 2026-05-07T16:28:07Z

Warning

Rate limit exceeded

@poyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 2 minutes and 17 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c6a52a91-a64b-445e-a384-6f85514b81be

📥 Commits

Reviewing files that changed from the base of the PR and between 189579c and 58ad9f4.

📒 Files selected for processing (1)

rust/crates/cloudsearch-index/src/lib.rs

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch release/bm25-scoring

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(index): implement BM25 relevance scoring#94

feat(index): implement BM25 relevance scoring#94
poyrazK wants to merge 1 commit intomainfrom
release/bm25-scoring

poyrazK commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026

Rate limit exceeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poyrazK commented May 7, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented May 7, 2026

Rate limit exceeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant