feat(index): implement BM25 relevance scoring by poyrazK · Pull Request #95 · poyrazK/cloudSearch

poyrazK · 2026-05-07T16:33:49Z

Summary

Replace binary token-overlap scoring with BM25 probabilistic relevance scoring for match/term/phrase queries
Add bm25_idf() for inverse document frequency using the standard Lucene formula: ln((N - df + 0.5) / (df + 0.5))
Add compute_avg_field_length() to compute average field length for field length normalization (b=0.75)
score_match_query now uses the full BM25 formula with term frequency from postings and document frequency from PositionsReader across all segments
search() precomputes IDF map and average field length before scoring, then threads them through the scoring chain
k1=1.2 (TF saturation parameter) and b=0.75 (field length normalization) are the standard Lucene defaults

Test plan

cargo test --workspace — 333 tests pass
cargo clippy --workspace --all-targets -- -D warnings — clean

Summary by CodeRabbit

Improvements
- Upgraded search result ranking to use an advanced scoring algorithm that delivers more relevant and accurate results. The new mechanism better prioritizes matching documents based on query term frequency, overall document importance, and field-level characteristics, resulting in higher-quality search results.
Tests
- Updated test expectations to reflect the improved search scoring behavior.

Replace binary token-overlap scoring with BM25 (Best Matching 25) probabilistic scoring for match/term/phrase queries. - Add bm25_idf() for inverse document frequency computation - Add compute_avg_field_length() for field length normalization - Add extract_query_terms() to collect query terms for IDF precomputation - score_match_query now uses full BM25 formula with TF from postings and document frequency from PositionsReader across all segments - search() precomputes IDF map and avg field length before scoring - Update match_queries_find_tokens_in_text_fields test to accept BM25 scores (non-exact since scoring now considers term frequency and document frequency, not just token overlap) - Fix clippy collapsible_if in compute_avg_field_length - Add #[allow(clippy::too_many_arguments)] to scoring functions

coderabbitai · 2026-05-07T16:34:11Z

Warning

Rate limit exceeded

@poyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 7 minutes and 39 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f8db4da-9bc6-4cc9-9abf-fdc04bf35a49

📥 Commits

Reviewing files that changed from the base of the PR and between 0672d7e and a9df6af.

📒 Files selected for processing (1)

rust/crates/cloudsearch-index/src/lib.rs

📝 Walkthrough

Walkthrough

This change enhances search ranking with BM25-style term weighting. The search entry point now precomputes BM25 parameters (IDF per term, average field length) and supplies them to scoring functions, which apply normalized term frequency calculations for improved relevance ranking over the simpler fixed-score approach.

Changes

BM25 Search Ranking Enhancement

Layer / File(s)	Summary
BM25 Helper Functions `rust/crates/cloudsearch-index/src/lib.rs`	Added `bm25_idf(df, n_docs)` formula, `extract_query_terms(query, target_field)` for term enumeration, `compute_avg_field_length(documents, field)` for field normalization, and extended `score_query` signature to accept BM25 inputs.
Search Entry Point Orchestration `rust/crates/cloudsearch-index/src/lib.rs`	`IndexHandle::search` precomputes BM25 parameters: extracts query terms, aggregates document frequency across `positions_readers` to build IDF map, computes average field length, and passes all BM25 inputs into scoring.
Query Scoring Dispatch `rust/crates/cloudsearch-index/src/lib.rs`	`score_query` routes `Bool` queries to `score_bool_query` and `Match` queries to `score_match_query`, updating function signatures to accept BM25 parameters.
Match Query BM25 Scoring `rust/crates/cloudsearch-index/src/lib.rs`	Replaced legacy match scoring with BM25 implementation: computes per-token TF from posting lists (with field-counting fallback), applies BM25 formula using precomputed IDF and field-length normalization with `k1` and `b` parameters, returns averaged score.
Boolean Query Integration `rust/crates/cloudsearch-index/src/lib.rs`	Updated `score_bool_query` to propagate BM25 parameters through `score_query` calls for all clause types (`must`, `filter`, `must_not`, `should`) while preserving boolean matching logic.
Test Adjustments `rust/crates/cloudsearch-index/src/lib.rs`	Modified `match_queries_find_tokens_in_text_fields` test: removed fixed `score == Some(1.0)` assertions, asserted stable doc ID ordering and non-negative scores, updated expectations for BM25 field-length normalization effects.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 With terms now weighted by their rarity,
And field lengths tuned with great clarity,
BM25 scoring brings ranking flair—
Each search result finds its rightful lair! 🔍✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(index): implement BM25 relevance scoring' directly and clearly summarizes the main change: implementing BM25 relevance scoring in the search index, which is the primary objective of this PR.
Docstring Coverage	✅ Passed	Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/bm25-scoring

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 805-806: The BM25 normalization currently computes avg_field_len
using a hardcoded "content" which misnormalizes scores for other fields; in
score_match_query replace the literal "content" with the actual field from the
query (query.field) when calling compute_avg_field_length (and keep the existing
.max(1.0) fallback), so avg_field_len is computed per the field being scored and
normalization is correct; locate compute_avg_field_length and score_match_query
references to update the argument to use query.field (with a safe fallback if
query.field can be empty).
- Around line 794-804: The IDF precompute loop uses extract_query_terms(query,
"") which strips field names and thus fails to extract terms for fielded
Match/Phrase queries, leaving idf_map empty; change the extraction to preserve
the query's field context so term lookup against self.positions_readers works:
call extract_query_terms with the actual field (or a variant that returns
fielded terms) for the incoming Query used by BM25, then compute total_df by
iterating readers and summing pl.docs.len(), compute idf via bm25_idf and insert
into idf_map as before (ensure the functions extract_query_terms, idf_map,
bm25_idf, and positions_readers are used consistently so Match/Phrase terms are
included).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5171ac10-b9ef-484c-9ce3-0775dcfc7673

📥 Commits

Reviewing files that changed from the base of the PR and between 189579c and 0672d7e.

📒 Files selected for processing (1)

rust/crates/cloudsearch-index/src/lib.rs

- Add extract_target_field() helper to get field name from query - Use actual query field (not hardcoded "content") for avg_field_len - Deduplicate query_terms via BTreeSet before IDF loop

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread rust/crates/cloudsearch-index/src/lib.rs Outdated

Comment thread rust/crates/cloudsearch-index/src/lib.rs Outdated

fix(index): compute BM25 avg_field_len for actual query field

a9df6af

- Add extract_target_field() helper to get field name from query - Use actual query field (not hardcoded "content") for avg_field_len - Deduplicate query_terms via BTreeSet before IDF loop

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(index): implement BM25 relevance scoring#95

feat(index): implement BM25 relevance scoring#95
poyrazK wants to merge 2 commits intomainfrom
feature/bm25-scoring

poyrazK commented May 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poyrazK commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poyrazK commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading