Skip to content

fix: resolve search timeout by using HNSW index#54

Open
rajivsinclair wants to merge 2 commits intomainfrom
fix/search-timeout-vector-index
Open

fix: resolve search timeout by using HNSW index#54
rajivsinclair wants to merge 2 commits intomainfrom
fix/search-timeout-vector-index

Conversation

@rajivsinclair
Copy link
Copy Markdown
Contributor

@rajivsinclair rajivsinclair commented Jan 15, 2026

Summary

Fixed two functions that were causing "statement timeout" errors:

  1. search_related_snippets - Optimized to use two-stage vector search with HNSW index
  2. get_snippets - Optimized full-text search and count query

Problem

The production database was experiencing frequent statement timeouts:

ERROR: canceling statement due to statement timeout

Root Causes

1. search_related_snippets (121K+ embeddings)

  • Computed 1 - (se.embedding <=> source_embedding) > match_threshold in WHERE clause
  • Prevented use of any index, forcing sequential scan of 3072-dimensional vectors

2. get_snippets (144K+ snippets)

  • Used COUNT(*) OVER() window function which computed count for every row
  • Full-text search used concatenated fields which could not use PGroonga indexes
  • Redundant like_summary CTE aggregated user_like_snippets when snippets already has like_count

Solution

search_related_snippets - Two-Stage Search

  1. Stage 1: Use sub_vector(embedding, 512) with HNSW index to quickly find candidates
  2. Stage 2: Re-rank candidates using full 3072-dim embedding similarity

get_snippets - Query Optimization

  1. Split COUNT(*) OVER() into separate count query - Avoids window function overhead
  2. Changed full-text search to use individual columns - Allows PGroonga index use
  3. Removed redundant like_summary CTE - Uses existing like_count column

Performance Improvement

Function Before After
search_related_snippets >8s (timeout) ~350ms
get_snippets >2min (timeout) <1s

Changes

  • supabase/database/sql/search_related_snippets.sql - Two-stage HNSW search
  • supabase/database/sql/fix_search_timeout.sql - Migration for search_related_snippets
  • supabase/database/sql/get_snippets_function.sql - Optimized count and full-text search

Migrations Applied

Both migrations have been applied to production:

  • fix_search_timeout - search_related_snippets optimization
  • fix_get_snippets_timeout - get_snippets optimization

Test plan

  • Verified search_related_snippets uses HNSW index via EXPLAIN ANALYZE
  • Verified get_snippets returns correct results with pagination
  • Confirmed both functions execute without timeout in production

Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Prevented search timeouts with an optimized two-stage embedding search that narrows candidates then re-ranks for relevance.
    • Fixed snippet listing pagination and total counts; ensured consistent default ordering.
  • Improvements

    • Faster, index-friendly text filtering across fields for more responsive searches.
    • More accurate language-specific summaries and corrected like-count exposure in results.

✏️ Tip: You can customize this high-level summary in your review settings.

…ppets

The search_related_snippets function was causing statement timeouts by doing
sequential scans across 121K+ embeddings. This rewrites the function to use
a two-stage search pattern:

1. Stage 1: Fast approximate search using sub_vector(512) with existing HNSW index
2. Stage 2: Re-rank candidates using full 3072-dim embedding similarity

This change reduces query time from timeout (>8s) to ~350ms by leveraging
the existing snippet_embeddings_sub_vector_idx index.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings January 15, 2026 18:19
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 15, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Replaces the single-pass related-snippet search with a two-stage embedding retrieval (512-dim sub-vector coarse filter then 3072-dim full re-rank), updates function signature (adds candidate_multiplier, renames snippet_idp_snippet_id), and adjusts snippets pagination/search query logic.

Changes

Cohort / File(s) Summary
Embedding search (two-stage)
supabase/database/sql/fix_search_timeout.sql, supabase/database/sql/search_related_snippets.sql
Replaces previous single-pass embedding lookup with a two-stage flow: Stage 1 coarse candidate retrieval using 512-dim sub-vectors (limit = match_count * candidate_multiplier), Stage 2 re-rank using full 3072-dim embeddings, filter by match_threshold, return top match_count. Signature updated (rename to p_snippet_id, add candidate_multiplier default 8). Result JSON now uses precomputed fs.similarity.
Snippets listing / pagination
supabase/database/sql/get_snippets_function.sql
Reworked pagination/total-count approach (separate COUNT then aggregated results), replaced concatenated full-text approach with per-field searches to better leverage indexes, adjusted like/label aggregations and ordering logic, removed previous num_of_snippets aggregation pattern and updated result assembly.

Sequence Diagram

sequenceDiagram
    actor Client
    participant Function as search_related_snippets()
    participant Embeddings as Snippet Embeddings Store
    participant Stage1 as Stage 1: Sub-vector Filter
    participant Stage2 as Stage 2: Full Embedding Re-rank
    participant Result as Result Assembly

    Client->>Function: call(p_snippet_id, p_language, ...)
    Function->>Embeddings: SELECT embedding, sub_embedding FOR p_snippet_id
    Embeddings-->>Function: return source_embedding + source_sub_embedding
    Function->>Stage1: query candidates using source_sub_embedding (limit = match_count * candidate_multiplier)
    Stage1-->>Function: return candidate_ids + sub_similarities
    Function->>Stage2: compute full similarities using candidate full embeddings (filter by match_threshold, limit = match_count)
    Stage2-->>Function: return ranked candidates with fs.similarity
    Function->>Result: join metadata, aggregate labels, build JSON entries
    Result-->>Function: jsonb_agg results (or [] if none)
    Function-->>Client: return JSON results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • nhphong

Poem

🐰 I hopped through vectors, short then long,
First nibble quick, then sing the full song.
Candidates picked, then ranked with delight,
JSON packed snug, returned by night.
A little rabbit cheers the search made right. 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main fix: resolving search timeouts using an HNSW index, which aligns with the primary objective of the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/search-timeout-vector-index

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @rajivsinclair, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical performance bottleneck in the search_related_snippets function, which was causing frequent database statement timeouts. By refactoring the search logic to utilize an efficient two-stage approach with an HNSW index, the change dramatically improves query response times and overall system stability for snippet similarity searches.

Highlights

  • Performance Fix: Resolved critical "statement timeout" errors in the search_related_snippets function, which previously performed inefficient sequential scans.
  • Optimized Search Strategy: Replaced the old implementation's sequential scans across 121K+ 3072-dimensional embeddings with a highly optimized two-stage search approach.
  • HNSW Index Utilization: Implemented a two-stage search leveraging an existing HNSW index on 512-dimensional sub-vectors for rapid candidate selection, followed by precise re-ranking using full 3072-dimensional embeddings.
  • Significant Speed Improvement: Reduced search execution time from over 8 seconds (causing timeouts) to approximately 350 milliseconds, dramatically improving query response times.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to c576220 in 50 seconds. Click for details.
  • Reviewed 216 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 7 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. supabase/database/sql/fix_search_timeout.sql:8
  • Draft comment:
    Ensure the DROP FUNCTION statement covers all previous overloads; if earlier versions used a different signature (e.g. with candidate_multiplier) they might not be dropped.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None
2. supabase/database/sql/fix_search_timeout.sql:25
  • Draft comment:
    Review the explicit cast to vector(512) in the sub_vector call – if sub_vector already returns vector(512), the cast might be redundant or intended for index use.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None
3. supabase/database/sql/fix_search_timeout.sql:44
  • Draft comment:
    The ORDER BY using the <#> operator on the sub_vector appears correct for leveraging the HNSW index; verify that the index is indeed used in production EXPLAIN plans.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None
4. supabase/database/sql/fix_search_timeout.sql:50
  • Draft comment:
    The similarity calculation using '1 - (embedding <=> source_embedding)' is clear; consider if candidates with similarity exactly equal to the threshold should be included (using >= instead of >).
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None
5. supabase/database/sql/search_related_snippets.sql:1
  • Draft comment:
    Renaming the parameter to 'p_snippet_id' improves clarity and maintains consistency. This is a good refactor.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
6. supabase/database/sql/search_related_snippets.sql:16
  • Draft comment:
    The SELECT now includes extraction of the sub-vector for HNSW search. Ensure this casting logic is consistent with the index definition.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None
7. supabase/database/sql/search_related_snippets.sql:29
  • Draft comment:
    The two-stage search using 'sub_similar' and 'full_similar' CTEs is well-structured. Confirm that the candidate_multiplier parameter is tuned to balance performance and match quality.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None

Workflow ID: wflow_Gn1kPicczhtbOd08

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c576220e35

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +41 to +45
1 - (embedding <=> source_embedding) as similarity
FROM sub_similar
WHERE 1 - (embedding <=> source_embedding) > match_threshold
ORDER BY embedding <=> source_embedding ASC
LIMIT match_count
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply match_count after filtering Processed rows

The LIMIT match_count is applied inside full_similar before joining snippets and filtering s.status = 'Processed', so any top candidates that are unprocessed get dropped later and the function can return fewer than match_count results. Previously, the status filter happened before the limit, ensuring up to match_count processed snippets were returned. If there are unprocessed embeddings among the nearest neighbors, this change will consistently under-deliver results relative to the caller’s expectation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a critical performance issue by refactoring the search_related_snippets function to use a two-stage search leveraging an HNSW index. The approach is sound and the code is well-commented. My review identified a critical issue where the SQL query uses a distance operator that mismatches the index's operator class, which would prevent the index from being used and negate the performance benefits. I also found a medium-severity issue concerning the aggregation of JSON data that could lead to undesirable [null] values instead of empty arrays. Addressing these points will ensure the implementation is both correct and robust.

se.embedding
FROM snippet_embeddings se
WHERE se.snippet != p_snippet_id
ORDER BY sub_vector(se.embedding, 512)::vector(512) <#> source_sub_embedding ASC
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a mismatch between the distance operator used here and the index's operator class. The snippet_embeddings_sub_vector_idx index is created with vector_ip_ops for inner product search. However, the query uses the <#> operator, which is for L2 distance. For the HNSW index to be utilized, the query operator must match the index's operator class. Using <#> will likely cause PostgreSQL to fall back to a sequential scan, defeating the purpose of this performance fix.

Since the sub-vectors are normalized, you should use the inner product distance operator <-> to find the nearest neighbors, which corresponds to the vector_ip_ops index.

        ORDER BY sub_vector(se.embedding, 512)::vector(512) <-> source_sub_embedding ASC

se.embedding
FROM snippet_embeddings se
WHERE se.snippet != p_snippet_id
ORDER BY sub_vector(se.embedding, 512)::vector(512) <#> source_sub_embedding ASC
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a mismatch between the distance operator used here and the index's operator class. The snippet_embeddings_sub_vector_idx index is created with vector_ip_ops for inner product search. However, the query uses the <#> operator, which is for L2 distance. For the HNSW index to be utilized, the query operator must match the index's operator class. Using <#> will likely cause PostgreSQL to fall back to a sequential scan, defeating the purpose of this performance fix.

Since the sub-vectors are normalized, you should use the inner product distance operator <-> to find the nearest neighbors, which corresponds to the vector_ip_ops index.

        ORDER BY sub_vector(se.embedding, 512)::vector(512) <-> source_sub_embedding ASC

ELSE s.summary ->> 'english'
END AS summary,
fs.similarity,
jsonb_agg(l) as labels
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The jsonb_agg(l) function will produce [null] if a snippet has no associated labels (due to the LEFT JOIN producing a row with l as NULL). This is often not the desired behavior; an empty array [] is usually preferred. To handle this correctly and ensure an empty array is returned for snippets without labels, you can use FILTER to exclude NULL values from the aggregation and COALESCE to default to an empty array if the result is NULL (i.e., no labels found).

            COALESCE(jsonb_agg(l) FILTER (WHERE l.id IS NOT NULL), '[]'::jsonb) as labels

END AS summary,
1 - (se.embedding <=> source_embedding) as similarity,
fs.similarity,
jsonb_agg(l) as labels
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The jsonb_agg(l) function will produce [null] if a snippet has no associated labels (due to the LEFT JOIN producing a row with l as NULL). This is often not the desired behavior; an empty array [] is usually preferred. To handle this correctly and ensure an empty array is returned for snippets without labels, you can use FILTER to exclude NULL values from the aggregation and COALESCE to default to an empty array if the result is NULL (i.e., no labels found).

            COALESCE(jsonb_agg(l) FILTER (WHERE l.id IS NOT NULL), '[]'::jsonb) as labels

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical performance issue in the search_related_snippets function that was causing statement timeout errors in production. The function was performing sequential scans on 121K+ embeddings without using any index, resulting in timeouts exceeding 8 seconds.

Changes:

  • Rewrote search_related_snippets to use a two-stage search pattern leveraging the existing HNSW index
  • Added candidate_multiplier parameter to control the first-stage candidate pool size
  • Renamed first parameter from snippet_id to p_snippet_id for consistency

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
supabase/database/sql/search_related_snippets.sql Implements two-stage search: HNSW index lookup for candidates, then full embedding re-ranking
supabase/database/sql/fix_search_timeout.sql Migration file containing the same function definition with DROP statement

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +41 to +44
1 - (embedding <=> source_embedding) as similarity
FROM sub_similar
WHERE 1 - (embedding <=> source_embedding) > match_threshold
ORDER BY embedding <=> source_embedding ASC
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function uses different distance operators in stage 1 (<#> inner product) and stage 2 (<=> cosine distance), which is inconsistent with search_related_snippets_public that uses <#> (inner product) in both stages. Since the HNSW index is built with vector_ip_ops (inner product operations) and the comment in search_related_snippets_public states 'Embedding is normalized so we can use inner product as similarity', this should likely use -(embedding <#> source_embedding) instead of 1 - (embedding <=> source_embedding) for consistency and correctness with normalized embeddings.

Suggested change
1 - (embedding <=> source_embedding) as similarity
FROM sub_similar
WHERE 1 - (embedding <=> source_embedding) > match_threshold
ORDER BY embedding <=> source_embedding ASC
-(embedding <#> source_embedding) as similarity
FROM sub_similar
WHERE -(embedding <#> source_embedding) > match_threshold
ORDER BY embedding <#> source_embedding ASC

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +44
WHERE 1 - (embedding <=> source_embedding) > match_threshold
ORDER BY embedding <=> source_embedding ASC
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold comparison and ordering should be consistent with the similarity calculation on line 41. If using cosine distance (<=>), the ORDER BY is correct but the similarity calculation and threshold comparison may need adjustment. If switching to inner product (as suggested in Comment 1), this should be WHERE -(embedding <#> source_embedding) > match_threshold and ORDER BY embedding <#> source_embedding ASC.

Copilot uses AI. Check for mistakes.
Comment on lines +50 to +53
1 - (embedding <=> source_embedding) as similarity
FROM sub_similar
WHERE 1 - (embedding <=> source_embedding) > match_threshold
ORDER BY embedding <=> source_embedding ASC
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function uses different distance operators in stage 1 (<#> inner product) and stage 2 (<=> cosine distance), which is inconsistent with search_related_snippets_public that uses <#> (inner product) in both stages. Since the HNSW index is built with vector_ip_ops (inner product operations) and the comment in search_related_snippets_public states 'Embedding is normalized so we can use inner product as similarity', this should likely use -(embedding <#> source_embedding) instead of 1 - (embedding <=> source_embedding) for consistency and correctness with normalized embeddings.

Suggested change
1 - (embedding <=> source_embedding) as similarity
FROM sub_similar
WHERE 1 - (embedding <=> source_embedding) > match_threshold
ORDER BY embedding <=> source_embedding ASC
-(embedding <#> source_embedding) as similarity
FROM sub_similar
WHERE -(embedding <#> source_embedding) > match_threshold
ORDER BY similarity DESC

Copilot uses AI. Check for mistakes.
Comment on lines +50 to +53
1 - (embedding <=> source_embedding) as similarity
FROM sub_similar
WHERE 1 - (embedding <=> source_embedding) > match_threshold
ORDER BY embedding <=> source_embedding ASC
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold comparison and ordering should be consistent with the similarity calculation on line 50. If using cosine distance (<=>), the ORDER BY is correct but the similarity calculation and threshold comparison may need adjustment. If switching to inner product (as suggested in Comment 3), this should be WHERE -(embedding <#> source_embedding) > match_threshold and ORDER BY embedding <#> source_embedding ASC.

Suggested change
1 - (embedding <=> source_embedding) as similarity
FROM sub_similar
WHERE 1 - (embedding <=> source_embedding) > match_threshold
ORDER BY embedding <=> source_embedding ASC
-(embedding <#> source_embedding) as similarity
FROM sub_similar
WHERE -(embedding <#> source_embedding) > match_threshold
ORDER BY embedding <#> source_embedding ASC

Copilot uses AI. Check for mistakes.
The actual timeout issue was in get_snippets, not search_related_snippets.
This commit applies the same optimizations that were deployed to production:

1. Split COUNT(*) OVER() into separate count query
   - Window functions can be slow on large result sets
   - Separate count query allows better query planning

2. Changed full-text search from concatenated fields to individual columns
   - Old: (s.title ->> 'english') || ' ' || (s.title ->> 'spanish') &@ term
   - New: (s.title ->> 'english') &@ term OR (s.title ->> 'spanish') &@ term
   - This allows PostgreSQL to use PGroonga indexes on each column

3. Removed redundant like_summary CTE
   - Snippets table already has like_count column
   - Eliminates unnecessary aggregation over user_like_snippets

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed b5f8bef in 2 minutes and 12 seconds. Click for details.
  • Reviewed 313 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 4 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. supabase/database/sql/get_snippets_function.sql:32
  • Draft comment:
    The COUNT(*) query duplicate filters the same conditions as the main query. Duplicating these filters risks inconsistent updates in future. Consider refactoring shared filtering logic.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50% The comment is about code that was added in this PR (the separate COUNT query is new). The observation is technically correct - there is duplication of filtering logic between the COUNT query and the main query. However, this comment is suggesting a refactor for code quality/maintainability reasons rather than pointing out a bug. The rules state that "Comments that suggest code quality refactors are good! But only if they are actionable and clear." This comment identifies the issue (duplication) and the risk (inconsistent updates), but doesn't provide a clear, actionable solution. It's somewhat vague - "Consider refactoring shared filtering logic" doesn't tell the author HOW to refactor it. The author likely made this design choice deliberately for performance reasons (as indicated by the comment "to avoid COUNT(*) OVER() overhead"). Without more context about whether there's a better way to structure this, the comment may not be actionable enough. The comment does identify a real code quality issue (duplication) that was introduced in this PR. While it's not super specific about the solution, it does clearly identify the problem and the risk. The author may find this feedback valuable even if it's not immediately actionable, as it highlights a maintenance burden they've introduced. While the comment identifies a valid concern, it's not actionable enough. The author clearly made a deliberate choice to separate the COUNT query for performance reasons. Simply pointing out that there's duplication without providing a concrete alternative approach or demonstrating that the duplication is actually problematic doesn't give the author clear guidance on what to do. This falls into the category of speculative concerns about future maintenance rather than a clear code issue that needs to be fixed. The comment should be deleted. While it identifies code duplication, it's not actionable enough - it doesn't provide a clear alternative approach or demonstrate that this is a problem that needs to be solved now. The author likely made this design choice deliberately for performance, and the comment doesn't provide enough value to warrant keeping it.
2. supabase/database/sql/get_snippets_function.sql:247
  • Draft comment:
    The removal of the like_summary join and using s.like_count directly assumes that the snippet record always holds an up‐to‐date count. Please ensure this aligns with business logic and that dynamic counts aren’t needed.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to ensure that the change aligns with business logic and that dynamic counts aren't needed. This falls under asking the author to confirm their intention or ensure behavior, which is against the rules.
3. supabase/database/sql/get_snippets_function.sql:459
  • Draft comment:
    The ORDER BY CASE expression handles specific p_order_by values, but for unexpected values it falls back to recording date. Consider explicitly handling unrecognized values to ensure predictable ordering.
  • Reason this comment was not posted:
    Confidence changes required: 50% <= threshold 50% None
4. supabase/database/sql/get_snippets_function.sql:541
  • Draft comment:
    CEIL(total_count::FLOAT / page_size) may lead to division-by-zero if page_size is 0. Consider adding a safeguard or validating page_size to avoid runtime errors.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_ir5CVFw4wtSBiIv8

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@supabase/database/sql/get_snippets_function.sql`:
- Line 18: Assign trimmed_search_term using COALESCE to convert NULL to an empty
string (e.g., set trimmed_search_term := COALESCE(TRIM(p_search_term), '')) so
that subsequent checks against '' (the uses of trimmed_search_term in the
full-text search block and the comparisons currently written as
trimmed_search_term = '') evaluate deterministically; update the initialization
of trimmed_search_term (and any similar trimmed var) and leave the existing
comparisons in the full-text search logic unchanged so they work when callers
pass NULL.
- Line 260: Replace the hardcoded "0 AS dislike_count" with a computed dislike
count by adding a subquery that counts rows in the user_like_snippets table
where value = -1 (similar to the aggregation used in get_snippet_function.sql);
use the existing join to user_like_snippets and reference the same snippet id
(e.g., snippet_id) in the WHERE clause so dislike_count returns COUNT(*) FILTER
(WHERE value = -1) for each snippet instead of always 0.
🧹 Nitpick comments (1)
supabase/database/sql/get_snippets_function.sql (1)

32-217: Significant DRY violation: filter logic duplicated between COUNT query and filtered_snippets CTE.

The ~180 lines of filter conditions are duplicated verbatim between the COUNT query (lines 32-217) and the filtered_snippets CTE (lines 266-444). If a filter is modified in one location but not the other, total_count will become inconsistent with the actual paginated results.

Consider consolidating by counting from the CTE:

Alternative approach using a single filter source
-- Remove the separate COUNT query (lines 32-217) and instead:
WITH
filtered_snippets AS (
    -- ... existing filter logic ...
),
count_cte AS (
    SELECT COUNT(*) AS cnt FROM filtered_snippets
),
paginated_snippets AS (
    SELECT fs.*
    FROM filtered_snippets fs
    ORDER BY ...
    LIMIT page_size
    OFFSET page * page_size
)
-- Then retrieve total_count from count_cte:
SELECT cnt INTO total_count FROM count_cte;

This ensures the count and the paginated results always use identical filters. The planner can still optimize this efficiently.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c576220 and b5f8bef.

📒 Files selected for processing (1)
  • supabase/database/sql/get_snippets_function.sql
🔇 Additional comments (2)
supabase/database/sql/get_snippets_function.sql (2)

204-217: Good optimization: per-column PGroonga search.

Switching from a concatenated multi-field text search to individual column checks allows PostgreSQL to leverage PGroonga indexes on each column. This aligns well with the PR's objective to reduce query execution time.


539-547: LGTM: Final result assembly.

The pagination calculation and JSON response structure are correctly implemented. The total_pages computation using CEIL handles edge cases properly.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

user_roles TEXT[];
user_is_admin BOOLEAN;
trimmed_search_term TEXT := TRIM(p_search_term); -- Trim the search term here
trimmed_search_term TEXT := TRIM(p_search_term);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Handle NULL case for p_search_term defensively.

TRIM(NULL) returns NULL, not an empty string. If a caller explicitly passes NULL for p_search_term, the check trimmed_search_term = '' (lines 207, 434) will evaluate to NULL (not TRUE), causing all the OR conditions in the full-text search block to be evaluated unnecessarily.

Suggested fix
-    trimmed_search_term TEXT := TRIM(p_search_term);
+    trimmed_search_term TEXT := COALESCE(TRIM(p_search_term), '');
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
trimmed_search_term TEXT := TRIM(p_search_term);
trimmed_search_term TEXT := COALESCE(TRIM(p_search_term), '');
🤖 Prompt for AI Agents
In `@supabase/database/sql/get_snippets_function.sql` at line 18, Assign
trimmed_search_term using COALESCE to convert NULL to an empty string (e.g., set
trimmed_search_term := COALESCE(TRIM(p_search_term), '')) so that subsequent
checks against '' (the uses of trimmed_search_term in the full-text search block
and the comparisons currently written as trimmed_search_term = '') evaluate
deterministically; update the initialization of trimmed_search_term (and any
similar trimmed var) and leave the existing comparisons in the full-text search
logic unchanged so they work when callers pass NULL.

uhs.snippet IS NOT NULL AS hidden,
COALESCE(lk.likes, 0) AS like_count,
COALESCE(lk.dislikes, 0) AS dislike_count
0 AS dislike_count
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for dislike references in SQL files
rg -n "dislike" --type sql -C 2

echo "---"

# Check the user_like_snippets table definition
rg -n "CREATE TABLE user_like_snippets" --type sql -A 30

Repository: PublicDataWorks/verdad

Length of output: 3470


🏁 Script executed:

#!/bin/bash
# Search for the get_snippets_function.sql file to examine full context
fd "get_snippets_function.sql" --type f

Repository: PublicDataWorks/verdad

Length of output: 113


🏁 Script executed:

#!/bin/bash
# Get the full context around line 260 and look for like_summary CTE
sed -n '200,300p' supabase/database/sql/get_snippets_function.sql | cat -n

Repository: PublicDataWorks/verdad

Length of output: 5142


🏁 Script executed:

#!/bin/bash
# Search for like_summary CTE in the file
rg -n "like_summary" supabase/database/sql/get_snippets_function.sql -B5 -A10

Repository: PublicDataWorks/verdad

Length of output: 48


🏁 Script executed:

#!/bin/bash
# Search for dislike_count column in snippets table definition
rg -n "dislike_count" --type sql -B5 -A5 | head -100

Repository: PublicDataWorks/verdad

Length of output: 6134


🏁 Script executed:

#!/bin/bash
# Check the snippets table schema
rg -n "CREATE TABLE snippets" --type sql -A 50

Repository: PublicDataWorks/verdad

Length of output: 48


🏁 Script executed:

#!/bin/bash
# Search for migrations or schema files that define snippets table
find . -type f -name "*.sql" | xargs grep -l "CREATE TABLE snippets" 2>/dev/null

Repository: PublicDataWorks/verdad

Length of output: 50


Fix hardcoded dislike_count to compute from user_like_snippets table.

The dislike_count is hardcoded to 0, but other functions (get_snippet_function.sql and like_snippet_function.sql) correctly compute it by counting records in user_like_snippets where value = -1. This inconsistency causes the snippets list endpoint to always return 0 dislikes, while the single snippet endpoint returns the correct count.

The user_like_snippets table is already joined in the query (line 264); add a subquery aggregation similar to get_snippet_function.sql line 74 to compute the actual dislike count.

🤖 Prompt for AI Agents
In `@supabase/database/sql/get_snippets_function.sql` at line 260, Replace the
hardcoded "0 AS dislike_count" with a computed dislike count by adding a
subquery that counts rows in the user_like_snippets table where value = -1
(similar to the aggregation used in get_snippet_function.sql); use the existing
join to user_like_snippets and reference the same snippet id (e.g., snippet_id)
in the WHERE clause so dislike_count returns COUNT(*) FILTER (WHERE value = -1)
for each snippet instead of always 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants