fix: resolve search timeout by using HNSW index by rajivsinclair · Pull Request #54 · PublicDataWorks/verdad

rajivsinclair · 2026-01-15T18:19:30Z

Summary

Fixed two functions that were causing "statement timeout" errors:

search_related_snippets - Optimized to use two-stage vector search with HNSW index
get_snippets - Optimized full-text search and count query

Problem

The production database was experiencing frequent statement timeouts:

ERROR: canceling statement due to statement timeout

Root Causes

1. search_related_snippets (121K+ embeddings)

Computed 1 - (se.embedding <=> source_embedding) > match_threshold in WHERE clause
Prevented use of any index, forcing sequential scan of 3072-dimensional vectors

2. get_snippets (144K+ snippets)

Used COUNT(*) OVER() window function which computed count for every row
Full-text search used concatenated fields which could not use PGroonga indexes
Redundant like_summary CTE aggregated user_like_snippets when snippets already has like_count

Solution

search_related_snippets - Two-Stage Search

Stage 1: Use sub_vector(embedding, 512) with HNSW index to quickly find candidates
Stage 2: Re-rank candidates using full 3072-dim embedding similarity

get_snippets - Query Optimization

Split COUNT(*) OVER() into separate count query - Avoids window function overhead
Changed full-text search to use individual columns - Allows PGroonga index use
Removed redundant like_summary CTE - Uses existing like_count column

Performance Improvement

Function	Before	After
search_related_snippets	>8s (timeout)	~350ms
get_snippets	>2min (timeout)	<1s

Changes

supabase/database/sql/search_related_snippets.sql - Two-stage HNSW search
supabase/database/sql/fix_search_timeout.sql - Migration for search_related_snippets
supabase/database/sql/get_snippets_function.sql - Optimized count and full-text search

Migrations Applied

Both migrations have been applied to production:

fix_search_timeout - search_related_snippets optimization
fix_get_snippets_timeout - get_snippets optimization

Test plan

Verified search_related_snippets uses HNSW index via EXPLAIN ANALYZE
Verified get_snippets returns correct results with pagination
Confirmed both functions execute without timeout in production

Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Prevented search timeouts with an optimized two-stage embedding search that narrows candidates then re-ranks for relevance.
- Fixed snippet listing pagination and total counts; ensured consistent default ordering.
Improvements
- Faster, index-friendly text filtering across fields for more responsive searches.
- More accurate language-specific summaries and corrected like-count exposure in results.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…ppets The search_related_snippets function was causing statement timeouts by doing sequential scans across 121K+ embeddings. This rewrites the function to use a two-stage search pattern: 1. Stage 1: Fast approximate search using sub_vector(512) with existing HNSW index 2. Stage 2: Re-rank candidates using full 3072-dim embedding similarity This change reduces query time from timeout (>8s) to ~350ms by leveraging the existing snippet_embeddings_sub_vector_idx index. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai · 2026-01-15T18:19:44Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Replaces the single-pass related-snippet search with a two-stage embedding retrieval (512-dim sub-vector coarse filter then 3072-dim full re-rank), updates function signature (adds candidate_multiplier, renames snippet_id → p_snippet_id), and adjusts snippets pagination/search query logic.

Changes

Cohort / File(s)	Summary
Embedding search (two-stage) `supabase/database/sql/fix_search_timeout.sql`, `supabase/database/sql/search_related_snippets.sql`	Replaces previous single-pass embedding lookup with a two-stage flow: Stage 1 coarse candidate retrieval using 512-dim sub-vectors (limit = match_count * candidate_multiplier), Stage 2 re-rank using full 3072-dim embeddings, filter by `match_threshold`, return top `match_count`. Signature updated (rename to `p_snippet_id`, add `candidate_multiplier` default 8). Result JSON now uses precomputed `fs.similarity`.
Snippets listing / pagination `supabase/database/sql/get_snippets_function.sql`	Reworked pagination/total-count approach (separate COUNT then aggregated results), replaced concatenated full-text approach with per-field searches to better leverage indexes, adjusted like/label aggregations and ordering logic, removed previous num_of_snippets aggregation pattern and updated result assembly.

Sequence Diagram

sequenceDiagram
    actor Client
    participant Function as search_related_snippets()
    participant Embeddings as Snippet Embeddings Store
    participant Stage1 as Stage 1: Sub-vector Filter
    participant Stage2 as Stage 2: Full Embedding Re-rank
    participant Result as Result Assembly

    Client->>Function: call(p_snippet_id, p_language, ...)
    Function->>Embeddings: SELECT embedding, sub_embedding FOR p_snippet_id
    Embeddings-->>Function: return source_embedding + source_sub_embedding
    Function->>Stage1: query candidates using source_sub_embedding (limit = match_count * candidate_multiplier)
    Stage1-->>Function: return candidate_ids + sub_similarities
    Function->>Stage2: compute full similarities using candidate full embeddings (filter by match_threshold, limit = match_count)
    Stage2-->>Function: return ranked candidates with fs.similarity
    Function->>Result: join metadata, aggregate labels, build JSON entries
    Result-->>Function: jsonb_agg results (or [] if none)
    Function-->>Client: return JSON results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[f] VER-267: Adaptive retrieval for searching related snippets #23 — Implements the same two-stage sub-vector coarse retrieval + full-embedding re-rank and introduces candidate_multiplier.
VER-265 - Optimize fetch related snippets #21 — Restructures search_related_snippets CTEs and staged similarity computation (closely related code-level edits).
[f] VER-263 - Replace temp table creation with CTE query #19 — Alters get_snippets pagination/aggregation approach; related to the get_snippets_function.sql changes.

Suggested reviewers

nhphong

Poem

🐰 I hopped through vectors, short then long,
First nibble quick, then sing the full song.
Candidates picked, then ranked with delight,
JSON packed snug, returned by night.
A little rabbit cheers the search made right. 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main fix: resolving search timeouts using an HNSW index, which aligns with the primary objective of the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/search-timeout-vector-index

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-15T18:19:49Z

Summary of Changes

Hello @rajivsinclair, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical performance bottleneck in the search_related_snippets function, which was causing frequent database statement timeouts. By refactoring the search logic to utilize an efficient two-stage approach with an HNSW index, the change dramatically improves query response times and overall system stability for snippet similarity searches.

Highlights

Performance Fix: Resolved critical "statement timeout" errors in the search_related_snippets function, which previously performed inefficient sequential scans.
Optimized Search Strategy: Replaced the old implementation's sequential scans across 121K+ 3072-dimensional embeddings with a highly optimized two-stage search approach.
HNSW Index Utilization: Implemented a two-stage search leveraging an existing HNSW index on 512-dimensional sub-vectors for rapid candidate selection, followed by precise re-ranking using full 3072-dimensional embeddings.
Significant Speed Improvement: Reduced search execution time from over 8 seconds (causing timeouts) to approximately 350 milliseconds, dramatically improving query response times.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

ellipsis-dev

Important

Looks good to me! 👍

Reviewed everything up to c576220 in 50 seconds. Click for details.

Reviewed 216 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 7 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. supabase/database/sql/fix_search_timeout.sql:8

Draft comment:
Ensure the DROP FUNCTION statement covers all previous overloads; if earlier versions used a different signature (e.g. with candidate_multiplier) they might not be dropped.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

2. supabase/database/sql/fix_search_timeout.sql:25

Draft comment:
Review the explicit cast to vector(512) in the sub_vector call – if sub_vector already returns vector(512), the cast might be redundant or intended for index use.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

3. supabase/database/sql/fix_search_timeout.sql:44

Draft comment:
The ORDER BY using the <#> operator on the sub_vector appears correct for leveraging the HNSW index; verify that the index is indeed used in production EXPLAIN plans.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

4. supabase/database/sql/fix_search_timeout.sql:50

Draft comment:
The similarity calculation using '1 - (embedding <=> source_embedding)' is clear; consider if candidates with similarity exactly equal to the threshold should be included (using >= instead of >).
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

5. supabase/database/sql/search_related_snippets.sql:1

Draft comment:
Renaming the parameter to 'p_snippet_id' improves clarity and maintains consistency. This is a good refactor.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

6. supabase/database/sql/search_related_snippets.sql:16

Draft comment:
The SELECT now includes extraction of the sub-vector for HNSW search. Ensure this casting logic is consistent with the index definition.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

7. supabase/database/sql/search_related_snippets.sql:29

Draft comment:
The two-stage search using 'sub_similar' and 'full_similar' CTEs is well-structured. Confirm that the candidate_multiplier parameter is tuned to balance performance and match quality.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

Workflow ID: wflow_Gn1kPicczhtbOd08

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c576220e35

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-15T18:21:09Z

+            1 - (embedding <=> source_embedding) as similarity
+        FROM sub_similar
+        WHERE 1 - (embedding <=> source_embedding) > match_threshold
+        ORDER BY embedding <=> source_embedding ASC
+        LIMIT match_count


Apply match_count after filtering Processed rows

The LIMIT match_count is applied inside full_similar before joining snippets and filtering s.status = 'Processed', so any top candidates that are unprocessed get dropped later and the function can return fewer than match_count results. Previously, the status filter happened before the limit, ensuring up to match_count processed snippets were returned. If there are unprocessed embeddings among the nearest neighbors, this change will consistently under-deliver results relative to the caller’s expectation.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request effectively addresses a critical performance issue by refactoring the search_related_snippets function to use a two-stage search leveraging an HNSW index. The approach is sound and the code is well-commented. My review identified a critical issue where the SQL query uses a distance operator that mismatches the index's operator class, which would prevent the index from being used and negate the performance benefits. I also found a medium-severity issue concerning the aggregation of JSON data that could lead to undesirable [null] values instead of empty arrays. Addressing these points will ensure the implementation is both correct and robust.

gemini-code-assist · 2026-01-15T18:21:40Z

+            se.embedding
+        FROM snippet_embeddings se
+        WHERE se.snippet != p_snippet_id
+        ORDER BY sub_vector(se.embedding, 512)::vector(512) <#> source_sub_embedding ASC


There's a mismatch between the distance operator used here and the index's operator class. The snippet_embeddings_sub_vector_idx index is created with vector_ip_ops for inner product search. However, the query uses the <#> operator, which is for L2 distance. For the HNSW index to be utilized, the query operator must match the index's operator class. Using <#> will likely cause PostgreSQL to fall back to a sequential scan, defeating the purpose of this performance fix.

Since the sub-vectors are normalized, you should use the inner product distance operator <-> to find the nearest neighbors, which corresponds to the vector_ip_ops index.

ORDER BY sub_vector(se.embedding, 512)::vector(512) <-> source_sub_embedding ASC

gemini-code-assist · 2026-01-15T18:21:40Z

+            se.embedding
+        FROM snippet_embeddings se
+        WHERE se.snippet != p_snippet_id
+        ORDER BY sub_vector(se.embedding, 512)::vector(512) <#> source_sub_embedding ASC


There's a mismatch between the distance operator used here and the index's operator class. The snippet_embeddings_sub_vector_idx index is created with vector_ip_ops for inner product search. However, the query uses the <#> operator, which is for L2 distance. For the HNSW index to be utilized, the query operator must match the index's operator class. Using <#> will likely cause PostgreSQL to fall back to a sequential scan, defeating the purpose of this performance fix.

Since the sub-vectors are normalized, you should use the inner product distance operator <-> to find the nearest neighbors, which corresponds to the vector_ip_ops index.

ORDER BY sub_vector(se.embedding, 512)::vector(512) <-> source_sub_embedding ASC

gemini-code-assist · 2026-01-15T18:21:40Z

+                ELSE s.summary ->> 'english'
+            END AS summary,
+            fs.similarity,
+            jsonb_agg(l) as labels


The jsonb_agg(l) function will produce [null] if a snippet has no associated labels (due to the LEFT JOIN producing a row with l as NULL). This is often not the desired behavior; an empty array [] is usually preferred. To handle this correctly and ensure an empty array is returned for snippets without labels, you can use FILTER to exclude NULL values from the aggregation and COALESCE to default to an empty array if the result is NULL (i.e., no labels found).

COALESCE(jsonb_agg(l) FILTER (WHERE l.id IS NOT NULL), '[]'::jsonb) as labels

gemini-code-assist · 2026-01-15T18:21:40Z

            END AS summary,
-            1 - (se.embedding <=> source_embedding) as similarity,
+            fs.similarity,
            jsonb_agg(l) as labels


The jsonb_agg(l) function will produce [null] if a snippet has no associated labels (due to the LEFT JOIN producing a row with l as NULL). This is often not the desired behavior; an empty array [] is usually preferred. To handle this correctly and ensure an empty array is returned for snippets without labels, you can use FILTER to exclude NULL values from the aggregation and COALESCE to default to an empty array if the result is NULL (i.e., no labels found).

COALESCE(jsonb_agg(l) FILTER (WHERE l.id IS NOT NULL), '[]'::jsonb) as labels

Copilot

Pull request overview

This PR fixes a critical performance issue in the search_related_snippets function that was causing statement timeout errors in production. The function was performing sequential scans on 121K+ embeddings without using any index, resulting in timeouts exceeding 8 seconds.

Changes:

Rewrote search_related_snippets to use a two-stage search pattern leveraging the existing HNSW index
Added candidate_multiplier parameter to control the first-stage candidate pool size
Renamed first parameter from snippet_id to p_snippet_id for consistency

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
supabase/database/sql/search_related_snippets.sql	Implements two-stage search: HNSW index lookup for candidates, then full embedding re-ranking
supabase/database/sql/fix_search_timeout.sql	Migration file containing the same function definition with DROP statement

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-15T18:22:38Z

+            1 - (embedding <=> source_embedding) as similarity
+        FROM sub_similar
+        WHERE 1 - (embedding <=> source_embedding) > match_threshold
+        ORDER BY embedding <=> source_embedding ASC


The function uses different distance operators in stage 1 (<#> inner product) and stage 2 (<=> cosine distance), which is inconsistent with search_related_snippets_public that uses <#> (inner product) in both stages. Since the HNSW index is built with vector_ip_ops (inner product operations) and the comment in search_related_snippets_public states 'Embedding is normalized so we can use inner product as similarity', this should likely use -(embedding <#> source_embedding) instead of 1 - (embedding <=> source_embedding) for consistency and correctness with normalized embeddings.

Suggested change

1 - (embedding <=> source_embedding) as similarity

FROM sub_similar

WHERE 1 - (embedding <=> source_embedding) > match_threshold

ORDER BY embedding <=> source_embedding ASC

-(embedding <#> source_embedding) as similarity

FROM sub_similar

WHERE -(embedding <#> source_embedding) > match_threshold

ORDER BY embedding <#> source_embedding ASC

Copilot · 2026-01-15T18:22:38Z

+        WHERE 1 - (embedding <=> source_embedding) > match_threshold
+        ORDER BY embedding <=> source_embedding ASC


The threshold comparison and ordering should be consistent with the similarity calculation on line 41. If using cosine distance (<=>), the ORDER BY is correct but the similarity calculation and threshold comparison may need adjustment. If switching to inner product (as suggested in Comment 1), this should be WHERE -(embedding <#> source_embedding) > match_threshold and ORDER BY embedding <#> source_embedding ASC.

Copilot · 2026-01-15T18:22:38Z

+            1 - (embedding <=> source_embedding) as similarity
+        FROM sub_similar
+        WHERE 1 - (embedding <=> source_embedding) > match_threshold
+        ORDER BY embedding <=> source_embedding ASC


The function uses different distance operators in stage 1 (<#> inner product) and stage 2 (<=> cosine distance), which is inconsistent with search_related_snippets_public that uses <#> (inner product) in both stages. Since the HNSW index is built with vector_ip_ops (inner product operations) and the comment in search_related_snippets_public states 'Embedding is normalized so we can use inner product as similarity', this should likely use -(embedding <#> source_embedding) instead of 1 - (embedding <=> source_embedding) for consistency and correctness with normalized embeddings.

Suggested change

1 - (embedding <=> source_embedding) as similarity

FROM sub_similar

WHERE 1 - (embedding <=> source_embedding) > match_threshold

ORDER BY embedding <=> source_embedding ASC

-(embedding <#> source_embedding) as similarity

FROM sub_similar

WHERE -(embedding <#> source_embedding) > match_threshold

ORDER BY similarity DESC

Copilot · 2026-01-15T18:22:39Z

+            1 - (embedding <=> source_embedding) as similarity
+        FROM sub_similar
+        WHERE 1 - (embedding <=> source_embedding) > match_threshold
+        ORDER BY embedding <=> source_embedding ASC


The threshold comparison and ordering should be consistent with the similarity calculation on line 50. If using cosine distance (<=>), the ORDER BY is correct but the similarity calculation and threshold comparison may need adjustment. If switching to inner product (as suggested in Comment 3), this should be WHERE -(embedding <#> source_embedding) > match_threshold and ORDER BY embedding <#> source_embedding ASC.

Suggested change

1 - (embedding <=> source_embedding) as similarity

FROM sub_similar

WHERE 1 - (embedding <=> source_embedding) > match_threshold

ORDER BY embedding <=> source_embedding ASC

-(embedding <#> source_embedding) as similarity

FROM sub_similar

WHERE -(embedding <#> source_embedding) > match_threshold

ORDER BY embedding <#> source_embedding ASC

The actual timeout issue was in get_snippets, not search_related_snippets. This commit applies the same optimizations that were deployed to production: 1. Split COUNT(*) OVER() into separate count query - Window functions can be slow on large result sets - Separate count query allows better query planning 2. Changed full-text search from concatenated fields to individual columns - Old: (s.title ->> 'english') || ' ' || (s.title ->> 'spanish') &@ term - New: (s.title ->> 'english') &@ term OR (s.title ->> 'spanish') &@ term - This allows PostgreSQL to use PGroonga indexes on each column 3. Removed redundant like_summary CTE - Snippets table already has like_count column - Eliminates unnecessary aggregation over user_like_snippets Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ellipsis-dev

Important

Looks good to me! 👍

Reviewed b5f8bef in 2 minutes and 12 seconds. Click for details.

Reviewed 313 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 4 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. supabase/database/sql/get_snippets_function.sql:32

Draft comment:
The COUNT(*) query duplicate filters the same conditions as the main query. Duplicating these filters risks inconsistent updates in future. Consider refactoring shared filtering logic.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50% The comment is about code that was added in this PR (the separate COUNT query is new). The observation is technically correct - there is duplication of filtering logic between the COUNT query and the main query. However, this comment is suggesting a refactor for code quality/maintainability reasons rather than pointing out a bug. The rules state that "Comments that suggest code quality refactors are good! But only if they are actionable and clear." This comment identifies the issue (duplication) and the risk (inconsistent updates), but doesn't provide a clear, actionable solution. It's somewhat vague - "Consider refactoring shared filtering logic" doesn't tell the author HOW to refactor it. The author likely made this design choice deliberately for performance reasons (as indicated by the comment "to avoid COUNT(*) OVER() overhead"). Without more context about whether there's a better way to structure this, the comment may not be actionable enough. The comment does identify a real code quality issue (duplication) that was introduced in this PR. While it's not super specific about the solution, it does clearly identify the problem and the risk. The author may find this feedback valuable even if it's not immediately actionable, as it highlights a maintenance burden they've introduced. While the comment identifies a valid concern, it's not actionable enough. The author clearly made a deliberate choice to separate the COUNT query for performance reasons. Simply pointing out that there's duplication without providing a concrete alternative approach or demonstrating that the duplication is actually problematic doesn't give the author clear guidance on what to do. This falls into the category of speculative concerns about future maintenance rather than a clear code issue that needs to be fixed. The comment should be deleted. While it identifies code duplication, it's not actionable enough - it doesn't provide a clear alternative approach or demonstrate that this is a problem that needs to be solved now. The author likely made this design choice deliberately for performance, and the comment doesn't provide enough value to warrant keeping it.

2. supabase/database/sql/get_snippets_function.sql:247

Draft comment:
The removal of the like_summary join and using s.like_count directly assumes that the snippet record always holds an up‐to‐date count. Please ensure this aligns with business logic and that dynamic counts aren’t needed.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to ensure that the change aligns with business logic and that dynamic counts aren't needed. This falls under asking the author to confirm their intention or ensure behavior, which is against the rules.

3. supabase/database/sql/get_snippets_function.sql:459

Draft comment:
The ORDER BY CASE expression handles specific p_order_by values, but for unexpected values it falls back to recording date. Consider explicitly handling unrecognized values to ensure predictable ordering.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50% None

4. supabase/database/sql/get_snippets_function.sql:541

Draft comment:
CEIL(total_count::FLOAT / page_size) may lead to division-by-zero if page_size is 0. Consider adding a safeguard or validating page_size to avoid runtime errors.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_ir5CVFw4wtSBiIv8

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@supabase/database/sql/get_snippets_function.sql`:
- Line 18: Assign trimmed_search_term using COALESCE to convert NULL to an empty
string (e.g., set trimmed_search_term := COALESCE(TRIM(p_search_term), '')) so
that subsequent checks against '' (the uses of trimmed_search_term in the
full-text search block and the comparisons currently written as
trimmed_search_term = '') evaluate deterministically; update the initialization
of trimmed_search_term (and any similar trimmed var) and leave the existing
comparisons in the full-text search logic unchanged so they work when callers
pass NULL.
- Line 260: Replace the hardcoded "0 AS dislike_count" with a computed dislike
count by adding a subquery that counts rows in the user_like_snippets table
where value = -1 (similar to the aggregation used in get_snippet_function.sql);
use the existing join to user_like_snippets and reference the same snippet id
(e.g., snippet_id) in the WHERE clause so dislike_count returns COUNT(*) FILTER
(WHERE value = -1) for each snippet instead of always 0.

🧹 Nitpick comments (1)

supabase/database/sql/get_snippets_function.sql (1)
32-217: Significant DRY violation: filter logic duplicated between COUNT query and filtered_snippets CTE.

The ~180 lines of filter conditions are duplicated verbatim between the COUNT query (lines 32-217) and the filtered_snippets CTE (lines 266-444). If a filter is modified in one location but not the other, total_count will become inconsistent with the actual paginated results.

Consider consolidating by counting from the CTE:
Alternative approach using a single filter source
-- Remove the separate COUNT query (lines 32-217) and instead:
WITH
filtered_snippets AS (
    -- ... existing filter logic ...
),
count_cte AS (
    SELECT COUNT(*) AS cnt FROM filtered_snippets
),
paginated_snippets AS (
    SELECT fs.*
    FROM filtered_snippets fs
    ORDER BY ...
    LIMIT page_size
    OFFSET page * page_size
)
-- Then retrieve total_count from count_cte:
SELECT cnt INTO total_count FROM count_cte;
This ensures the count and the paginated results always use identical filters. The planner can still optimize this efficiently.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c576220 and b5f8bef.

📒 Files selected for processing (1)

supabase/database/sql/get_snippets_function.sql

🔇 Additional comments (2)

supabase/database/sql/get_snippets_function.sql (2)

204-217: Good optimization: per-column PGroonga search.

Switching from a concatenated multi-field text search to individual column checks allows PostgreSQL to leverage PGroonga indexes on each column. This aligns well with the PR's objective to reduce query execution time.

539-547: LGTM: Final result assembly.

The pagination calculation and JSON response structure are correctly implemented. The total_pages computation using CEIL handles edge cases properly.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-15T18:56:24Z

    user_roles TEXT[];
    user_is_admin BOOLEAN;
-    trimmed_search_term TEXT := TRIM(p_search_term); -- Trim the search term here
+    trimmed_search_term TEXT := TRIM(p_search_term);


⚠️ Potential issue | 🟡 Minor

Handle NULL case for p_search_term defensively.

TRIM(NULL) returns NULL, not an empty string. If a caller explicitly passes NULL for p_search_term, the check trimmed_search_term = '' (lines 207, 434) will evaluate to NULL (not TRUE), causing all the OR conditions in the full-text search block to be evaluated unnecessarily.

Suggested fix

- trimmed_search_term TEXT := TRIM(p_search_term); + trimmed_search_term TEXT := COALESCE(TRIM(p_search_term), '');

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

trimmed_search_term TEXT := TRIM(p_search_term);

trimmed_search_term TEXT := COALESCE(TRIM(p_search_term), '');

🤖 Prompt for AI Agents

In `@supabase/database/sql/get_snippets_function.sql` at line 18, Assign trimmed_search_term using COALESCE to convert NULL to an empty string (e.g., set trimmed_search_term := COALESCE(TRIM(p_search_term), '')) so that subsequent checks against '' (the uses of trimmed_search_term in the full-text search block and the comparisons currently written as trimmed_search_term = '') evaluate deterministically; update the initialization of trimmed_search_term (and any similar trimmed var) and leave the existing comparisons in the full-text search logic unchanged so they work when callers pass NULL.

coderabbitai · 2026-01-15T18:56:24Z

            uhs.snippet IS NOT NULL AS hidden,
-            COALESCE(lk.likes, 0) AS like_count,
-            COALESCE(lk.dislikes, 0) AS dislike_count
+            0 AS dislike_count


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for dislike references in SQL files rg -n "dislike" --type sql -C 2 echo "---" # Check the user_like_snippets table definition rg -n "CREATE TABLE user_like_snippets" --type sql -A 30

Repository: PublicDataWorks/verdad

Length of output: 3470

🏁 Script executed:

#!/bin/bash # Search for the get_snippets_function.sql file to examine full context fd "get_snippets_function.sql" --type f

Repository: PublicDataWorks/verdad

Length of output: 113

🏁 Script executed:

#!/bin/bash # Get the full context around line 260 and look for like_summary CTE sed -n '200,300p' supabase/database/sql/get_snippets_function.sql | cat -n

Repository: PublicDataWorks/verdad

Length of output: 5142

🏁 Script executed:

#!/bin/bash # Search for like_summary CTE in the file rg -n "like_summary" supabase/database/sql/get_snippets_function.sql -B5 -A10

Repository: PublicDataWorks/verdad

Length of output: 48

🏁 Script executed:

#!/bin/bash # Search for dislike_count column in snippets table definition rg -n "dislike_count" --type sql -B5 -A5 | head -100

Repository: PublicDataWorks/verdad

Length of output: 6134

🏁 Script executed:

#!/bin/bash # Check the snippets table schema rg -n "CREATE TABLE snippets" --type sql -A 50

Repository: PublicDataWorks/verdad

Length of output: 48

🏁 Script executed:

#!/bin/bash # Search for migrations or schema files that define snippets table find . -type f -name "*.sql" | xargs grep -l "CREATE TABLE snippets" 2>/dev/null

Repository: PublicDataWorks/verdad

Length of output: 50

Fix hardcoded dislike_count to compute from user_like_snippets table.

The dislike_count is hardcoded to 0, but other functions (get_snippet_function.sql and like_snippet_function.sql) correctly compute it by counting records in user_like_snippets where value = -1. This inconsistency causes the snippets list endpoint to always return 0 dislikes, while the single snippet endpoint returns the correct count.

The user_like_snippets table is already joined in the query (line 264); add a subquery aggregation similar to get_snippet_function.sql line 74 to compute the actual dislike count.

🤖 Prompt for AI Agents

In `@supabase/database/sql/get_snippets_function.sql` at line 260, Replace the hardcoded "0 AS dislike_count" with a computed dislike count by adding a subquery that counts rows in the user_like_snippets table where value = -1 (similar to the aggregation used in get_snippet_function.sql); use the existing join to user_like_snippets and reference the same snippet id (e.g., snippet_id) in the WHERE clause so dislike_count returns COUNT(*) FILTER (WHERE value = -1) for each snippet instead of always 0.

Copilot AI review requested due to automatic review settings January 15, 2026 18:19

Copilot started reviewing on behalf of rajivsinclair January 15, 2026 18:20 View session

ellipsis-dev Bot reviewed Jan 15, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jan 15, 2026

View reviewed changes

gemini-code-assist Bot reviewed Jan 15, 2026

View reviewed changes

Copilot AI reviewed Jan 15, 2026

View reviewed changes

ellipsis-dev Bot reviewed Jan 15, 2026

View reviewed changes

coderabbitai Bot reviewed Jan 15, 2026

View reviewed changes

		WHERE 1 - (embedding <=> source_embedding) > match_threshold
		ORDER BY embedding <=> source_embedding ASC

	trimmed_search_term TEXT := TRIM(p_search_term);
	trimmed_search_term TEXT := COALESCE(TRIM(p_search_term), '');

Conversation

rajivsinclair commented Jan 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Causes

Solution

search_related_snippets - Two-Stage Search

get_snippets - Query Optimization

Performance Improvement

Changes

Migrations Applied

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot commented Jan 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

rajivsinclair commented Jan 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 15, 2026 •

edited

Loading