fix(qdrant): add pandas dep, fix cross-repo ID collisions, fix comment indexing#2323
fix(qdrant): add pandas dep, fix cross-repo ID collisions, fix comment indexing#2323
Conversation
…t indexing 1. Added pandas to the qdrant extras in requirements.txt - the Qdrant indexing path (via _update_qdrant_with_issues) imports and uses pandas but it was missing from the dependency list. 2. Fixed cross-repo ID collisions in Qdrant point IDs by including the repo name in the UUID seed. Previously uuid5(NAMESPACE_DNS, row["id"]) would generate identical point IDs for the same issue number across different repos, causing silent overwrites when indexing multiple repos into the same collection. 3. Fixed off-by-one error in comment URL indexing. Comment IDs were constructed with 1-based indices (j+1) but retrieved as 0-based list indices, causing the wrong comment URL to be returned. Changed all three vector DB paths (pinecone, lancedb, qdrant) to use 0-based indexing (j) to match the retrieval code.
Review Summary by QodoFix Qdrant pandas dependency, cross-repo collisions, and comment indexing
WalkthroughsDescription• Fixed cross-repo ID collisions in Qdrant by including repo name in UUID seed • Corrected off-by-one error in comment indexing across all vector DB paths • Added missing pandas dependency to qdrant extras in requirements.txt Diagramflowchart LR
A["Comment Indexing<br/>j+1 to j"] --> B["0-based Index<br/>Matches Retrieval"]
C["UUID Generation<br/>row[id] only"] --> D["UUID with Repo<br/>repo_name:row[id]"]
E["Missing pandas<br/>in requirements.txt"] --> F["Added pandas<br/>to qdrant extras"]
B --> G["Correct Comment URLs"]
D --> G
F --> G
File Changes1. pr_agent/tools/pr_similar_issue.py
|
Code Review by Qodo
|
| # pinecone-datasets @ git+https://github.com/mrT23/pinecone-datasets.git@main | ||
| # lancedb==0.5.1 | ||
| # qdrant-client==1.15.1 | ||
| # pandas # required by qdrant indexing path |
There was a problem hiding this comment.
1. requirements.txt adds commented pandas 📘 Rule violation ⚙ Maintainability
The PR introduces a new commented-out dependency line for pandas, which is inactive code and violates the no-commented-out-code requirement. This can also cause runtime failures for users who enable Qdrant without actually installing pandas.
Agent Prompt
## Issue description
A new commented-out dependency line (`# pandas ...`) was added to `requirements.txt`, violating the requirement to avoid commented-out code and leaving the dependency inactive.
## Issue Context
The code path `_update_qdrant_with_issues` imports/uses `pandas`, so leaving it commented relies on manual user action and can still lead to runtime `ImportError`.
## Fix Focus Areas
- requirements.txt[42-42]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| for row in df.to_dict(orient="records"): | ||
| points.append( | ||
| PointStruct(id=uuid.uuid5(uuid.NAMESPACE_DNS, row["id"]).hex, vector=row["vector"], payload={"id": row["id"], "text": row["text"], "metadata": row["metadata"]}) | ||
| PointStruct(id=uuid.uuid5(uuid.NAMESPACE_DNS, f"{repo_name_for_index}:{row['id']}").hex, vector=row["vector"], payload={"id": row["id"], "text": row["text"], "metadata": row["metadata"]}) |
There was a problem hiding this comment.
3. Qdrant upsert creates duplicates 🐞 Bug ≡ Correctness
Changing Qdrant PointStruct.id generation to include repo_name_for_index alters point IDs for existing repos, so re-ingesting into an existing collection will add a second copy of each point instead of overwriting the old ones. This can reduce result diversity (duplicates consume top_k) because querying/parsing uses payload["id"], not the Qdrant point ID.
Agent Prompt
### Issue description
Qdrant point IDs changed (UUID seed now includes `repo_name_for_index`), so re-ingesting an already-indexed repo into an existing collection will create duplicate points rather than overwrite the existing ones.
### Issue Context
- `_update_qdrant_with_issues()` always calls `qdrant.upsert(...)` with newly generated point IDs.
- When re-indexing (e.g., `force_update_dataset`), the code does not delete existing points for that repo first.
- Querying later reads `payload['id']`, so duplicates remain queryable and can consume the top_k.
### Fix Focus Areas
- pr_agent/tools/pr_similar_issue.py[212-257]
- pr_agent/tools/pr_similar_issue.py[587-679]
### What to change
1. Before calling `_update_qdrant_with_issues(..., ingest=True)` for an already-existing collection, delete existing points for that repo using a filter on `metadata.repo == repo_name_for_index` (and optionally `level in {issue,comment}` if needed).
2. Alternatively (or additionally), set `PointStruct.id` to a stable, human-readable, repo-scoped string like `f"{repo_name_for_index}:{row['id']}"` (Qdrant supports string point IDs) to avoid needing UUIDs and make overwrite semantics explicit.
3. Add a small log/metric indicating how many points were deleted before reindexing so operators can validate migration behavior.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
…andas comment - Break PointStruct(...) call into multiple lines to stay within 120-char limit - Update pandas comment to note it is needed by both qdrant and pinecone paths Note: the ID generation change in the parent commit is intentional — existing points with old IDs (without repo prefix) will become orphans and be overwritten on the next full reindex.
|
Persistent review updated to latest commit d1b7a85 |
Summary
_update_qdrant_with_issues) imports and usespandas, but it was not listed inrequirements.txt. Addedpandasto the commented qdrant extras section so users who uncommentqdrant-clientknow to also uncommentpandas.uuid5(NAMESPACE_DNS, row["id"])whererow["id"]is justissue_123.issue-- no repo context. When indexing multiple repos into the same collection, different repos with the same issue number silently overwrite each other. Fixed by including the repo identifier in the UUID seed:uuid5(NAMESPACE_DNS, f"{repo_name_for_index}:{row['id']}").j+1) during indexing, but the retrieval path at line 383 uses the parsed number as a 0-based list index intoissue.get_comments(). This caused the wrong comment URL to be linked. Changed all three vector DB paths (pinecone, lancedb, qdrant) to use 0-based indexing (j) to match retrieval.Test plan
pandasimport succeeds when qdrant extras are installed