Fix KB docx duplicate and add reupload prompt#88
Fix KB docx duplicate and add reupload prompt#88sqhyz55 wants to merge 5 commits intoxorbitsai:mainfrom
Conversation
Made-with: Cursor
rogercloud
left a comment
There was a problem hiding this comment.
Review Summary
Thanks for working on this feature to prevent duplicate document uploads! The implementation adds a nice UX improvement with the re-upload confirmation dialog, and the deterministic doc_id generation is a solid approach.
✅ What's Good
- Test coverage added for deterministic doc_id idempotency
- Proper i18n translations for both English and Chinese
- Code follows existing patterns, ruff checks pass
- For regular users, the check API and doc_id generation work correctly together
Issues Found (7 total)
Higher Priority:
- #1 (Admin edge case): Admin users see duplicate warnings for all users' files, but uploading creates a new record (inconsistent behavior)
- #2 (Silent fallback): No user notification when check API fails
- #4 (Breaking change): Old documents with random UUID doc_ids will create duplicates on re-upload
Lower Priority:
3. #3 (Performance): Loading 10K records on every check could be slow for large collections
5. #5 (Code quality): Dialog close logic is duplicated
6. #6 (Race condition): Time-of-check-to-time-of-use gap (mitigated by backend idempotency)
7. #7 (Validation): Implicit type conversion for filenames array elements
Recommendation
Request changes for issues #1, #2, and #4. The other issues can be addressed as follow-ups.
| for record in records: | ||
| sp = record.get("source_path") | ||
| if sp: | ||
| existing_basenames.add(os.path.basename(str(sp))) |
There was a problem hiding this comment.
Issue #1: Admin edge case - inconsistent duplicate detection
The check API uses os.path.basename(source_path) to compare filenames, but generate_deterministic_doc_id() uses the full source_path. For admin users:
- Admin gets
user_filter = None, so they see ALL users' documents in this check - If any user has
report.docx, admin sees a "file already exists" warning - But admin's file is stored at
/uploads/user_{admin_id}/report.docx - The doc_id would be different (different path), so it creates a NEW record instead of overwriting
For regular users this works correctly (they only see their own docs), but for admins the warning and actual behavior are inconsistent.
Recommendation: Consider either:
- Apply user filter even for admins ( admins see all, but duplicate check only checks admin's own files)
- Add a comment explaining this edge case behavior
| return | ||
| } | ||
| await doUpload() | ||
| } catch { |
There was a problem hiding this comment.
Issue #2: Silent error handling - no user feedback
When the check API fails (network error, server error), the code silently falls back to uploading without any notification. This masks potential issues that users should know about.
} catch {
await doUpload() // Empty catch - silently continues
}Recommendation: Add at least a console warning for debugging:
} catch (error) {
console.warn('Check API failed, proceeding with upload:', error)
await doUpload()
}| if user_filter and base_filter | ||
| else (user_filter or base_filter) | ||
| ) | ||
| MAX_SEARCH_RESULTS = 10000 |
There was a problem hiding this comment.
Issue #3: Performance concern for large collections
Loading up to 10,000 records on every duplicate check could be slow for collections with many documents:
MAX_SEARCH_RESULTS = 10000
records = query_to_list(
table.search().where(combined_filter).limit(MAX_SEARCH_RESULTS)
)Each upload triggers this scan. For large KBs this could add noticeable latency.
Recommendation:
- Consider adding a filename index column for faster lookups
- Or add a comment acknowledging this trade-off for simplicity
There was a problem hiding this comment.
We can address this after the file management refactoring is finished.
| from ..utils.string_utils import ( | ||
| build_lancedb_filter_expression, | ||
| generate_doc_id_from_filename, | ||
| generate_deterministic_doc_id, |
There was a problem hiding this comment.
Issue #4: Breaking change for existing documents
This changes from generate_doc_id_from_filename (random UUID) to generate_deterministic_doc_id (hash-based).
Impact: Existing documents registered with the old method will get a NEW doc_id when re-uploaded, potentially creating duplicates instead of updating existing records.
Example:
- Old doc_id:
report_a1b2c3d4(random) - New doc_id:
report_e91df8a2(hash-based)
When the old document is re-registered, it creates a new row instead of updating.
Recommendation: Document this behavior change or consider a migration strategy for existing documents.
| </TabsContent> | ||
| </Tabs> | ||
|
|
||
| {/* Re-upload confirm: file(s) already exist */} |
There was a problem hiding this comment.
Issue #5: Code duplication in dialog close handlers
The dialog close logic is duplicated in multiple places:
// Line 924-930
if (!open) {
setReuploadDialogOpen(false)
setExistingFilenamesForReupload([])
}
// Line 946-952 (similar logic)
onClick={() => {
setReuploadDialogOpen(false)
setExistingFilenamesForReupload([])
}}Recommendation: Extract to a helper function:
const closeReuploadDialog = () => {
setReuploadDialogOpen(false)
setExistingFilenamesForReupload([])
}| } | ||
| } | ||
|
|
||
| const handleUpload = async () => { |
There was a problem hiding this comment.
Issue #6: Race condition between check and upload
Between checking if files exist (line 365-385) and actually uploading (line 395), another user could upload the same file. This is mitigated by the deterministic doc_id on the backend (duplicates would update same record), but worth noting.
Recommendation: Add a comment explaining this is an acceptable race condition since the backend handles idempotency correctly.
src/xagent/web/api/kb.py
Outdated
| status_code=422, | ||
| detail="Request body must contain 'filenames' as a list of strings", | ||
| ) | ||
| requested = { |
There was a problem hiding this comment.
Issue #7: Implicit type validation for filenames
The code validates that filenames is a list, but doesn't explicitly validate that each element is a string. The comprehension on line 705-707 handles this gracefully:
requested = {
str(f).strip() for f in filenames if f is not None and str(f).strip()
}However, this silently converts non-string values. It would be clearer to explicitly validate and reject invalid input with a 422 error.
Recommendation: Add explicit validation:
if not all(isinstance(f, str) for f in filenames):
raise HTTPException(
status_code=422,
detail="All filenames must be strings"
)Address review issues xorbitsai#2, xorbitsai#5, xorbitsai#6, xorbitsai#7 for KB docx duplicate fix. Performance-related issue xorbitsai#3 will be revisited after file management refactor.
f3bac10 to
332b750
Compare
Run mocked sync wrapper coroutines in the test to avoid unawaited coroutine warnings in CI, and allow root CHANGELOG.md to be committed by updating .gitignore exceptions. Made-with: Cursor
48cbbac to
74e5aa9
Compare
No description provided.