Skip to content

Fix KB docx duplicate and add reupload prompt#88

Open
sqhyz55 wants to merge 5 commits intoxorbitsai:mainfrom
sqhyz55:fix/kb-docx-duplicate-files
Open

Fix KB docx duplicate and add reupload prompt#88
sqhyz55 wants to merge 5 commits intoxorbitsai:mainfrom
sqhyz55:fix/kb-docx-duplicate-files

Conversation

@sqhyz55
Copy link
Collaborator

@sqhyz55 sqhyz55 commented Mar 3, 2026

No description provided.

@sqhyz55 sqhyz55 requested a review from rogercloud March 3, 2026 09:48
Copy link
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Thanks for working on this feature to prevent duplicate document uploads! The implementation adds a nice UX improvement with the re-upload confirmation dialog, and the deterministic doc_id generation is a solid approach.

✅ What's Good

  • Test coverage added for deterministic doc_id idempotency
  • Proper i18n translations for both English and Chinese
  • Code follows existing patterns, ruff checks pass
  • For regular users, the check API and doc_id generation work correctly together

Issues Found (7 total)

Higher Priority:

  1. #1 (Admin edge case): Admin users see duplicate warnings for all users' files, but uploading creates a new record (inconsistent behavior)
  2. #2 (Silent fallback): No user notification when check API fails
  3. #4 (Breaking change): Old documents with random UUID doc_ids will create duplicates on re-upload

Lower Priority:
3. #3 (Performance): Loading 10K records on every check could be slow for large collections
5. #5 (Code quality): Dialog close logic is duplicated
6. #6 (Race condition): Time-of-check-to-time-of-use gap (mitigated by backend idempotency)
7. #7 (Validation): Implicit type conversion for filenames array elements

Recommendation

Request changes for issues #1, #2, and #4. The other issues can be addressed as follow-ups.

for record in records:
sp = record.get("source_path")
if sp:
existing_basenames.add(os.path.basename(str(sp)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #1: Admin edge case - inconsistent duplicate detection

The check API uses os.path.basename(source_path) to compare filenames, but generate_deterministic_doc_id() uses the full source_path. For admin users:

  1. Admin gets user_filter = None, so they see ALL users' documents in this check
  2. If any user has report.docx, admin sees a "file already exists" warning
  3. But admin's file is stored at /uploads/user_{admin_id}/report.docx
  4. The doc_id would be different (different path), so it creates a NEW record instead of overwriting

For regular users this works correctly (they only see their own docs), but for admins the warning and actual behavior are inconsistent.

Recommendation: Consider either:

  • Apply user filter even for admins ( admins see all, but duplicate check only checks admin's own files)
  • Add a comment explaining this edge case behavior

return
}
await doUpload()
} catch {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #2: Silent error handling - no user feedback

When the check API fails (network error, server error), the code silently falls back to uploading without any notification. This masks potential issues that users should know about.

} catch {
  await doUpload()  // Empty catch - silently continues
}

Recommendation: Add at least a console warning for debugging:

} catch (error) {
  console.warn('Check API failed, proceeding with upload:', error)
  await doUpload()
}

if user_filter and base_filter
else (user_filter or base_filter)
)
MAX_SEARCH_RESULTS = 10000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #3: Performance concern for large collections

Loading up to 10,000 records on every duplicate check could be slow for collections with many documents:

MAX_SEARCH_RESULTS = 10000
records = query_to_list(
    table.search().where(combined_filter).limit(MAX_SEARCH_RESULTS)
)

Each upload triggers this scan. For large KBs this could add noticeable latency.

Recommendation:

  • Consider adding a filename index column for faster lookups
  • Or add a comment acknowledging this trade-off for simplicity

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can address this after the file management refactoring is finished.

from ..utils.string_utils import (
build_lancedb_filter_expression,
generate_doc_id_from_filename,
generate_deterministic_doc_id,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #4: Breaking change for existing documents

This changes from generate_doc_id_from_filename (random UUID) to generate_deterministic_doc_id (hash-based).

Impact: Existing documents registered with the old method will get a NEW doc_id when re-uploaded, potentially creating duplicates instead of updating existing records.

Example:

  • Old doc_id: report_a1b2c3d4 (random)
  • New doc_id: report_e91df8a2 (hash-based)

When the old document is re-registered, it creates a new row instead of updating.

Recommendation: Document this behavior change or consider a migration strategy for existing documents.

</TabsContent>
</Tabs>

{/* Re-upload confirm: file(s) already exist */}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #5: Code duplication in dialog close handlers

The dialog close logic is duplicated in multiple places:

// Line 924-930
if (!open) {
  setReuploadDialogOpen(false)
  setExistingFilenamesForReupload([])
}

// Line 946-952 (similar logic)
onClick={() => {
  setReuploadDialogOpen(false)
  setExistingFilenamesForReupload([])
}}

Recommendation: Extract to a helper function:

const closeReuploadDialog = () => {
  setReuploadDialogOpen(false)
  setExistingFilenamesForReupload([])
}

}
}

const handleUpload = async () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #6: Race condition between check and upload

Between checking if files exist (line 365-385) and actually uploading (line 395), another user could upload the same file. This is mitigated by the deterministic doc_id on the backend (duplicates would update same record), but worth noting.

Recommendation: Add a comment explaining this is an acceptable race condition since the backend handles idempotency correctly.

status_code=422,
detail="Request body must contain 'filenames' as a list of strings",
)
requested = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #7: Implicit type validation for filenames

The code validates that filenames is a list, but doesn't explicitly validate that each element is a string. The comprehension on line 705-707 handles this gracefully:

requested = {
    str(f).strip() for f in filenames if f is not None and str(f).strip()
}

However, this silently converts non-string values. It would be clearer to explicitly validate and reject invalid input with a 422 error.

Recommendation: Add explicit validation:

if not all(isinstance(f, str) for f in filenames):
    raise HTTPException(
        status_code=422,
        detail="All filenames must be strings"
    )

sqhyz55 added 2 commits March 4, 2026 15:18
Address review issues xorbitsai#2, xorbitsai#5, xorbitsai#6, xorbitsai#7 for KB docx duplicate fix. Performance-related issue xorbitsai#3 will be revisited after file management refactor.
@sqhyz55 sqhyz55 force-pushed the fix/kb-docx-duplicate-files branch from f3bac10 to 332b750 Compare March 5, 2026 03:01
Run mocked sync wrapper coroutines in the test to avoid unawaited coroutine warnings in CI, and allow root CHANGELOG.md to be committed by updating .gitignore exceptions.

Made-with: Cursor
@sqhyz55 sqhyz55 force-pushed the fix/kb-docx-duplicate-files branch from 48cbbac to 74e5aa9 Compare March 5, 2026 06:29
@qinxuye qinxuye requested a review from rogercloud March 10, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants