Skip to content

Conversation

@ocervell
Copy link
Contributor

@ocervell ocervell commented Dec 19, 2025

Summary by CodeRabbit

Release Notes

  • Chores
    • Streamlined duplicate detection and resolution for workspace findings, significantly enhancing performance and accuracy when handling large datasets with improved indexing mechanisms.
    • Enhanced database operations through optimized batched processing approach for improved overall efficiency and response times.
    • Improved logging capabilities for better visibility into duplicate identification and handling processes during routine operations.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

Walkthrough

Refactors duplicate detection in MongoDB hooks from an O(n^2) approach to grouped processing using a hashable equality key derived from dataclass fields. Introduces a new helper function to generate equality keys and uses defaultdict to build indexed groupings of findings, enabling batched bulk database updates with improved verbose logging.

Changes

Cohort / File(s) Summary
MongoDB duplicate detection refactoring
secator/hooks/mongodb.py
Replaces O(n²) duplicate-detection logic with grouped processing using hashable equality keys; introduces make_key helper function to generate keys from dataclass fields; builds indexed groupings via defaultdict; adds verbose logging for group processing; reworks duplicate flagging and tagging to mark canonical items and related duplicates; switches to batched bulk updates instead of per-item updates.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

  • Verification that the new grouped approach produces identical duplicate detection results to the previous O(n²) logic
  • Correctness of the hashable key generation from dataclass fields with compare=True flag
  • Proper handling of canonical item selection and related duplicate aggregation
  • Validation that _context.workspace_duplicate and _tagged flags are applied consistently to all duplicates and related items
  • Confirmation that batched bulk updates maintain data integrity and don't introduce race conditions

Poem

🐰 A hashable key unlocks the path,
No more O(n²) aftermath!
Findings grouped, duplicates found,
Batched updates spinning 'round,
MongoDB now hops more sound! 🌿

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: tag duplicates perf improv' clearly relates to the main change: optimizing duplicate tagging performance by replacing O(n²) duplicate detection with grouped batch processing.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch mongodb-tag-duplicates-efficient

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
secator/hooks/mongodb.py (1)

259-268: Bug: Untagged findings without duplicates are never marked as tagged.

When an untagged finding has no duplicates (neither in the untagged batch nor in workspace), it's skipped and never receives _tagged: True. This causes the finding to be re-queried and re-processed on every subsequent tag_duplicates call, defeating the performance optimization.

The canonical item should always be marked as tagged, even when it has no duplicates.

🔎 Suggested fix
-if not duplicate_ids and not related_ids:
-    # Nothing to update for this group
-    continue
-
-# Canonical item for this equality group
-db_updates[item._uuid] = {
-    '_related': duplicate_ids + related_ids,
-    '_context.workspace_duplicate': False,
-    '_tagged': True
-}
+# Canonical item for this equality group - always mark as tagged
+db_updates[item._uuid] = {
+    '_related': duplicate_ids + related_ids,
+    '_context.workspace_duplicate': False,
+    '_tagged': True
+}
+
+if not duplicate_ids:
+    # No duplicates to update for this group
+    continue
🧹 Nitpick comments (2)
secator/hooks/mongodb.py (2)

220-224: Use f-string conversion flag for repr.

Per static analysis hint (RUF010), prefer {item!r} over {repr(item)} in f-strings.

🔎 Suggested fix
 debug(
-    f'Processing group: {repr(item)} ({item._timestamp}) [{item._uuid}] with {len(items) - 1} local duplicates',
+    f'Processing group: {item!r} ({item._timestamp}) [{item._uuid}] with {len(items) - 1} local duplicates',
     sub='hooks.mongodb',
     verbose=True
 )

227-228: Redundant list comprehension.

items[1:] already returns a new list; the comprehension is unnecessary.

🔎 Suggested fix
-duplicate_untagged = [f for f in items[1:]]
+duplicate_untagged = items[1:]
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e2e406b and 3a9c4a2.

📒 Files selected for processing (1)
  • secator/hooks/mongodb.py (3 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
secator/hooks/mongodb.py

221-221: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: integration (3.11, ubuntu-latest)
🔇 Additional comments (4)
secator/hooks/mongodb.py (4)

171-172: LGTM!

In-function imports for dataclasses.fields and defaultdict are acceptable here, especially for a Celery task that may not always execute.


205-212: Good use of defaultdict for O(n) grouping.

The refactored approach using defaultdict(list) to group findings by equality key is a solid improvement over O(n²) comparison. The indexed lookup enables efficient duplicate detection.


285-290: Good use of bulk write for database efficiency.

Batching updates with bulk_write and UpdateOne is the right approach for reducing MongoDB round-trips.


195-200: The make_key function is correctly implemented and requires no changes. Analysis of all OUTPUT_TYPES definitions confirms that every unhashable type (list, dict) is explicitly marked with compare=False, ensuring only hashable primitives (str, int, bool) participate in the comparison key. This deliberate design pattern throughout the codebase eliminates the risk described in the review comment.

Likely an incorrect or invalid review comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants