fix: tag duplicates perf improv #865

ocervell · 2025-12-19T15:17:54Z

Summary by CodeRabbit

Release Notes

Chores
- Streamlined duplicate detection and resolution for workspace findings, significantly enhancing performance and accuracy when handling large datasets with improved indexing mechanisms.
- Enhanced database operations through optimized batched processing approach for improved overall efficiency and response times.
- Improved logging capabilities for better visibility into duplicate identification and handling processes during routine operations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-19T15:18:08Z

Walkthrough

Refactors duplicate detection in MongoDB hooks from an O(n^2) approach to grouped processing using a hashable equality key derived from dataclass fields. Introduces a new helper function to generate equality keys and uses defaultdict to build indexed groupings of findings, enabling batched bulk database updates with improved verbose logging.

Changes

Cohort / File(s)	Summary
MongoDB duplicate detection refactoring `secator/hooks/mongodb.py`	Replaces O(n²) duplicate-detection logic with grouped processing using hashable equality keys; introduces `make_key` helper function to generate keys from dataclass fields; builds indexed groupings via `defaultdict`; adds verbose logging for group processing; reworks duplicate flagging and tagging to mark canonical items and related duplicates; switches to batched bulk updates instead of per-item updates.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

Verification that the new grouped approach produces identical duplicate detection results to the previous O(n²) logic
Correctness of the hashable key generation from dataclass fields with compare=True flag
Proper handling of canonical item selection and related duplicate aggregation
Validation that _context.workspace_duplicate and _tagged flags are applied consistently to all duplicates and related items
Confirmation that batched bulk updates maintain data integrity and don't introduce race conditions

Poem

🐰 A hashable key unlocks the path,
No more O(n²) aftermath!
Findings grouped, duplicates found,
Batched updates spinning 'round,
MongoDB now hops more sound! 🌿

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: tag duplicates perf improv' clearly relates to the main change: optimizing duplicate tagging performance by replacing O(n²) duplicate detection with grouped batch processing.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch mongodb-tag-duplicates-efficient

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

secator/hooks/mongodb.py (1)
259-268: Bug: Untagged findings without duplicates are never marked as tagged.

When an untagged finding has no duplicates (neither in the untagged batch nor in workspace), it's skipped and never receives _tagged: True. This causes the finding to be re-queried and re-processed on every subsequent tag_duplicates call, defeating the performance optimization.

The canonical item should always be marked as tagged, even when it has no duplicates.
🔎 Suggested fix
-if not duplicate_ids and not related_ids:
-    # Nothing to update for this group
-    continue
-
-# Canonical item for this equality group
-db_updates[item._uuid] = {
-    '_related': duplicate_ids + related_ids,
-    '_context.workspace_duplicate': False,
-    '_tagged': True
-}
+# Canonical item for this equality group - always mark as tagged
+db_updates[item._uuid] = {
+    '_related': duplicate_ids + related_ids,
+    '_context.workspace_duplicate': False,
+    '_tagged': True
+}
+
+if not duplicate_ids:
+    # No duplicates to update for this group
+    continue

🧹 Nitpick comments (2)

secator/hooks/mongodb.py (2)
220-224: Use f-string conversion flag for repr.

Per static analysis hint (RUF010), prefer {item!r} over {repr(item)} in f-strings.
🔎 Suggested fix
 debug(
-    f'Processing group: {repr(item)} ({item._timestamp}) [{item._uuid}] with {len(items) - 1} local duplicates',
+    f'Processing group: {item!r} ({item._timestamp}) [{item._uuid}] with {len(items) - 1} local duplicates',
     sub='hooks.mongodb',
     verbose=True
 )
227-228: Redundant list comprehension.

items[1:] already returns a new list; the comprehension is unnecessary.
🔎 Suggested fix
-duplicate_untagged = [f for f in items[1:]]
+duplicate_untagged = items[1:]

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e2e406b and 3a9c4a2.

📒 Files selected for processing (1)

secator/hooks/mongodb.py (3 hunks)

🧰 Additional context used

🪛 Ruff (0.14.8)

secator/hooks/mongodb.py

221-221: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: integration (3.11, ubuntu-latest)

🔇 Additional comments (4)

secator/hooks/mongodb.py (4)

171-172: LGTM!

In-function imports for dataclasses.fields and defaultdict are acceptable here, especially for a Celery task that may not always execute.

205-212: Good use of defaultdict for O(n) grouping.

The refactored approach using defaultdict(list) to group findings by equality key is a solid improvement over O(n²) comparison. The indexed lookup enables efficient duplicate detection.

285-290: Good use of bulk write for database efficiency.

Batching updates with bulk_write and UpdateOne is the right approach for reducing MongoDB round-trips.

195-200: The make_key function is correctly implemented and requires no changes. Analysis of all OUTPUT_TYPES definitions confirms that every unhashable type (list, dict) is explicitly marked with compare=False, ensuring only hashable primitives (str, int, bool) participate in the comparison key. This deliberate design pattern throughout the codebase eliminates the risk described in the review comment.

Likely an incorrect or invalid review comment.

update

3a9c4a2

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: tag duplicates perf improv #865

fix: tag duplicates perf improv #865

Uh oh!

ocervell commented Dec 19, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 19, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: tag duplicates perf improv #865

Are you sure you want to change the base?

fix: tag duplicates perf improv #865

Uh oh!

Conversation

ocervell commented Dec 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ocervell commented Dec 19, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 19, 2025 •

edited

Loading