-
Notifications
You must be signed in to change notification settings - Fork 123
fix: tag duplicates perf improv #865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughRefactors duplicate detection in MongoDB hooks from an O(n^2) approach to grouped processing using a hashable equality key derived from dataclass fields. Introduces a new helper function to generate equality keys and uses Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Areas requiring extra attention:
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
secator/hooks/mongodb.py (1)
259-268: Bug: Untagged findings without duplicates are never marked as tagged.When an untagged finding has no duplicates (neither in the untagged batch nor in workspace), it's skipped and never receives
_tagged: True. This causes the finding to be re-queried and re-processed on every subsequenttag_duplicatescall, defeating the performance optimization.The canonical item should always be marked as tagged, even when it has no duplicates.
🔎 Suggested fix
-if not duplicate_ids and not related_ids: - # Nothing to update for this group - continue - -# Canonical item for this equality group -db_updates[item._uuid] = { - '_related': duplicate_ids + related_ids, - '_context.workspace_duplicate': False, - '_tagged': True -} +# Canonical item for this equality group - always mark as tagged +db_updates[item._uuid] = { + '_related': duplicate_ids + related_ids, + '_context.workspace_duplicate': False, + '_tagged': True +} + +if not duplicate_ids: + # No duplicates to update for this group + continue
🧹 Nitpick comments (2)
secator/hooks/mongodb.py (2)
220-224: Use f-string conversion flag forrepr.Per static analysis hint (RUF010), prefer
{item!r}over{repr(item)}in f-strings.🔎 Suggested fix
debug( - f'Processing group: {repr(item)} ({item._timestamp}) [{item._uuid}] with {len(items) - 1} local duplicates', + f'Processing group: {item!r} ({item._timestamp}) [{item._uuid}] with {len(items) - 1} local duplicates', sub='hooks.mongodb', verbose=True )
227-228: Redundant list comprehension.
items[1:]already returns a new list; the comprehension is unnecessary.🔎 Suggested fix
-duplicate_untagged = [f for f in items[1:]] +duplicate_untagged = items[1:]
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
secator/hooks/mongodb.py(3 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
secator/hooks/mongodb.py
221-221: Use explicit conversion flag
Replace with conversion flag
(RUF010)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: integration (3.11, ubuntu-latest)
🔇 Additional comments (4)
secator/hooks/mongodb.py (4)
171-172: LGTM!In-function imports for
dataclasses.fieldsanddefaultdictare acceptable here, especially for a Celery task that may not always execute.
205-212: Good use ofdefaultdictfor O(n) grouping.The refactored approach using
defaultdict(list)to group findings by equality key is a solid improvement over O(n²) comparison. The indexed lookup enables efficient duplicate detection.
285-290: Good use of bulk write for database efficiency.Batching updates with
bulk_writeandUpdateOneis the right approach for reducing MongoDB round-trips.
195-200: Themake_keyfunction is correctly implemented and requires no changes. Analysis of all OUTPUT_TYPES definitions confirms that every unhashable type (list, dict) is explicitly marked withcompare=False, ensuring only hashable primitives (str, int, bool) participate in the comparison key. This deliberate design pattern throughout the codebase eliminates the risk described in the review comment.Likely an incorrect or invalid review comment.
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.