-
Notifications
You must be signed in to change notification settings - Fork 2
Feature/np core 20241215 #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Warning Rate limit exceeded@cschu has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 16 minutes and 38 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughThis pull request updates the GFFquant library by changing the version number from 2.18.0 to 2.19.0 and significantly refactoring the counting and annotation mechanisms. Key modifications include the removal of several classes related to counting, the introduction of new classes for gene and region counting, and the replacement of Changes
Sequence DiagramsequenceDiagram
participant Profiler
participant AlignmentCounter
participant Annotator
participant CountWriter
Profiler->>AlignmentCounter: Update counts
AlignmentCounter-->>Profiler: Counts processed
Profiler->>Annotator: Annotate counts
Annotator->>AlignmentCounter: Transform counts
Annotator-->>Profiler: Annotated counts
Profiler->>CountWriter: Write counts
CountWriter-->>Profiler: Counts written
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (3)
gffquant/counters/alignment_counter2.py (1)
60-65: Improve readability by replacing Boolean indexing with explicit control flow.Using Boolean values as indices in tuples can be confusing. Consider using
if-elsestatements for better clarity in theupdate_countsmethod.Refactor for
inc:- inc = ( - ( - self.get_increment(aln_count, increment), - increment, - ) - )[aln_count == 1] + if aln_count == 1: + inc = increment + else: + inc = self.get_increment(aln_count, increment)Refactor for
key:- key = ( - ( - (hit.rid, hit.rev_strand), - hit.rid - ) - )[self.strand_specific] + if self.strand_specific: + key = (hit.rid, hit.rev_strand) + else: + key = hit.ridAlso applies to: 66-71
gffquant/counters/count_manager.py (1)
167-171: Enhance readability by avoiding Boolean indexing inget_all_regions.Using a Boolean value to index tuples can reduce code readability. Consider refactoring to use an explicit
if-elsestatement.Apply this refactor:
- uniq_counts, ambig_counts = ( - (self.uniq_seqcounts, self.ambig_seqcounts,), - (self.uniq_regioncounts, self.ambig_regioncounts,), - )[region_counts] + if region_counts: + uniq_counts = self.uniq_regioncounts + ambig_counts = self.ambig_regioncounts + else: + uniq_counts = self.uniq_seqcounts + ambig_counts = self.ambig_seqcountsgffquant/annotation/count_annotator.py (1)
202-213: Remove unused parametergene_group_dbfromRegionCountAnnotator.annotatemethod.In the
annotatemethod ofRegionCountAnnotator, the parametergene_group_dbis provided but not used within the method. Consider removing this parameter to simplify the method signature.Apply this diff:
def annotate(self, refmgr, db, count_manager: CountManager, gene_group_db=False): - def annotate(self, refmgr, db, count_manager: CountManager, gene_group_db=False): + def annotate(self, refmgr, db, count_manager: CountManager):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
Dockerfile(1 hunks)gffquant/__init__.py(1 hunks)gffquant/annotation/count_annotator.py(4 hunks)gffquant/counters/__init__.py(1 hunks)gffquant/counters/alignment_counter2.py(1 hunks)gffquant/counters/count_manager.py(2 hunks)gffquant/counters/region_counter.py(1 hunks)gffquant/counters/seq_counter.py(1 hunks)
✅ Files skipped from review due to trivial changes (4)
- Dockerfile
- gffquant/init.py
- gffquant/counters/init.py
- gffquant/counters/seq_counter.py
🔇 Additional comments (4)
gffquant/counters/region_counter.py (1)
8-8: Updated import statement for AlignmentCounter.
The import has been updated to import AlignmentCounter from alignment_counter2. Ensure that all references to AlignmentCounter in this file and elsewhere are updated accordingly.
gffquant/counters/count_manager.py (1)
163-164: Verify the change in default values from set() to Counter() in get_regions.
The method get_regions now uses Counter() as the default value instead of set(). This change may affect how the union of regions is computed. Ensure that this change is intentional and that it doesn't impact the expected behavior of returning a set of regions.
gffquant/annotation/count_annotator.py (2)
210-213: Ensure get_all_regions returns the correct regions for annotation.
The loop now iterates over count_manager.get_all_regions(region_counts=True). Verify that this method provides the expected set of regions, especially after changes in CountManager. This ensures that all regions are appropriately annotated.
Line range hint 278-294: Confirm correct usage of get_all_regions in GeneCountAnnotator.annotate.
In the annotate method of GeneCountAnnotator, the code now iterates over count_manager.get_all_regions(). Ensure that this change aligns with the intended functionality and that all gene regions are processed as expected.
| def setdefault(self, key, default_val): | ||
| ... | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement the setdefault method or remove it if unnecessary.
The setdefault method currently contains ..., indicating it's unimplemented. If this method is intended to provide functionality similar to dict.setdefault, please implement it. If it's not needed, consider removing it to avoid confusion.
| return counts, normalised, scaled | ||
|
|
||
| def get_increment(self, n_aln, increment): | ||
| # 1overN = lavern. Maya <3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove non-descriptive comment in get_increment method.
The comment # 1overN = lavern. Maya <3 does not provide meaningful information about the code and includes personal content. It's best to remove it or replace it with a descriptive comment explaining the purpose of the method.
Apply this diff to remove the comment:
- # 1overN = lavern. Maya <3There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cschu, understood, I'll ignore it.
(_ _/)
( •_•)
✏️ Learnings added
Learnt from: cschu
PR: cschu/gff_quantifier#55
File: gffquant/counters/alignment_counter2.py:20-20
Timestamp: 2024-12-16T20:17:48.980Z
Learning: In `gffquant/counters/alignment_counter2.py`, comments like `# 1overN = lavern. Maya <3` are acceptable and should not be flagged as non-descriptive or requiring removal.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
| def __getitem__(self, key): | ||
| key_index = self.index.get(key) | ||
| if key_index is None: | ||
| return 0.0 | ||
| return self.counts[self.index.get(key)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Optimize __getitem__ by reusing key_index.
In the __getitem__ method, after retrieving key_index, the code redundantly calls self.index.get(key) again. This can be optimized by using the already retrieved key_index variable.
Apply this diff to use key_index:
if key_index is None:
return 0.0
- return self.counts[self.index.get(key)]
+ return self.counts[key_index]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def __getitem__(self, key): | |
| key_index = self.index.get(key) | |
| if key_index is None: | |
| return 0.0 | |
| return self.counts[self.index.get(key)] | |
| def __getitem__(self, key): | |
| key_index = self.index.get(key) | |
| if key_index is None: | |
| return 0.0 | |
| return self.counts[key_index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def __setitem__(self, key, value): | ||
| key_index = self.index.get(key) | ||
| if key_index is not None: | ||
| self.counts[key_index] = value | ||
| raise KeyError(f"{key=} not found.") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the logic in __setitem__ to handle missing keys appropriately.
The __setitem__ method raises a KeyError regardless of whether the key exists or not. The raise KeyError statement should be placed inside an else block to ensure it's only executed when the key is not found.
Apply this diff to fix the logic:
def __setitem__(self, key, value):
key_index = self.index.get(key)
if key_index is not None:
self.counts[key_index] = value
+ else:
raise KeyError(f"{key=} not found.")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def __setitem__(self, key, value): | |
| key_index = self.index.get(key) | |
| if key_index is not None: | |
| self.counts[key_index] = value | |
| raise KeyError(f"{key=} not found.") | |
| def __setitem__(self, key, value): | |
| key_index = self.index.get(key) | |
| if key_index is not None: | |
| self.counts[key_index] = value | |
| else: | |
| raise KeyError(f"{key=} not found.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| key_index = self.index.get(key) | ||
| if key_index is None: | ||
| return Counter() | ||
| return Counter({key: self.counts[key_index]}) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utilize the default_val parameter in the get method.
The default_val parameter in get(self, key, default_val) is not used. Instead, the method returns a new empty Counter() when the key is not found. To adhere to the expected behavior of the get method, consider returning default_val when the key is missing.
Apply this diff to use default_val:
if key_index is None:
- return Counter()
+ return default_val
return Counter({key: self.counts[key_index]})Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
gffquant/counters/count_manager.py (2)
69-72: Remove commented-out code to improve readability.The lines 69-72 contain code that has been commented out. If this code is no longer needed, consider removing it to keep the codebase clean and maintainable.
Apply this diff to remove the commented-out code:
- # self.seqcounts = AlignmentCounter( - # strand_specific=strand_specific, - # distribution_mode=distribution_mode, - # )
171-175: Refactor selection logic inget_all_regionsfor clarity.Using tuple indexing with a boolean can be confusing. Consider using an
if-elsestatement for better readability.Apply this diff to refactor the selection logic:
def get_all_regions(self, region_counts=False): - uniq_counts, ambig_counts = ( - (self.uniq_seqcounts, self.ambig_seqcounts,), - (self.uniq_regioncounts, self.ambig_regioncounts,), - )[region_counts] + if region_counts: + uniq_counts = self.uniq_regioncounts + ambig_counts = self.ambig_regioncounts + else: + uniq_counts = self.uniq_seqcounts + ambig_counts = self.ambig_seqcounts yield from set(uniq_counts).union(ambig_counts)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
gffquant/counters/alignment_counter2.py(1 hunks)gffquant/counters/count_manager.py(3 hunks)
🔇 Additional comments (5)
gffquant/counters/alignment_counter2.py (5)
20-20: Remove non-descriptive comment in get_increment method.
The comment # 1overN = lavern. Maya <3 does not provide meaningful information about the code and includes personal content. Please remove it or replace it with a descriptive comment explaining the purpose of the method.
Apply this diff to remove the comment:
- # 1overN = lavern. Maya <335-39: Utilize the default_val parameter in the get method.
The default_val parameter in get(self, key, default_val) is not used. Instead, the method returns a new empty Counter() when the key is not found. To adhere to the expected behavior of the get method, consider returning default_val when the key is missing.
Apply this diff to use default_val:
if key_index is None:
- return Counter()
+ return default_val
return Counter({key: self.counts[key_index]})40-42: Implement or remove the setdefault method.
The setdefault method currently contains ..., indicating it's unimplemented. If this method is intended to provide functionality similar to dict.setdefault, please implement it. If it's not needed, consider removing it to avoid confusion.
45-49: Optimize __getitem__ by reusing key_index.
In the __getitem__ method, after retrieving key_index, the code redundantly calls self.index.get(key) again. This can be optimized by using the already retrieved key_index variable.
Apply this diff to use key_index:
if key_index is None:
return 0.0
- return self.counts[self.index.get(key)]
+ return self.counts[key_index]50-55: Correct the logic in __setitem__ to handle missing keys appropriately.
The raise KeyError statement is outside the if block, so it will always raise an exception even when the key exists. The raise KeyError should be placed inside an else block to ensure it's only executed when the key is not found.
Apply this diff to fix the logic:
if key_index is not None:
self.counts[key_index] = value
+ else:
raise KeyError(f"{key=} not found.")
gffquant/counters/count_manager.py
Outdated
| (self.uniq_seqcounts, self.ambig_seqcounts,), | ||
| (self.uniq_regioncounts, self.ambig_regioncounts,), | ||
| )[region_counts] | ||
| yield from set(uniq_counts).union(ambig_counts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle potential None values in get_all_regions to prevent exceptions.
If ambig_counts is None, calling set(uniq_counts).union(ambig_counts) will raise a TypeError. Consider checking for None before performing the union to ensure robustness.
Apply this diff to handle None values:
yield from set(uniq_counts).union(
- ambig_counts
+ ambig_counts if ambig_counts is not None else set()
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| yield from set(uniq_counts).union(ambig_counts) | |
| yield from set(uniq_counts).union(ambig_counts if ambig_counts is not None else set()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (1)
gffquant/counters/count_manager.py (1)
58-77: Clean up commented code and fix indentation.The counter initialization is cleaner with the consolidated approach, but there are some housekeeping items:
- Remove the commented-out code blocks
- Fix the inconsistent indentation in the else block
if region_counts: - # self.uniq_regioncounts = RegionCounter(strand_specific=strand_specific) - # self.ambig_regioncounts = RegionCounter( - # strand_specific=strand_specific, - # distribution_mode=distribution_mode, - # ) self.regioncounts = RegionCounter( strand_specific=strand_specific, distribution_mode=distribution_mode, ) else: - # self.uniq_seqcounts = AlignmentCounter(strand_specific=strand_specific) - # self.ambig_seqcounts = AlignmentCounter( - # strand_specific=strand_specific, - # distribution_mode=distribution_mode - # ) self.seqcounts = AlignmentCounter( strand_specific=strand_specific, distribution_mode=distribution_mode, )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
gffquant/annotation/count_annotator.py(4 hunks)gffquant/counters/alignment_counter2.py(1 hunks)gffquant/counters/count_manager.py(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- gffquant/counters/alignment_counter2.py
- gffquant/annotation/count_annotator.py
🔇 Additional comments (1)
gffquant/counters/count_manager.py (1)
190-191:
Handle potential None references in get_regions.
The method assumes self.uniq_regioncounts and self.ambig_regioncounts exist, but they might be None.
- return set(self.uniq_regioncounts.get(rid, Counter())).union(
- self.ambig_regioncounts.get(rid, Counter())
+ if self.regioncounts is None:
+ return set()
+ return set(self.regioncounts.get(rid, Counter()))Likely invalid or redundant comment.
gffquant/counters/count_manager.py
Outdated
| # self.uniq_seqcounts, self.ambig_seqcounts = None, None | ||
| # self.uniq_regioncounts, self.ambig_regioncounts = None, None | ||
| self.seqcounts, self.regioncounts = None, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove commented out code and document the architectural change.
The consolidation of separate counters into single instances is a good architectural improvement. However:
- The commented-out code should be removed as it can be retrieved from version control if needed.
- Consider adding a docstring explaining the new counter structure and its advantages.
- # self.uniq_seqcounts, self.ambig_seqcounts = None, None
- # self.uniq_regioncounts, self.ambig_regioncounts = None, None
+ """
+ Initialize counters that handle both unique and ambiguous counts.
+ This consolidation improves maintainability and reduces code duplication.
+ """
self.seqcounts, self.regioncounts = None, None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # self.uniq_seqcounts, self.ambig_seqcounts = None, None | |
| # self.uniq_regioncounts, self.ambig_regioncounts = None, None | |
| self.seqcounts, self.regioncounts = None, None | |
| """ | |
| Initialize counters that handle both unique and ambiguous counts. | |
| This consolidation improves maintainability and reduces code duplication. | |
| """ | |
| self.seqcounts, self.regioncounts = None, None |
gffquant/counters/count_manager.py
Outdated
|
|
||
| from .. import DistributionMode | ||
| from .alignment_counter import AlignmentCounter | ||
| from .alignment_counter2 import AlignmentCounter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Migration from alignment_counter.py to alignment_counter2.py requires attention
The verification reveals that both the old alignment_counter.py and new alignment_counter2.py modules coexist in the codebase. While imports have been updated to use the new module, the presence of both files could lead to confusion and maintenance issues:
- All imports now correctly reference
alignment_counter2.py - The old
AlignmentCounterclass inalignment_counter.pystill exists and inherits fromCounter - The new
AlignmentCounterclass inalignment_counter2.pyis a standalone class - Several classes inherit from the new
AlignmentCounter:UniqueSeqCounter,AmbiguousSeqCounter, andRegionCounter
Consider:
- Removing the deprecated
alignment_counter.pyif it's no longer needed - Adding a deprecation notice if backward compatibility is required
- Documenting the migration in the codebase
🔗 Analysis chain
Verify the impact of the AlignmentCounter import change.
The change from alignment_counter to alignment_counter2 suggests a significant refactor. Let's verify that all dependent code has been updated accordingly.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any remaining imports of the old module
rg "from.*alignment_counter\s+import" --type py
# Search for any direct usage of alignment_counter
rg "alignment_counter\." --type py
Length of output: 86
Script:
#!/bin/bash
# Let's try a broader search for alignment_counter (case-insensitive)
rg -i "alignmentcounter" --type py
# Also search for the specific class name
rg "class AlignmentCounter" --type py
# Check if the old module file still exists
fd "alignment_counter.py"
Length of output: 1706
gffquant/counters/count_manager.py
Outdated
| if self.seqcounts is not None: | ||
| unannotated_reads += self.seqcounts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix missing attribute access in get_unannotated_reads.
The method is missing the .unannotated_reads attribute access which could lead to runtime errors.
if self.regioncounts is not None:
- unannotated_reads += self.regioncounts
+ unannotated_reads += self.regioncounts.unannotated_reads
if self.seqcounts is not None:
- unannotated_reads += self.seqcounts
+ unannotated_reads += self.seqcounts.unannotated_reads📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self.seqcounts is not None: | |
| unannotated_reads += self.seqcounts | |
| if self.seqcounts is not None: | |
| unannotated_reads += self.seqcounts.unannotated_reads |
gffquant/counters/count_manager.py
Outdated
| def get_all_regions(self, region_counts=False): | ||
| # uniq_counts, ambig_counts = ( | ||
| # (self.uniq_seqcounts, self.ambig_seqcounts,), | ||
| # (self.uniq_regioncounts, self.ambig_regioncounts,), | ||
| # )[region_counts] | ||
| # yield from set(uniq_counts).union(ambig_counts) | ||
| counts = ( | ||
| self.seqcounts, | ||
| self.regioncounts, | ||
| )[region_counts] | ||
|
|
||
| yield from counts | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle potential None reference in get_all_regions.
The method should handle the case where the selected counter is None.
counts = (
self.seqcounts,
self.regioncounts,
)[region_counts]
+ if counts is None:
+ return
yield from counts📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_all_regions(self, region_counts=False): | |
| # uniq_counts, ambig_counts = ( | |
| # (self.uniq_seqcounts, self.ambig_seqcounts,), | |
| # (self.uniq_regioncounts, self.ambig_regioncounts,), | |
| # )[region_counts] | |
| # yield from set(uniq_counts).union(ambig_counts) | |
| counts = ( | |
| self.seqcounts, | |
| self.regioncounts, | |
| )[region_counts] | |
| yield from counts | |
| def get_all_regions(self, region_counts=False): | |
| # uniq_counts, ambig_counts = ( | |
| # (self.uniq_seqcounts, self.ambig_seqcounts,), | |
| # (self.uniq_regioncounts, self.ambig_regioncounts,), | |
| # )[region_counts] | |
| # yield from set(uniq_counts).union(ambig_counts) | |
| counts = ( | |
| self.seqcounts, | |
| self.regioncounts, | |
| )[region_counts] | |
| if counts is None: | |
| return | |
| yield from counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
gffquant/counters/count_manager.py (1)
151-151: Document or implement NotImplementedError cases.Several methods raise
NotImplementedError:
get_counts()for region_countsget_counts()for strand_specificPlease either:
- Implement these methods
- Add documentation explaining why they're not implemented
- Create tracking issues for future implementation
Would you like me to help create GitHub issues to track these implementations?
Also applies to: 168-168
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
gffquant/counters/alignment_counter2.py(1 hunks)gffquant/counters/count_manager.py(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- gffquant/counters/alignment_counter2.py
🔇 Additional comments (5)
gffquant/counters/count_manager.py (5)
6-6: Verify the complete migration to alignment_counter2.
The codebase appears to have both alignment_counter.py and alignment_counter2.py. While imports have been updated, having both modules could cause confusion.
53-77: Clean up commented code after counter consolidation.
The consolidation of separate counters into single instances is a good improvement. However, the commented-out code should be removed as it's now in version control.
142-145: Fix missing attribute access in get_unannotated_reads.
The method is missing the .unannotated_reads attribute access which could lead to runtime errors.
196-208: Handle potential None reference in get_all_regions.
The method should handle the case where the selected counter is None.
189-193:
Breaking change: get_regions now returns Counter instead of set.
The return type of get_regions() has changed from set to Counter. This is a breaking change that could affect dependent code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
gffquant/annotation/genecount_annotator.py (1)
3-3: Remove unused import 'numpy' to clean up dependencies.
The code does not appear to use numpy, so consider removing it to reduce clutter.- import numpy as np🧰 Tools
🪛 Ruff (0.8.2)
3-3:
numpyimported but unusedRemove unused import:
numpy(F401)
gffquant/counters/alignment_counter2.py (2)
35-39: Use 'default_val' in the 'get' method if intended to provide fallback.
Currently, the code returns a new Counter() instead of using 'default_val'. If the fallback should be 'default_val', consider updating the implementation. Otherwise, remove the unused parameter from the method signature.Possible adjustment for the fallback:
-def get(self, key, default_val): +def get(self, key, default_val=Counter()): key_index = self.index.get(key) if key_index is None: - return Counter() + return default_val return Counter({key: self.counts[key_index]})
86-86: Reinstate removed 'setdefault(key, ...)' usage or remove the commented-out line.
Line 86 references a commented-out usage of 'setdefault'. If you no longer use or plan to use that code, consider removing it to avoid confusion.gffquant/counters/count_manager.py (2)
55-57: Remove commented-out code or document its purpose.
Commented code can cause confusion. If it’s no longer needed, remove it. Otherwise, add a comment explaining why it’s retained.- # self.uniq_seqcounts, self.ambig_seqcounts = None, None - # self.uniq_regioncounts, self.ambig_regioncounts = None, None self.seqcounts, self.regioncounts = None, None
193-199: Remove or document commented code in 'get_regions'.
If the old approach is deprecated, remove it for clarity. Otherwise, document why it’s commented out.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
gffquant/annotation/count_annotator.py(6 hunks)gffquant/annotation/genecount_annotator.py(1 hunks)gffquant/counters/alignment_counter2.py(1 hunks)gffquant/counters/count_manager.py(5 hunks)
🧰 Additional context used
📓 Learnings (1)
gffquant/counters/alignment_counter2.py (1)
Learnt from: cschu
PR: cschu/gff_quantifier#55
File: gffquant/counters/alignment_counter2.py:20-20
Timestamp: 2024-12-16T20:17:49.179Z
Learning: In `gffquant/counters/alignment_counter2.py`, comments like `# 1overN = lavern. Maya <3` are acceptable and should not be flagged as non-descriptive or requiring removal.
🪛 Ruff (0.8.2)
gffquant/annotation/genecount_annotator.py
3-3: numpy imported but unused
Remove unused import: numpy
(F401)
🔇 Additional comments (11)
gffquant/annotation/genecount_annotator.py (1)
20-20: Clarify intention of setting 'self.total_counts' to 'self.total_gene_counts'.
It’s unclear whether this is meant to unify the counters or if it’s a placeholder for future logic. Confirm that the assignment meets your design goal.
gffquant/counters/alignment_counter2.py (1)
20-20: Comment is acceptable per user learnings; no removal needed.
Noting the user-provided preference to keep comments like "# 1overN = lavern. Maya <3" intact.
gffquant/counters/count_manager.py (4)
82-87: 'has_ambig_counts' logic looks clear; no issues identified.
It succinctly checks for ambiguous counts across sequence and region counters.
153-153: Method not implemented for region_counts=True.
This raises NotImplementedError. Confirm if that matches your design or if you plan to add region-counts logic later.
190-190: Clarify the return shape for 'get_counts' in unstranded mode.
The code returns np.array(...) with four columns: (counts[0], counts[2], counts[1], counts[3]). Ensure calling code expects that ordering.
213-218: Good integration of 'transform_counts' for both seqcounts and regioncounts.
The approach is consistent and extensible.
gffquant/annotation/count_annotator.py (5)
98-99: Be mindful of the variable order in 'total_counts'.
Swapping indices from (uniq_raw, uniq_norm, ambig_raw, ambig_norm) to (uniq_raw, ambig_raw, uniq_norm, ambig_norm) might cause confusion. Confirm that this reordering aligns with the rest of the code.
113-114: Same index-swapping question applies to 'total_gene_counts'.
Ensure that these array indices match usage in distribution logic, especially if referencing them by position.
204-215: Implementation of 'annotate' in 'RegionCountAnnotator' is straightforward.
It iterates over the new unified regioncounts approach correctly. The flow to query DB, compute counts, and distribute feature counts is logically consistent.
280-280: GeneCountAnnotator: annotate method signature with 'CountManager' is well-defined.
The additional type hint improves clarity for code using this method.
293-293: Check consistency of gene-group ID parsing with other modules.
Splitting reference strings by '.' might differ from how other classes or utilities parse gene IDs. Verify uniform usage across the codebase to avoid mismatched references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
gffquant/annotation/__init__.py (1)
5-5: Remove commented import statement.Instead of keeping the commented import line, it should be removed as it's no longer needed and adds unnecessary noise to the codebase.
-# from .count_annotator import GeneCountAnnotator, RegionCountAnnotator
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
gffquant/annotation/__init__.py(1 hunks)
🔇 Additional comments (1)
gffquant/annotation/__init__.py (1)
6-6: Verify the new module structure.
The GeneCountAnnotator has been moved to a new dedicated module. Let's verify that the new module exists and the imports are working correctly.
Also applies to: 8-8
✅ Verification successful
Let me verify the imports in the new module and check for any potential issues.
Module structure and imports have been successfully verified
The verification confirms that:
- The new
genecount_annotator.pymodule exists and contains theGeneCountAnnotatorclass - The imports in
__init__.pyhave been correctly updated to importGeneCountAnnotatorfrom the new module - The old import is properly commented out
- All references to
GeneCountAnnotatorin the codebase are valid
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the new module structure and imports
# Check if the new module exists
echo "Checking for new module file..."
fd "genecount_annotator.py"
# Verify the class definition in the new location
echo "Checking GeneCountAnnotator class definition..."
ast-grep --pattern 'class GeneCountAnnotator'
# Check for any remaining references to the old import path
echo "Checking for any missed import updates..."
rg "from .count_annotator import GeneCountAnnotator"
Length of output: 548
Script:
#!/bin/bash
# Check the content of the new module and its imports
echo "Content of genecount_annotator.py:"
cat gffquant/annotation/genecount_annotator.py
# Check the actual imports in __init__.py
echo -e "\nContent of __init__.py:"
cat gffquant/annotation/__init__.py
# Check for any references to GeneCountAnnotator in the codebase
echo -e "\nReferences to GeneCountAnnotator:"
rg "GeneCountAnnotator" -l
Length of output: 2224
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
gffquant/annotation/genecount_annotator.py (2)
18-46: Add type hints and documentationThe method would benefit from:
- Type hints for
refmgranddbparameters- A docstring explaining the method's purpose and parameters
- Proper error handling for malformed gene IDs
- def annotate(self, refmgr, db, count_manager: CountManager, gene_group_db=False): + def annotate(self, refmgr: ReferenceManager, db: Database, count_manager: CountManager, gene_group_db: bool = False) -> None: + """ + Annotate gene counts using the provided reference manager and database. + + Args: + refmgr: Reference manager containing gene information + db: Database containing gene annotations + count_manager: Manager handling count operations + gene_group_db: If True, splits references into gene and group IDs + """
21-21: Clarify or remove the questionable commentThe comment
# ?suggests uncertainty about the assignment. Either document the reason for this assignment or remove the comment if the assignment is confirmed to be correct.gffquant/counters/count_manager.py (2)
55-78: Document the counter consolidation architectureWhile the consolidation of counters is a good improvement, the changes should be properly documented.
- # self.uniq_seqcounts, self.ambig_seqcounts = None, None - # self.uniq_regioncounts, self.ambig_regioncounts = None, None + """ + Initialize consolidated counters that handle both unique and ambiguous counts. + This architectural change improves maintainability by: + - Reducing code duplication + - Centralizing count management + - Simplifying the counter interface + """ self.seqcounts, self.regioncounts = None, None
153-153: Add error message to NotImplementedErrorProvide a descriptive message explaining what functionality is not implemented.
- raise NotImplementedError() + raise NotImplementedError("Region-specific count retrieval with strand specificity is not yet implemented")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
gffquant/annotation/genecount_annotator.py(1 hunks)gffquant/counters/count_manager.py(5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/annotation/genecount_annotator.py
3-3: numpy imported but unused
Remove unused import: numpy
(F401)
🔇 Additional comments (3)
gffquant/annotation/genecount_annotator.py (1)
12-17: LGTM! Clean class implementation
The class properly extends CountAnnotator and correctly initializes the parent class.
gffquant/counters/count_manager.py (2)
200-212: Handle potential None values in get_all_regions
If counts is None, the method will raise an exception.
counts = (
self.seqcounts,
self.regioncounts,
)[region_counts]
+ if counts is None:
+ return
yield from counts213-218: LGTM! Well-structured transform_counts method
The method properly handles both seqcounts and regioncounts cases with clear control flow.
| if gene_group_db: | ||
| ref_tokens = ref.split(".") | ||
| gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1] | ||
| else: | ||
| gene_id, ggroup_id = ref, ref |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for malformed gene IDs
The gene ID splitting logic should handle potential malformed IDs gracefully.
if gene_group_db:
- ref_tokens = ref.split(".")
- gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1]
+ try:
+ ref_tokens = ref.split(".")
+ if len(ref_tokens) < 2:
+ raise ValueError(f"Malformed gene ID: {ref}")
+ gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1]
+ except Exception as e:
+ logger.error("Failed to parse gene ID %s: %s", ref, str(e))
+ gene_id = ggroup_id = ref
else:
gene_id, ggroup_id = ref, refThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
gffquant/annotation/genecount_annotator.py (2)
21-21: Remove or clarify commented codeThe commented line
# self.total_counts = self.total_gene_counts # ?suggests uncertainty in the implementation. Either remove it or document why it might be needed in the future.
18-18: Add type hints for all parametersConsider adding type hints for all parameters in the annotate method for better code maintainability and IDE support.
- def annotate(self, refmgr, db, count_manager: CountManager, gene_group_db=False): + def annotate(self, refmgr: Any, db: Any, count_manager: CountManager, gene_group_db: bool = False):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
gffquant/annotation/count_annotator.py(7 hunks)gffquant/annotation/count_writer.py(1 hunks)gffquant/annotation/genecount_annotator.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- gffquant/annotation/count_annotator.py
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/annotation/genecount_annotator.py
3-3: numpy imported but unused
Remove unused import: numpy
(F401)
🔇 Additional comments (4)
gffquant/annotation/genecount_annotator.py (3)
3-3: Remove unused import
The numpy module is imported but never used in this file.
-import numpy as np🧰 Tools
🪛 Ruff (0.8.2)
3-3: numpy imported but unused
Remove unused import: numpy
(F401)
12-17: LGTM: Clean class implementation
The class definition and initialization are well-structured with proper inheritance pattern.
27-31: Add error handling for gene ID splitting
The gene ID splitting logic should handle malformed IDs gracefully.
if gene_group_db:
- ref_tokens = ref.split(".")
- gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1]
+ try:
+ ref_tokens = ref.split(".")
+ if len(ref_tokens) < 2:
+ raise ValueError(f"Malformed gene ID: {ref}")
+ gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1]
+ except Exception as e:
+ logger.error("Failed to parse gene ID %s: %s", ref, str(e))
+ gene_id = ggroup_id = ref
else:
gene_id, ggroup_id = ref, refgffquant/annotation/count_writer.py (1)
145-145: LGTM: Appropriate logging enhancement
The added logging statement follows the existing logging pattern and provides useful debugging information for category counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
gffquant/annotation/count_writer.py (1)
172-178: Remove commented-out codeInstead of keeping commented-out code, it should be removed as it can be retrieved from version control if needed.
- # for gene, g_counts in sorted(gene_counts.items()): - # out_row = self.compile_output_row( - # g_counts, - # scaling_factor=uniq_scaling_factor, - # ambig_scaling_factor=ambig_scaling_factor - # ) - # CountWriter.write_row(gene, out_row, stream=gene_out)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
gffquant/annotation/count_writer.py(3 hunks)
🔇 Additional comments (4)
gffquant/annotation/count_writer.py (4)
11-11: LGTM: Import addition is appropriate
The addition of the CountManager import aligns with its usage in type hinting and supports the refactoring of the write_gene_counts method.
147-147: LGTM: Helpful logging enhancement
The addition of category logging provides valuable debugging context for category processing.
166-166: LGTM: Type hint addition improves code clarity
The addition of type hints for the gene_counts parameter improves code maintainability and IDE support.
179-186: Verify the impact of switching from gene names to region IDs
The refactoring changes the output identifier from gene names to region IDs (rid). This could impact downstream processes that expect gene names in the output file.
✅ Verification successful
Let me gather more information about how the gene counts data is being used in tests and other parts of the codebase.
Let me try a different approach to search for tests and usages of the gene counts functionality.
Based on the search results, I can now generate a final response as we have sufficient context about how gene counts are used throughout the codebase.
The switch from gene names to region IDs is part of the intended design
The change from gene names to region IDs (rid) is consistent with the codebase's design. Looking at the evidence:
- The
db_collate.pyscript explicitly handles both "feature" and "gene" identifiers in the count files - The
GeneCountAnnotatorclass shows that region IDs are properly mapped to gene IDs internally - The commented-out code in
count_writer.pyshows this was an intentional refactoring to use region IDs consistently
The change maintains compatibility with downstream processing as the collation scripts are already designed to handle both identifier types.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any downstream code that might be processing the gene_counts.txt.gz file
rg -l "gene_counts\.txt\.gz" | while read -r file; do
echo "=== $file ==="
# Show the context of how the file is being used
rg -C 5 "gene_counts\.txt\.gz" "$file"
done
Length of output: 870
Script:
#!/bin/bash
# Search for test files that might be validating the gene counts output format
fd test_.*\.py$ | while read -r file; do
echo "=== $file ==="
rg -C 5 "gene_counts" "$file"
done
# Also search for any code that reads/processes the gene counts file
ast-grep --pattern 'open($_, "r")' | rg -C 5 "gene_counts"
Length of output: 92
Script:
#!/bin/bash
# Search for test files and any code that processes gene counts with broader patterns
rg -l "gene_counts" | while read -r file; do
echo "=== $file ==="
rg -C 5 "gene_counts" "$file"
done
# Also look for any test data files that might contain gene counts
fd -e txt -e tsv -e csv | xargs rg -l "gene"
Length of output: 6974
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
gffquant/profilers/feature_quantifier.py (1)
158-197: Add documentation for category processing logic.The category processing logic would benefit from documentation explaining:
- The purpose of category_sum and its 6-element structure
- The scaling factor calculation methodology
- The relationship between features and categories
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
gffquant/profilers/feature_quantifier.py(5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py
133-133133: Local variable count_annotator is assigned to but never used
Remove assignment to unused variable count_annotator
(F841)
🔇 Additional comments (5)
gffquant/profilers/feature_quantifier.py (5)
13-20: LGTM! Import changes align with the architectural updates.The imports reflect the transition from
CountManagertoAlignmentCounterand the introduction ofReferenceHit.
65-69: LGTM! Proper initialization of AlignmentCounter.The initialization maintains the necessary parameters (distribution_mode, strand_specific, paired_end_count) for feature quantification.
202-249: Remove commented-out code block.The large block of commented code should be removed if it's no longer needed. If it's kept for reference, consider moving it to documentation or creating a ticket to clean it up later.
286-286: LGTM! Alignment processing changes are well-integrated.The changes properly integrate with the new AlignmentCounter API:
- Appropriate handling of single reads with
toggle_single_read_handling- Updated counter update logic with correct parameters for ambiguous counts and paired-end handling
Also applies to: 507-524
133-133: 🛠️ Refactor suggestionRemove unused variable.
The
count_annotatorvariable is assigned but never used.-count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)Likely invalid or redundant comment.
🧰 Tools
🪛 Ruff (0.8.2)
133-133133: Local variable
count_annotatoris assigned to but never usedRemove assignment to unused variable
count_annotator(F841)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
gffquant/counters/alignment_counter.py (2)
73-82:⚠️ Potential issueImplement the dump method.
The method raises
NotImplementedErrorbut has a commented implementation. This will cause runtime errors when called.Restore the implementation:
def dump(self, prefix, refmgr): - raise NotImplementedError() with gzip.open(f"{prefix}.{self.__class__.__name__}.txt.gz", "wt") as _out: for key, key_index in self.index.items(): ref, reflen = refmgr.get(key[0] if isinstance(key, tuple) else key) print(key, ref, reflen, self.counts[key_index], sep="\t", file=_out)
143-148: 🛠️ Refactor suggestionAdd error handling for gene length retrieval.
The gene length retrieval from
refmgrlacks validation. Invalid lengths could cause issues in downstream calculations.gene_lengths = np.array( tuple( - (refmgr.get(key[0] if isinstance(key, tuple) else key))[1] + next( + (length for _, length in [refmgr.get(key[0] if isinstance(key, tuple) else key)] if length > 0), + 1 # Default length if invalid + ) for key, _ in self.counts ) )
🧹 Nitpick comments (4)
gffquant/counters/alignment_counter.py (2)
20-23: Enhance documentation for strand constants.While the comment explains the origin of the counter-intuitive strand values from SAM flags, consider adding more context about how these values map to biological strand orientation.
- # this may be counter-intuitive - # but originates from the samflags 0x10, 0x20, - # which explicitly identify the reverse-strandness of the read + # These values may seem counter-intuitive but are derived from SAM flags: + # - PLUS_STRAND = False: Forward strand (SAM flag 0x10 not set) + # - MINUS_STRAND = True: Reverse strand (SAM flag 0x10 set) + # This mapping maintains consistency with the SAM specification.
111-114: Define unannotated reads key as a class constant.The hardcoded hash value used for tracking unannotated reads should be defined as a class constant for better maintainability.
class AlignmentCounter: + # Special key for tracking reads that don't map to any annotated features + UNANNOTATED_READS_KEY = "c591b65a0f4cd" + COUNT_HEADER_ELEMENTS = ("raw", "lnorm", "scaled") def get_unannotated_reads(self): - return self.counts["c591b65a0f4cd"][0] + return self.counts[self.UNANNOTATED_READS_KEY][0]gffquant/profilers/feature_quantifier.py (2)
133-133: Remove unused variable.The
count_annotatorvariable is assigned but never used.- count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)🧰 Tools
🪛 Ruff (0.8.2)
133-133: Local variable
count_annotatoris assigned to but never usedRemove assignment to unused variable
count_annotator(F841)
Line range hint
507-512: Enhance logging for counter updates.The debug message could be more informative by including the contributed counts in a more readable format.
msg = f"{ambig_hit_counts=} {contributed_counts=})" + logger.debug( + "Counter update: ambiguous=%s, paired=%s, contributed=%s", + is_ambiguous_group, + aln_group.is_paired(), + contributed_counts + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
gffquant/counters/alignment_counter.py(2 hunks)gffquant/profilers/feature_quantifier.py(5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py
133-133: Local variable count_annotator is assigned to but never used
Remove assignment to unused variable count_annotator
(F841)
🔇 Additional comments (1)
gffquant/profilers/feature_quantifier.py (1)
202-249: Remove commented-out code block.This large block of commented code appears to be an old implementation. If it's no longer needed, it should be removed to improve code maintainability.
Would you like me to help create a GitHub issue to track this cleanup task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (7)
gffquant/counters/alignment_counter.py (7)
32-35: Remove unprofessional comment.The comment "1overN = lavern. Maya <3" should be removed or replaced with a professional explanation of the one-over-N distribution mode.
- # 1overN = lavern. Maya <3 + # Return increment/n_aln when using one-over-N distribution mode
37-56: Remove commented-out code.The old code should be removed as it's tracked in version control.
- # old code: - # increment = 1 if (not pair or self.paired_end_count == 2) else 0.5 - - # if pair: - # increment = 1 if self.paired_end_count == 2 else 0.5 - # else: - # increment = 0.5 if self.unmarked_orphans else 1
58-70: Add type hints and docstring.The constructor should have type hints and a docstring explaining its parameters.
def __init__( self, - distribution_mode=DistributionMode.ONE_OVER_N, - strand_specific=False, - paired_end_count=1, + distribution_mode: DistributionMode = DistributionMode.ONE_OVER_N, + strand_specific: bool = False, + paired_end_count: int = 1, - ): + ) -> None: + """Initialize the AlignmentCounter. + + Args: + distribution_mode: Mode for distributing counts (default: ONE_OVER_N) + strand_specific: Whether to track strand-specific counts (default: False) + paired_end_count: Count for paired-end reads (default: 1) + """
83-85: Remove commented-out code.The old implementation should be removed as it's tracked in version control.
- # return bool(self.counts[:, 1].sum() != 0)
111-114: Make unannotated reads key a class constant.The hardcoded key "00000000" should be a class constant with a descriptive name. Also, remove the commented-out code.
+ UNANNOTATED_READS_KEY = "00000000" # Key for tracking unannotated reads + def get_unannotated_reads(self): - # return self.counts["c591b65a0f4cd46d5125745a40c8c056"][0] - # return self.counts["c591b65a0f4cd"][0] - return self.counts["00000000"][0] + return self.counts[self.UNANNOTATED_READS_KEY][0]
116-136: Add type hints and docstring.The method should have type hints and a docstring explaining its parameters and return value.
- def update_counts(self, count_stream, increment=1, ambiguous_counts=False): + def update_counts( + self, + count_stream: Iterator[Tuple[List[ReferenceHit], int]], + increment: float = 1, + ambiguous_counts: bool = False, + ) -> float: + """Update counts for alignments in the stream. + + Args: + count_stream: Iterator of (hits, alignment_count) tuples + increment: Count increment value (default: 1) + ambiguous_counts: Whether hits are ambiguous (default: False) + + Returns: + float: Total contributed counts + """
155-159: Add type hints and docstring.The method should have type hints and a docstring explaining its parameters and return value.
@staticmethod - def calculate_scaling_factor(raw, norm): + def calculate_scaling_factor(raw: float, norm: float) -> float: + """Calculate scaling factor from raw and normalized values. + + Args: + raw: Raw count value + norm: Normalized count value + + Returns: + float: Scaling factor (1.0 if norm is 0, otherwise raw/norm) + """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
gffquant/annotation/count_writer.py(4 hunks)gffquant/counters/alignment_counter.py(2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/annotation/count_writer.py
164-167: Use ternary operator row = counts[key] if counts.has_record(key) else empty_row instead of if-else-block
Replace if-else-block with row = counts[key] if counts.has_record(key) else empty_row
(SIM108)
🔇 Additional comments (3)
gffquant/counters/alignment_counter.py (3)
Line range hint
25-30: LGTM!The function correctly normalizes and scales feature counts.
138-153: 🛠️ Refactor suggestionAdd error handling for gene lengths.
The method should handle cases where gene lengths might be missing or invalid.
gene_lengths = np.array( tuple( - (refmgr.get(key[0] if isinstance(key, tuple) else key))[1] + next( + (length for _, length in [refmgr.get(key[0] if isinstance(key, tuple) else key)] + if length and length > 0), + 1 # Default length if missing or invalid + ) for key, _ in self.counts ) ) + if (gene_lengths <= 0).any(): + logger.warning("Some genes have invalid lengths (<=0). Using default length of 1.")Likely invalid or redundant comment.
161-168: 🛠️ Refactor suggestionAdd error handling for malformed gene IDs.
The method assumes gene IDs always contain a dot separator. Add error handling for malformed IDs.
ggroups = ( - (refmgr.get(key[0] if isinstance(key, tuple) else key))[0].split(".")[0] + next( + (ref.split(".")[0] for ref, _ in [refmgr.get(key[0] if isinstance(key, tuple) else key)] + if "." in ref), + (refmgr.get(key[0] if isinstance(key, tuple) else key))[0] # Use full ID if no dot + ) for key, _ in self.counts ) + logger.info("Grouped %d gene IDs", len(set(ggroups)))Likely invalid or redundant comment.
| def dump(self, prefix, refmgr): | ||
| raise NotImplementedError() | ||
| with gzip.open(f"{prefix}.{self.__class__.__name__}.txt.gz", "wt") as _out: | ||
| for k, v in self.items(): | ||
| ref, reflen = refmgr.get(k[0] if isinstance(k, tuple) else k) | ||
| print(k, ref, reflen, v, sep="\t", file=_out) | ||
| for key, key_index in self.index.items(): | ||
| ref, reflen = refmgr.get(key[0] if isinstance(key, tuple) else key) | ||
| print(key, ref, reflen, self.counts[key_index], sep="\t", file=_out) | ||
| # for k, v in self.items(): | ||
| # ref, reflen = refmgr.get(k[0] if isinstance(k, tuple) else k) | ||
| # print(k, ref, reflen, v, sep="\t", file=_out) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement the dump method.
The method raises NotImplementedError but contains commented-out implementation. Either implement the method or remove it if not needed.
def dump(self, prefix, refmgr):
- raise NotImplementedError()
- with gzip.open(f"{prefix}.{self.__class__.__name__}.txt.gz", "wt") as _out:
- for key, key_index in self.index.items():
- ref, reflen = refmgr.get(key[0] if isinstance(key, tuple) else key)
- print(key, ref, reflen, self.counts[key_index], sep="\t", file=_out)
- # for k, v in self.items():
- # ref, reflen = refmgr.get(k[0] if isinstance(k, tuple) else k)
- # print(k, ref, reflen, v, sep="\t", file=_out)
+ """Dump counts to a file.
+
+ Args:
+ prefix: Output file prefix
+ refmgr: Reference manager for looking up references
+ """
+ with gzip.open(f"{prefix}.{self.__class__.__name__}.txt.gz", "wt") as _out:
+ for key, key_index in self.index.items():
+ ref, reflen = refmgr.get(key[0] if isinstance(key, tuple) else key)
+ print(key, ref, reflen, self.counts[key_index], sep="\t", file=_out)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def dump(self, prefix, refmgr): | |
| raise NotImplementedError() | |
| with gzip.open(f"{prefix}.{self.__class__.__name__}.txt.gz", "wt") as _out: | |
| for k, v in self.items(): | |
| ref, reflen = refmgr.get(k[0] if isinstance(k, tuple) else k) | |
| print(k, ref, reflen, v, sep="\t", file=_out) | |
| for key, key_index in self.index.items(): | |
| ref, reflen = refmgr.get(key[0] if isinstance(key, tuple) else key) | |
| print(key, ref, reflen, self.counts[key_index], sep="\t", file=_out) | |
| # for k, v in self.items(): | |
| # ref, reflen = refmgr.get(k[0] if isinstance(k, tuple) else k) | |
| # print(k, ref, reflen, v, sep="\t", file=_out) | |
| def dump(self, prefix, refmgr): | |
| """Dump counts to a file. | |
| Args: | |
| prefix: Output file prefix | |
| refmgr: Reference manager for looking up references | |
| """ | |
| with gzip.open(f"{prefix}.{self.__class__.__name__}.txt.gz", "wt") as _out: | |
| for key, key_index in self.index.items(): | |
| ref, reflen = refmgr.get(key[0] if isinstance(key, tuple) else key) | |
| print(key, ref, reflen, self.counts[key_index], sep="\t", file=_out) |
| def write_gene_counts( | ||
| self, | ||
| gene_counts: AlignmentCounter, | ||
| refmgr, | ||
| gene_group_db=False, | ||
| ): | ||
| with gzip.open(f"{self.out_prefix}.gene_counts.txt.gz", "wt") as gene_out: | ||
| print("gene", *self.get_header(), sep="\t", file=gene_out, flush=True) | ||
|
|
||
| for gene, g_counts in sorted(gene_counts.items()): | ||
| out_row = self.compile_output_row( | ||
| g_counts, | ||
| scaling_factor=uniq_scaling_factor, | ||
| ambig_scaling_factor=ambig_scaling_factor | ||
| ref_stream = ( | ||
| ( | ||
| refmgr.get(rid[0] if isinstance(rid, tuple) else rid)[0], | ||
| rid, | ||
| ) | ||
| CountWriter.write_row(gene, out_row, stream=gene_out) | ||
| for rid, _ in gene_counts | ||
| ) | ||
|
|
||
| for ref, rid in sorted(ref_stream): | ||
| counts = gene_counts[rid] | ||
| # if gene_group_db: | ||
| # ref_tokens = ref.split(".") | ||
| # gene_id, _ = ".".join(ref_tokens[:-1]), ref_tokens[-1] | ||
| # else: | ||
| # gene_id = ref | ||
| gene_id = ref | ||
|
|
||
| CountWriter.write_row(gene_id, counts, stream=gene_out,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Multiple improvements needed.
- Remove commented-out code
- Add error handling for file operations and reference lookups
- Add type hints and docstring
def write_gene_counts(
self,
- gene_counts: AlignmentCounter,
- refmgr,
- gene_group_db=False,
+ gene_counts: AlignmentCounter,
+ refmgr: 'ReferenceManager',
+ gene_group_db: bool = False,
- ):
+ ) -> None:
+ """Write gene counts to a gzipped output file.
+
+ Args:
+ gene_counts: Counter containing gene alignment data
+ refmgr: Reference manager for gene lookups
+ gene_group_db: Whether to parse gene IDs as group.version format
+
+ Raises:
+ IOError: If file operations fail
+ ValueError: If reference lookup fails
+ """
+ try:
with gzip.open(f"{self.out_prefix}.gene_counts.txt.gz", "wt") as gene_out:
print("gene", *self.get_header(), sep="\t", file=gene_out, flush=True)
ref_stream = (
(
- refmgr.get(rid[0] if isinstance(rid, tuple) else rid)[0],
+ ref_result := refmgr.get(rid[0] if isinstance(rid, tuple) else rid),
rid,
)
for rid, _ in gene_counts
+ if ref_result is not None
)
for ref, rid in sorted(ref_stream):
+ if not ref:
+ logger.error("Reference not found for rid: %s", rid)
+ continue
counts = gene_counts[rid]
- # if gene_group_db:
- # ref_tokens = ref.split(".")
- # gene_id, _ = ".".join(ref_tokens[:-1]), ref_tokens[-1]
- # else:
- # gene_id = ref
gene_id = ref
CountWriter.write_row(gene_id, counts, stream=gene_out,)
+ except IOError as e:
+ logger.error("Failed to write gene counts: %s", str(e))
+ raise📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def write_gene_counts( | |
| self, | |
| gene_counts: AlignmentCounter, | |
| refmgr, | |
| gene_group_db=False, | |
| ): | |
| with gzip.open(f"{self.out_prefix}.gene_counts.txt.gz", "wt") as gene_out: | |
| print("gene", *self.get_header(), sep="\t", file=gene_out, flush=True) | |
| for gene, g_counts in sorted(gene_counts.items()): | |
| out_row = self.compile_output_row( | |
| g_counts, | |
| scaling_factor=uniq_scaling_factor, | |
| ambig_scaling_factor=ambig_scaling_factor | |
| ref_stream = ( | |
| ( | |
| refmgr.get(rid[0] if isinstance(rid, tuple) else rid)[0], | |
| rid, | |
| ) | |
| CountWriter.write_row(gene, out_row, stream=gene_out) | |
| for rid, _ in gene_counts | |
| ) | |
| for ref, rid in sorted(ref_stream): | |
| counts = gene_counts[rid] | |
| # if gene_group_db: | |
| # ref_tokens = ref.split(".") | |
| # gene_id, _ = ".".join(ref_tokens[:-1]), ref_tokens[-1] | |
| # else: | |
| # gene_id = ref | |
| gene_id = ref | |
| CountWriter.write_row(gene_id, counts, stream=gene_out,) | |
| def write_gene_counts( | |
| self, | |
| gene_counts: AlignmentCounter, | |
| refmgr: 'ReferenceManager', | |
| gene_group_db: bool = False, | |
| ) -> None: | |
| """Write gene counts to a gzipped output file. | |
| Args: | |
| gene_counts: Counter containing gene alignment data | |
| refmgr: Reference manager for gene lookups | |
| gene_group_db: Whether to parse gene IDs as group.version format | |
| Raises: | |
| IOError: If file operations fail | |
| ValueError: If reference lookup fails | |
| """ | |
| try: | |
| with gzip.open(f"{self.out_prefix}.gene_counts.txt.gz", "wt") as gene_out: | |
| print("gene", *self.get_header(), sep="\t", file=gene_out, flush=True) | |
| ref_stream = ( | |
| ( | |
| ref_result := refmgr.get(rid[0] if isinstance(rid, tuple) else rid), | |
| rid, | |
| ) | |
| for rid, _ in gene_counts | |
| if ref_result is not None | |
| ) | |
| for ref, rid in sorted(ref_stream): | |
| if not ref: | |
| logger.error("Reference not found for rid: %s", rid) | |
| continue | |
| counts = gene_counts[rid] | |
| gene_id = ref | |
| CountWriter.write_row(gene_id, counts, stream=gene_out,) | |
| except IOError as e: | |
| logger.error("Failed to write gene counts: %s", str(e)) | |
| raise |
| def write_category( | ||
| self, | ||
| category_id, | ||
| category_name, | ||
| category_sum, | ||
| counts, | ||
| # feature_names, | ||
| features, | ||
| unannotated_reads=None, | ||
| report_unseen=True, | ||
| ): | ||
| with gzip.open(f"{self.out_prefix}.{category_name}.txt.gz", "wt") as feat_out: | ||
| header = self.get_header() | ||
| print("feature", *header, sep="\t", file=feat_out) | ||
|
|
||
| if unannotated_reads is not None: | ||
| print("unannotated", unannotated_reads, sep="\t", file=feat_out) | ||
|
|
||
| if "total_readcount" in self.publish_reports: | ||
| CountWriter.write_row( | ||
| "total_reads", | ||
| np.zeros(len(header)) + self.total_readcount, | ||
| stream=feat_out, | ||
| ) | ||
| with gzip.open(f"{self.out_prefix}.{category}.txt.gz", "wt") as feat_out: | ||
| header = self.get_header() | ||
| print("feature", *header, sep="\t", file=feat_out) | ||
|
|
||
| if unannotated_reads is not None: | ||
| print("unannotated", unannotated_reads, sep="\t", file=feat_out) | ||
|
|
||
| if "total_readcount" in self.publish_reports: | ||
| CountWriter.write_row( | ||
| "total_reads", | ||
| np.zeros(len(header)) + self.total_readcount, | ||
| stream=feat_out, | ||
| ) | ||
|
|
||
| if "filtered_readcount" in self.publish_reports: | ||
| CountWriter.write_row( | ||
| "filtered_reads", | ||
| np.zeros(len(header)) + self.filtered_readcount, | ||
| stream=feat_out, | ||
| ) | ||
| if "filtered_readcount" in self.publish_reports: | ||
| CountWriter.write_row( | ||
| "filtered_reads", | ||
| np.zeros(len(header)) + self.filtered_readcount, | ||
| stream=feat_out, | ||
| ) | ||
|
|
||
| if "category" in self.publish_reports: | ||
| cat_counts = counts.get(f"cat:::{category_id}") | ||
| if cat_counts is not None: | ||
| cat_row = self.compile_output_row( | ||
| cat_counts, | ||
| scaling_factor=featcounts.scaling_factors["total_uniq"], | ||
| ambig_scaling_factor=featcounts.scaling_factors["total_ambi"], | ||
| ) | ||
| CountWriter.write_row("category", cat_row, stream=feat_out) | ||
|
|
||
| for feature in db.get_features(category_id): | ||
| f_counts = counts.get(str(feature.id), np.zeros(len(header))) | ||
| if report_unseen or f_counts.sum(): | ||
| out_row = self.compile_output_row( | ||
| f_counts, | ||
| scaling_factor=scaling_factor, | ||
| ambig_scaling_factor=ambig_scaling_factor, | ||
| ) | ||
| CountWriter.write_row(feature.name, out_row, stream=feat_out) | ||
|
|
||
| def write_gene_counts(self, gene_counts, uniq_scaling_factor, ambig_scaling_factor): | ||
| if "scaled" in self.publish_reports: | ||
| logger.info("SCALING_FACTORS %s %s", uniq_scaling_factor, ambig_scaling_factor) | ||
| if "category" in self.publish_reports: | ||
| # cat_counts = counts[0] | ||
| cat_counts = category_sum | ||
| logger.info("CAT %s: %s", category_name, str(cat_counts)) | ||
| if cat_counts is not None: | ||
| CountWriter.write_row("category", category_sum, stream=feat_out) | ||
|
|
||
| # for item in counts: | ||
| # if not isinstance(item[0], tuple): | ||
| # logger.info("ITEM: %s", str(item)) | ||
| # raise TypeError(f"Weird key: {str(item)}") | ||
| # (cid, fid), fcounts = item | ||
| # if (report_unseen or fcounts.sum()) and cid == category_id: | ||
| # CountWriter.write_row(feature_names[fid], fcounts, stream=feat_out,) | ||
|
|
||
| empty_row = np.zeros(6, dtype=CountMatrix.NUMPY_DTYPE) | ||
| for feature in features: | ||
| key = (category_id, feature.id) | ||
| if counts.has_record(key): | ||
| row = counts[key] | ||
| else: | ||
| row = empty_row | ||
| if (report_unseen or row.sum()): | ||
| CountWriter.write_row(feature.name, row, stream=feat_out,) | ||
|
|
||
|
|
||
| # for (cid, fid), fcounts in counts: | ||
| # if (report_unseen or fcounts.sum()) and cid == category_id: | ||
| # CountWriter.write_row(feature_names[fid], fcounts, stream=feat_out,) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Multiple improvements needed.
- Remove commented-out code and parameters
- Add error handling for file operations
- Simplify the row assignment logic
- Add type hints and docstring
def write_category(
self,
- category_id,
- category_name,
- category_sum,
- counts,
- # feature_names,
- features,
- unannotated_reads=None,
- report_unseen=True,
+ category_id: int,
+ category_name: str,
+ category_sum: np.ndarray,
+ counts: CountMatrix,
+ features: List[Feature],
+ unannotated_reads: Optional[float] = None,
+ report_unseen: bool = True,
- ):
+ ) -> None:
+ """Write category counts to a gzipped output file.
+
+ Args:
+ category_id: Category identifier
+ category_name: Category name
+ category_sum: Sum of counts for the category
+ counts: Matrix of counts
+ features: List of features
+ unannotated_reads: Count of unannotated reads
+ report_unseen: Whether to report features with zero counts
+
+ Raises:
+ IOError: If file operations fail
+ """
- with gzip.open(f"{self.out_prefix}.{category_name}.txt.gz", "wt") as feat_out:
+ try:
+ with gzip.open(f"{self.out_prefix}.{category_name}.txt.gz", "wt") as feat_out:
# ... rest of the implementation ...
- if counts.has_record(key):
- row = counts[key]
- else:
- row = empty_row
+ row = counts[key] if counts.has_record(key) else empty_row
+ except IOError as e:
+ logger.error("Failed to write category %s: %s", category_name, str(e))
+ raise📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def write_category( | |
| self, | |
| category_id, | |
| category_name, | |
| category_sum, | |
| counts, | |
| # feature_names, | |
| features, | |
| unannotated_reads=None, | |
| report_unseen=True, | |
| ): | |
| with gzip.open(f"{self.out_prefix}.{category_name}.txt.gz", "wt") as feat_out: | |
| header = self.get_header() | |
| print("feature", *header, sep="\t", file=feat_out) | |
| if unannotated_reads is not None: | |
| print("unannotated", unannotated_reads, sep="\t", file=feat_out) | |
| if "total_readcount" in self.publish_reports: | |
| CountWriter.write_row( | |
| "total_reads", | |
| np.zeros(len(header)) + self.total_readcount, | |
| stream=feat_out, | |
| ) | |
| with gzip.open(f"{self.out_prefix}.{category}.txt.gz", "wt") as feat_out: | |
| header = self.get_header() | |
| print("feature", *header, sep="\t", file=feat_out) | |
| if unannotated_reads is not None: | |
| print("unannotated", unannotated_reads, sep="\t", file=feat_out) | |
| if "total_readcount" in self.publish_reports: | |
| CountWriter.write_row( | |
| "total_reads", | |
| np.zeros(len(header)) + self.total_readcount, | |
| stream=feat_out, | |
| ) | |
| if "filtered_readcount" in self.publish_reports: | |
| CountWriter.write_row( | |
| "filtered_reads", | |
| np.zeros(len(header)) + self.filtered_readcount, | |
| stream=feat_out, | |
| ) | |
| if "filtered_readcount" in self.publish_reports: | |
| CountWriter.write_row( | |
| "filtered_reads", | |
| np.zeros(len(header)) + self.filtered_readcount, | |
| stream=feat_out, | |
| ) | |
| if "category" in self.publish_reports: | |
| cat_counts = counts.get(f"cat:::{category_id}") | |
| if cat_counts is not None: | |
| cat_row = self.compile_output_row( | |
| cat_counts, | |
| scaling_factor=featcounts.scaling_factors["total_uniq"], | |
| ambig_scaling_factor=featcounts.scaling_factors["total_ambi"], | |
| ) | |
| CountWriter.write_row("category", cat_row, stream=feat_out) | |
| for feature in db.get_features(category_id): | |
| f_counts = counts.get(str(feature.id), np.zeros(len(header))) | |
| if report_unseen or f_counts.sum(): | |
| out_row = self.compile_output_row( | |
| f_counts, | |
| scaling_factor=scaling_factor, | |
| ambig_scaling_factor=ambig_scaling_factor, | |
| ) | |
| CountWriter.write_row(feature.name, out_row, stream=feat_out) | |
| def write_gene_counts(self, gene_counts, uniq_scaling_factor, ambig_scaling_factor): | |
| if "scaled" in self.publish_reports: | |
| logger.info("SCALING_FACTORS %s %s", uniq_scaling_factor, ambig_scaling_factor) | |
| if "category" in self.publish_reports: | |
| # cat_counts = counts[0] | |
| cat_counts = category_sum | |
| logger.info("CAT %s: %s", category_name, str(cat_counts)) | |
| if cat_counts is not None: | |
| CountWriter.write_row("category", category_sum, stream=feat_out) | |
| # for item in counts: | |
| # if not isinstance(item[0], tuple): | |
| # logger.info("ITEM: %s", str(item)) | |
| # raise TypeError(f"Weird key: {str(item)}") | |
| # (cid, fid), fcounts = item | |
| # if (report_unseen or fcounts.sum()) and cid == category_id: | |
| # CountWriter.write_row(feature_names[fid], fcounts, stream=feat_out,) | |
| empty_row = np.zeros(6, dtype=CountMatrix.NUMPY_DTYPE) | |
| for feature in features: | |
| key = (category_id, feature.id) | |
| if counts.has_record(key): | |
| row = counts[key] | |
| else: | |
| row = empty_row | |
| if (report_unseen or row.sum()): | |
| CountWriter.write_row(feature.name, row, stream=feat_out,) | |
| # for (cid, fid), fcounts in counts: | |
| # if (report_unseen or fcounts.sum()) and cid == category_id: | |
| # CountWriter.write_row(feature_names[fid], fcounts, stream=feat_out,) | |
| def write_category( | |
| self, | |
| category_id: int, | |
| category_name: str, | |
| category_sum: np.ndarray, | |
| counts: CountMatrix, | |
| features: List[Feature], | |
| unannotated_reads: Optional[float] = None, | |
| report_unseen: bool = True, | |
| ) -> None: | |
| """Write category counts to a gzipped output file. | |
| Args: | |
| category_id: Category identifier | |
| category_name: Category name | |
| category_sum: Sum of counts for the category | |
| counts: Matrix of counts | |
| features: List of features | |
| unannotated_reads: Count of unannotated reads | |
| report_unseen: Whether to report features with zero counts | |
| Raises: | |
| IOError: If file operations fail | |
| """ | |
| try: | |
| with gzip.open(f"{self.out_prefix}.{category_name}.txt.gz", "wt") as feat_out: | |
| header = self.get_header() | |
| print("feature", *header, sep="\t", file=feat_out) | |
| if unannotated_reads is not None: | |
| print("unannotated", unannotated_reads, sep="\t", file=feat_out) | |
| if "total_readcount" in self.publish_reports: | |
| CountWriter.write_row( | |
| "total_reads", | |
| np.zeros(len(header)) + self.total_readcount, | |
| stream=feat_out, | |
| ) | |
| if "filtered_readcount" in self.publish_reports: | |
| CountWriter.write_row( | |
| "filtered_reads", | |
| np.zeros(len(header)) + self.filtered_readcount, | |
| stream=feat_out, | |
| ) | |
| if "category" in self.publish_reports: | |
| cat_counts = category_sum | |
| logger.info("CAT %s: %s", category_name, str(cat_counts)) | |
| if cat_counts is not None: | |
| CountWriter.write_row("category", category_sum, stream=feat_out) | |
| empty_row = np.zeros(6, dtype=CountMatrix.NUMPY_DTYPE) | |
| for feature in features: | |
| key = (category_id, feature.id) | |
| row = counts[key] if counts.has_record(key) else empty_row | |
| if (report_unseen or row.sum()): | |
| CountWriter.write_row(feature.name, row, stream=feat_out,) | |
| except IOError as e: | |
| logger.error("Failed to write category %s: %s", category_name, str(e)) | |
| raise |
🧰 Tools
🪛 Ruff (0.8.2)
164-167: Use ternary operator row = counts[key] if counts.has_record(key) else empty_row instead of if-else-block
Replace if-else-block with row = counts[key] if counts.has_record(key) else empty_row
(SIM108)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
gffquant/counters/count_matrix.py (3)
25-29: Remove commented-out code.The commented-out code block appears to be an alternative implementation. If it's no longer needed, please remove it to improve code readability.
89-129: Improve memory efficiency in generate_gene_counts.The method creates multiple temporary arrays that could consume significant memory with large datasets:
- Initial copy of counts array
- Column stacking operation
- Multiple array operations
Consider using in-place operations where possible:
def generate_gene_counts(self, lengths): + # Pre-allocate final array size to avoid multiple allocations + final_shape = (len(self.index), 6) + if self.counts.shape != final_shape: + self.counts = np.zeros(final_shape, dtype=self.NUMPY_DTYPE) + # calculate combined_raw counts[:, 1:2] += counts[:, 0:1] - counts = np.column_stack(...) # Avoid creating new array + # Reshape existing array and perform operations in-place
32-37: Add docstring to calculate_scaling_factor.The static method needs documentation explaining its purpose and edge cases.
Add this docstring:
@staticmethod def calculate_scaling_factor(raw, norm): + """Calculate scaling factor between raw and normalized counts. + + Args: + raw: Raw count value + norm: Normalized count value + + Returns: + float: Scaling factor (1.0 if norm is zero) + """ if norm == 0.0: return 1.0 return raw / norm
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
gffquant/counters/count_matrix.py(1 hunks)
🔇 Additional comments (2)
gffquant/counters/count_matrix.py (2)
1-2: Add comprehensive module documentation.The module needs proper documentation explaining its purpose, usage, and key components.
11-12: Document overflow scenarios with float16.Please document the specific scenarios where float16 caused overflow issues. This will help future maintainers understand why float64 was chosen and when it might be safe to optimize by using float16.
Run this script to find potential overflow scenarios in tests:
✅ Verification successful
Overflow scenarios with float16 documented:
The float16 precision causes overflow in several critical numerical operations:
- Length normalization (counts/length) overflows when normalized values exceed 65504
- RPKM calculations using 1e9 factor exceed float16 range
- Scaling factors from raw/normalized count ratios can overflow with large datasets
- Small values from 1/N distribution of ambiguous reads risk precision loss when accumulated
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for test cases that might trigger float16 overflow rg -A 5 "test.*overflow|test.*float|assert.*overflow" tests/Length of output: 109
Script:
#!/bin/bash # Find test files and search for float-related patterns echo "=== Python test files ===" fd -e py "test" --type f echo -e "\n=== Float-related patterns in Python files ===" rg -l "float|numpy.*(float|dtype)|np.*(float|dtype)" echo -e "\n=== Documentation about float precision ===" fd "README|CONTRIBUTING" --type f --exec cat {}Length of output: 8321
Script:
#!/bin/bash # Look for numerical operations in key files echo "=== Count Matrix Implementation ===" cat gffquant/counters/count_matrix.py echo -e "\n=== Feature Quantifier Implementation ===" cat gffquant/profilers/feature_quantifier.py echo -e "\n=== Count Writer Implementation ===" cat gffquant/annotation/count_writer.pyLength of output: 32070
| def group_gene_counts(self, ggroups): | ||
| ggroup_index = {} | ||
| # for (key, _), ggroup_id in zip(self.index.items(), ggroups): | ||
| # g_key_index = ggroup_index.get(ggroup_id) | ||
| # gene_counts = self.counts[self.index[key]] | ||
| for gene_id, gene_counts in self: | ||
| ggroup_id = gene_id.split(".")[-1] | ||
| g_key_index = ggroup_index.get(ggroup_id) | ||
| if g_key_index is None: | ||
| g_key_index = ggroup_index[ggroup_id] = len(ggroup_index) | ||
| self.counts[g_key_index] = gene_counts | ||
| else: | ||
| self.counts[g_key_index] += gene_counts | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve gene ID parsing robustness.
The current implementation has several issues:
- Gene ID parsing assumes a specific format (split by ".") which might not be universal
- No validation of ggroup_id after splitting
- Potential memory inefficiency when building the group index
Consider this safer implementation:
def group_gene_counts(self, ggroups):
ggroup_index = {}
- for gene_id, gene_counts in self:
- ggroup_id = gene_id.split(".")[-1]
+ for gene_id, gene_counts in self:
+ try:
+ ggroup_id = gene_id.split(".")[-1]
+ if not ggroup_id:
+ logger.warning(f"Empty group ID found for gene {gene_id}")
+ continue
+ except IndexError:
+ logger.warning(f"Invalid gene ID format: {gene_id}")
+ continue
g_key_index = ggroup_index.get(ggroup_id)
if g_key_index is None:
g_key_index = ggroup_index[ggroup_id] = len(ggroup_index)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def group_gene_counts(self, ggroups): | |
| ggroup_index = {} | |
| # for (key, _), ggroup_id in zip(self.index.items(), ggroups): | |
| # g_key_index = ggroup_index.get(ggroup_id) | |
| # gene_counts = self.counts[self.index[key]] | |
| for gene_id, gene_counts in self: | |
| ggroup_id = gene_id.split(".")[-1] | |
| g_key_index = ggroup_index.get(ggroup_id) | |
| if g_key_index is None: | |
| g_key_index = ggroup_index[ggroup_id] = len(ggroup_index) | |
| self.counts[g_key_index] = gene_counts | |
| else: | |
| self.counts[g_key_index] += gene_counts | |
| def group_gene_counts(self, ggroups): | |
| ggroup_index = {} | |
| for gene_id, gene_counts in self: | |
| try: | |
| ggroup_id = gene_id.split(".")[-1] | |
| if not ggroup_id: | |
| logger.warning(f"Empty group ID found for gene {gene_id}") | |
| continue | |
| except IndexError: | |
| logger.warning(f"Invalid gene ID format: {gene_id}") | |
| continue | |
| g_key_index = ggroup_index.get(ggroup_id) | |
| if g_key_index is None: | |
| g_key_index = ggroup_index[ggroup_id] = len(ggroup_index) | |
| self.counts[g_key_index] = gene_counts | |
| else: | |
| self.counts[g_key_index] += gene_counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
gffquant/counters/count_matrix.py (4)
12-12: Document the overflow threshold.The comment indicates that 'float16' caused overflow issues during testing. It would be helpful to document the specific threshold or scenario that triggered the overflow to justify using 'float64'.
- NUMPY_DTYPE = 'float64' # float16 causes some overflow issue during testing + NUMPY_DTYPE = 'float64' # float16 overflows for counts > 65504 during testing
25-29: Remove commented-out code.The commented-out dictionary comprehension is redundant as the current implementation using a loop is more readable and serves the same purpose.
52-59: Extract magic number into a class constant.The padding size of 1000 should be defined as a class constant for better maintainability and clarity.
class CountMatrix: NUMPY_DTYPE = 'float64' + RESIZE_CHUNK = 1000 # Number of rows to add when resizing def _resize(self): nrows = self.counts.shape[0] if len(self.index) == nrows: self.counts = np.pad( self.counts, - ((0, nrows + 1000), (0, 0),), + ((0, nrows + self.RESIZE_CHUNK), (0, 0),), )
131-151: Simplify group_gene_counts implementation.The method has commented-out code and could be simplified for better readability.
def group_gene_counts(self, ggroups): + """Group gene counts by the provided group IDs. + + Args: + ggroups: Sequence of group IDs corresponding to each gene + + Returns: + self for method chaining + """ ggroup_index = {} - # for gene_id, gene_counts in self: - # ggroup_id = gene_id.split(".")[-1] - # g_key_index = ggroup_index.get(ggroup_id) for (_, gene_counts), ggroup_id in zip(self, ggroups): g_key_index = ggroup_index.get(ggroup_id) - # gene_counts = self.counts[self.index[key]] if g_key_index is None: g_key_index = ggroup_index[ggroup_id] = len(ggroup_index) self.counts[g_key_index] = gene_counts else: self.counts[g_key_index] += gene_counts
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
gffquant/counters/count_matrix.py(1 hunks)
🔇 Additional comments (1)
gffquant/counters/count_matrix.py (1)
89-129: 🛠️ Refactor suggestionImprove readability of array operations.
The array operations use magic numbers for indexing, making the code hard to understand. Define constants for these indices and add comments explaining the array structure.
def generate_gene_counts(self, lengths): + # Column indices for unique counts + UNIQUE_RAW = 0 + UNIQUE_NORM = 1 + UNIQUE_SCALED = 2 + # Column indices for combined counts + COMBINED_RAW = 3 + COMBINED_NORM = 4 + COMBINED_SCALED = 5 + logger.info("LENGTHS ARRAY = %s", lengths.shape) logger.info("INDEX SIZE = %s", len(self.index)) # remove the un-indexed rows counts = self.counts[0:len(self.index), :] # calculate combined_raw by adding unique counts - counts[:, 1:2] += counts[:, 0:1] + counts[:, 1:2] += counts[:, UNIQUE_RAW:UNIQUE_RAW+1] counts = np.column_stack( ( - counts[:, 0], counts[:, 0], counts[:, 0], # 0, 1, 2 - counts[:, 1], counts[:, 1], counts[:, 1], # 3, 4, 5 + counts[:, UNIQUE_RAW], counts[:, UNIQUE_RAW], counts[:, UNIQUE_RAW], + counts[:, COMBINED_RAW], counts[:, COMBINED_RAW], counts[:, COMBINED_RAW], ), )Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
gffquant/counters/count_matrix.py (5)
11-12: Document the rationale for using float64.The comment about overflow issues should be expanded to explain:
- What kind of overflow issues were encountered
- Why float64 was chosen over other alternatives
- The memory implications of using float64
- NUMPY_DTYPE = 'float64' # float16 causes some overflow issue during testing + NUMPY_DTYPE = 'float64' + """Data type for count matrices. + + Using float64 instead of float16 to prevent overflow issues during: + - Large count accumulation + - Scaling operations + - Length normalization + + Note: This increases memory usage but ensures numerical stability. + """
14-30: Clean up commented code and add proper documentation.The method needs cleanup and documentation improvements:
- Remove the commented-out dictionary comprehension
- Add type hints and docstring
@classmethod - def from_count_matrix(cls, cmatrix, rows=None): + def from_count_matrix(cls, cmatrix: 'CountMatrix', rows: np.ndarray = None) -> 'CountMatrix': + """Create a new CountMatrix from an existing one. + + Args: + cmatrix: Source count matrix + rows: Boolean mask for row selection + + Returns: + A new CountMatrix instance with selected data + """ if rows is None: counts = np.array(cmatrix.counts) index = dict(cmatrix.index.items()) else: counts = cmatrix.counts[rows, :] index = {} for (key, _), keep in zip(cmatrix.index.items(), rows): if keep: index[key] = len(index) - # index = { - # key: value - # for (key, value), keep in zip(cmatrix.index.items(), rows) - # if keep - # } return cls(index=index, counts=counts)
32-36: Add input validation and improve documentation.The method should validate inputs and document its purpose clearly.
@staticmethod - def calculate_scaling_factor(raw, norm): + def calculate_scaling_factor(raw: float, norm: float) -> float: + """Calculate scaling factor for count normalization. + + Args: + raw: Raw count value + norm: Normalized count value + + Returns: + Scaling factor (raw/norm) or 1.0 if norm is zero + + Raises: + ValueError: If raw or norm is negative + """ + if raw < 0 or norm < 0: + raise ValueError("Raw and normalized counts must be non-negative") if norm == 0.0: return 1.0 return raw / norm
86-88: Document the drop_unindexed method.The method needs documentation explaining its purpose and side effects.
- def drop_unindexed(self): + def drop_unindexed(self) -> None: + """Remove rows that aren't in the index. + + This method modifies the matrix in-place by truncating it to + only include rows that have corresponding entries in the index. + """ self.counts = self.counts[0:len(self.index), :]
131-154: Clean up commented code and improve documentation.The method needs cleanup and better documentation:
- Remove commented-out code
- Document the purpose of using
ggroupsinstead of splitting gene IDs- Add type hints and return value documentation
- def group_gene_counts(self, ggroups): + def group_gene_counts(self, ggroups: list[str]) -> 'CountMatrix': + """Group gene counts by provided group IDs. + + Args: + ggroups: List of group IDs corresponding to each gene + + Returns: + self for method chaining + """ ggroup_index = {} - # for gene_id, gene_counts in self: - # ggroup_id = gene_id.split(".")[-1] - # g_key_index = ggroup_index.get(ggroup_id) for (_, gene_counts), ggroup_id in zip(self, ggroups): g_key_index = ggroup_index.get(ggroup_id) - # gene_counts = self.counts[self.index[key]] if g_key_index is None: g_key_index = ggroup_index[ggroup_id] = len(ggroup_index) self.counts[g_key_index] = gene_counts - # logger.info("CM.group_gene_counts: Adding %s to new group %s (%s).", str(gene_counts), ggroup_id, g_key_index) else: self.counts[g_key_index] += gene_counts - # logger.info("CM.group_gene_counts: Adding %s to group %s (%s).", str(gene_counts), ggroup_id, g_key_index)gffquant/profilers/feature_quantifier.py (1)
158-198: Consider simplifying category processing logic.The category processing logic could be simplified by:
- Using list comprehension for feature processing
- Extracting the category sum calculation into a separate method
Example refactor:
for category in categories: logger.info("PROCESSING CATEGORY=%s", category.name) category_sum = np.zeros(6, dtype='float64') category_counts = CountMatrix(ncols=6) - for rid, counts in self.counter: - if gene_group_db: - ggroup_id = rid - logger.info("GGROUP %s: %s", ggroup_id, str(counts)) - else: - ref, _ = self.reference_manager.get(rid[0] if isinstance(rid, tuple) else rid) - ggroup_id = ref + def get_ggroup_id(rid): + return rid if gene_group_db else self.reference_manager.get(rid[0] if isinstance(rid, tuple) else rid)[0] + + for rid, counts in self.counter: + ggroup_id = get_ggroup_id(rid) + if gene_group_db: + logger.info("GGROUP %s: %s", ggroup_id, str(counts))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
gffquant/counters/count_matrix.py(1 hunks)gffquant/profilers/feature_quantifier.py(5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py
133-133: Local variable count_annotator is assigned to but never used
Remove assignment to unused variable count_annotator
(F841)
🔇 Additional comments (6)
gffquant/counters/count_matrix.py (3)
1-1: Add comprehensive module documentation.The module needs proper documentation explaining its purpose, usage, and key components.
79-85: Add bounds checking and documentation for in-place modification.The method needs bounds checking and should document that it modifies data in-place.
89-129: Improve readability and add validation in generate_gene_counts.The method has complex array operations that need better documentation and validation.
gffquant/profilers/feature_quantifier.py (3)
203-250: Remove commented-out code block.This large block of commented code appears to be obsolete after the refactoring. If it's no longer needed, it should be removed to improve code maintainability.
133-133: 🛠️ Refactor suggestionRemove unused variable.
The variable
count_annotatoris assigned but never used.-count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)Likely invalid or redundant comment.
🧰 Tools
🪛 Ruff (0.8.2)
133-133: Local variable
count_annotatoris assigned to but never usedRemove assignment to unused variable
count_annotator(F841)
Line range hint
508-516: Verify AlignmentCounter.update return type.Please ensure that the
contributed_countsvariable returned bycounter.updatematches the expected type used in the debug message.✅ Verification successful
Return type verification completed successfully
The
contributed_countsreturned bycounter.update()is a numeric value that matches the expected type used in the debug message. This is verified by examining both the implementation inAlignmentCounterand its usage in the debug output.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check the return type of AlignmentCounter.update ast-grep --pattern 'class AlignmentCounter { $$$ update($$$) { $$$ return $$$ } $$$ }'Length of output: 104
Script:
#!/bin/bash # Find the AlignmentCounter class definition rg "class AlignmentCounter" -l # If found, examine the update method implementation rg "class AlignmentCounter" -A 20Length of output: 1785
Script:
#!/bin/bash # Find the update method in AlignmentCounter rg "def update" gffquant/counters/alignment_counter.py -A 10 # Check how contributed_counts is used in feature_quantifier rg "contributed_counts" gffquant/profilers/feature_quantifier.py -A 5Length of output: 1636
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (5)
gffquant/profilers/feature_quantifier.py (2)
163-204: Add documentation for category processing logic.The category processing logic is complex and would benefit from documentation explaining:
- The purpose of category_sum and category_counts
- The scaling factor calculations
- The relationship between gene groups and categories
Line range hint
514-520: Add error handling for counter.update.Consider adding error handling around
counter.updateto gracefully handle potential exceptions during alignment processing.+ try: contributed_counts = self.counter.update( count_stream, ambiguous_counts=is_ambiguous_group, pair=aln_group.is_paired(), pe_library=aln_group.pe_library, ) + except Exception as e: + logger.error(f"Error updating counts for alignment group {aln_group.qname}: {e}") + contributed_counts = Nonegffquant/counters/count_matrix.py (3)
12-12: Document the rationale for using float64.Add a more detailed comment explaining the specific overflow issues encountered with float16 to help future maintainers understand this decision.
- NUMPY_DTYPE = 'float64' # float16 causes some overflow issue during testing + NUMPY_DTYPE = 'float64' # Using float64 instead of float16 to prevent overflow + # issues during scaling and normalization operations, particularly + # when dealing with large count values or small normalization factors
25-29: Remove commented-out code.The commented-out dictionary comprehension should be removed as it's no longer needed and can confuse maintainers.
131-134: Make output file path configurable.The dump method writes to a hardcoded file path. Consider making this configurable to allow for different output locations and to facilitate testing.
- def dump(self, state="genes"): - with open(f"CountMatrix.{state}.txt", "wt") as _out: + def dump(self, state="genes", output_dir="."): + """Dump the count matrix to a text file. + + Args: + state: String identifier for the current state + output_dir: Directory where the output file should be written + """ + output_path = f"{output_dir}/CountMatrix.{state}.txt" + with open(output_path, "wt") as _out:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
gffquant/counters/count_matrix.py(1 hunks)gffquant/profilers/feature_quantifier.py(5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py
133-133: Local variable count_annotator is assigned to but never used
Remove assignment to unused variable count_annotator
(F841)
🔇 Additional comments (5)
gffquant/profilers/feature_quantifier.py (3)
Line range hint
91-103: LGTM! Good use of ReferenceHit class.The structured representation of alignment hits using the
ReferenceHitclass improves code organization and type safety.
293-293: LGTM! Correct usage of AlignmentCounter.The change correctly uses the new
AlignmentCounterAPI.
133-133: 🛠️ Refactor suggestionRemove unused variable.
The
count_annotatorvariable is assigned but never used.- count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)Likely invalid or redundant comment.
🧰 Tools
🪛 Ruff (0.8.2)
133-133: Local variable
count_annotatoris assigned to but never usedRemove assignment to unused variable
count_annotator(F841)
gffquant/counters/count_matrix.py (2)
1-1: Add comprehensive module documentation.The module needs proper documentation explaining its purpose, usage, and key components.
137-158: Verify group_gene_counts changes.The method has been significantly refactored from using gene_id splitting to using a provided ggroups parameter. Please verify:
- That all callers have been updated to provide the ggroups parameter
- That the ggroups parameter matches the expected length and format
✅ Verification successful
Changes to group_gene_counts are properly implemented
The refactoring is safe. The single caller in
alignment_counter.pycorrectly provides the ggroups parameter, and the parameter format matches the expected usage. The use ofzip()in the implementation ensures length consistency between counts and group identifiers.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for calls to group_gene_counts to verify they provide ggroups parameter rg -A 5 "group_gene_counts" --type pythonLength of output: 72
Script:
#!/bin/bash # Search for files containing group_gene_counts rg "group_gene_counts" -l "*.py" # Find method calls using ast-grep ast-grep --pattern 'group_gene_counts($$$)' # Get more context around the usage rg "group_gene_counts" -B 3 -A 3 "*.py"Length of output: 208
Script:
#!/bin/bash # Find all Python files containing group_gene_counts fd -e py | xargs rg "group_gene_counts" -l # Get context around usage fd -e py | xargs rg "group_gene_counts" -B 3 -A 3 # Find method calls in Python files ast-grep -l python --pattern 'group_gene_counts($$$)'Length of output: 1864
Script:
#!/bin/bash # Get more context around the caller in alignment_counter.py rg -B 10 -A 5 "group_gene_counts" gffquant/counters/alignment_counter.pyLength of output: 385
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
gffquant/profilers/feature_quantifier.py (1)
215-262: 🛠️ Refactor suggestionRemove commented-out code block.
This large block of commented code should be removed as it's been replaced by the new implementation above. If this code contains valuable information, consider:
- Moving it to documentation
- Creating a ticket to track this technical debt
🧹 Nitpick comments (4)
gffquant/profilers/feature_quantifier.py (1)
169-190: Optimize category processing performance.The nested loops for category processing could be optimized:
- Consider using a dictionary to cache region annotations
- Use list comprehension for better performance
- Consider vectorizing the operations using numpy
Would you like me to provide an optimized implementation?
gffquant/counters/count_matrix.py (3)
25-29: Remove commented-out code.The commented-out dictionary comprehension should be removed as it's no longer needed.
32-37: Add documentation for calculate_scaling_factor.The method handles division by zero but lacks docstring explaining its purpose and edge cases.
@staticmethod def calculate_scaling_factor(raw, norm): + """Calculate scaling factor while handling zero normalization values. + + Args: + raw: Raw count value + norm: Normalized count value + + Returns: + float: Scaling factor (1.0 if norm is zero, raw/norm otherwise) + """ if norm == 0.0: return 1.0 return raw / norm
52-59: Extract magic number and improve resize logic.The resize method uses a magic number (1000) for padding. This should be a class constant or configuration parameter.
+ RESIZE_INCREMENT = 1000 # Number of rows to add when resizing + def _resize(self): nrows = self.counts.shape[0] if len(self.index) == nrows: self.counts = np.pad( self.counts, - ((0, nrows + 1000), (0, 0),), + ((0, nrows + self.RESIZE_INCREMENT), (0, 0),), ) return len(self.index)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
gffquant/counters/count_matrix.py(1 hunks)gffquant/profilers/feature_quantifier.py(5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py
133-133: Local variable count_annotator is assigned to but never used
Remove assignment to unused variable count_annotator
(F841)
gffquant/counters/count_matrix.py
136-136: Loop control variable index not used within loop body
(B007)
🔇 Additional comments (7)
gffquant/profilers/feature_quantifier.py (3)
13-20: LGTM! Clean import organization.The new imports for
AlignmentCounterandReferenceHitare well-organized and align with the refactoring changes.
Line range hint
520-524: LGTM! Clean counter update implementation.The refactored counter update logic is well-implemented with:
- Proper handling of ambiguous counts
- Clear pair and PE library handling
- Good debug logging
133-133: 🛠️ Refactor suggestionRemove unused variable.
The variable
count_annotatoris assigned but never used.- count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)Likely invalid or redundant comment.
🧰 Tools
🪛 Ruff (0.8.2)
133-133: Local variable
count_annotatoris assigned to but never usedRemove assignment to unused variable
count_annotator(F841)
gffquant/counters/count_matrix.py (4)
1-2: Add comprehensive module documentation.The module still lacks proper documentation explaining its purpose, usage, and key components.
38-48: Add input validation to constructor.The constructor still lacks validation for input parameters.
89-129: Improve readability of generate_gene_counts.The method uses magic indices and lacks clear documentation of the array structure.
12-12: Verify and document the float16 overflow issue.The comment suggests there was an overflow issue with float16, but it lacks details about the specific test case or scenario.
✅ Verification successful
Comment provides valuable historical context about dtype choice
The comment about float16 overflow serves as important documentation for future maintainers, explaining why float64 was chosen. The codebase consistently uses float64 across multiple components that handle numerical operations.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for test files and overflow-related patterns rg -l "overflow|float16|dtype" --type pythonLength of output: 77
Script:
#!/bin/bash # Search for test files and overflow-related patterns with correct extension rg -l "overflow|float16|dtype" -g "*.py" # Search specifically in test files fd "test.*\.py$" --exec rg -l "float|dtype|numpy" {} # Look for the actual test implementations rg -A 5 "class.*Test.*Matrix" -g "test_*.py"Length of output: 462
Script:
#!/bin/bash # Check content of count_matrix.py for context rg -B 2 -A 2 "float|dtype|numpy" gffquant/counters/count_matrix.py # Find test files with correct pattern fd "test.*matrix.*\.py$" --exec cat {} # Check other files for dtype usage context rg -B 2 -A 2 "dtype" gffquant/profilers/feature_quantifier.py gffquant/alignment/bamreader.py gffquant/annotation/count_writer.py gffquant/db/__init__.pyLength of output: 1443
| def dump(self, state="genes", labels=None,): | ||
| with open(f"CountMatrix.{state}.txt", "wt") as _out: | ||
| if labels is None: | ||
| for index, counts in self: | ||
| print(index, *counts, sep="\t", file=_out) | ||
| for (index, counts), label in zip(self, labels): | ||
| print(label, *counts, sep="\t", file=_out) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix logic error in dump method.
The method has several issues:
- The labels condition is incorrectly implemented (both loops will execute)
- The
indexvariable in the second loop is unused - No error handling for file operations
def dump(self, state="genes", labels=None,):
+ """Dump matrix contents to a file.
+
+ Args:
+ state: String identifier for the output file
+ labels: Optional labels for each row
+ """
with open(f"CountMatrix.{state}.txt", "wt") as _out:
- if labels is None:
+ try:
+ if labels is None:
for index, counts in self:
print(index, *counts, sep="\t", file=_out)
- for (index, counts), label in zip(self, labels):
- print(label, *counts, sep="\t", file=_out)
+ else:
+ for (_, counts), label in zip(self, labels):
+ print(label, *counts, sep="\t", file=_out)
+ except IOError as e:
+ logger.error("Failed to write matrix: %s", e)
+ raise📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def dump(self, state="genes", labels=None,): | |
| with open(f"CountMatrix.{state}.txt", "wt") as _out: | |
| if labels is None: | |
| for index, counts in self: | |
| print(index, *counts, sep="\t", file=_out) | |
| for (index, counts), label in zip(self, labels): | |
| print(label, *counts, sep="\t", file=_out) | |
| def dump(self, state="genes", labels=None,): | |
| """Dump matrix contents to a file. | |
| Args: | |
| state: String identifier for the output file | |
| labels: Optional labels for each row | |
| """ | |
| with open(f"CountMatrix.{state}.txt", "wt") as _out: | |
| try: | |
| if labels is None: | |
| for index, counts in self: | |
| print(index, *counts, sep="\t", file=_out) | |
| else: | |
| for (_, counts), label in zip(self, labels): | |
| print(label, *counts, sep="\t", file=_out) | |
| except IOError as e: | |
| logger.error("Failed to write matrix: %s", e) | |
| raise |
🧰 Tools
🪛 Ruff (0.8.2)
136-136: Loop control variable index not used within loop body
(B007)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
gffquant/profilers/feature_quantifier.py (1)
65-69:⚠️ Potential issueVerify AlignmentCounter.dump implementation.
The
dumpmethod inAlignmentCounteris present but deliberately raisesNotImplementedError. There's an actual implementation after the raise statement, but it's unreachable.To fix this:
- Remove the
raise NotImplementedError()line ingffquant/counters/alignment_counter.py:dump
🧹 Nitpick comments (1)
gffquant/counters/count_matrix.py (1)
11-12: Document the rationale for using float64.The comment indicates that float16 causes overflow issues, but it would be helpful to document:
- The specific overflow scenarios
- The performance implications of using float64
- Whether float32 was considered as a middle ground
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
gffquant/counters/count_matrix.py(1 hunks)gffquant/profilers/feature_quantifier.py(5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py
133-133: Local variable count_annotator is assigned to but never used
Remove assignment to unused variable count_annotator
(F841)
gffquant/counters/count_matrix.py
137-137: Loop control variable index not used within loop body
(B007)
🔇 Additional comments (8)
gffquant/counters/count_matrix.py (6)
1-2: Add comprehensive module documentation.The module lacks proper documentation explaining its purpose, usage, and key components.
Add a detailed docstring that includes:
- Module purpose and overview
- Key classes and their relationships
- Usage examples
- Any important notes about memory usage or performance considerations
-""" module docstring """ +"""Count matrix management for gene expression quantification. + +This module provides the CountMatrix class for efficient storage and manipulation +of gene expression count data. It supports operations like: +- Raw and normalized count storage +- Length normalization +- Count scaling and grouping +- Memory-efficient storage and operations + +Example: + matrix = CountMatrix(ncols=2, nrows=1000) + matrix[gene_id] = [raw_count, 0] + matrix.generate_gene_counts(lengths) +"""
38-48: Add input validation to constructor.The constructor should validate input parameters and document the expected shapes.
- def __init__(self, ncols=2, nrows=1000, index=None, counts=None,): + def __init__(self, ncols: int = 2, nrows: int = 1000, + index: dict = None, counts: np.ndarray = None) -> None: + """Initialize CountMatrix. + + Args: + ncols: Number of columns for the count matrix + nrows: Initial number of rows + index: Optional pre-existing index dictionary + counts: Optional pre-existing count matrix + + Raises: + ValueError: If counts is provided without matching index + ValueError: If ncols or nrows are not positive integers + """ + if ncols <= 0 or nrows <= 0: + raise ValueError("ncols and nrows must be positive integers") + if counts is not None and index is None: + raise ValueError("counts provided without matching index") if index is not None and counts is not None: + if len(index) > counts.shape[0]: + raise ValueError("index size exceeds counts dimensions") self.index = dict(index.items()) self.counts = counts
61-71: Add bounds checking to item access methods.Both getitem and setitem should verify array bounds before access.
def __getitem__(self, key): key_index = self.index.get(key) if key_index is None: + if len(self.index) >= self.counts.shape[0]: + raise IndexError("Matrix capacity exceeded") key_index = self.index[key] = self._resize() return self.counts[key_index] def __setitem__(self, key, value): key_index = self.index.get(key) if key_index is None: + if len(self.index) >= self.counts.shape[0]: + raise IndexError("Matrix capacity exceeded") key_index = self.index[key] = self._resize() + if not isinstance(value, (list, np.ndarray)) or len(value) != self.counts.shape[1]: + raise ValueError(f"Value must be array-like with length {self.counts.shape[1]}") self.counts[key_index] = value
79-85: Add bounds checking and documentation for in-place modification.The
scale_columnmethod needs bounds checking and should document that it modifies data in-place.- def scale_column(self, col_index, factor, rows=None): + def scale_column(self, col_index: int, factor: float, rows=None) -> None: + """Scale values in the specified column by the given factor. + + Args: + col_index: Index of the column to scale + factor: Scaling factor to apply + rows: Optional mask for row selection + + Note: + This operation modifies the matrix in-place. + + Raises: + IndexError: If col_index is out of bounds + """ + if not 0 <= col_index < self.counts.shape[1] - 1: + raise IndexError(f"Column index {col_index} out of bounds") # apply scaling factors if rows is None: self.counts[:, col_index + 1] = self.counts[:, col_index] * factor else: self.counts[rows, col_index + 1] = self.counts[rows, col_index] * factor
89-130: Improve readability and add validation in generate_gene_counts.The
generate_gene_countsmethod has complex array operations that need:
- Input validation
- Better documentation of the operations
- Extraction of magic numbers into named constants
- def generate_gene_counts(self, lengths): + def generate_gene_counts(self, lengths: np.ndarray) -> 'CountMatrix': + """Generate normalized and scaled gene counts. + + Args: + lengths: Array of gene lengths for normalization + + Returns: + self for method chaining + + Raises: + ValueError: If lengths shape doesn't match count matrix + """ + if lengths.shape[0] != len(self.index): + raise ValueError(f"Lengths array shape {lengths.shape} doesn't match matrix shape {len(self.index)}") + + # Constants for clarity + RAW_COL = 0 + NORM_COL = 1 + SCALED_COL = 2 + logger.info("LENGTHS ARRAY = %s", lengths.shape) logger.info("INDEX SIZE = %s", len(self.index)) # remove the un-indexed rows counts = self.counts[0:len(self.index), :] # calculate combined_raw counts[:, 1:2] += counts[:, 0:1] + # Create columns: [raw, length_norm, scaled] for both unique and combined counts counts = np.column_stack( ( - counts[:, 0], counts[:, 0], counts[:, 0], # 0, 1, 2 - counts[:, 1], counts[:, 1], counts[:, 1], # 3, 4, 5 + counts[:, RAW_COL], counts[:, RAW_COL], counts[:, RAW_COL], # unique counts + counts[:, NORM_COL], counts[:, NORM_COL], counts[:, NORM_COL], # combined counts ), )
131-139: Fix logic error in dump method.The method has several issues:
- The labels condition is incorrectly implemented (both loops will execute)
- The
indexvariable in the second loop is unused- No error handling for file operations
- def dump(self, state="genes", labels=None,): + def dump(self, state: str = "genes", labels: list = None) -> None: + """Dump matrix contents to a file. + + Args: + state: String identifier for the output file + labels: Optional labels for each row + """ with open(f"CountMatrix.{state}.txt", "wt") as _out: - if labels is None: + try: + if labels is None: for index, counts in self: print(index, *counts, sep="\t", file=_out) - for (index, counts), label in zip(self, labels): - print(label, *counts, sep="\t", file=_out) + else: + for (_, counts), label in zip(self, labels): + print(label, *counts, sep="\t", file=_out) + except IOError as e: + logger.error("Failed to write matrix: %s", e) + raise🧰 Tools
🪛 Ruff (0.8.2)
137-137: Loop control variable
indexnot used within loop body(B007)
gffquant/profilers/feature_quantifier.py (2)
215-262: Remove commented-out code block.This large block of commented code should be removed if it's no longer needed.
If it contains valuable information, consider:
- Moving it to documentation
- Creating a ticket to track this technical debt
133-133: 🛠️ Refactor suggestionRemove unused variable.
The variable
count_annotatoris assigned but never used.- count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)Likely invalid or redundant comment.
🧰 Tools
🪛 Ruff (0.8.2)
133-133: Local variable
count_annotatoris assigned to but never usedRemove assignment to unused variable
count_annotator(F841)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
gffquant/counters/count_matrix.py (4)
11-12: Document the overflow issue in more detail.The comment about float16 overflow should be expanded to help future maintainers understand the specific scenarios where overflow occurred and why float64 was chosen.
- NUMPY_DTYPE = 'float64' # float16 causes some overflow issue during testing + NUMPY_DTYPE = 'float64' # Using float64 instead of float16 to prevent overflow + # during large count accumulation and scaling operations in testing. + # float16 max value (65504) is too small for typical read counts.
14-30: Add type hints and documentation to the classmethod.The method needs proper type hints and documentation for better maintainability.
@classmethod - def from_count_matrix(cls, cmatrix, rows=None): + def from_count_matrix( + cls, + cmatrix: 'CountMatrix', + rows: np.ndarray | None = None + ) -> 'CountMatrix': + """Create a new CountMatrix from an existing one. + + Args: + cmatrix: Source CountMatrix instance + rows: Optional boolean mask for row selection + + Returns: + A new CountMatrix instance with selected data + """
25-29: Remove commented-out code.The commented-out dictionary comprehension should be removed as it's no longer needed.
32-36: Add type hints and improve documentation.The method needs proper type hints and clearer documentation about the zero-value handling.
@staticmethod - def calculate_scaling_factor(raw, norm): + def calculate_scaling_factor(raw: float, norm: float) -> float: + """Calculate scaling factor between raw and normalized values. + + Args: + raw: Raw count value + norm: Normalized count value + + Returns: + float: Scaling factor (raw/norm) or 1.0 if norm is zero + """ if norm == 0.0: return 1.0 return raw / norm
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
gffquant/counters/count_matrix.py(1 hunks)gffquant/profilers/feature_quantifier.py(5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py
133-133: Local variable count_annotator is assigned to but never used
Remove assignment to unused variable count_annotator
(F841)
gffquant/counters/count_matrix.py
137-137: Loop control variable index not used within loop body
(B007)
🔇 Additional comments (10)
gffquant/counters/count_matrix.py (7)
1-2: Add comprehensive module documentation.The module lacks proper documentation explaining its purpose, usage, and key components.
38-48: Add input validation to constructor.The constructor should validate input parameters and document the expected shapes.
61-71: Add bounds checking to item access methods.Both getitem and setitem should verify array bounds before access.
79-85: Add bounds checking and documentation for in-place modification.The scale_column method needs bounds checking and should document that it modifies data in-place.
89-130: Improve readability and add validation in generate_gene_counts.The method needs input validation and better documentation of the operations.
174-175: Add input validation and documentation to colsum method.The method needs bounds checking and proper documentation.
131-138:⚠️ Potential issueFix unused variable and add error handling.
The method has two issues:
- Unused
indexvariable in the loop with labels- Missing error handling for file operations
def dump(self, state="genes", labels=None,): + """Dump matrix contents to a file. + + Args: + state: String identifier for the output file + labels: Optional labels for each row + + Raises: + IOError: If file operations fail + """ - with open(f"CountMatrix.{state}.txt", "wt") as _out: - if labels is None: - for index, counts in self: - print(index, *counts, sep="\t", file=_out) - else: - for (index, counts), label in zip(self, labels): - print(label, *counts, sep="\t", file=_out) + try: + with open(f"CountMatrix.{state}.txt", "wt") as _out: + if labels is None: + for index, counts in self: + print(index, *counts, sep="\t", file=_out) + else: + for (_, counts), label in zip(self, labels): + print(label, *counts, sep="\t", file=_out) + except IOError as e: + logger.error("Failed to write matrix: %s", e) + raiseLikely invalid or redundant comment.
🧰 Tools
🪛 Ruff (0.8.2)
137-137: Loop control variable
indexnot used within loop body(B007)
gffquant/profilers/feature_quantifier.py (3)
215-262: Remove commented-out code block.This large block of commented code should be removed if it's no longer needed.
Line range hint
520-524: Well-structured update to use new counter interface.The changes to
process_alignment_groupare well-implemented, making the code more concise and maintainable.
133-133:⚠️ Potential issueRemove unused variable.
The
count_annotatorvariable is assigned but never used.- count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)Likely invalid or redundant comment.
🧰 Tools
🪛 Ruff (0.8.2)
133-133: Local variable
count_annotatoris assigned to but never usedRemove assignment to unused variable
count_annotator(F841)
| def group_gene_counts(self, ggroups): | ||
|
|
||
| ggroup_counts = CountMatrix(ncols=6) | ||
| for (_, gene_counts), ggroup_id in zip(self, ggroups): | ||
| ggroup_counts[ggroup_id] +=gene_counts | ||
|
|
||
| return ggroup_counts | ||
|
|
||
|
|
||
|
|
||
| ggroup_index = {} | ||
| # for gene_id, gene_counts in self: | ||
| # ggroup_id = gene_id.split(".")[-1] | ||
| # g_key_index = ggroup_index.get(ggroup_id) | ||
| for (_, gene_counts), ggroup_id in zip(self, ggroups): | ||
| g_key_index = ggroup_index.get(ggroup_id) | ||
| # gene_counts = self.counts[self.index[key]] | ||
| if g_key_index is None: | ||
| g_key_index = ggroup_index[ggroup_id] = len(ggroup_index) | ||
| self.counts[g_key_index] = gene_counts | ||
| # logger.info("CM.group_gene_counts: Adding %s to new group %s (%s).", str(gene_counts), ggroup_id, g_key_index) | ||
| else: | ||
| self.counts[g_key_index] += gene_counts | ||
| # logger.info("CM.group_gene_counts: Adding %s to group %s (%s).", str(gene_counts), ggroup_id, g_key_index) | ||
|
|
||
| # replace index with grouped index | ||
| self.index = ggroup_index | ||
|
|
||
| # remove the un-indexed (ungrouped) rows | ||
| self.counts = self.counts[0:len(self.index), :] | ||
|
|
||
| return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix duplicate implementation and remove commented code.
The method has two critical issues:
- Duplicate implementation (lines 141-148 and 151-172)
- Commented-out code that should be removed
Keep the first implementation (lines 141-148) as it's more concise and remove the duplicate implementation and commented code.
def group_gene_counts(self, ggroups):
+ """Group gene counts by provided group IDs.
+
+ Args:
+ ggroups: Iterable of group IDs corresponding to each gene
+
+ Returns:
+ CountMatrix: New matrix with grouped counts
+ """
ggroup_counts = CountMatrix(ncols=6)
for (_, gene_counts), ggroup_id in zip(self, ggroups):
ggroup_counts[ggroup_id] +=gene_counts
return ggroup_counts
-
-
-
- ggroup_index = {}
- # for gene_id, gene_counts in self:
- # ggroup_id = gene_id.split(".")[-1]
- # g_key_index = ggroup_index.get(ggroup_id)
- for (_, gene_counts), ggroup_id in zip(self, ggroups):
- g_key_index = ggroup_index.get(ggroup_id)
- # gene_counts = self.counts[self.index[key]]
- if g_key_index is None:
- g_key_index = ggroup_index[ggroup_id] = len(ggroup_index)
- self.counts[g_key_index] = gene_counts
- # logger.info("CM.group_gene_counts: Adding %s to new group %s (%s).", str(gene_counts), ggroup_id, g_key_index)
- else:
- self.counts[g_key_index] += gene_counts
- # logger.info("CM.group_gene_counts: Adding %s to group %s (%s).", str(gene_counts), ggroup_id, g_key_index)
-
- # replace index with grouped index
- self.index = ggroup_index
-
- # remove the un-indexed (ungrouped) rows
- self.counts = self.counts[0:len(self.index), :]
-
- return self📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def group_gene_counts(self, ggroups): | |
| ggroup_counts = CountMatrix(ncols=6) | |
| for (_, gene_counts), ggroup_id in zip(self, ggroups): | |
| ggroup_counts[ggroup_id] +=gene_counts | |
| return ggroup_counts | |
| ggroup_index = {} | |
| # for gene_id, gene_counts in self: | |
| # ggroup_id = gene_id.split(".")[-1] | |
| # g_key_index = ggroup_index.get(ggroup_id) | |
| for (_, gene_counts), ggroup_id in zip(self, ggroups): | |
| g_key_index = ggroup_index.get(ggroup_id) | |
| # gene_counts = self.counts[self.index[key]] | |
| if g_key_index is None: | |
| g_key_index = ggroup_index[ggroup_id] = len(ggroup_index) | |
| self.counts[g_key_index] = gene_counts | |
| # logger.info("CM.group_gene_counts: Adding %s to new group %s (%s).", str(gene_counts), ggroup_id, g_key_index) | |
| else: | |
| self.counts[g_key_index] += gene_counts | |
| # logger.info("CM.group_gene_counts: Adding %s to group %s (%s).", str(gene_counts), ggroup_id, g_key_index) | |
| # replace index with grouped index | |
| self.index = ggroup_index | |
| # remove the un-indexed (ungrouped) rows | |
| self.counts = self.counts[0:len(self.index), :] | |
| return self | |
| def group_gene_counts(self, ggroups): | |
| """Group gene counts by provided group IDs. | |
| Args: | |
| ggroups: Iterable of group IDs corresponding to each gene | |
| Returns: | |
| CountMatrix: New matrix with grouped counts | |
| """ | |
| ggroup_counts = CountMatrix(ncols=6) | |
| for (_, gene_counts), ggroup_id in zip(self, ggroups): | |
| ggroup_counts[ggroup_id] +=gene_counts | |
| return ggroup_counts |
| total_gene_counts = self.counter.generate_gene_count_matrix(self.reference_manager) | ||
| logger.info("TOTAL_GENE_COUNTS = %s", total_gene_counts) | ||
|
|
||
| count_writer.write_feature_counts( | ||
| self.adm, | ||
| count_annotator, | ||
| (None, unannotated_reads)[report_unannotated], | ||
| count_writer.write_gene_counts( | ||
| self.counter, | ||
| self.reference_manager, | ||
| gene_group_db=gene_group_db, | ||
| ) | ||
|
|
||
| count_writer.write_gene_counts( | ||
| count_annotator.gene_counts, | ||
| count_annotator.scaling_factors["total_gene_uniq"], | ||
| count_annotator.scaling_factors["total_gene_ambi"] | ||
| ggroups = tuple( | ||
| (self.reference_manager.get(key[0] if isinstance(key, tuple) else key))[0] # .split(".")[0] | ||
| for key, _ in self.counter | ||
| ) | ||
|
|
||
|
|
||
| self.counter.counts.dump(labels=ggroups) | ||
|
|
||
| self.counter.group_gene_count_matrix(self.reference_manager) | ||
| unannotated_reads = self.counter.get_unannotated_reads() + self.aln_counter["unannotated_ambig"] | ||
|
|
||
| self.counter.counts.dump(state="ggroup") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for gene count operations.
The gene count operations lack proper error handling and logging of potential failures.
+ try:
total_gene_counts = self.counter.generate_gene_count_matrix(self.reference_manager)
logger.info("TOTAL_GENE_COUNTS = %s", total_gene_counts)
count_writer.write_gene_counts(
self.counter,
self.reference_manager,
gene_group_db=gene_group_db,
)
ggroups = tuple(
(self.reference_manager.get(key[0] if isinstance(key, tuple) else key))[0]
for key, _ in self.counter
)
self.counter.counts.dump(labels=ggroups)
self.counter.group_gene_count_matrix(self.reference_manager)
self.counter.counts.dump(state="ggroup")
+ except Exception as e:
+ logger.error("Failed to process gene counts: %s", e)
+ raise📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| total_gene_counts = self.counter.generate_gene_count_matrix(self.reference_manager) | |
| logger.info("TOTAL_GENE_COUNTS = %s", total_gene_counts) | |
| count_writer.write_feature_counts( | |
| self.adm, | |
| count_annotator, | |
| (None, unannotated_reads)[report_unannotated], | |
| count_writer.write_gene_counts( | |
| self.counter, | |
| self.reference_manager, | |
| gene_group_db=gene_group_db, | |
| ) | |
| count_writer.write_gene_counts( | |
| count_annotator.gene_counts, | |
| count_annotator.scaling_factors["total_gene_uniq"], | |
| count_annotator.scaling_factors["total_gene_ambi"] | |
| ggroups = tuple( | |
| (self.reference_manager.get(key[0] if isinstance(key, tuple) else key))[0] # .split(".")[0] | |
| for key, _ in self.counter | |
| ) | |
| self.counter.counts.dump(labels=ggroups) | |
| self.counter.group_gene_count_matrix(self.reference_manager) | |
| unannotated_reads = self.counter.get_unannotated_reads() + self.aln_counter["unannotated_ambig"] | |
| self.counter.counts.dump(state="ggroup") | |
| try: | |
| total_gene_counts = self.counter.generate_gene_count_matrix(self.reference_manager) | |
| logger.info("TOTAL_GENE_COUNTS = %s", total_gene_counts) | |
| count_writer.write_gene_counts( | |
| self.counter, | |
| self.reference_manager, | |
| gene_group_db=gene_group_db, | |
| ) | |
| ggroups = tuple( | |
| (self.reference_manager.get(key[0] if isinstance(key, tuple) else key))[0] # .split(".")[0] | |
| for key, _ in self.counter | |
| ) | |
| self.counter.counts.dump(labels=ggroups) | |
| self.counter.group_gene_count_matrix(self.reference_manager) | |
| unannotated_reads = self.counter.get_unannotated_reads() + self.aln_counter["unannotated_ambig"] | |
| self.counter.counts.dump(state="ggroup") | |
| except Exception as e: | |
| logger.error("Failed to process gene counts: %s", e) | |
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (2)
gffquant/counters/count_matrix.py (2)
25-29: Remove commented-out code.The commented-out dictionary comprehension should be removed as it's no longer needed and adds noise to the codebase.
- # index = { - # key: value - # for (key, value), keep in zip(cmatrix.index.items(), rows) - # if keep - # }
55-58: Consider exponential growth for array resizing.The current implementation adds a fixed 1000 rows when resizing. Consider using exponential growth (e.g., doubling the size) for better amortized performance with large datasets.
- ((0, nrows + 1000), (0, 0),), + ((0, nrows * 2), (0, 0),),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
gffquant/counters/count_matrix.py(1 hunks)gffquant/profilers/feature_quantifier.py(5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py
133-133: Local variable count_annotator is assigned to but never used
Remove assignment to unused variable count_annotator
(F841)
gffquant/counters/count_matrix.py
137-137: Loop control variable index not used within loop body
(B007)
🔇 Additional comments (6)
gffquant/profilers/feature_quantifier.py (6)
13-20: LGTM! Clean import organization.The new imports for
AlignmentCounterandReferenceHitare well-organized and properly aligned with the code changes.
Line range hint
89-102: LGTM! Well-structured ReferenceHit implementation.The
ReferenceHitusage properly encapsulates alignment hit information with all necessary parameters.
300-300: LGTM! Clean method call.The change to use
counter.toggle_single_read_handlingis clean and consistent with the AlignmentCounter refactoring.
Line range hint
521-535: LGTM! Well-structured counter update.The counter update implementation properly handles ambiguous counts, pairing information, and includes helpful debug logging.
133-133: 🛠️ Refactor suggestionRemove unused variable.
The variable
count_annotatoris assigned but never used.- count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)Likely invalid or redundant comment.
🧰 Tools
🪛 Ruff (0.8.2)
133-133: Local variable
count_annotatoris assigned to but never usedRemove assignment to unused variable
count_annotator(F841)
148-152: 🛠️ Refactor suggestionAdd error handling for write_gene_counts.
The gene count operations should include error handling to gracefully handle potential failures.
+ try: count_writer.write_gene_counts( self.counter, self.reference_manager, gene_group_db=gene_group_db, ) + except Exception as e: + logger.error("Failed to write gene counts: %s", e) + raiseLikely invalid or redundant comment.
| def group_gene_counts(self, ggroups): | ||
|
|
||
| ggroup_counts = CountMatrix(ncols=6) | ||
| for (_, gene_counts), ggroup_id in zip(self, ggroups): | ||
| ggroup_counts[ggroup_id] += gene_counts | ||
|
|
||
| return ggroup_counts | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate implementation in group_gene_counts.
The method has two implementations:
- Lines 141-148: A clean implementation using a new CountMatrix
- Lines 151-172: A more complex in-place implementation with commented-out code
The first implementation is cleaner and more maintainable. Remove the second implementation.
def group_gene_counts(self, ggroups):
ggroup_counts = CountMatrix(ncols=6)
for (_, gene_counts), ggroup_id in zip(self, ggroups):
ggroup_counts[ggroup_id] += gene_counts
return ggroup_counts
-
-
-
- ggroup_index = {}
- # for gene_id, gene_counts in self:
- # ggroup_id = gene_id.split(".")[-1]
- # g_key_index = ggroup_index.get(ggroup_id)
- for (_, gene_counts), ggroup_id in zip(self, ggroups):
- g_key_index = ggroup_index.get(ggroup_id)
- # gene_counts = self.counts[self.index[key]]
- if g_key_index is None:
- g_key_index = ggroup_index[ggroup_id] = len(ggroup_index)
- self.counts[g_key_index] = gene_counts
- # logger.info("CM.group_gene_counts: Adding %s to new group %s (%s).", str(gene_counts), ggroup_id, g_key_index)
- else:
- self.counts[g_key_index] += gene_counts
- # logger.info("CM.group_gene_counts: Adding %s to group %s (%s).", str(gene_counts), ggroup_id, g_key_index)
-
- # replace index with grouped index
- self.index = ggroup_index
-
- # remove the un-indexed (ungrouped) rows
- self.counts = self.counts[0:len(self.index), :]
-
- return selfAlso applies to: 151-172
| for (index, counts), label in zip(self, labels): | ||
| print(label, *counts, sep="\t", file=_out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix unused variable in dump method.
The loop variable index is not used in the loop body. Remove it from the tuple unpacking.
- for (index, counts), label in zip(self, labels):
+ for (_, counts), label in zip(self, labels):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for (index, counts), label in zip(self, labels): | |
| print(label, *counts, sep="\t", file=_out) | |
| for (_, counts), label in zip(self, labels): | |
| print(label, *counts, sep="\t", file=_out) |
🧰 Tools
🪛 Ruff (0.8.2)
137-137: Loop control variable index not used within loop body
(B007)
|
|
||
|
|
||
| class CountMatrix: | ||
| NUMPY_DTYPE = 'float64' # float16 causes some overflow issue during testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Document the float16 overflow issue with a more detailed comment
The current comment about float16 overflow is too terse. Since this is a critical numerical precision decision that affects the entire counting system, it should be documented more thoroughly:
- Add a detailed comment explaining what operations caused the overflow
- Reference the specific test case or scenario that demonstrated the issue
- Consider adding this information to the class docstring instead
🔗 Analysis chain
Verify and document the overflow issue with float16.
The comment suggests that float16 causes overflow issues during testing. This should be documented more thoroughly with specific examples or test cases that demonstrate the problem.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for test files and test cases related to float16 overflow
rg -l "float16|overflow" --type py
Length of output: 126
Script:
#!/bin/bash
# Show context around float16/overflow mentions
echo "=== Context in count_matrix.py ==="
rg "float16|overflow" -B 2 -A 2 gffquant/counters/count_matrix.py
echo -e "\n=== Context in other files ==="
rg "float16|overflow" -B 2 -A 2 gffquant/db/__init__.py gffquant/alignment/bamreader.py
echo -e "\n=== Related test files ==="
fd "test.*\.py" | xargs rg "float|dtype" -B 2 -A 2
Length of output: 7315
| # functional_counts, category_sums = count_annotator.annotate_gene_counts( | ||
| # self.reference_manager, | ||
| # self.adm, | ||
| # self.counter, | ||
| # gene_group_db=gene_group_db, | ||
| # ) | ||
|
|
||
| # logger.info("FC-index: %s", str(list(functional_counts.index.keys())[:10])) | ||
| # logger.info("FC-counts: %s", str(functional_counts.counts[0:10, :])) | ||
|
|
||
| # categories = self.adm.get_categories() | ||
| # for category, category_sum in zip(categories, category_sums): | ||
| # features = tuple(self.adm.get_features(category.id)) | ||
| # feature_names = { | ||
| # feature.id: feature.name | ||
| # for feature in features | ||
| # } | ||
| # rows = tuple( | ||
| # key[0] == category.id | ||
| # for key, _ in functional_counts | ||
| # ) | ||
|
|
||
| # cat_counts = CountMatrix.from_count_matrix(functional_counts, rows=rows) | ||
| # # cat_counts = CountMatrix(ncols=6, nrows=len(feature_names)) | ||
| # # for feature in features: | ||
| # # key = (category.id, feature.id) | ||
| # # if functional_counts.has_record(key): | ||
| # # cat_counts[key] += functional_counts[key] | ||
| # # else: | ||
| # # _ = cat_counts[key] | ||
|
|
||
| # # for category in categories: | ||
| # # features = ((feature.name, feature) for feature in db.get_features(category.id)) | ||
| # # for _, feature in sorted(features, key=lambda x: x[0]): | ||
| # # _ = functional_counts[(category.id, feature.id)] | ||
|
|
||
|
|
||
| # logger.info("PROCESSING CATEGORY=%s", category.name) | ||
| # count_writer.write_category( | ||
| # category.id, | ||
| # category.name, | ||
| # category_sum, | ||
| # # functional_counts, | ||
| # cat_counts, | ||
| # # feature_names, | ||
| # features, | ||
| # unannotated_reads=(None, unannotated_reads)[report_unannotated], | ||
| # ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove commented-out code block.
This large block of commented code should be removed to improve maintainability. If it contains valuable information, consider moving it to documentation.
| self.counter = AlignmentCounter( | ||
| distribution_mode=distribution_mode, | ||
| region_counts=run_mode.overlap_required, | ||
| strand_specific=strand_specific and not run_mode.overlap_required, | ||
| strand_specific=strand_specific, | ||
| paired_end_count=paired_end_count, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
The dump method in AlignmentCounter is not implemented
The dump method in AlignmentCounter class (gffquant/counters/alignment_counter.py) raises NotImplementedError and contains commented-out implementation code. This needs to be properly implemented to:
- Remove the
raise NotImplementedError() - Implement the method using the
CountMatrixinterface instead of the commented-out code that references the non-existentself.index
🔗 Analysis chain
Verify AlignmentCounter.dump implementation.
The initialization of AlignmentCounter looks good, but the dump method needs to be properly implemented to avoid runtime failures.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify if dump method is properly implemented
ast-grep --pattern 'class AlignmentCounter {
$$$
dump($$$) {
$$$
}
$$$
}'
Length of output: 87
Script:
#!/bin/bash
# Find the AlignmentCounter class definition
rg "class AlignmentCounter" -l
# If found, show the class implementation
rg "class AlignmentCounter" -A 50
Length of output: 4212
Script:
#!/bin/bash
# Show the complete AlignmentCounter implementation
cat gffquant/counters/alignment_counter.py
Length of output: 6124
Summary by CodeRabbit
Release Notes
New Features
GeneCountAnnotatorandRegionCountAnnotatorfor enhanced counting based on genes and regions.ReferenceHitdata class to manage reference hits more effectively.CountMatrixclass for managing and manipulating genomic count matrices.Improvements
FeatureQuantifierclass to utilizeAlignmentCounterfor improved counting mechanisms.PandaProfiler.write_categorymethod inCountWriterfor handling category counts.Bug Fixes
CountAnnotatorsubclasses.CountWriterto improve handling of gene counts.CountManagerclass and associated functionalities.UniqueRegionCounter,AmbiguousRegionCounter,UniqueSeqCounter, andAmbiguousSeqCounterclasses.