Skip to content

Conversation

@cschu
Copy link
Owner

@cschu cschu commented Dec 16, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced new classes: GeneCountAnnotator and RegionCountAnnotator for enhanced counting based on genes and regions.
    • Added ReferenceHit data class to manage reference hits more effectively.
    • Added CountMatrix class for managing and manipulating genomic count matrices.
  • Improvements

    • Updated version numbers to "2.19.0" across relevant modules.
    • Enhanced FeatureQuantifier class to utilize AlignmentCounter for improved counting mechanisms.
    • Refined gene annotation and data processing in PandaProfiler.
    • Added new write_category method in CountWriter for handling category counts.
  • Bug Fixes

    • Removed deprecated methods and classes to streamline functionality and improve code clarity.
    • Eliminated methods related to gene and region counting in CountAnnotator subclasses.
    • Updated CountWriter to improve handling of gene counts.
    • Removed the CountManager class and associated functionalities.
    • Removed UniqueRegionCounter, AmbiguousRegionCounter, UniqueSeqCounter, and AmbiguousSeqCounter classes.

@coderabbitai
Copy link

coderabbitai bot commented Dec 16, 2024

Warning

Rate limit exceeded

@cschu has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 16 minutes and 38 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between c4a8fad and 1f295bc.

📒 Files selected for processing (1)
  • gffquant/profilers/feature_quantifier.py (5 hunks)

Walkthrough

This pull request updates the GFFquant library by changing the version number from 2.18.0 to 2.19.0 and significantly refactoring the counting and annotation mechanisms. Key modifications include the removal of several classes related to counting, the introduction of new classes for gene and region counting, and the replacement of CountManager with AlignmentCounter. Additionally, a new ReferenceHit data class is introduced, and the import structure has been reorganized to reflect these changes, while the overall functionality of existing components has been preserved or enhanced.

Changes

File Change Summary
Dockerfile Version label updated from 2.18.0 to 2.19.0
gffquant/__init__.py Version number updated to 2.19.0
gffquant/annotation/count_annotator.py Removed annotate methods from RegionCountAnnotator and GeneCountAnnotator
gffquant/counters/__init__.py Removed imports for UniqueSeqCounter, AmbiguousSeqCounter, and CountManager; added import for AlignmentCounter and RegionCounter
gffquant/counters/count_manager.py Entire CountManager class removed
gffquant/counters/region_counter.py Removed UniqueRegionCounter and AmbiguousRegionCounter classes
gffquant/counters/seq_counter.py Entire file removed
gffquant/annotation/genecount_annotator.py New GeneCountAnnotator class added with annotate_gene_counts method
gffquant/annotation/regioncount_annotator.py New RegionCountAnnotator class added with annotate method
gffquant/alignment/reference_hit.py New ReferenceHit data class added
gffquant/counters/alignment_counter.py Comprehensive refactoring of AlignmentCounter class
gffquant/profilers/feature_quantifier.py Updated to use AlignmentCounter instead of CountManager
gffquant/alignment/__init__.py New import for ReferenceHit added
gffquant/annotation/__init__.py Updated imports for GeneCountAnnotator and RegionCountAnnotator
gffquant/annotation/count_writer.py Updated write_gene_counts method signature and logic
gffquant/counters/count_matrix.py New methods added to CountMatrix class for managing count matrices

Sequence Diagram

sequenceDiagram
    participant Profiler
    participant AlignmentCounter
    participant Annotator
    participant CountWriter

    Profiler->>AlignmentCounter: Update counts
    AlignmentCounter-->>Profiler: Counts processed
    Profiler->>Annotator: Annotate counts
    Annotator->>AlignmentCounter: Transform counts
    Annotator-->>Profiler: Annotated counts
    Profiler->>CountWriter: Write counts
    CountWriter-->>Profiler: Counts written
Loading

Poem

🐰 Counting rabbits, code so bright,
Refactored counters take their flight!
AlignmentCounter, sleek and new,
Annotations dancing, crisp and true
Version bumped with coding might! 🧮


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
gffquant/counters/alignment_counter2.py (1)

60-65: Improve readability by replacing Boolean indexing with explicit control flow.

Using Boolean values as indices in tuples can be confusing. Consider using if-else statements for better clarity in the update_counts method.

Refactor for inc:

-            inc = (
-                (
-                    self.get_increment(aln_count, increment),
-                    increment,
-                )
-            )[aln_count == 1]
+            if aln_count == 1:
+                inc = increment
+            else:
+                inc = self.get_increment(aln_count, increment)

Refactor for key:

-            key = (
-                (
-                    (hit.rid, hit.rev_strand),
-                    hit.rid
-                )
-            )[self.strand_specific]
+            if self.strand_specific:
+                key = (hit.rid, hit.rev_strand)
+            else:
+                key = hit.rid

Also applies to: 66-71

gffquant/counters/count_manager.py (1)

167-171: Enhance readability by avoiding Boolean indexing in get_all_regions.

Using a Boolean value to index tuples can reduce code readability. Consider refactoring to use an explicit if-else statement.

Apply this refactor:

-            uniq_counts, ambig_counts = (
-                (self.uniq_seqcounts, self.ambig_seqcounts,),
-                (self.uniq_regioncounts, self.ambig_regioncounts,),
-            )[region_counts]
+            if region_counts:
+                uniq_counts = self.uniq_regioncounts
+                ambig_counts = self.ambig_regioncounts
+            else:
+                uniq_counts = self.uniq_seqcounts
+                ambig_counts = self.ambig_seqcounts
gffquant/annotation/count_annotator.py (1)

202-213: Remove unused parameter gene_group_db from RegionCountAnnotator.annotate method.

In the annotate method of RegionCountAnnotator, the parameter gene_group_db is provided but not used within the method. Consider removing this parameter to simplify the method signature.

Apply this diff:

     def annotate(self, refmgr, db, count_manager: CountManager, gene_group_db=False):
-        def annotate(self, refmgr, db, count_manager: CountManager, gene_group_db=False):
+        def annotate(self, refmgr, db, count_manager: CountManager):
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74b68d1 and 2be12b4.

📒 Files selected for processing (8)
  • Dockerfile (1 hunks)
  • gffquant/__init__.py (1 hunks)
  • gffquant/annotation/count_annotator.py (4 hunks)
  • gffquant/counters/__init__.py (1 hunks)
  • gffquant/counters/alignment_counter2.py (1 hunks)
  • gffquant/counters/count_manager.py (2 hunks)
  • gffquant/counters/region_counter.py (1 hunks)
  • gffquant/counters/seq_counter.py (1 hunks)
✅ Files skipped from review due to trivial changes (4)
  • Dockerfile
  • gffquant/init.py
  • gffquant/counters/init.py
  • gffquant/counters/seq_counter.py
🔇 Additional comments (4)
gffquant/counters/region_counter.py (1)

8-8: Updated import statement for AlignmentCounter.

The import has been updated to import AlignmentCounter from alignment_counter2. Ensure that all references to AlignmentCounter in this file and elsewhere are updated accordingly.

gffquant/counters/count_manager.py (1)

163-164: Verify the change in default values from set() to Counter() in get_regions.

The method get_regions now uses Counter() as the default value instead of set(). This change may affect how the union of regions is computed. Ensure that this change is intentional and that it doesn't impact the expected behavior of returning a set of regions.

gffquant/annotation/count_annotator.py (2)

210-213: Ensure get_all_regions returns the correct regions for annotation.

The loop now iterates over count_manager.get_all_regions(region_counts=True). Verify that this method provides the expected set of regions, especially after changes in CountManager. This ensures that all regions are appropriately annotated.


Line range hint 278-294: Confirm correct usage of get_all_regions in GeneCountAnnotator.annotate.

In the annotate method of GeneCountAnnotator, the code now iterates over count_manager.get_all_regions(). Ensure that this change aligns with the intended functionality and that all gene regions are processed as expected.

Comment on lines 40 to 42
def setdefault(self, key, default_val):
...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implement the setdefault method or remove it if unnecessary.

The setdefault method currently contains ..., indicating it's unimplemented. If this method is intended to provide functionality similar to dict.setdefault, please implement it. If it's not needed, consider removing it to avoid confusion.

return counts, normalised, scaled

def get_increment(self, n_aln, increment):
# 1overN = lavern. Maya <3
Copy link

@coderabbitai coderabbitai bot Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove non-descriptive comment in get_increment method.

The comment # 1overN = lavern. Maya <3 does not provide meaningful information about the code and includes personal content. It's best to remove it or replace it with a descriptive comment explaining the purpose of the method.

Apply this diff to remove the comment:

-        # 1overN = lavern. Maya <3

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please ignore

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cschu, understood, I'll ignore it.

(_ _/)

( •_•)


✏️ Learnings added
Learnt from: cschu
PR: cschu/gff_quantifier#55
File: gffquant/counters/alignment_counter2.py:20-20
Timestamp: 2024-12-16T20:17:48.980Z
Learning: In `gffquant/counters/alignment_counter2.py`, comments like `# 1overN = lavern. Maya <3` are acceptable and should not be flagged as non-descriptive or requiring removal.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines 45 to 49
def __getitem__(self, key):
key_index = self.index.get(key)
if key_index is None:
return 0.0
return self.counts[self.index.get(key)]
Copy link

@coderabbitai coderabbitai bot Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Optimize __getitem__ by reusing key_index.

In the __getitem__ method, after retrieving key_index, the code redundantly calls self.index.get(key) again. This can be optimized by using the already retrieved key_index variable.

Apply this diff to use key_index:

         if key_index is None:
             return 0.0
-        return self.counts[self.index.get(key)]
+        return self.counts[key_index]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __getitem__(self, key):
key_index = self.index.get(key)
if key_index is None:
return 0.0
return self.counts[self.index.get(key)]
def __getitem__(self, key):
key_index = self.index.get(key)
if key_index is None:
return 0.0
return self.counts[key_index]

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cschu, you're welcome!

 (\_/)
 (^.^)
(")_(")

If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!

Comment on lines 50 to 55
def __setitem__(self, key, value):
key_index = self.index.get(key)
if key_index is not None:
self.counts[key_index] = value
raise KeyError(f"{key=} not found.")

Copy link

@coderabbitai coderabbitai bot Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct the logic in __setitem__ to handle missing keys appropriately.

The __setitem__ method raises a KeyError regardless of whether the key exists or not. The raise KeyError statement should be placed inside an else block to ensure it's only executed when the key is not found.

Apply this diff to fix the logic:

     def __setitem__(self, key, value):
         key_index = self.index.get(key)
         if key_index is not None:
             self.counts[key_index] = value
+        else:
             raise KeyError(f"{key=} not found.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __setitem__(self, key, value):
key_index = self.index.get(key)
if key_index is not None:
self.counts[key_index] = value
raise KeyError(f"{key=} not found.")
def __setitem__(self, key, value):
key_index = self.index.get(key)
if key_index is not None:
self.counts[key_index] = value
else:
raise KeyError(f"{key=} not found.")

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cschu, you're welcome! Glad I could help.

(\ (
( -.-)
o_(")(")


If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!

Comment on lines 35 to 39
key_index = self.index.get(key)
if key_index is None:
return Counter()
return Counter({key: self.counts[key_index]})

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Utilize the default_val parameter in the get method.

The default_val parameter in get(self, key, default_val) is not used. Instead, the method returns a new empty Counter() when the key is not found. To adhere to the expected behavior of the get method, consider returning default_val when the key is missing.

Apply this diff to use default_val:

         if key_index is None:
-            return Counter()
+            return default_val
         return Counter({key: self.counts[key_index]})

Committable suggestion skipped: line range outside the PR's diff.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
gffquant/counters/count_manager.py (2)

69-72: Remove commented-out code to improve readability.

The lines 69-72 contain code that has been commented out. If this code is no longer needed, consider removing it to keep the codebase clean and maintainable.

Apply this diff to remove the commented-out code:

-                # self.seqcounts = AlignmentCounter(
-                #     strand_specific=strand_specific,
-                #     distribution_mode=distribution_mode,
-                # )

171-175: Refactor selection logic in get_all_regions for clarity.

Using tuple indexing with a boolean can be confusing. Consider using an if-else statement for better readability.

Apply this diff to refactor the selection logic:

         def get_all_regions(self, region_counts=False):
-            uniq_counts, ambig_counts = (
-                (self.uniq_seqcounts, self.ambig_seqcounts,),
-                (self.uniq_regioncounts, self.ambig_regioncounts,),
-            )[region_counts]
+            if region_counts:
+                uniq_counts = self.uniq_regioncounts
+                ambig_counts = self.ambig_regioncounts
+            else:
+                uniq_counts = self.uniq_seqcounts
+                ambig_counts = self.ambig_seqcounts
             yield from set(uniq_counts).union(ambig_counts)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2be12b4 and e143eb3.

📒 Files selected for processing (2)
  • gffquant/counters/alignment_counter2.py (1 hunks)
  • gffquant/counters/count_manager.py (3 hunks)
🔇 Additional comments (5)
gffquant/counters/alignment_counter2.py (5)

20-20: Remove non-descriptive comment in get_increment method.

The comment # 1overN = lavern. Maya <3 does not provide meaningful information about the code and includes personal content. Please remove it or replace it with a descriptive comment explaining the purpose of the method.

Apply this diff to remove the comment:

-        # 1overN = lavern. Maya <3

35-39: Utilize the default_val parameter in the get method.

The default_val parameter in get(self, key, default_val) is not used. Instead, the method returns a new empty Counter() when the key is not found. To adhere to the expected behavior of the get method, consider returning default_val when the key is missing.

Apply this diff to use default_val:

         if key_index is None:
-            return Counter()
+            return default_val
         return Counter({key: self.counts[key_index]})

40-42: Implement or remove the setdefault method.

The setdefault method currently contains ..., indicating it's unimplemented. If this method is intended to provide functionality similar to dict.setdefault, please implement it. If it's not needed, consider removing it to avoid confusion.


45-49: Optimize __getitem__ by reusing key_index.

In the __getitem__ method, after retrieving key_index, the code redundantly calls self.index.get(key) again. This can be optimized by using the already retrieved key_index variable.

Apply this diff to use key_index:

         if key_index is None:
             return 0.0
-        return self.counts[self.index.get(key)]
+        return self.counts[key_index]

50-55: Correct the logic in __setitem__ to handle missing keys appropriately.

The raise KeyError statement is outside the if block, so it will always raise an exception even when the key exists. The raise KeyError should be placed inside an else block to ensure it's only executed when the key is not found.

Apply this diff to fix the logic:

         if key_index is not None:
             self.counts[key_index] = value
+        else:
             raise KeyError(f"{key=} not found.")

(self.uniq_seqcounts, self.ambig_seqcounts,),
(self.uniq_regioncounts, self.ambig_regioncounts,),
)[region_counts]
yield from set(uniq_counts).union(ambig_counts)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle potential None values in get_all_regions to prevent exceptions.

If ambig_counts is None, calling set(uniq_counts).union(ambig_counts) will raise a TypeError. Consider checking for None before performing the union to ensure robustness.

Apply this diff to handle None values:

             yield from set(uniq_counts).union(
-                ambig_counts
+                ambig_counts if ambig_counts is not None else set()
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
yield from set(uniq_counts).union(ambig_counts)
yield from set(uniq_counts).union(ambig_counts if ambig_counts is not None else set())

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
gffquant/counters/count_manager.py (1)

58-77: Clean up commented code and fix indentation.

The counter initialization is cleaner with the consolidated approach, but there are some housekeeping items:

  1. Remove the commented-out code blocks
  2. Fix the inconsistent indentation in the else block
         if region_counts:
-            # self.uniq_regioncounts = RegionCounter(strand_specific=strand_specific)
-            # self.ambig_regioncounts = RegionCounter(
-            #     strand_specific=strand_specific,
-            #     distribution_mode=distribution_mode,
-            # )
             self.regioncounts = RegionCounter(
                 strand_specific=strand_specific,
                 distribution_mode=distribution_mode,
             )
         else:
-            # self.uniq_seqcounts = AlignmentCounter(strand_specific=strand_specific)
-            # self.ambig_seqcounts = AlignmentCounter(
-            #     strand_specific=strand_specific,
-            #     distribution_mode=distribution_mode
-            # )
             self.seqcounts = AlignmentCounter(
                 strand_specific=strand_specific,
                 distribution_mode=distribution_mode,
             )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32ce2bd and ed0eaab.

📒 Files selected for processing (3)
  • gffquant/annotation/count_annotator.py (4 hunks)
  • gffquant/counters/alignment_counter2.py (1 hunks)
  • gffquant/counters/count_manager.py (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • gffquant/counters/alignment_counter2.py
  • gffquant/annotation/count_annotator.py
🔇 Additional comments (1)
gffquant/counters/count_manager.py (1)

190-191: ⚠️ Potential issue

Handle potential None references in get_regions.

The method assumes self.uniq_regioncounts and self.ambig_regioncounts exist, but they might be None.

-        return set(self.uniq_regioncounts.get(rid, Counter())).union(
-            self.ambig_regioncounts.get(rid, Counter())
+        if self.regioncounts is None:
+            return set()
+        return set(self.regioncounts.get(rid, Counter()))

Likely invalid or redundant comment.

Comment on lines 53 to 55
# self.uniq_seqcounts, self.ambig_seqcounts = None, None
# self.uniq_regioncounts, self.ambig_regioncounts = None, None
self.seqcounts, self.regioncounts = None, None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove commented out code and document the architectural change.

The consolidation of separate counters into single instances is a good architectural improvement. However:

  1. The commented-out code should be removed as it can be retrieved from version control if needed.
  2. Consider adding a docstring explaining the new counter structure and its advantages.
-        # self.uniq_seqcounts, self.ambig_seqcounts = None, None
-        # self.uniq_regioncounts, self.ambig_regioncounts = None, None
+        """
+        Initialize counters that handle both unique and ambiguous counts.
+        This consolidation improves maintainability and reduces code duplication.
+        """
         self.seqcounts, self.regioncounts = None, None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# self.uniq_seqcounts, self.ambig_seqcounts = None, None
# self.uniq_regioncounts, self.ambig_regioncounts = None, None
self.seqcounts, self.regioncounts = None, None
"""
Initialize counters that handle both unique and ambiguous counts.
This consolidation improves maintainability and reduces code duplication.
"""
self.seqcounts, self.regioncounts = None, None


from .. import DistributionMode
from .alignment_counter import AlignmentCounter
from .alignment_counter2 import AlignmentCounter
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Migration from alignment_counter.py to alignment_counter2.py requires attention

The verification reveals that both the old alignment_counter.py and new alignment_counter2.py modules coexist in the codebase. While imports have been updated to use the new module, the presence of both files could lead to confusion and maintenance issues:

  • All imports now correctly reference alignment_counter2.py
  • The old AlignmentCounter class in alignment_counter.py still exists and inherits from Counter
  • The new AlignmentCounter class in alignment_counter2.py is a standalone class
  • Several classes inherit from the new AlignmentCounter: UniqueSeqCounter, AmbiguousSeqCounter, and RegionCounter

Consider:

  • Removing the deprecated alignment_counter.py if it's no longer needed
  • Adding a deprecation notice if backward compatibility is required
  • Documenting the migration in the codebase
🔗 Analysis chain

Verify the impact of the AlignmentCounter import change.

The change from alignment_counter to alignment_counter2 suggests a significant refactor. Let's verify that all dependent code has been updated accordingly.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining imports of the old module
rg "from.*alignment_counter\s+import" --type py

# Search for any direct usage of alignment_counter
rg "alignment_counter\." --type py

Length of output: 86


Script:

#!/bin/bash
# Let's try a broader search for alignment_counter (case-insensitive)
rg -i "alignmentcounter" --type py

# Also search for the specific class name
rg "class AlignmentCounter" --type py

# Check if the old module file still exists
fd "alignment_counter.py"

Length of output: 1706

Comment on lines 142 to 143
if self.seqcounts is not None:
unannotated_reads += self.seqcounts
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix missing attribute access in get_unannotated_reads.

The method is missing the .unannotated_reads attribute access which could lead to runtime errors.

         if self.regioncounts is not None:
-            unannotated_reads += self.regioncounts
+            unannotated_reads += self.regioncounts.unannotated_reads
         if self.seqcounts is not None:
-            unannotated_reads += self.seqcounts
+            unannotated_reads += self.seqcounts.unannotated_reads
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self.seqcounts is not None:
unannotated_reads += self.seqcounts
if self.seqcounts is not None:
unannotated_reads += self.seqcounts.unannotated_reads

Comment on lines 194 to 206
def get_all_regions(self, region_counts=False):
# uniq_counts, ambig_counts = (
# (self.uniq_seqcounts, self.ambig_seqcounts,),
# (self.uniq_regioncounts, self.ambig_regioncounts,),
# )[region_counts]
# yield from set(uniq_counts).union(ambig_counts)
counts = (
self.seqcounts,
self.regioncounts,
)[region_counts]

yield from counts

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle potential None reference in get_all_regions.

The method should handle the case where the selected counter is None.

         counts = (
             self.seqcounts,
             self.regioncounts,
         )[region_counts]
 
+        if counts is None:
+            return
         yield from counts
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_all_regions(self, region_counts=False):
# uniq_counts, ambig_counts = (
# (self.uniq_seqcounts, self.ambig_seqcounts,),
# (self.uniq_regioncounts, self.ambig_regioncounts,),
# )[region_counts]
# yield from set(uniq_counts).union(ambig_counts)
counts = (
self.seqcounts,
self.regioncounts,
)[region_counts]
yield from counts
def get_all_regions(self, region_counts=False):
# uniq_counts, ambig_counts = (
# (self.uniq_seqcounts, self.ambig_seqcounts,),
# (self.uniq_regioncounts, self.ambig_regioncounts,),
# )[region_counts]
# yield from set(uniq_counts).union(ambig_counts)
counts = (
self.seqcounts,
self.regioncounts,
)[region_counts]
if counts is None:
return
yield from counts

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
gffquant/counters/count_manager.py (1)

151-151: Document or implement NotImplementedError cases.

Several methods raise NotImplementedError:

  • get_counts() for region_counts
  • get_counts() for strand_specific

Please either:

  1. Implement these methods
  2. Add documentation explaining why they're not implemented
  3. Create tracking issues for future implementation

Would you like me to help create GitHub issues to track these implementations?

Also applies to: 168-168

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed0eaab and 6c32072.

📒 Files selected for processing (2)
  • gffquant/counters/alignment_counter2.py (1 hunks)
  • gffquant/counters/count_manager.py (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • gffquant/counters/alignment_counter2.py
🔇 Additional comments (5)
gffquant/counters/count_manager.py (5)

6-6: Verify the complete migration to alignment_counter2.

The codebase appears to have both alignment_counter.py and alignment_counter2.py. While imports have been updated, having both modules could cause confusion.


53-77: Clean up commented code after counter consolidation.

The consolidation of separate counters into single instances is a good improvement. However, the commented-out code should be removed as it's now in version control.


142-145: Fix missing attribute access in get_unannotated_reads.

The method is missing the .unannotated_reads attribute access which could lead to runtime errors.


196-208: Handle potential None reference in get_all_regions.

The method should handle the case where the selected counter is None.


189-193: ⚠️ Potential issue

Breaking change: get_regions now returns Counter instead of set.

The return type of get_regions() has changed from set to Counter. This is a breaking change that could affect dependent code.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
gffquant/annotation/genecount_annotator.py (1)

3-3: Remove unused import 'numpy' to clean up dependencies.
The code does not appear to use numpy, so consider removing it to reduce clutter.

- import numpy as np
🧰 Tools
🪛 Ruff (0.8.2)

3-3: numpy imported but unused

Remove unused import: numpy

(F401)

gffquant/counters/alignment_counter2.py (2)

35-39: Use 'default_val' in the 'get' method if intended to provide fallback.
Currently, the code returns a new Counter() instead of using 'default_val'. If the fallback should be 'default_val', consider updating the implementation. Otherwise, remove the unused parameter from the method signature.

Possible adjustment for the fallback:

-def get(self, key, default_val):
+def get(self, key, default_val=Counter()):
     key_index = self.index.get(key)
     if key_index is None:
-        return Counter()
+        return default_val
     return Counter({key: self.counts[key_index]})

86-86: Reinstate removed 'setdefault(key, ...)' usage or remove the commented-out line.
Line 86 references a commented-out usage of 'setdefault'. If you no longer use or plan to use that code, consider removing it to avoid confusion.

gffquant/counters/count_manager.py (2)

55-57: Remove commented-out code or document its purpose.
Commented code can cause confusion. If it’s no longer needed, remove it. Otherwise, add a comment explaining why it’s retained.

-        # self.uniq_seqcounts, self.ambig_seqcounts = None, None
-        # self.uniq_regioncounts, self.ambig_regioncounts = None, None
         self.seqcounts, self.regioncounts = None, None

193-199: Remove or document commented code in 'get_regions'.
If the old approach is deprecated, remove it for clarity. Otherwise, document why it’s commented out.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c32072 and b344e9f.

📒 Files selected for processing (4)
  • gffquant/annotation/count_annotator.py (6 hunks)
  • gffquant/annotation/genecount_annotator.py (1 hunks)
  • gffquant/counters/alignment_counter2.py (1 hunks)
  • gffquant/counters/count_manager.py (5 hunks)
🧰 Additional context used
📓 Learnings (1)
gffquant/counters/alignment_counter2.py (1)
Learnt from: cschu
PR: cschu/gff_quantifier#55
File: gffquant/counters/alignment_counter2.py:20-20
Timestamp: 2024-12-16T20:17:49.179Z
Learning: In `gffquant/counters/alignment_counter2.py`, comments like `# 1overN = lavern. Maya <3` are acceptable and should not be flagged as non-descriptive or requiring removal.
🪛 Ruff (0.8.2)
gffquant/annotation/genecount_annotator.py

3-3: numpy imported but unused

Remove unused import: numpy

(F401)

🔇 Additional comments (11)
gffquant/annotation/genecount_annotator.py (1)

20-20: Clarify intention of setting 'self.total_counts' to 'self.total_gene_counts'.
It’s unclear whether this is meant to unify the counters or if it’s a placeholder for future logic. Confirm that the assignment meets your design goal.

gffquant/counters/alignment_counter2.py (1)

20-20: Comment is acceptable per user learnings; no removal needed.
Noting the user-provided preference to keep comments like "# 1overN = lavern. Maya <3" intact.

gffquant/counters/count_manager.py (4)

82-87: 'has_ambig_counts' logic looks clear; no issues identified.
It succinctly checks for ambiguous counts across sequence and region counters.


153-153: Method not implemented for region_counts=True.
This raises NotImplementedError. Confirm if that matches your design or if you plan to add region-counts logic later.


190-190: Clarify the return shape for 'get_counts' in unstranded mode.
The code returns np.array(...) with four columns: (counts[0], counts[2], counts[1], counts[3]). Ensure calling code expects that ordering.


213-218: Good integration of 'transform_counts' for both seqcounts and regioncounts.
The approach is consistent and extensible.

gffquant/annotation/count_annotator.py (5)

98-99: Be mindful of the variable order in 'total_counts'.
Swapping indices from (uniq_raw, uniq_norm, ambig_raw, ambig_norm) to (uniq_raw, ambig_raw, uniq_norm, ambig_norm) might cause confusion. Confirm that this reordering aligns with the rest of the code.


113-114: Same index-swapping question applies to 'total_gene_counts'.
Ensure that these array indices match usage in distribution logic, especially if referencing them by position.


204-215: Implementation of 'annotate' in 'RegionCountAnnotator' is straightforward.
It iterates over the new unified regioncounts approach correctly. The flow to query DB, compute counts, and distribute feature counts is logically consistent.


280-280: GeneCountAnnotator: annotate method signature with 'CountManager' is well-defined.
The additional type hint improves clarity for code using this method.


293-293: Check consistency of gene-group ID parsing with other modules.
Splitting reference strings by '.' might differ from how other classes or utilities parse gene IDs. Verify uniform usage across the codebase to avoid mismatched references.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
gffquant/annotation/__init__.py (1)

5-5: Remove commented import statement.

Instead of keeping the commented import line, it should be removed as it's no longer needed and adds unnecessary noise to the codebase.

-# from .count_annotator import GeneCountAnnotator, RegionCountAnnotator
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b344e9f and 5ed8cfc.

📒 Files selected for processing (1)
  • gffquant/annotation/__init__.py (1 hunks)
🔇 Additional comments (1)
gffquant/annotation/__init__.py (1)

6-6: Verify the new module structure.

The GeneCountAnnotator has been moved to a new dedicated module. Let's verify that the new module exists and the imports are working correctly.

Also applies to: 8-8

✅ Verification successful

Let me verify the imports in the new module and check for any potential issues.


Module structure and imports have been successfully verified

The verification confirms that:

  • The new genecount_annotator.py module exists and contains the GeneCountAnnotator class
  • The imports in __init__.py have been correctly updated to import GeneCountAnnotator from the new module
  • The old import is properly commented out
  • All references to GeneCountAnnotator in the codebase are valid
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the new module structure and imports

# Check if the new module exists
echo "Checking for new module file..."
fd "genecount_annotator.py" 

# Verify the class definition in the new location
echo "Checking GeneCountAnnotator class definition..."
ast-grep --pattern 'class GeneCountAnnotator'

# Check for any remaining references to the old import path
echo "Checking for any missed import updates..."
rg "from .count_annotator import GeneCountAnnotator"

Length of output: 548


Script:

#!/bin/bash
# Check the content of the new module and its imports
echo "Content of genecount_annotator.py:"
cat gffquant/annotation/genecount_annotator.py

# Check the actual imports in __init__.py
echo -e "\nContent of __init__.py:"
cat gffquant/annotation/__init__.py

# Check for any references to GeneCountAnnotator in the codebase
echo -e "\nReferences to GeneCountAnnotator:"
rg "GeneCountAnnotator" -l

Length of output: 2224

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
gffquant/annotation/genecount_annotator.py (2)

18-46: Add type hints and documentation

The method would benefit from:

  1. Type hints for refmgr and db parameters
  2. A docstring explaining the method's purpose and parameters
  3. Proper error handling for malformed gene IDs
-    def annotate(self, refmgr, db, count_manager: CountManager, gene_group_db=False):
+    def annotate(self, refmgr: ReferenceManager, db: Database, count_manager: CountManager, gene_group_db: bool = False) -> None:
+        """
+        Annotate gene counts using the provided reference manager and database.
+        
+        Args:
+            refmgr: Reference manager containing gene information
+            db: Database containing gene annotations
+            count_manager: Manager handling count operations
+            gene_group_db: If True, splits references into gene and group IDs
+        """

21-21: Clarify or remove the questionable comment

The comment # ? suggests uncertainty about the assignment. Either document the reason for this assignment or remove the comment if the assignment is confirmed to be correct.

gffquant/counters/count_manager.py (2)

55-78: Document the counter consolidation architecture

While the consolidation of counters is a good improvement, the changes should be properly documented.

-        # self.uniq_seqcounts, self.ambig_seqcounts = None, None
-        # self.uniq_regioncounts, self.ambig_regioncounts = None, None
+        """
+        Initialize consolidated counters that handle both unique and ambiguous counts.
+        This architectural change improves maintainability by:
+        - Reducing code duplication
+        - Centralizing count management
+        - Simplifying the counter interface
+        """
         self.seqcounts, self.regioncounts = None, None

153-153: Add error message to NotImplementedError

Provide a descriptive message explaining what functionality is not implemented.

-            raise NotImplementedError()
+            raise NotImplementedError("Region-specific count retrieval with strand specificity is not yet implemented")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef34cde and d3649c2.

📒 Files selected for processing (2)
  • gffquant/annotation/genecount_annotator.py (1 hunks)
  • gffquant/counters/count_manager.py (5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/annotation/genecount_annotator.py

3-3: numpy imported but unused

Remove unused import: numpy

(F401)

🔇 Additional comments (3)
gffquant/annotation/genecount_annotator.py (1)

12-17: LGTM! Clean class implementation

The class properly extends CountAnnotator and correctly initializes the parent class.

gffquant/counters/count_manager.py (2)

200-212: Handle potential None values in get_all_regions

If counts is None, the method will raise an exception.

         counts = (
             self.seqcounts,
             self.regioncounts,
         )[region_counts]
 
+        if counts is None:
+            return
         yield from counts

213-218: LGTM! Well-structured transform_counts method

The method properly handles both seqcounts and regioncounts cases with clear control flow.

Comment on lines 27 to 31
if gene_group_db:
ref_tokens = ref.split(".")
gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1]
else:
gene_id, ggroup_id = ref, ref
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for malformed gene IDs

The gene ID splitting logic should handle potential malformed IDs gracefully.

     if gene_group_db:
-        ref_tokens = ref.split(".")
-        gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1]
+        try:
+            ref_tokens = ref.split(".")
+            if len(ref_tokens) < 2:
+                raise ValueError(f"Malformed gene ID: {ref}")
+            gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1]
+        except Exception as e:
+            logger.error("Failed to parse gene ID %s: %s", ref, str(e))
+            gene_id = ggroup_id = ref
     else:
         gene_id, ggroup_id = ref, ref

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
gffquant/annotation/genecount_annotator.py (2)

21-21: Remove or clarify commented code

The commented line # self.total_counts = self.total_gene_counts # ? suggests uncertainty in the implementation. Either remove it or document why it might be needed in the future.


18-18: Add type hints for all parameters

Consider adding type hints for all parameters in the annotate method for better code maintainability and IDE support.

-    def annotate(self, refmgr, db, count_manager: CountManager, gene_group_db=False):
+    def annotate(self, refmgr: Any, db: Any, count_manager: CountManager, gene_group_db: bool = False):
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d3649c2 and b466381.

📒 Files selected for processing (3)
  • gffquant/annotation/count_annotator.py (7 hunks)
  • gffquant/annotation/count_writer.py (1 hunks)
  • gffquant/annotation/genecount_annotator.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • gffquant/annotation/count_annotator.py
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/annotation/genecount_annotator.py

3-3: numpy imported but unused

Remove unused import: numpy

(F401)

🔇 Additional comments (4)
gffquant/annotation/genecount_annotator.py (3)

3-3: Remove unused import

The numpy module is imported but never used in this file.

-import numpy as np
🧰 Tools
🪛 Ruff (0.8.2)

3-3: numpy imported but unused

Remove unused import: numpy

(F401)


12-17: LGTM: Clean class implementation

The class definition and initialization are well-structured with proper inheritance pattern.


27-31: Add error handling for gene ID splitting

The gene ID splitting logic should handle malformed IDs gracefully.

     if gene_group_db:
-        ref_tokens = ref.split(".")
-        gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1]
+        try:
+            ref_tokens = ref.split(".")
+            if len(ref_tokens) < 2:
+                raise ValueError(f"Malformed gene ID: {ref}")
+            gene_id, ggroup_id = ".".join(ref_tokens[:-1]), ref_tokens[-1]
+        except Exception as e:
+            logger.error("Failed to parse gene ID %s: %s", ref, str(e))
+            gene_id = ggroup_id = ref
     else:
         gene_id, ggroup_id = ref, ref
gffquant/annotation/count_writer.py (1)

145-145: LGTM: Appropriate logging enhancement

The added logging statement follows the existing logging pattern and provides useful debugging information for category counts.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
gffquant/annotation/count_writer.py (1)

172-178: Remove commented-out code

Instead of keeping commented-out code, it should be removed as it can be retrieved from version control if needed.

-            # for gene, g_counts in sorted(gene_counts.items()):
-            #     out_row = self.compile_output_row(
-            #         g_counts,
-            #         scaling_factor=uniq_scaling_factor,
-            #         ambig_scaling_factor=ambig_scaling_factor
-            #     )
-            #     CountWriter.write_row(gene, out_row, stream=gene_out)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b466381 and 8249d9f.

📒 Files selected for processing (1)
  • gffquant/annotation/count_writer.py (3 hunks)
🔇 Additional comments (4)
gffquant/annotation/count_writer.py (4)

11-11: LGTM: Import addition is appropriate

The addition of the CountManager import aligns with its usage in type hinting and supports the refactoring of the write_gene_counts method.


147-147: LGTM: Helpful logging enhancement

The addition of category logging provides valuable debugging context for category processing.


166-166: LGTM: Type hint addition improves code clarity

The addition of type hints for the gene_counts parameter improves code maintainability and IDE support.


179-186: Verify the impact of switching from gene names to region IDs

The refactoring changes the output identifier from gene names to region IDs (rid). This could impact downstream processes that expect gene names in the output file.

✅ Verification successful

Let me gather more information about how the gene counts data is being used in tests and other parts of the codebase.


Let me try a different approach to search for tests and usages of the gene counts functionality.


Based on the search results, I can now generate a final response as we have sufficient context about how gene counts are used throughout the codebase.

The switch from gene names to region IDs is part of the intended design

The change from gene names to region IDs (rid) is consistent with the codebase's design. Looking at the evidence:

  1. The db_collate.py script explicitly handles both "feature" and "gene" identifiers in the count files
  2. The GeneCountAnnotator class shows that region IDs are properly mapped to gene IDs internally
  3. The commented-out code in count_writer.py shows this was an intentional refactoring to use region IDs consistently

The change maintains compatibility with downstream processing as the collation scripts are already designed to handle both identifier types.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any downstream code that might be processing the gene_counts.txt.gz file
rg -l "gene_counts\.txt\.gz" | while read -r file; do
    echo "=== $file ==="
    # Show the context of how the file is being used
    rg -C 5 "gene_counts\.txt\.gz" "$file"
done

Length of output: 870


Script:

#!/bin/bash
# Search for test files that might be validating the gene counts output format
fd test_.*\.py$ | while read -r file; do
    echo "=== $file ==="
    rg -C 5 "gene_counts" "$file"
done

# Also search for any code that reads/processes the gene counts file
ast-grep --pattern 'open($_, "r")' | rg -C 5 "gene_counts"

Length of output: 92


Script:

#!/bin/bash
# Search for test files and any code that processes gene counts with broader patterns
rg -l "gene_counts" | while read -r file; do
    echo "=== $file ==="
    rg -C 5 "gene_counts" "$file"
done

# Also look for any test data files that might contain gene counts
fd -e txt -e tsv -e csv | xargs rg -l "gene"

Length of output: 6974

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
gffquant/profilers/feature_quantifier.py (1)

158-197: Add documentation for category processing logic.

The category processing logic would benefit from documentation explaining:

  1. The purpose of category_sum and its 6-element structure
  2. The scaling factor calculation methodology
  3. The relationship between features and categories
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ffa0c5 and 688a081.

📒 Files selected for processing (1)
  • gffquant/profilers/feature_quantifier.py (5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py

133-133133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)

🔇 Additional comments (5)
gffquant/profilers/feature_quantifier.py (5)

13-20: LGTM! Import changes align with the architectural updates.

The imports reflect the transition from CountManager to AlignmentCounter and the introduction of ReferenceHit.


65-69: LGTM! Proper initialization of AlignmentCounter.

The initialization maintains the necessary parameters (distribution_mode, strand_specific, paired_end_count) for feature quantification.


202-249: Remove commented-out code block.

The large block of commented code should be removed if it's no longer needed. If it's kept for reference, consider moving it to documentation or creating a ticket to clean it up later.


286-286: LGTM! Alignment processing changes are well-integrated.

The changes properly integrate with the new AlignmentCounter API:

  1. Appropriate handling of single reads with toggle_single_read_handling
  2. Updated counter update logic with correct parameters for ambiguous counts and paired-end handling

Also applies to: 507-524


133-133: 🛠️ Refactor suggestion

Remove unused variable.

The count_annotator variable is assigned but never used.

-count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)

Likely invalid or redundant comment.

🧰 Tools
🪛 Ruff (0.8.2)

133-133133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
gffquant/counters/alignment_counter.py (2)

73-82: ⚠️ Potential issue

Implement the dump method.

The method raises NotImplementedError but has a commented implementation. This will cause runtime errors when called.

Restore the implementation:

     def dump(self, prefix, refmgr):
-        raise NotImplementedError()
         with gzip.open(f"{prefix}.{self.__class__.__name__}.txt.gz", "wt") as _out:
             for key, key_index in self.index.items():
                 ref, reflen = refmgr.get(key[0] if isinstance(key, tuple) else key)
                 print(key, ref, reflen, self.counts[key_index], sep="\t", file=_out)

143-148: 🛠️ Refactor suggestion

Add error handling for gene length retrieval.

The gene length retrieval from refmgr lacks validation. Invalid lengths could cause issues in downstream calculations.

         gene_lengths = np.array(
             tuple(
-                (refmgr.get(key[0] if isinstance(key, tuple) else key))[1]
+                next(
+                    (length for _, length in [refmgr.get(key[0] if isinstance(key, tuple) else key)] if length > 0),
+                    1  # Default length if invalid
+                )
                 for key, _ in self.counts
             )
         )
🧹 Nitpick comments (4)
gffquant/counters/alignment_counter.py (2)

20-23: Enhance documentation for strand constants.

While the comment explains the origin of the counter-intuitive strand values from SAM flags, consider adding more context about how these values map to biological strand orientation.

-    # this may be counter-intuitive
-    # but originates from the samflags 0x10, 0x20,
-    # which explicitly identify the reverse-strandness of the read
+    # These values may seem counter-intuitive but are derived from SAM flags:
+    # - PLUS_STRAND = False: Forward strand (SAM flag 0x10 not set)
+    # - MINUS_STRAND = True: Reverse strand (SAM flag 0x10 set)
+    # This mapping maintains consistency with the SAM specification.

111-114: Define unannotated reads key as a class constant.

The hardcoded hash value used for tracking unannotated reads should be defined as a class constant for better maintainability.

 class AlignmentCounter:
+    # Special key for tracking reads that don't map to any annotated features
+    UNANNOTATED_READS_KEY = "c591b65a0f4cd"
+
     COUNT_HEADER_ELEMENTS = ("raw", "lnorm", "scaled")

     def get_unannotated_reads(self):
-        return self.counts["c591b65a0f4cd"][0] 
+        return self.counts[self.UNANNOTATED_READS_KEY][0]
gffquant/profilers/feature_quantifier.py (2)

133-133: Remove unused variable.

The count_annotator variable is assigned but never used.

-        count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)
🧰 Tools
🪛 Ruff (0.8.2)

133-133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)


Line range hint 507-512: Enhance logging for counter updates.

The debug message could be more informative by including the contributed counts in a more readable format.

             msg = f"{ambig_hit_counts=} {contributed_counts=})"
+            logger.debug(
+                "Counter update: ambiguous=%s, paired=%s, contributed=%s",
+                is_ambiguous_group,
+                aln_group.is_paired(),
+                contributed_counts
+            )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 688a081 and f7c16c3.

📒 Files selected for processing (2)
  • gffquant/counters/alignment_counter.py (2 hunks)
  • gffquant/profilers/feature_quantifier.py (5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py

133-133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)

🔇 Additional comments (1)
gffquant/profilers/feature_quantifier.py (1)

202-249: Remove commented-out code block.

This large block of commented code appears to be an old implementation. If it's no longer needed, it should be removed to improve code maintainability.

Would you like me to help create a GitHub issue to track this cleanup task?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (7)
gffquant/counters/alignment_counter.py (7)

32-35: Remove unprofessional comment.

The comment "1overN = lavern. Maya <3" should be removed or replaced with a professional explanation of the one-over-N distribution mode.

-        # 1overN = lavern. Maya <3
+        # Return increment/n_aln when using one-over-N distribution mode

37-56: Remove commented-out code.

The old code should be removed as it's tracked in version control.

-        # old code:
-        # increment = 1 if (not pair or self.paired_end_count == 2) else 0.5
-
-        # if pair:
-        #     increment = 1 if self.paired_end_count == 2 else 0.5
-        # else:
-        #     increment = 0.5 if self.unmarked_orphans else 1

58-70: Add type hints and docstring.

The constructor should have type hints and a docstring explaining its parameters.

     def __init__(
         self,
-        distribution_mode=DistributionMode.ONE_OVER_N,
-        strand_specific=False,
-        paired_end_count=1,
+        distribution_mode: DistributionMode = DistributionMode.ONE_OVER_N,
+        strand_specific: bool = False,
+        paired_end_count: int = 1,
-    ):
+    ) -> None:
+        """Initialize the AlignmentCounter.
+
+        Args:
+            distribution_mode: Mode for distributing counts (default: ONE_OVER_N)
+            strand_specific: Whether to track strand-specific counts (default: False)
+            paired_end_count: Count for paired-end reads (default: 1)
+        """

83-85: Remove commented-out code.

The old implementation should be removed as it's tracked in version control.

-        # return bool(self.counts[:, 1].sum() != 0)

111-114: Make unannotated reads key a class constant.

The hardcoded key "00000000" should be a class constant with a descriptive name. Also, remove the commented-out code.

+    UNANNOTATED_READS_KEY = "00000000"  # Key for tracking unannotated reads
+
     def get_unannotated_reads(self):
-        # return self.counts["c591b65a0f4cd46d5125745a40c8c056"][0]
-        # return self.counts["c591b65a0f4cd"][0] 
-        return self.counts["00000000"][0]
+        return self.counts[self.UNANNOTATED_READS_KEY][0]

116-136: Add type hints and docstring.

The method should have type hints and a docstring explaining its parameters and return value.

-    def update_counts(self, count_stream, increment=1, ambiguous_counts=False):
+    def update_counts(
+        self,
+        count_stream: Iterator[Tuple[List[ReferenceHit], int]],
+        increment: float = 1,
+        ambiguous_counts: bool = False,
+    ) -> float:
+        """Update counts for alignments in the stream.
+
+        Args:
+            count_stream: Iterator of (hits, alignment_count) tuples
+            increment: Count increment value (default: 1)
+            ambiguous_counts: Whether hits are ambiguous (default: False)
+
+        Returns:
+            float: Total contributed counts
+        """

155-159: Add type hints and docstring.

The method should have type hints and a docstring explaining its parameters and return value.

     @staticmethod
-    def calculate_scaling_factor(raw, norm):
+    def calculate_scaling_factor(raw: float, norm: float) -> float:
+        """Calculate scaling factor from raw and normalized values.
+
+        Args:
+            raw: Raw count value
+            norm: Normalized count value
+
+        Returns:
+            float: Scaling factor (1.0 if norm is 0, otherwise raw/norm)
+        """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f7c16c3 and 2fe0bdc.

📒 Files selected for processing (2)
  • gffquant/annotation/count_writer.py (4 hunks)
  • gffquant/counters/alignment_counter.py (2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/annotation/count_writer.py

164-167: Use ternary operator row = counts[key] if counts.has_record(key) else empty_row instead of if-else-block

Replace if-else-block with row = counts[key] if counts.has_record(key) else empty_row

(SIM108)

🔇 Additional comments (3)
gffquant/counters/alignment_counter.py (3)

Line range hint 25-30: LGTM!

The function correctly normalizes and scales feature counts.


138-153: 🛠️ Refactor suggestion

Add error handling for gene lengths.

The method should handle cases where gene lengths might be missing or invalid.

         gene_lengths = np.array(
             tuple(
-                (refmgr.get(key[0] if isinstance(key, tuple) else key))[1]
+                next(
+                    (length for _, length in [refmgr.get(key[0] if isinstance(key, tuple) else key)]
+                     if length and length > 0),
+                    1  # Default length if missing or invalid
+                )
                 for key, _ in self.counts
             )
         )
+        if (gene_lengths <= 0).any():
+            logger.warning("Some genes have invalid lengths (<=0). Using default length of 1.")

Likely invalid or redundant comment.


161-168: 🛠️ Refactor suggestion

Add error handling for malformed gene IDs.

The method assumes gene IDs always contain a dot separator. Add error handling for malformed IDs.

         ggroups = (
-            (refmgr.get(key[0] if isinstance(key, tuple) else key))[0].split(".")[0]
+            next(
+                (ref.split(".")[0] for ref, _ in [refmgr.get(key[0] if isinstance(key, tuple) else key)]
+                 if "." in ref),
+                (refmgr.get(key[0] if isinstance(key, tuple) else key))[0]  # Use full ID if no dot
+            )
             for key, _ in self.counts
         )
+        logger.info("Grouped %d gene IDs", len(set(ggroups)))

Likely invalid or redundant comment.

Comment on lines 73 to +82
def dump(self, prefix, refmgr):
raise NotImplementedError()
with gzip.open(f"{prefix}.{self.__class__.__name__}.txt.gz", "wt") as _out:
for k, v in self.items():
ref, reflen = refmgr.get(k[0] if isinstance(k, tuple) else k)
print(k, ref, reflen, v, sep="\t", file=_out)
for key, key_index in self.index.items():
ref, reflen = refmgr.get(key[0] if isinstance(key, tuple) else key)
print(key, ref, reflen, self.counts[key_index], sep="\t", file=_out)
# for k, v in self.items():
# ref, reflen = refmgr.get(k[0] if isinstance(k, tuple) else k)
# print(k, ref, reflen, v, sep="\t", file=_out)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implement the dump method.

The method raises NotImplementedError but contains commented-out implementation. Either implement the method or remove it if not needed.

     def dump(self, prefix, refmgr):
-        raise NotImplementedError()
-        with gzip.open(f"{prefix}.{self.__class__.__name__}.txt.gz", "wt") as _out:
-            for key, key_index in self.index.items():
-                ref, reflen = refmgr.get(key[0] if isinstance(key, tuple) else key)
-                print(key, ref, reflen, self.counts[key_index], sep="\t", file=_out)
-            # for k, v in self.items():
-            # ref, reflen = refmgr.get(k[0] if isinstance(k, tuple) else k)
-            # print(k, ref, reflen, v, sep="\t", file=_out)
+        """Dump counts to a file.
+
+        Args:
+            prefix: Output file prefix
+            refmgr: Reference manager for looking up references
+        """
+        with gzip.open(f"{prefix}.{self.__class__.__name__}.txt.gz", "wt") as _out:
+            for key, key_index in self.index.items():
+                ref, reflen = refmgr.get(key[0] if isinstance(key, tuple) else key)
+                print(key, ref, reflen, self.counts[key_index], sep="\t", file=_out)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def dump(self, prefix, refmgr):
raise NotImplementedError()
with gzip.open(f"{prefix}.{self.__class__.__name__}.txt.gz", "wt") as _out:
for k, v in self.items():
ref, reflen = refmgr.get(k[0] if isinstance(k, tuple) else k)
print(k, ref, reflen, v, sep="\t", file=_out)
for key, key_index in self.index.items():
ref, reflen = refmgr.get(key[0] if isinstance(key, tuple) else key)
print(key, ref, reflen, self.counts[key_index], sep="\t", file=_out)
# for k, v in self.items():
# ref, reflen = refmgr.get(k[0] if isinstance(k, tuple) else k)
# print(k, ref, reflen, v, sep="\t", file=_out)
def dump(self, prefix, refmgr):
"""Dump counts to a file.
Args:
prefix: Output file prefix
refmgr: Reference manager for looking up references
"""
with gzip.open(f"{prefix}.{self.__class__.__name__}.txt.gz", "wt") as _out:
for key, key_index in self.index.items():
ref, reflen = refmgr.get(key[0] if isinstance(key, tuple) else key)
print(key, ref, reflen, self.counts[key_index], sep="\t", file=_out)

Comment on lines +176 to +202
def write_gene_counts(
self,
gene_counts: AlignmentCounter,
refmgr,
gene_group_db=False,
):
with gzip.open(f"{self.out_prefix}.gene_counts.txt.gz", "wt") as gene_out:
print("gene", *self.get_header(), sep="\t", file=gene_out, flush=True)

for gene, g_counts in sorted(gene_counts.items()):
out_row = self.compile_output_row(
g_counts,
scaling_factor=uniq_scaling_factor,
ambig_scaling_factor=ambig_scaling_factor
ref_stream = (
(
refmgr.get(rid[0] if isinstance(rid, tuple) else rid)[0],
rid,
)
CountWriter.write_row(gene, out_row, stream=gene_out)
for rid, _ in gene_counts
)

for ref, rid in sorted(ref_stream):
counts = gene_counts[rid]
# if gene_group_db:
# ref_tokens = ref.split(".")
# gene_id, _ = ".".join(ref_tokens[:-1]), ref_tokens[-1]
# else:
# gene_id = ref
gene_id = ref

CountWriter.write_row(gene_id, counts, stream=gene_out,)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Multiple improvements needed.

  1. Remove commented-out code
  2. Add error handling for file operations and reference lookups
  3. Add type hints and docstring
     def write_gene_counts(
         self,
-        gene_counts: AlignmentCounter,
-        refmgr,
-        gene_group_db=False,
+        gene_counts: AlignmentCounter,
+        refmgr: 'ReferenceManager',
+        gene_group_db: bool = False,
-    ):
+    ) -> None:
+        """Write gene counts to a gzipped output file.
+
+        Args:
+            gene_counts: Counter containing gene alignment data
+            refmgr: Reference manager for gene lookups
+            gene_group_db: Whether to parse gene IDs as group.version format
+
+        Raises:
+            IOError: If file operations fail
+            ValueError: If reference lookup fails
+        """
+        try:
             with gzip.open(f"{self.out_prefix}.gene_counts.txt.gz", "wt") as gene_out:
                 print("gene", *self.get_header(), sep="\t", file=gene_out, flush=True)

                 ref_stream = (
                     (
-                        refmgr.get(rid[0] if isinstance(rid, tuple) else rid)[0],
+                        ref_result := refmgr.get(rid[0] if isinstance(rid, tuple) else rid),
                         rid,
                     )
                     for rid, _ in gene_counts
+                    if ref_result is not None
                 )

                 for ref, rid in sorted(ref_stream):
+                    if not ref:
+                        logger.error("Reference not found for rid: %s", rid)
+                        continue
                     counts = gene_counts[rid]
-                    # if gene_group_db:
-                    #     ref_tokens = ref.split(".")
-                    #     gene_id, _ = ".".join(ref_tokens[:-1]), ref_tokens[-1]
-                    # else:
-                    #     gene_id = ref
                     gene_id = ref

                     CountWriter.write_row(gene_id, counts, stream=gene_out,)
+        except IOError as e:
+            logger.error("Failed to write gene counts: %s", str(e))
+            raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def write_gene_counts(
self,
gene_counts: AlignmentCounter,
refmgr,
gene_group_db=False,
):
with gzip.open(f"{self.out_prefix}.gene_counts.txt.gz", "wt") as gene_out:
print("gene", *self.get_header(), sep="\t", file=gene_out, flush=True)
for gene, g_counts in sorted(gene_counts.items()):
out_row = self.compile_output_row(
g_counts,
scaling_factor=uniq_scaling_factor,
ambig_scaling_factor=ambig_scaling_factor
ref_stream = (
(
refmgr.get(rid[0] if isinstance(rid, tuple) else rid)[0],
rid,
)
CountWriter.write_row(gene, out_row, stream=gene_out)
for rid, _ in gene_counts
)
for ref, rid in sorted(ref_stream):
counts = gene_counts[rid]
# if gene_group_db:
# ref_tokens = ref.split(".")
# gene_id, _ = ".".join(ref_tokens[:-1]), ref_tokens[-1]
# else:
# gene_id = ref
gene_id = ref
CountWriter.write_row(gene_id, counts, stream=gene_out,)
def write_gene_counts(
self,
gene_counts: AlignmentCounter,
refmgr: 'ReferenceManager',
gene_group_db: bool = False,
) -> None:
"""Write gene counts to a gzipped output file.
Args:
gene_counts: Counter containing gene alignment data
refmgr: Reference manager for gene lookups
gene_group_db: Whether to parse gene IDs as group.version format
Raises:
IOError: If file operations fail
ValueError: If reference lookup fails
"""
try:
with gzip.open(f"{self.out_prefix}.gene_counts.txt.gz", "wt") as gene_out:
print("gene", *self.get_header(), sep="\t", file=gene_out, flush=True)
ref_stream = (
(
ref_result := refmgr.get(rid[0] if isinstance(rid, tuple) else rid),
rid,
)
for rid, _ in gene_counts
if ref_result is not None
)
for ref, rid in sorted(ref_stream):
if not ref:
logger.error("Reference not found for rid: %s", rid)
continue
counts = gene_counts[rid]
gene_id = ref
CountWriter.write_row(gene_id, counts, stream=gene_out,)
except IOError as e:
logger.error("Failed to write gene counts: %s", str(e))
raise

Comment on lines +114 to +175
def write_category(
self,
category_id,
category_name,
category_sum,
counts,
# feature_names,
features,
unannotated_reads=None,
report_unseen=True,
):
with gzip.open(f"{self.out_prefix}.{category_name}.txt.gz", "wt") as feat_out:
header = self.get_header()
print("feature", *header, sep="\t", file=feat_out)

if unannotated_reads is not None:
print("unannotated", unannotated_reads, sep="\t", file=feat_out)

if "total_readcount" in self.publish_reports:
CountWriter.write_row(
"total_reads",
np.zeros(len(header)) + self.total_readcount,
stream=feat_out,
)
with gzip.open(f"{self.out_prefix}.{category}.txt.gz", "wt") as feat_out:
header = self.get_header()
print("feature", *header, sep="\t", file=feat_out)

if unannotated_reads is not None:
print("unannotated", unannotated_reads, sep="\t", file=feat_out)

if "total_readcount" in self.publish_reports:
CountWriter.write_row(
"total_reads",
np.zeros(len(header)) + self.total_readcount,
stream=feat_out,
)

if "filtered_readcount" in self.publish_reports:
CountWriter.write_row(
"filtered_reads",
np.zeros(len(header)) + self.filtered_readcount,
stream=feat_out,
)
if "filtered_readcount" in self.publish_reports:
CountWriter.write_row(
"filtered_reads",
np.zeros(len(header)) + self.filtered_readcount,
stream=feat_out,
)

if "category" in self.publish_reports:
cat_counts = counts.get(f"cat:::{category_id}")
if cat_counts is not None:
cat_row = self.compile_output_row(
cat_counts,
scaling_factor=featcounts.scaling_factors["total_uniq"],
ambig_scaling_factor=featcounts.scaling_factors["total_ambi"],
)
CountWriter.write_row("category", cat_row, stream=feat_out)

for feature in db.get_features(category_id):
f_counts = counts.get(str(feature.id), np.zeros(len(header)))
if report_unseen or f_counts.sum():
out_row = self.compile_output_row(
f_counts,
scaling_factor=scaling_factor,
ambig_scaling_factor=ambig_scaling_factor,
)
CountWriter.write_row(feature.name, out_row, stream=feat_out)

def write_gene_counts(self, gene_counts, uniq_scaling_factor, ambig_scaling_factor):
if "scaled" in self.publish_reports:
logger.info("SCALING_FACTORS %s %s", uniq_scaling_factor, ambig_scaling_factor)
if "category" in self.publish_reports:
# cat_counts = counts[0]
cat_counts = category_sum
logger.info("CAT %s: %s", category_name, str(cat_counts))
if cat_counts is not None:
CountWriter.write_row("category", category_sum, stream=feat_out)

# for item in counts:
# if not isinstance(item[0], tuple):
# logger.info("ITEM: %s", str(item))
# raise TypeError(f"Weird key: {str(item)}")
# (cid, fid), fcounts = item
# if (report_unseen or fcounts.sum()) and cid == category_id:
# CountWriter.write_row(feature_names[fid], fcounts, stream=feat_out,)

empty_row = np.zeros(6, dtype=CountMatrix.NUMPY_DTYPE)
for feature in features:
key = (category_id, feature.id)
if counts.has_record(key):
row = counts[key]
else:
row = empty_row
if (report_unseen or row.sum()):
CountWriter.write_row(feature.name, row, stream=feat_out,)


# for (cid, fid), fcounts in counts:
# if (report_unseen or fcounts.sum()) and cid == category_id:
# CountWriter.write_row(feature_names[fid], fcounts, stream=feat_out,)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Multiple improvements needed.

  1. Remove commented-out code and parameters
  2. Add error handling for file operations
  3. Simplify the row assignment logic
  4. Add type hints and docstring
     def write_category(
         self,
-        category_id,
-        category_name,
-        category_sum,
-        counts,
-        # feature_names,
-        features,
-        unannotated_reads=None,
-        report_unseen=True,
+        category_id: int,
+        category_name: str,
+        category_sum: np.ndarray,
+        counts: CountMatrix,
+        features: List[Feature],
+        unannotated_reads: Optional[float] = None,
+        report_unseen: bool = True,
-    ):
+    ) -> None:
+        """Write category counts to a gzipped output file.
+
+        Args:
+            category_id: Category identifier
+            category_name: Category name
+            category_sum: Sum of counts for the category
+            counts: Matrix of counts
+            features: List of features
+            unannotated_reads: Count of unannotated reads
+            report_unseen: Whether to report features with zero counts
+
+        Raises:
+            IOError: If file operations fail
+        """
-        with gzip.open(f"{self.out_prefix}.{category_name}.txt.gz", "wt") as feat_out:
+        try:
+            with gzip.open(f"{self.out_prefix}.{category_name}.txt.gz", "wt") as feat_out:
                 # ... rest of the implementation ...
-                if counts.has_record(key):
-                    row = counts[key]
-                else:
-                    row = empty_row
+                row = counts[key] if counts.has_record(key) else empty_row
+        except IOError as e:
+            logger.error("Failed to write category %s: %s", category_name, str(e))
+            raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def write_category(
self,
category_id,
category_name,
category_sum,
counts,
# feature_names,
features,
unannotated_reads=None,
report_unseen=True,
):
with gzip.open(f"{self.out_prefix}.{category_name}.txt.gz", "wt") as feat_out:
header = self.get_header()
print("feature", *header, sep="\t", file=feat_out)
if unannotated_reads is not None:
print("unannotated", unannotated_reads, sep="\t", file=feat_out)
if "total_readcount" in self.publish_reports:
CountWriter.write_row(
"total_reads",
np.zeros(len(header)) + self.total_readcount,
stream=feat_out,
)
with gzip.open(f"{self.out_prefix}.{category}.txt.gz", "wt") as feat_out:
header = self.get_header()
print("feature", *header, sep="\t", file=feat_out)
if unannotated_reads is not None:
print("unannotated", unannotated_reads, sep="\t", file=feat_out)
if "total_readcount" in self.publish_reports:
CountWriter.write_row(
"total_reads",
np.zeros(len(header)) + self.total_readcount,
stream=feat_out,
)
if "filtered_readcount" in self.publish_reports:
CountWriter.write_row(
"filtered_reads",
np.zeros(len(header)) + self.filtered_readcount,
stream=feat_out,
)
if "filtered_readcount" in self.publish_reports:
CountWriter.write_row(
"filtered_reads",
np.zeros(len(header)) + self.filtered_readcount,
stream=feat_out,
)
if "category" in self.publish_reports:
cat_counts = counts.get(f"cat:::{category_id}")
if cat_counts is not None:
cat_row = self.compile_output_row(
cat_counts,
scaling_factor=featcounts.scaling_factors["total_uniq"],
ambig_scaling_factor=featcounts.scaling_factors["total_ambi"],
)
CountWriter.write_row("category", cat_row, stream=feat_out)
for feature in db.get_features(category_id):
f_counts = counts.get(str(feature.id), np.zeros(len(header)))
if report_unseen or f_counts.sum():
out_row = self.compile_output_row(
f_counts,
scaling_factor=scaling_factor,
ambig_scaling_factor=ambig_scaling_factor,
)
CountWriter.write_row(feature.name, out_row, stream=feat_out)
def write_gene_counts(self, gene_counts, uniq_scaling_factor, ambig_scaling_factor):
if "scaled" in self.publish_reports:
logger.info("SCALING_FACTORS %s %s", uniq_scaling_factor, ambig_scaling_factor)
if "category" in self.publish_reports:
# cat_counts = counts[0]
cat_counts = category_sum
logger.info("CAT %s: %s", category_name, str(cat_counts))
if cat_counts is not None:
CountWriter.write_row("category", category_sum, stream=feat_out)
# for item in counts:
# if not isinstance(item[0], tuple):
# logger.info("ITEM: %s", str(item))
# raise TypeError(f"Weird key: {str(item)}")
# (cid, fid), fcounts = item
# if (report_unseen or fcounts.sum()) and cid == category_id:
# CountWriter.write_row(feature_names[fid], fcounts, stream=feat_out,)
empty_row = np.zeros(6, dtype=CountMatrix.NUMPY_DTYPE)
for feature in features:
key = (category_id, feature.id)
if counts.has_record(key):
row = counts[key]
else:
row = empty_row
if (report_unseen or row.sum()):
CountWriter.write_row(feature.name, row, stream=feat_out,)
# for (cid, fid), fcounts in counts:
# if (report_unseen or fcounts.sum()) and cid == category_id:
# CountWriter.write_row(feature_names[fid], fcounts, stream=feat_out,)
def write_category(
self,
category_id: int,
category_name: str,
category_sum: np.ndarray,
counts: CountMatrix,
features: List[Feature],
unannotated_reads: Optional[float] = None,
report_unseen: bool = True,
) -> None:
"""Write category counts to a gzipped output file.
Args:
category_id: Category identifier
category_name: Category name
category_sum: Sum of counts for the category
counts: Matrix of counts
features: List of features
unannotated_reads: Count of unannotated reads
report_unseen: Whether to report features with zero counts
Raises:
IOError: If file operations fail
"""
try:
with gzip.open(f"{self.out_prefix}.{category_name}.txt.gz", "wt") as feat_out:
header = self.get_header()
print("feature", *header, sep="\t", file=feat_out)
if unannotated_reads is not None:
print("unannotated", unannotated_reads, sep="\t", file=feat_out)
if "total_readcount" in self.publish_reports:
CountWriter.write_row(
"total_reads",
np.zeros(len(header)) + self.total_readcount,
stream=feat_out,
)
if "filtered_readcount" in self.publish_reports:
CountWriter.write_row(
"filtered_reads",
np.zeros(len(header)) + self.filtered_readcount,
stream=feat_out,
)
if "category" in self.publish_reports:
cat_counts = category_sum
logger.info("CAT %s: %s", category_name, str(cat_counts))
if cat_counts is not None:
CountWriter.write_row("category", category_sum, stream=feat_out)
empty_row = np.zeros(6, dtype=CountMatrix.NUMPY_DTYPE)
for feature in features:
key = (category_id, feature.id)
row = counts[key] if counts.has_record(key) else empty_row
if (report_unseen or row.sum()):
CountWriter.write_row(feature.name, row, stream=feat_out,)
except IOError as e:
logger.error("Failed to write category %s: %s", category_name, str(e))
raise
🧰 Tools
🪛 Ruff (0.8.2)

164-167: Use ternary operator row = counts[key] if counts.has_record(key) else empty_row instead of if-else-block

Replace if-else-block with row = counts[key] if counts.has_record(key) else empty_row

(SIM108)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
gffquant/counters/count_matrix.py (3)

25-29: Remove commented-out code.

The commented-out code block appears to be an alternative implementation. If it's no longer needed, please remove it to improve code readability.


89-129: Improve memory efficiency in generate_gene_counts.

The method creates multiple temporary arrays that could consume significant memory with large datasets:

  1. Initial copy of counts array
  2. Column stacking operation
  3. Multiple array operations

Consider using in-place operations where possible:

     def generate_gene_counts(self, lengths):
+        # Pre-allocate final array size to avoid multiple allocations
+        final_shape = (len(self.index), 6)
+        if self.counts.shape != final_shape:
+            self.counts = np.zeros(final_shape, dtype=self.NUMPY_DTYPE)
+
         # calculate combined_raw
         counts[:, 1:2] += counts[:, 0:1]
-        counts = np.column_stack(...)  # Avoid creating new array
+        # Reshape existing array and perform operations in-place

32-37: Add docstring to calculate_scaling_factor.

The static method needs documentation explaining its purpose and edge cases.

Add this docstring:

     @staticmethod
     def calculate_scaling_factor(raw, norm):
+        """Calculate scaling factor between raw and normalized counts.
+        
+        Args:
+            raw: Raw count value
+            norm: Normalized count value
+            
+        Returns:
+            float: Scaling factor (1.0 if norm is zero)
+        """
         if norm == 0.0:
             return 1.0
         return raw / norm
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2fe0bdc and 9a0f888.

📒 Files selected for processing (1)
  • gffquant/counters/count_matrix.py (1 hunks)
🔇 Additional comments (2)
gffquant/counters/count_matrix.py (2)

1-2: Add comprehensive module documentation.

The module needs proper documentation explaining its purpose, usage, and key components.


11-12: Document overflow scenarios with float16.

Please document the specific scenarios where float16 caused overflow issues. This will help future maintainers understand why float64 was chosen and when it might be safe to optimize by using float16.

Run this script to find potential overflow scenarios in tests:

✅ Verification successful

Overflow scenarios with float16 documented:

The float16 precision causes overflow in several critical numerical operations:

  • Length normalization (counts/length) overflows when normalized values exceed 65504
  • RPKM calculations using 1e9 factor exceed float16 range
  • Scaling factors from raw/normalized count ratios can overflow with large datasets
  • Small values from 1/N distribution of ambiguous reads risk precision loss when accumulated
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test cases that might trigger float16 overflow
rg -A 5 "test.*overflow|test.*float|assert.*overflow" tests/

Length of output: 109


Script:

#!/bin/bash
# Find test files and search for float-related patterns
echo "=== Python test files ==="
fd -e py "test" --type f

echo -e "\n=== Float-related patterns in Python files ==="
rg -l "float|numpy.*(float|dtype)|np.*(float|dtype)" 

echo -e "\n=== Documentation about float precision ==="
fd "README|CONTRIBUTING" --type f --exec cat {}

Length of output: 8321


Script:

#!/bin/bash
# Look for numerical operations in key files
echo "=== Count Matrix Implementation ==="
cat gffquant/counters/count_matrix.py

echo -e "\n=== Feature Quantifier Implementation ==="
cat gffquant/profilers/feature_quantifier.py

echo -e "\n=== Count Writer Implementation ==="
cat gffquant/annotation/count_writer.py

Length of output: 32070

Comment on lines 131 to 144
def group_gene_counts(self, ggroups):
ggroup_index = {}
# for (key, _), ggroup_id in zip(self.index.items(), ggroups):
# g_key_index = ggroup_index.get(ggroup_id)
# gene_counts = self.counts[self.index[key]]
for gene_id, gene_counts in self:
ggroup_id = gene_id.split(".")[-1]
g_key_index = ggroup_index.get(ggroup_id)
if g_key_index is None:
g_key_index = ggroup_index[ggroup_id] = len(ggroup_index)
self.counts[g_key_index] = gene_counts
else:
self.counts[g_key_index] += gene_counts

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve gene ID parsing robustness.

The current implementation has several issues:

  1. Gene ID parsing assumes a specific format (split by ".") which might not be universal
  2. No validation of ggroup_id after splitting
  3. Potential memory inefficiency when building the group index

Consider this safer implementation:

     def group_gene_counts(self, ggroups):
         ggroup_index = {}
-        for gene_id, gene_counts in self:
-            ggroup_id = gene_id.split(".")[-1]
+        for gene_id, gene_counts in self:
+            try:
+                ggroup_id = gene_id.split(".")[-1]
+                if not ggroup_id:
+                    logger.warning(f"Empty group ID found for gene {gene_id}")
+                    continue
+            except IndexError:
+                logger.warning(f"Invalid gene ID format: {gene_id}")
+                continue
             g_key_index = ggroup_index.get(ggroup_id)
             if g_key_index is None:
                 g_key_index = ggroup_index[ggroup_id] = len(ggroup_index)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def group_gene_counts(self, ggroups):
ggroup_index = {}
# for (key, _), ggroup_id in zip(self.index.items(), ggroups):
# g_key_index = ggroup_index.get(ggroup_id)
# gene_counts = self.counts[self.index[key]]
for gene_id, gene_counts in self:
ggroup_id = gene_id.split(".")[-1]
g_key_index = ggroup_index.get(ggroup_id)
if g_key_index is None:
g_key_index = ggroup_index[ggroup_id] = len(ggroup_index)
self.counts[g_key_index] = gene_counts
else:
self.counts[g_key_index] += gene_counts
def group_gene_counts(self, ggroups):
ggroup_index = {}
for gene_id, gene_counts in self:
try:
ggroup_id = gene_id.split(".")[-1]
if not ggroup_id:
logger.warning(f"Empty group ID found for gene {gene_id}")
continue
except IndexError:
logger.warning(f"Invalid gene ID format: {gene_id}")
continue
g_key_index = ggroup_index.get(ggroup_id)
if g_key_index is None:
g_key_index = ggroup_index[ggroup_id] = len(ggroup_index)
self.counts[g_key_index] = gene_counts
else:
self.counts[g_key_index] += gene_counts

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
gffquant/counters/count_matrix.py (4)

12-12: Document the overflow threshold.

The comment indicates that 'float16' caused overflow issues during testing. It would be helpful to document the specific threshold or scenario that triggered the overflow to justify using 'float64'.

-    NUMPY_DTYPE = 'float64'  # float16 causes some overflow issue during testing
+    NUMPY_DTYPE = 'float64'  # float16 overflows for counts > 65504 during testing

25-29: Remove commented-out code.

The commented-out dictionary comprehension is redundant as the current implementation using a loop is more readable and serves the same purpose.


52-59: Extract magic number into a class constant.

The padding size of 1000 should be defined as a class constant for better maintainability and clarity.

 class CountMatrix:
     NUMPY_DTYPE = 'float64'
+    RESIZE_CHUNK = 1000  # Number of rows to add when resizing
     
     def _resize(self):
         nrows = self.counts.shape[0]
         if len(self.index) == nrows:
             self.counts = np.pad(
                 self.counts,
-                ((0, nrows + 1000), (0, 0),),
+                ((0, nrows + self.RESIZE_CHUNK), (0, 0),),
             )

131-151: Simplify group_gene_counts implementation.

The method has commented-out code and could be simplified for better readability.

 def group_gene_counts(self, ggroups):
+    """Group gene counts by the provided group IDs.
+    
+    Args:
+        ggroups: Sequence of group IDs corresponding to each gene
+    
+    Returns:
+        self for method chaining
+    """
     ggroup_index = {}
-    # for gene_id, gene_counts in self:
-    #     ggroup_id = gene_id.split(".")[-1]
-    #     g_key_index = ggroup_index.get(ggroup_id)
     for (_, gene_counts), ggroup_id in zip(self, ggroups):
         g_key_index = ggroup_index.get(ggroup_id)
-        # gene_counts = self.counts[self.index[key]]
         if g_key_index is None:
             g_key_index = ggroup_index[ggroup_id] = len(ggroup_index)
             self.counts[g_key_index] = gene_counts
         else:
             self.counts[g_key_index] += gene_counts
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a0f888 and b51b48d.

📒 Files selected for processing (1)
  • gffquant/counters/count_matrix.py (1 hunks)
🔇 Additional comments (1)
gffquant/counters/count_matrix.py (1)

89-129: 🛠️ Refactor suggestion

Improve readability of array operations.

The array operations use magic numbers for indexing, making the code hard to understand. Define constants for these indices and add comments explaining the array structure.

 def generate_gene_counts(self, lengths):
+    # Column indices for unique counts
+    UNIQUE_RAW = 0
+    UNIQUE_NORM = 1
+    UNIQUE_SCALED = 2
+    # Column indices for combined counts
+    COMBINED_RAW = 3
+    COMBINED_NORM = 4
+    COMBINED_SCALED = 5
+
     logger.info("LENGTHS ARRAY = %s", lengths.shape)
     logger.info("INDEX SIZE = %s", len(self.index))

     # remove the un-indexed rows
     counts = self.counts[0:len(self.index), :]

     # calculate combined_raw by adding unique counts
-    counts[:, 1:2] += counts[:, 0:1]
+    counts[:, 1:2] += counts[:, UNIQUE_RAW:UNIQUE_RAW+1]

     counts = np.column_stack(
         (
-            counts[:, 0], counts[:, 0], counts[:, 0],  # 0, 1, 2
-            counts[:, 1], counts[:, 1], counts[:, 1],  # 3, 4, 5
+            counts[:, UNIQUE_RAW], counts[:, UNIQUE_RAW], counts[:, UNIQUE_RAW],
+            counts[:, COMBINED_RAW], counts[:, COMBINED_RAW], counts[:, COMBINED_RAW],
         ),
     )

Likely invalid or redundant comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
gffquant/counters/count_matrix.py (5)

11-12: Document the rationale for using float64.

The comment about overflow issues should be expanded to explain:

  1. What kind of overflow issues were encountered
  2. Why float64 was chosen over other alternatives
  3. The memory implications of using float64
-    NUMPY_DTYPE = 'float64'  # float16 causes some overflow issue during testing
+    NUMPY_DTYPE = 'float64'
+    """Data type for count matrices.
+    
+    Using float64 instead of float16 to prevent overflow issues during:
+    - Large count accumulation
+    - Scaling operations
+    - Length normalization
+    
+    Note: This increases memory usage but ensures numerical stability.
+    """

14-30: Clean up commented code and add proper documentation.

The method needs cleanup and documentation improvements:

  1. Remove the commented-out dictionary comprehension
  2. Add type hints and docstring
     @classmethod
-    def from_count_matrix(cls, cmatrix, rows=None):
+    def from_count_matrix(cls, cmatrix: 'CountMatrix', rows: np.ndarray = None) -> 'CountMatrix':
+        """Create a new CountMatrix from an existing one.
+
+        Args:
+            cmatrix: Source count matrix
+            rows: Boolean mask for row selection
+
+        Returns:
+            A new CountMatrix instance with selected data
+        """
         if rows is None:
             counts = np.array(cmatrix.counts)
             index = dict(cmatrix.index.items())
         else:
             counts = cmatrix.counts[rows, :]
             index = {}
             for (key, _), keep in zip(cmatrix.index.items(), rows):
                 if keep:
                     index[key] = len(index)
-            # index = {
-            #     key: value
-            #     for (key, value), keep in zip(cmatrix.index.items(), rows)
-            #     if keep
-            # }
         return cls(index=index, counts=counts)

32-36: Add input validation and improve documentation.

The method should validate inputs and document its purpose clearly.

     @staticmethod
-    def calculate_scaling_factor(raw, norm):
+    def calculate_scaling_factor(raw: float, norm: float) -> float:
+        """Calculate scaling factor for count normalization.
+
+        Args:
+            raw: Raw count value
+            norm: Normalized count value
+
+        Returns:
+            Scaling factor (raw/norm) or 1.0 if norm is zero
+
+        Raises:
+            ValueError: If raw or norm is negative
+        """
+        if raw < 0 or norm < 0:
+            raise ValueError("Raw and normalized counts must be non-negative")
         if norm == 0.0:
             return 1.0
         return raw / norm

86-88: Document the drop_unindexed method.

The method needs documentation explaining its purpose and side effects.

-    def drop_unindexed(self):
+    def drop_unindexed(self) -> None:
+        """Remove rows that aren't in the index.
+        
+        This method modifies the matrix in-place by truncating it to
+        only include rows that have corresponding entries in the index.
+        """
         self.counts = self.counts[0:len(self.index), :]

131-154: Clean up commented code and improve documentation.

The method needs cleanup and better documentation:

  1. Remove commented-out code
  2. Document the purpose of using ggroups instead of splitting gene IDs
  3. Add type hints and return value documentation
-    def group_gene_counts(self, ggroups):
+    def group_gene_counts(self, ggroups: list[str]) -> 'CountMatrix':
+        """Group gene counts by provided group IDs.
+        
+        Args:
+            ggroups: List of group IDs corresponding to each gene
+            
+        Returns:
+            self for method chaining
+        """
         ggroup_index = {}
-        # for gene_id, gene_counts in self:
-        #     ggroup_id = gene_id.split(".")[-1]
-        #     g_key_index = ggroup_index.get(ggroup_id)
         for (_, gene_counts), ggroup_id in zip(self, ggroups):
             g_key_index = ggroup_index.get(ggroup_id)
-            # gene_counts = self.counts[self.index[key]]
             if g_key_index is None:
                 g_key_index = ggroup_index[ggroup_id] = len(ggroup_index)
                 self.counts[g_key_index] = gene_counts
-                # logger.info("CM.group_gene_counts: Adding %s to new group %s (%s).", str(gene_counts), ggroup_id, g_key_index)
             else:
                 self.counts[g_key_index] += gene_counts
-                # logger.info("CM.group_gene_counts: Adding %s to group %s (%s).", str(gene_counts), ggroup_id, g_key_index)
gffquant/profilers/feature_quantifier.py (1)

158-198: Consider simplifying category processing logic.

The category processing logic could be simplified by:

  1. Using list comprehension for feature processing
  2. Extracting the category sum calculation into a separate method

Example refactor:

 for category in categories:
     logger.info("PROCESSING CATEGORY=%s", category.name)
     category_sum = np.zeros(6, dtype='float64')
     category_counts = CountMatrix(ncols=6)
-    for rid, counts in self.counter:
-        if gene_group_db:
-            ggroup_id = rid
-            logger.info("GGROUP %s: %s", ggroup_id, str(counts))
-        else:
-            ref, _ = self.reference_manager.get(rid[0] if isinstance(rid, tuple) else rid)
-            ggroup_id = ref
+    def get_ggroup_id(rid):
+        return rid if gene_group_db else self.reference_manager.get(rid[0] if isinstance(rid, tuple) else rid)[0]
+
+    for rid, counts in self.counter:
+        ggroup_id = get_ggroup_id(rid)
+        if gene_group_db:
+            logger.info("GGROUP %s: %s", ggroup_id, str(counts))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b51b48d and c065163.

📒 Files selected for processing (2)
  • gffquant/counters/count_matrix.py (1 hunks)
  • gffquant/profilers/feature_quantifier.py (5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py

133-133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)

🔇 Additional comments (6)
gffquant/counters/count_matrix.py (3)

1-1: Add comprehensive module documentation.

The module needs proper documentation explaining its purpose, usage, and key components.


79-85: Add bounds checking and documentation for in-place modification.

The method needs bounds checking and should document that it modifies data in-place.


89-129: Improve readability and add validation in generate_gene_counts.

The method has complex array operations that need better documentation and validation.

gffquant/profilers/feature_quantifier.py (3)

203-250: Remove commented-out code block.

This large block of commented code appears to be obsolete after the refactoring. If it's no longer needed, it should be removed to improve code maintainability.


133-133: 🛠️ Refactor suggestion

Remove unused variable.

The variable count_annotator is assigned but never used.

-count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)

Likely invalid or redundant comment.

🧰 Tools
🪛 Ruff (0.8.2)

133-133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)


Line range hint 508-516: Verify AlignmentCounter.update return type.

Please ensure that the contributed_counts variable returned by counter.update matches the expected type used in the debug message.

✅ Verification successful

Return type verification completed successfully

The contributed_counts returned by counter.update() is a numeric value that matches the expected type used in the debug message. This is verified by examining both the implementation in AlignmentCounter and its usage in the debug output.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check the return type of AlignmentCounter.update
ast-grep --pattern 'class AlignmentCounter {
  $$$
  update($$$) {
    $$$
    return $$$
  }
  $$$
}'

Length of output: 104


Script:

#!/bin/bash
# Find the AlignmentCounter class definition
rg "class AlignmentCounter" -l

# If found, examine the update method implementation
rg "class AlignmentCounter" -A 20

Length of output: 1785


Script:

#!/bin/bash
# Find the update method in AlignmentCounter
rg "def update" gffquant/counters/alignment_counter.py -A 10

# Check how contributed_counts is used in feature_quantifier
rg "contributed_counts" gffquant/profilers/feature_quantifier.py -A 5

Length of output: 1636

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
gffquant/profilers/feature_quantifier.py (2)

163-204: Add documentation for category processing logic.

The category processing logic is complex and would benefit from documentation explaining:

  • The purpose of category_sum and category_counts
  • The scaling factor calculations
  • The relationship between gene groups and categories

Line range hint 514-520: Add error handling for counter.update.

Consider adding error handling around counter.update to gracefully handle potential exceptions during alignment processing.

+            try:
                 contributed_counts = self.counter.update(
                     count_stream,
                     ambiguous_counts=is_ambiguous_group,
                     pair=aln_group.is_paired(),
                     pe_library=aln_group.pe_library,
                 )
+            except Exception as e:
+                logger.error(f"Error updating counts for alignment group {aln_group.qname}: {e}")
+                contributed_counts = None
gffquant/counters/count_matrix.py (3)

12-12: Document the rationale for using float64.

Add a more detailed comment explaining the specific overflow issues encountered with float16 to help future maintainers understand this decision.

-    NUMPY_DTYPE = 'float64'  # float16 causes some overflow issue during testing
+    NUMPY_DTYPE = 'float64'  # Using float64 instead of float16 to prevent overflow
+    # issues during scaling and normalization operations, particularly
+    # when dealing with large count values or small normalization factors

25-29: Remove commented-out code.

The commented-out dictionary comprehension should be removed as it's no longer needed and can confuse maintainers.


131-134: Make output file path configurable.

The dump method writes to a hardcoded file path. Consider making this configurable to allow for different output locations and to facilitate testing.

-    def dump(self, state="genes"):
-        with open(f"CountMatrix.{state}.txt", "wt") as _out:
+    def dump(self, state="genes", output_dir="."):
+        """Dump the count matrix to a text file.
+        
+        Args:
+            state: String identifier for the current state
+            output_dir: Directory where the output file should be written
+        """
+        output_path = f"{output_dir}/CountMatrix.{state}.txt"
+        with open(output_path, "wt") as _out:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c065163 and 21d4264.

📒 Files selected for processing (2)
  • gffquant/counters/count_matrix.py (1 hunks)
  • gffquant/profilers/feature_quantifier.py (5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py

133-133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)

🔇 Additional comments (5)
gffquant/profilers/feature_quantifier.py (3)

Line range hint 91-103: LGTM! Good use of ReferenceHit class.

The structured representation of alignment hits using the ReferenceHit class improves code organization and type safety.


293-293: LGTM! Correct usage of AlignmentCounter.

The change correctly uses the new AlignmentCounter API.


133-133: 🛠️ Refactor suggestion

Remove unused variable.

The count_annotator variable is assigned but never used.

-        count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)

Likely invalid or redundant comment.

🧰 Tools
🪛 Ruff (0.8.2)

133-133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)

gffquant/counters/count_matrix.py (2)

1-1: Add comprehensive module documentation.

The module needs proper documentation explaining its purpose, usage, and key components.


137-158: Verify group_gene_counts changes.

The method has been significantly refactored from using gene_id splitting to using a provided ggroups parameter. Please verify:

  1. That all callers have been updated to provide the ggroups parameter
  2. That the ggroups parameter matches the expected length and format
✅ Verification successful

Changes to group_gene_counts are properly implemented

The refactoring is safe. The single caller in alignment_counter.py correctly provides the ggroups parameter, and the parameter format matches the expected usage. The use of zip() in the implementation ensures length consistency between counts and group identifiers.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for calls to group_gene_counts to verify they provide ggroups parameter
rg -A 5 "group_gene_counts" --type python

Length of output: 72


Script:

#!/bin/bash
# Search for files containing group_gene_counts
rg "group_gene_counts" -l "*.py"

# Find method calls using ast-grep
ast-grep --pattern 'group_gene_counts($$$)'

# Get more context around the usage
rg "group_gene_counts" -B 3 -A 3 "*.py"

Length of output: 208


Script:

#!/bin/bash
# Find all Python files containing group_gene_counts
fd -e py | xargs rg "group_gene_counts" -l

# Get context around usage
fd -e py | xargs rg "group_gene_counts" -B 3 -A 3

# Find method calls in Python files
ast-grep -l python --pattern 'group_gene_counts($$$)'

Length of output: 1864


Script:

#!/bin/bash
# Get more context around the caller in alignment_counter.py
rg -B 10 -A 5 "group_gene_counts" gffquant/counters/alignment_counter.py

Length of output: 385

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
gffquant/profilers/feature_quantifier.py (1)

215-262: 🛠️ Refactor suggestion

Remove commented-out code block.

This large block of commented code should be removed as it's been replaced by the new implementation above. If this code contains valuable information, consider:

  1. Moving it to documentation
  2. Creating a ticket to track this technical debt
🧹 Nitpick comments (4)
gffquant/profilers/feature_quantifier.py (1)

169-190: Optimize category processing performance.

The nested loops for category processing could be optimized:

  1. Consider using a dictionary to cache region annotations
  2. Use list comprehension for better performance
  3. Consider vectorizing the operations using numpy

Would you like me to provide an optimized implementation?

gffquant/counters/count_matrix.py (3)

25-29: Remove commented-out code.

The commented-out dictionary comprehension should be removed as it's no longer needed.


32-37: Add documentation for calculate_scaling_factor.

The method handles division by zero but lacks docstring explaining its purpose and edge cases.

     @staticmethod
     def calculate_scaling_factor(raw, norm):
+        """Calculate scaling factor while handling zero normalization values.
+        
+        Args:
+            raw: Raw count value
+            norm: Normalized count value
+            
+        Returns:
+            float: Scaling factor (1.0 if norm is zero, raw/norm otherwise)
+        """
         if norm == 0.0:
             return 1.0
         return raw / norm

52-59: Extract magic number and improve resize logic.

The resize method uses a magic number (1000) for padding. This should be a class constant or configuration parameter.

+    RESIZE_INCREMENT = 1000  # Number of rows to add when resizing
+
     def _resize(self):
         nrows = self.counts.shape[0]
         if len(self.index) == nrows:
             self.counts = np.pad(
                 self.counts,
-                ((0, nrows + 1000), (0, 0),),
+                ((0, nrows + self.RESIZE_INCREMENT), (0, 0),),
             )
         return len(self.index)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 21d4264 and 929e211.

📒 Files selected for processing (2)
  • gffquant/counters/count_matrix.py (1 hunks)
  • gffquant/profilers/feature_quantifier.py (5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py

133-133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)

gffquant/counters/count_matrix.py

136-136: Loop control variable index not used within loop body

(B007)

🔇 Additional comments (7)
gffquant/profilers/feature_quantifier.py (3)

13-20: LGTM! Clean import organization.

The new imports for AlignmentCounter and ReferenceHit are well-organized and align with the refactoring changes.


Line range hint 520-524: LGTM! Clean counter update implementation.

The refactored counter update logic is well-implemented with:

  1. Proper handling of ambiguous counts
  2. Clear pair and PE library handling
  3. Good debug logging

133-133: 🛠️ Refactor suggestion

Remove unused variable.

The variable count_annotator is assigned but never used.

-        count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)

Likely invalid or redundant comment.

🧰 Tools
🪛 Ruff (0.8.2)

133-133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)

gffquant/counters/count_matrix.py (4)

1-2: Add comprehensive module documentation.

The module still lacks proper documentation explaining its purpose, usage, and key components.


38-48: Add input validation to constructor.

The constructor still lacks validation for input parameters.


89-129: Improve readability of generate_gene_counts.

The method uses magic indices and lacks clear documentation of the array structure.


12-12: Verify and document the float16 overflow issue.

The comment suggests there was an overflow issue with float16, but it lacks details about the specific test case or scenario.

✅ Verification successful

Comment provides valuable historical context about dtype choice

The comment about float16 overflow serves as important documentation for future maintainers, explaining why float64 was chosen. The codebase consistently uses float64 across multiple components that handle numerical operations.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test files and overflow-related patterns
rg -l "overflow|float16|dtype" --type python

Length of output: 77


Script:

#!/bin/bash
# Search for test files and overflow-related patterns with correct extension
rg -l "overflow|float16|dtype" -g "*.py"

# Search specifically in test files
fd "test.*\.py$" --exec rg -l "float|dtype|numpy" {}

# Look for the actual test implementations
rg -A 5 "class.*Test.*Matrix" -g "test_*.py"

Length of output: 462


Script:

#!/bin/bash
# Check content of count_matrix.py for context
rg -B 2 -A 2 "float|dtype|numpy" gffquant/counters/count_matrix.py

# Find test files with correct pattern
fd "test.*matrix.*\.py$" --exec cat {}

# Check other files for dtype usage context
rg -B 2 -A 2 "dtype" gffquant/profilers/feature_quantifier.py gffquant/alignment/bamreader.py gffquant/annotation/count_writer.py gffquant/db/__init__.py

Length of output: 1443

Comment on lines 131 to 138
def dump(self, state="genes", labels=None,):
with open(f"CountMatrix.{state}.txt", "wt") as _out:
if labels is None:
for index, counts in self:
print(index, *counts, sep="\t", file=_out)
for (index, counts), label in zip(self, labels):
print(label, *counts, sep="\t", file=_out)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix logic error in dump method.

The method has several issues:

  1. The labels condition is incorrectly implemented (both loops will execute)
  2. The index variable in the second loop is unused
  3. No error handling for file operations
     def dump(self, state="genes", labels=None,):
+        """Dump matrix contents to a file.
+        
+        Args:
+            state: String identifier for the output file
+            labels: Optional labels for each row
+        """
         with open(f"CountMatrix.{state}.txt", "wt") as _out:
-            if labels is None:
+            try:
+                if labels is None:
                     for index, counts in self:
                         print(index, *counts, sep="\t", file=_out)
-                for (index, counts), label in zip(self, labels):
-                    print(label, *counts, sep="\t", file=_out)
+                else:
+                    for (_, counts), label in zip(self, labels):
+                        print(label, *counts, sep="\t", file=_out)
+            except IOError as e:
+                logger.error("Failed to write matrix: %s", e)
+                raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def dump(self, state="genes", labels=None,):
with open(f"CountMatrix.{state}.txt", "wt") as _out:
if labels is None:
for index, counts in self:
print(index, *counts, sep="\t", file=_out)
for (index, counts), label in zip(self, labels):
print(label, *counts, sep="\t", file=_out)
def dump(self, state="genes", labels=None,):
"""Dump matrix contents to a file.
Args:
state: String identifier for the output file
labels: Optional labels for each row
"""
with open(f"CountMatrix.{state}.txt", "wt") as _out:
try:
if labels is None:
for index, counts in self:
print(index, *counts, sep="\t", file=_out)
else:
for (_, counts), label in zip(self, labels):
print(label, *counts, sep="\t", file=_out)
except IOError as e:
logger.error("Failed to write matrix: %s", e)
raise
🧰 Tools
🪛 Ruff (0.8.2)

136-136: Loop control variable index not used within loop body

(B007)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
gffquant/profilers/feature_quantifier.py (1)

65-69: ⚠️ Potential issue

Verify AlignmentCounter.dump implementation.

The dump method in AlignmentCounter is present but deliberately raises NotImplementedError. There's an actual implementation after the raise statement, but it's unreachable.

To fix this:

  • Remove the raise NotImplementedError() line in gffquant/counters/alignment_counter.py:dump
🧹 Nitpick comments (1)
gffquant/counters/count_matrix.py (1)

11-12: Document the rationale for using float64.

The comment indicates that float16 causes overflow issues, but it would be helpful to document:

  1. The specific overflow scenarios
  2. The performance implications of using float64
  3. Whether float32 was considered as a middle ground
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 929e211 and 12dbfb3.

📒 Files selected for processing (2)
  • gffquant/counters/count_matrix.py (1 hunks)
  • gffquant/profilers/feature_quantifier.py (5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py

133-133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)

gffquant/counters/count_matrix.py

137-137: Loop control variable index not used within loop body

(B007)

🔇 Additional comments (8)
gffquant/counters/count_matrix.py (6)

1-2: Add comprehensive module documentation.

The module lacks proper documentation explaining its purpose, usage, and key components.

Add a detailed docstring that includes:

  • Module purpose and overview
  • Key classes and their relationships
  • Usage examples
  • Any important notes about memory usage or performance considerations
-""" module docstring """
+"""Count matrix management for gene expression quantification.
+
+This module provides the CountMatrix class for efficient storage and manipulation
+of gene expression count data. It supports operations like:
+- Raw and normalized count storage
+- Length normalization
+- Count scaling and grouping
+- Memory-efficient storage and operations
+
+Example:
+    matrix = CountMatrix(ncols=2, nrows=1000)
+    matrix[gene_id] = [raw_count, 0]
+    matrix.generate_gene_counts(lengths)
+"""

38-48: Add input validation to constructor.

The constructor should validate input parameters and document the expected shapes.

-    def __init__(self, ncols=2, nrows=1000, index=None, counts=None,):
+    def __init__(self, ncols: int = 2, nrows: int = 1000, 
+                 index: dict = None, counts: np.ndarray = None) -> None:
+        """Initialize CountMatrix.
+        
+        Args:
+            ncols: Number of columns for the count matrix
+            nrows: Initial number of rows
+            index: Optional pre-existing index dictionary
+            counts: Optional pre-existing count matrix
+            
+        Raises:
+            ValueError: If counts is provided without matching index
+            ValueError: If ncols or nrows are not positive integers
+        """
+        if ncols <= 0 or nrows <= 0:
+            raise ValueError("ncols and nrows must be positive integers")
+        if counts is not None and index is None:
+            raise ValueError("counts provided without matching index")
         if index is not None and counts is not None:
+            if len(index) > counts.shape[0]:
+                raise ValueError("index size exceeds counts dimensions")
             self.index = dict(index.items())
             self.counts = counts

61-71: Add bounds checking to item access methods.

Both getitem and setitem should verify array bounds before access.

     def __getitem__(self, key):
         key_index = self.index.get(key)
         if key_index is None:
+            if len(self.index) >= self.counts.shape[0]:
+                raise IndexError("Matrix capacity exceeded")
             key_index = self.index[key] = self._resize()
         return self.counts[key_index]

     def __setitem__(self, key, value):
         key_index = self.index.get(key)
         if key_index is None:
+            if len(self.index) >= self.counts.shape[0]:
+                raise IndexError("Matrix capacity exceeded")
             key_index = self.index[key] = self._resize()
+        if not isinstance(value, (list, np.ndarray)) or len(value) != self.counts.shape[1]:
+            raise ValueError(f"Value must be array-like with length {self.counts.shape[1]}")
         self.counts[key_index] = value

79-85: Add bounds checking and documentation for in-place modification.

The scale_column method needs bounds checking and should document that it modifies data in-place.

-    def scale_column(self, col_index, factor, rows=None):
+    def scale_column(self, col_index: int, factor: float, rows=None) -> None:
+        """Scale values in the specified column by the given factor.
+        
+        Args:
+            col_index: Index of the column to scale
+            factor: Scaling factor to apply
+            rows: Optional mask for row selection
+            
+        Note:
+            This operation modifies the matrix in-place.
+            
+        Raises:
+            IndexError: If col_index is out of bounds
+        """
+        if not 0 <= col_index < self.counts.shape[1] - 1:
+            raise IndexError(f"Column index {col_index} out of bounds")
         # apply scaling factors
         if rows is None:
             self.counts[:, col_index + 1] = self.counts[:, col_index] * factor
         else:
             self.counts[rows, col_index + 1] = self.counts[rows, col_index] * factor

89-130: Improve readability and add validation in generate_gene_counts.

The generate_gene_counts method has complex array operations that need:

  1. Input validation
  2. Better documentation of the operations
  3. Extraction of magic numbers into named constants
-    def generate_gene_counts(self, lengths):
+    def generate_gene_counts(self, lengths: np.ndarray) -> 'CountMatrix':
+        """Generate normalized and scaled gene counts.
+
+        Args:
+            lengths: Array of gene lengths for normalization
+
+        Returns:
+            self for method chaining
+
+        Raises:
+            ValueError: If lengths shape doesn't match count matrix
+        """
+        if lengths.shape[0] != len(self.index):
+            raise ValueError(f"Lengths array shape {lengths.shape} doesn't match matrix shape {len(self.index)}")
+
+        # Constants for clarity
+        RAW_COL = 0
+        NORM_COL = 1
+        SCALED_COL = 2
+
         logger.info("LENGTHS ARRAY = %s", lengths.shape)
         logger.info("INDEX SIZE = %s", len(self.index))

         # remove the un-indexed rows
         counts = self.counts[0:len(self.index), :]

         # calculate combined_raw
         counts[:, 1:2] += counts[:, 0:1]

+        # Create columns: [raw, length_norm, scaled] for both unique and combined counts
         counts = np.column_stack(
             (
-                counts[:, 0], counts[:, 0], counts[:, 0],  # 0, 1, 2
-                counts[:, 1], counts[:, 1], counts[:, 1],  # 3, 4, 5
+                counts[:, RAW_COL], counts[:, RAW_COL], counts[:, RAW_COL],  # unique counts
+                counts[:, NORM_COL], counts[:, NORM_COL], counts[:, NORM_COL],  # combined counts
             ),
         )

131-139: Fix logic error in dump method.

The method has several issues:

  1. The labels condition is incorrectly implemented (both loops will execute)
  2. The index variable in the second loop is unused
  3. No error handling for file operations
-    def dump(self, state="genes", labels=None,):
+    def dump(self, state: str = "genes", labels: list = None) -> None:
+        """Dump matrix contents to a file.
+        
+        Args:
+            state: String identifier for the output file
+            labels: Optional labels for each row
+        """
         with open(f"CountMatrix.{state}.txt", "wt") as _out:
-            if labels is None:
+            try:
+                if labels is None:
                     for index, counts in self:
                         print(index, *counts, sep="\t", file=_out)
-                for (index, counts), label in zip(self, labels):
-                    print(label, *counts, sep="\t", file=_out)
+                else:
+                    for (_, counts), label in zip(self, labels):
+                        print(label, *counts, sep="\t", file=_out)
+            except IOError as e:
+                logger.error("Failed to write matrix: %s", e)
+                raise
🧰 Tools
🪛 Ruff (0.8.2)

137-137: Loop control variable index not used within loop body

(B007)

gffquant/profilers/feature_quantifier.py (2)

215-262: Remove commented-out code block.

This large block of commented code should be removed if it's no longer needed.

If it contains valuable information, consider:

  1. Moving it to documentation
  2. Creating a ticket to track this technical debt

133-133: 🛠️ Refactor suggestion

Remove unused variable.

The variable count_annotator is assigned but never used.

-        count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)

Likely invalid or redundant comment.

🧰 Tools
🪛 Ruff (0.8.2)

133-133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
gffquant/counters/count_matrix.py (4)

11-12: Document the overflow issue in more detail.

The comment about float16 overflow should be expanded to help future maintainers understand the specific scenarios where overflow occurred and why float64 was chosen.

-    NUMPY_DTYPE = 'float64'  # float16 causes some overflow issue during testing
+    NUMPY_DTYPE = 'float64'  # Using float64 instead of float16 to prevent overflow
+    # during large count accumulation and scaling operations in testing.
+    # float16 max value (65504) is too small for typical read counts.

14-30: Add type hints and documentation to the classmethod.

The method needs proper type hints and documentation for better maintainability.

     @classmethod
-    def from_count_matrix(cls, cmatrix, rows=None):
+    def from_count_matrix(
+        cls,
+        cmatrix: 'CountMatrix',
+        rows: np.ndarray | None = None
+    ) -> 'CountMatrix':
+        """Create a new CountMatrix from an existing one.
+
+        Args:
+            cmatrix: Source CountMatrix instance
+            rows: Optional boolean mask for row selection
+
+        Returns:
+            A new CountMatrix instance with selected data
+        """

25-29: Remove commented-out code.

The commented-out dictionary comprehension should be removed as it's no longer needed.


32-36: Add type hints and improve documentation.

The method needs proper type hints and clearer documentation about the zero-value handling.

     @staticmethod
-    def calculate_scaling_factor(raw, norm):
+    def calculate_scaling_factor(raw: float, norm: float) -> float:
+        """Calculate scaling factor between raw and normalized values.
+
+        Args:
+            raw: Raw count value
+            norm: Normalized count value
+
+        Returns:
+            float: Scaling factor (raw/norm) or 1.0 if norm is zero
+        """
         if norm == 0.0:
             return 1.0
         return raw / norm
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 12dbfb3 and a822d12.

📒 Files selected for processing (2)
  • gffquant/counters/count_matrix.py (1 hunks)
  • gffquant/profilers/feature_quantifier.py (5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py

133-133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)

gffquant/counters/count_matrix.py

137-137: Loop control variable index not used within loop body

(B007)

🔇 Additional comments (10)
gffquant/counters/count_matrix.py (7)

1-2: Add comprehensive module documentation.

The module lacks proper documentation explaining its purpose, usage, and key components.


38-48: Add input validation to constructor.

The constructor should validate input parameters and document the expected shapes.


61-71: Add bounds checking to item access methods.

Both getitem and setitem should verify array bounds before access.


79-85: Add bounds checking and documentation for in-place modification.

The scale_column method needs bounds checking and should document that it modifies data in-place.


89-130: Improve readability and add validation in generate_gene_counts.

The method needs input validation and better documentation of the operations.


174-175: Add input validation and documentation to colsum method.

The method needs bounds checking and proper documentation.


131-138: ⚠️ Potential issue

Fix unused variable and add error handling.

The method has two issues:

  1. Unused index variable in the loop with labels
  2. Missing error handling for file operations
     def dump(self, state="genes", labels=None,):
+        """Dump matrix contents to a file.
+
+        Args:
+            state: String identifier for the output file
+            labels: Optional labels for each row
+
+        Raises:
+            IOError: If file operations fail
+        """
-        with open(f"CountMatrix.{state}.txt", "wt") as _out:
-            if labels is None:
-                for index, counts in self:
-                    print(index, *counts, sep="\t", file=_out)
-            else:
-                for (index, counts), label in zip(self, labels):
-                    print(label, *counts, sep="\t", file=_out)
+        try:
+            with open(f"CountMatrix.{state}.txt", "wt") as _out:
+                if labels is None:
+                    for index, counts in self:
+                        print(index, *counts, sep="\t", file=_out)
+                else:
+                    for (_, counts), label in zip(self, labels):
+                        print(label, *counts, sep="\t", file=_out)
+        except IOError as e:
+            logger.error("Failed to write matrix: %s", e)
+            raise

Likely invalid or redundant comment.

🧰 Tools
🪛 Ruff (0.8.2)

137-137: Loop control variable index not used within loop body

(B007)

gffquant/profilers/feature_quantifier.py (3)

215-262: Remove commented-out code block.

This large block of commented code should be removed if it's no longer needed.


Line range hint 520-524: Well-structured update to use new counter interface.

The changes to process_alignment_group are well-implemented, making the code more concise and maintainable.


133-133: ⚠️ Potential issue

Remove unused variable.

The count_annotator variable is assigned but never used.

-        count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)

Likely invalid or redundant comment.

🧰 Tools
🪛 Ruff (0.8.2)

133-133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)

Comment on lines 141 to 172
def group_gene_counts(self, ggroups):

ggroup_counts = CountMatrix(ncols=6)
for (_, gene_counts), ggroup_id in zip(self, ggroups):
ggroup_counts[ggroup_id] +=gene_counts

return ggroup_counts



ggroup_index = {}
# for gene_id, gene_counts in self:
# ggroup_id = gene_id.split(".")[-1]
# g_key_index = ggroup_index.get(ggroup_id)
for (_, gene_counts), ggroup_id in zip(self, ggroups):
g_key_index = ggroup_index.get(ggroup_id)
# gene_counts = self.counts[self.index[key]]
if g_key_index is None:
g_key_index = ggroup_index[ggroup_id] = len(ggroup_index)
self.counts[g_key_index] = gene_counts
# logger.info("CM.group_gene_counts: Adding %s to new group %s (%s).", str(gene_counts), ggroup_id, g_key_index)
else:
self.counts[g_key_index] += gene_counts
# logger.info("CM.group_gene_counts: Adding %s to group %s (%s).", str(gene_counts), ggroup_id, g_key_index)

# replace index with grouped index
self.index = ggroup_index

# remove the un-indexed (ungrouped) rows
self.counts = self.counts[0:len(self.index), :]

return self
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix duplicate implementation and remove commented code.

The method has two critical issues:

  1. Duplicate implementation (lines 141-148 and 151-172)
  2. Commented-out code that should be removed

Keep the first implementation (lines 141-148) as it's more concise and remove the duplicate implementation and commented code.

     def group_gene_counts(self, ggroups):
+        """Group gene counts by provided group IDs.
+
+        Args:
+            ggroups: Iterable of group IDs corresponding to each gene
+
+        Returns:
+            CountMatrix: New matrix with grouped counts
+        """
         ggroup_counts = CountMatrix(ncols=6)
         for (_, gene_counts), ggroup_id in zip(self, ggroups):
             ggroup_counts[ggroup_id] +=gene_counts
         
         return ggroup_counts
-
-
-
-        ggroup_index = {}
-        # for gene_id, gene_counts in self:
-        #     ggroup_id = gene_id.split(".")[-1]
-        #     g_key_index = ggroup_index.get(ggroup_id)
-        for (_, gene_counts), ggroup_id in zip(self, ggroups):
-            g_key_index = ggroup_index.get(ggroup_id)
-            # gene_counts = self.counts[self.index[key]]
-            if g_key_index is None:
-                g_key_index = ggroup_index[ggroup_id] = len(ggroup_index)
-                self.counts[g_key_index] = gene_counts
-                # logger.info("CM.group_gene_counts: Adding %s to new group %s (%s).", str(gene_counts), ggroup_id, g_key_index)
-            else:
-                self.counts[g_key_index] += gene_counts
-                # logger.info("CM.group_gene_counts: Adding %s to group %s (%s).", str(gene_counts), ggroup_id, g_key_index)
-
-        # replace index with grouped index
-        self.index = ggroup_index
-
-        # remove the un-indexed (ungrouped) rows
-        self.counts = self.counts[0:len(self.index), :]
-
-        return self
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def group_gene_counts(self, ggroups):
ggroup_counts = CountMatrix(ncols=6)
for (_, gene_counts), ggroup_id in zip(self, ggroups):
ggroup_counts[ggroup_id] +=gene_counts
return ggroup_counts
ggroup_index = {}
# for gene_id, gene_counts in self:
# ggroup_id = gene_id.split(".")[-1]
# g_key_index = ggroup_index.get(ggroup_id)
for (_, gene_counts), ggroup_id in zip(self, ggroups):
g_key_index = ggroup_index.get(ggroup_id)
# gene_counts = self.counts[self.index[key]]
if g_key_index is None:
g_key_index = ggroup_index[ggroup_id] = len(ggroup_index)
self.counts[g_key_index] = gene_counts
# logger.info("CM.group_gene_counts: Adding %s to new group %s (%s).", str(gene_counts), ggroup_id, g_key_index)
else:
self.counts[g_key_index] += gene_counts
# logger.info("CM.group_gene_counts: Adding %s to group %s (%s).", str(gene_counts), ggroup_id, g_key_index)
# replace index with grouped index
self.index = ggroup_index
# remove the un-indexed (ungrouped) rows
self.counts = self.counts[0:len(self.index), :]
return self
def group_gene_counts(self, ggroups):
"""Group gene counts by provided group IDs.
Args:
ggroups: Iterable of group IDs corresponding to each gene
Returns:
CountMatrix: New matrix with grouped counts
"""
ggroup_counts = CountMatrix(ncols=6)
for (_, gene_counts), ggroup_id in zip(self, ggroups):
ggroup_counts[ggroup_id] +=gene_counts
return ggroup_counts

Comment on lines +145 to +166
total_gene_counts = self.counter.generate_gene_count_matrix(self.reference_manager)
logger.info("TOTAL_GENE_COUNTS = %s", total_gene_counts)

count_writer.write_feature_counts(
self.adm,
count_annotator,
(None, unannotated_reads)[report_unannotated],
count_writer.write_gene_counts(
self.counter,
self.reference_manager,
gene_group_db=gene_group_db,
)

count_writer.write_gene_counts(
count_annotator.gene_counts,
count_annotator.scaling_factors["total_gene_uniq"],
count_annotator.scaling_factors["total_gene_ambi"]
ggroups = tuple(
(self.reference_manager.get(key[0] if isinstance(key, tuple) else key))[0] # .split(".")[0]
for key, _ in self.counter
)


self.counter.counts.dump(labels=ggroups)

self.counter.group_gene_count_matrix(self.reference_manager)
unannotated_reads = self.counter.get_unannotated_reads() + self.aln_counter["unannotated_ambig"]

self.counter.counts.dump(state="ggroup")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for gene count operations.

The gene count operations lack proper error handling and logging of potential failures.

+        try:
             total_gene_counts = self.counter.generate_gene_count_matrix(self.reference_manager)
             logger.info("TOTAL_GENE_COUNTS = %s", total_gene_counts)

             count_writer.write_gene_counts(
                 self.counter,
                 self.reference_manager,
                 gene_group_db=gene_group_db,
             )

             ggroups = tuple(
                 (self.reference_manager.get(key[0] if isinstance(key, tuple) else key))[0]
                 for key, _ in self.counter
             )

             self.counter.counts.dump(labels=ggroups)

             self.counter.group_gene_count_matrix(self.reference_manager)
             self.counter.counts.dump(state="ggroup")
+        except Exception as e:
+            logger.error("Failed to process gene counts: %s", e)
+            raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
total_gene_counts = self.counter.generate_gene_count_matrix(self.reference_manager)
logger.info("TOTAL_GENE_COUNTS = %s", total_gene_counts)
count_writer.write_feature_counts(
self.adm,
count_annotator,
(None, unannotated_reads)[report_unannotated],
count_writer.write_gene_counts(
self.counter,
self.reference_manager,
gene_group_db=gene_group_db,
)
count_writer.write_gene_counts(
count_annotator.gene_counts,
count_annotator.scaling_factors["total_gene_uniq"],
count_annotator.scaling_factors["total_gene_ambi"]
ggroups = tuple(
(self.reference_manager.get(key[0] if isinstance(key, tuple) else key))[0] # .split(".")[0]
for key, _ in self.counter
)
self.counter.counts.dump(labels=ggroups)
self.counter.group_gene_count_matrix(self.reference_manager)
unannotated_reads = self.counter.get_unannotated_reads() + self.aln_counter["unannotated_ambig"]
self.counter.counts.dump(state="ggroup")
try:
total_gene_counts = self.counter.generate_gene_count_matrix(self.reference_manager)
logger.info("TOTAL_GENE_COUNTS = %s", total_gene_counts)
count_writer.write_gene_counts(
self.counter,
self.reference_manager,
gene_group_db=gene_group_db,
)
ggroups = tuple(
(self.reference_manager.get(key[0] if isinstance(key, tuple) else key))[0] # .split(".")[0]
for key, _ in self.counter
)
self.counter.counts.dump(labels=ggroups)
self.counter.group_gene_count_matrix(self.reference_manager)
unannotated_reads = self.counter.get_unannotated_reads() + self.aln_counter["unannotated_ambig"]
self.counter.counts.dump(state="ggroup")
except Exception as e:
logger.error("Failed to process gene counts: %s", e)
raise

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
gffquant/counters/count_matrix.py (2)

25-29: Remove commented-out code.

The commented-out dictionary comprehension should be removed as it's no longer needed and adds noise to the codebase.

-            # index = {
-            #     key: value
-            #     for (key, value), keep in zip(cmatrix.index.items(), rows)
-            #     if keep
-            # }

55-58: Consider exponential growth for array resizing.

The current implementation adds a fixed 1000 rows when resizing. Consider using exponential growth (e.g., doubling the size) for better amortized performance with large datasets.

-                ((0, nrows + 1000), (0, 0),),
+                ((0, nrows * 2), (0, 0),),
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a822d12 and e3c86c1.

📒 Files selected for processing (2)
  • gffquant/counters/count_matrix.py (1 hunks)
  • gffquant/profilers/feature_quantifier.py (5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
gffquant/profilers/feature_quantifier.py

133-133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)

gffquant/counters/count_matrix.py

137-137: Loop control variable index not used within loop body

(B007)

🔇 Additional comments (6)
gffquant/profilers/feature_quantifier.py (6)

13-20: LGTM! Clean import organization.

The new imports for AlignmentCounter and ReferenceHit are well-organized and properly aligned with the code changes.


Line range hint 89-102: LGTM! Well-structured ReferenceHit implementation.

The ReferenceHit usage properly encapsulates alignment hit information with all necessary parameters.


300-300: LGTM! Clean method call.

The change to use counter.toggle_single_read_handling is clean and consistent with the AlignmentCounter refactoring.


Line range hint 521-535: LGTM! Well-structured counter update.

The counter update implementation properly handles ambiguous counts, pairing information, and includes helpful debug logging.


133-133: 🛠️ Refactor suggestion

Remove unused variable.

The variable count_annotator is assigned but never used.

-        count_annotator = Annotator(self.strand_specific, report_scaling_factors=report_scaling_factors)

Likely invalid or redundant comment.

🧰 Tools
🪛 Ruff (0.8.2)

133-133: Local variable count_annotator is assigned to but never used

Remove assignment to unused variable count_annotator

(F841)


148-152: 🛠️ Refactor suggestion

Add error handling for write_gene_counts.

The gene count operations should include error handling to gracefully handle potential failures.

+        try:
             count_writer.write_gene_counts(
                 self.counter,
                 self.reference_manager,
                 gene_group_db=gene_group_db,
             )
+        except Exception as e:
+            logger.error("Failed to write gene counts: %s", e)
+            raise

Likely invalid or redundant comment.

Comment on lines +141 to +148
def group_gene_counts(self, ggroups):

ggroup_counts = CountMatrix(ncols=6)
for (_, gene_counts), ggroup_id in zip(self, ggroups):
ggroup_counts[ggroup_id] += gene_counts

return ggroup_counts

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove duplicate implementation in group_gene_counts.

The method has two implementations:

  1. Lines 141-148: A clean implementation using a new CountMatrix
  2. Lines 151-172: A more complex in-place implementation with commented-out code

The first implementation is cleaner and more maintainable. Remove the second implementation.

     def group_gene_counts(self, ggroups):
         ggroup_counts = CountMatrix(ncols=6)
         for (_, gene_counts), ggroup_id in zip(self, ggroups):
             ggroup_counts[ggroup_id] += gene_counts
         
         return ggroup_counts
-
-
-
-        ggroup_index = {}
-        # for gene_id, gene_counts in self:
-        #     ggroup_id = gene_id.split(".")[-1]
-        #     g_key_index = ggroup_index.get(ggroup_id)
-        for (_, gene_counts), ggroup_id in zip(self, ggroups):
-            g_key_index = ggroup_index.get(ggroup_id)
-            # gene_counts = self.counts[self.index[key]]
-            if g_key_index is None:
-                g_key_index = ggroup_index[ggroup_id] = len(ggroup_index)
-                self.counts[g_key_index] = gene_counts
-                # logger.info("CM.group_gene_counts: Adding %s to new group %s (%s).", str(gene_counts), ggroup_id, g_key_index)
-            else:
-                self.counts[g_key_index] += gene_counts
-                # logger.info("CM.group_gene_counts: Adding %s to group %s (%s).", str(gene_counts), ggroup_id, g_key_index)
-
-        # replace index with grouped index
-        self.index = ggroup_index
-
-        # remove the un-indexed (ungrouped) rows
-        self.counts = self.counts[0:len(self.index), :]
-
-        return self

Also applies to: 151-172

Comment on lines +137 to +138
for (index, counts), label in zip(self, labels):
print(label, *counts, sep="\t", file=_out)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix unused variable in dump method.

The loop variable index is not used in the loop body. Remove it from the tuple unpacking.

-                for (index, counts), label in zip(self, labels):
+                for (_, counts), label in zip(self, labels):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (index, counts), label in zip(self, labels):
print(label, *counts, sep="\t", file=_out)
for (_, counts), label in zip(self, labels):
print(label, *counts, sep="\t", file=_out)
🧰 Tools
🪛 Ruff (0.8.2)

137-137: Loop control variable index not used within loop body

(B007)



class CountMatrix:
NUMPY_DTYPE = 'float64' # float16 causes some overflow issue during testing
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Document the float16 overflow issue with a more detailed comment

The current comment about float16 overflow is too terse. Since this is a critical numerical precision decision that affects the entire counting system, it should be documented more thoroughly:

  • Add a detailed comment explaining what operations caused the overflow
  • Reference the specific test case or scenario that demonstrated the issue
  • Consider adding this information to the class docstring instead
🔗 Analysis chain

Verify and document the overflow issue with float16.

The comment suggests that float16 causes overflow issues during testing. This should be documented more thoroughly with specific examples or test cases that demonstrate the problem.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test files and test cases related to float16 overflow
rg -l "float16|overflow" --type py

Length of output: 126


Script:

#!/bin/bash
# Show context around float16/overflow mentions
echo "=== Context in count_matrix.py ==="
rg "float16|overflow" -B 2 -A 2 gffquant/counters/count_matrix.py

echo -e "\n=== Context in other files ==="
rg "float16|overflow" -B 2 -A 2 gffquant/db/__init__.py gffquant/alignment/bamreader.py

echo -e "\n=== Related test files ==="
fd "test.*\.py" | xargs rg "float|dtype" -B 2 -A 2

Length of output: 7315

Comment on lines 216 to 263
# functional_counts, category_sums = count_annotator.annotate_gene_counts(
# self.reference_manager,
# self.adm,
# self.counter,
# gene_group_db=gene_group_db,
# )

# logger.info("FC-index: %s", str(list(functional_counts.index.keys())[:10]))
# logger.info("FC-counts: %s", str(functional_counts.counts[0:10, :]))

# categories = self.adm.get_categories()
# for category, category_sum in zip(categories, category_sums):
# features = tuple(self.adm.get_features(category.id))
# feature_names = {
# feature.id: feature.name
# for feature in features
# }
# rows = tuple(
# key[0] == category.id
# for key, _ in functional_counts
# )

# cat_counts = CountMatrix.from_count_matrix(functional_counts, rows=rows)
# # cat_counts = CountMatrix(ncols=6, nrows=len(feature_names))
# # for feature in features:
# # key = (category.id, feature.id)
# # if functional_counts.has_record(key):
# # cat_counts[key] += functional_counts[key]
# # else:
# # _ = cat_counts[key]

# # for category in categories:
# # features = ((feature.name, feature) for feature in db.get_features(category.id))
# # for _, feature in sorted(features, key=lambda x: x[0]):
# # _ = functional_counts[(category.id, feature.id)]


# logger.info("PROCESSING CATEGORY=%s", category.name)
# count_writer.write_category(
# category.id,
# category.name,
# category_sum,
# # functional_counts,
# cat_counts,
# # feature_names,
# features,
# unannotated_reads=(None, unannotated_reads)[report_unannotated],
# )
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove commented-out code block.

This large block of commented code should be removed to improve maintainability. If it contains valuable information, consider moving it to documentation.

Comment on lines +65 to 69
self.counter = AlignmentCounter(
distribution_mode=distribution_mode,
region_counts=run_mode.overlap_required,
strand_specific=strand_specific and not run_mode.overlap_required,
strand_specific=strand_specific,
paired_end_count=paired_end_count,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

The dump method in AlignmentCounter is not implemented

The dump method in AlignmentCounter class (gffquant/counters/alignment_counter.py) raises NotImplementedError and contains commented-out implementation code. This needs to be properly implemented to:

  • Remove the raise NotImplementedError()
  • Implement the method using the CountMatrix interface instead of the commented-out code that references the non-existent self.index
🔗 Analysis chain

Verify AlignmentCounter.dump implementation.

The initialization of AlignmentCounter looks good, but the dump method needs to be properly implemented to avoid runtime failures.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify if dump method is properly implemented
ast-grep --pattern 'class AlignmentCounter {
  $$$
  dump($$$) {
    $$$
  }
  $$$
}'

Length of output: 87


Script:

#!/bin/bash
# Find the AlignmentCounter class definition
rg "class AlignmentCounter" -l

# If found, show the class implementation
rg "class AlignmentCounter" -A 50

Length of output: 4212


Script:

#!/bin/bash
# Show the complete AlignmentCounter implementation
cat gffquant/counters/alignment_counter.py

Length of output: 6124

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants