Skip to content

Conversation

rcastelo
Copy link
Contributor

@rcastelo rcastelo commented Feb 14, 2023

Hi,

This PR provides support for using the strandMode parameter with GAlignmentsList objects, following our discussion at the Bioconductor support site. In principle, I've made all necessary edits in the code and in the documentation, and added a unit test in inst/unitTests/test_readGAlignmentsList.R. I've tested that the package builds and checks without errors and without warnings caused by this PR.

I've tried to mimic where possible the implementation of strandMode for GAlignmentPairs object, but there's the following important difference. In GAlignmentPairs objects, there is a GAlignments object for each mate of a paired alignment, stored in the slots first and last. This implicitly stores the information about which mate is first and which is not. In the case of GAlignmentsList objects, each pair of aligned reads, or set of ambiguously paired aligned reads, is stored as an element of a CompressedRangesList object, and therefore, we need to additionally store which aligned read is the first mate, and which is not the first mate. This information is in the flag metadata, so the current implementation reads that metadata column from the BAM file and uses it to figure out the "real" strand of each mate, according to the strandMode parameter, in a modified strand() method. The other key point of the implementation is in the new findOverlaps() methods for GAlignmentsList objects, which prior to calculating the overlaps, they need to create a new version of the input GAlignmentsList object with the real strand according to the strandMode parameter. This is done by instructions similar to this one:

query2 <- query
strand(query2) <- strand(query2)

and perform the findOverlaps() operation on the new version query2 of the GAlingmentsList object. This has the consequence of increasing the memory footprint, first because we're storing the flag information (an integer vector), and second because in the findOverlaps() operation we are duplicating the GAlignmentsList object. Probably there's a more efficient way of implementing this, but I couldn't come up with anything better. In any case, the current solution at least calculates the overlaps correctly according to the real strand of the aligned reads, which I think is very important.

@hpages
Copy link
Contributor

hpages commented Feb 14, 2023

Thanks @rcastelo but do we really need to add the strandMode slot and accessor to the GAlignmentsList class? Wouldn't it be enough to add the strandMode argument to the readGAlignmentsList() function to "fix" the strand of the fragments at load-time? I thought this was all you needed to support your summarizeOverlaps( , fragments=TRUE) use case but maybe I misunderstood.

@jmacdon
Copy link
Collaborator

jmacdon commented Feb 14, 2023

@hpages In Robert's original example, there are multiple list objects like this:

> gall[lengths(gall) == 1L]
GAlignmentsList object of length 127:
[[1]]
GAlignments object with 1 alignment and 0 metadata columns:
      seqnames strand       cigar    qwidth     start       end     width
         <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
  [1]     seq2      +         35M        35      1520      1554        35
          njunc
      <integer>
  [1]         0
  -------
  seqinfo: 2 sequences from an unspecified genome

And without knowing which end of the read that alignment comes from, how do you attribute the correct strand? In other words, if it's the first read and it's strandMode = 2, the strand should get flipped. But if it's the second read and strandMode = 2, it's on the + strand. So you have to read in from the flag metadata and store that somewhere, no?

@rcastelo
Copy link
Contributor Author

I've tried to mimic where possible the implementation of strandMode for GAlignmentPairs objects. I saw that the class definition of GAlignmentPairs includes a slot for strandMode, and that the way in which this was implemented there was to preserve the original strand and calculate the real strand when necessary. So, I thought that should be the strategy for GAlignmentsList objects. There are also coercion methods between GAlignmentPairs and GAlignmentsList objects, which by the way now I realize I should have modified accordingly as well, and I didn't, and such coercion methods would need to have the original strand in GAlignmentsList objects to turn into GAlignmentPairs objects. The summarizeOverlaps( , fragments=TRUE) case is one of the reasons to have this working, but in more general terms, we need to use GAlignmentsList objects in the atena package for certain cases and have access in them to the correct strand.

@jmacdon I was writing this comment as yours popped up, I think Hervé is proposing to minimize the memory footprint by setting the real strand during the call to readGAlignmentsList(), so that one doesn't need to store the strandMode, flag info and, more importantly, the original strand anymore, which I think it would be a good strategy if we are sure we don't need the original strand anymore after reading the object, but the current implementation of GAlignmentPairs and the coercion methods, make me doubt about this.

@hpages
Copy link
Contributor

hpages commented Feb 14, 2023

Yes I'm proposing that readGAlignmentsList() gets a strandMode argument that can be used to "fix" the strand of the fragments based on the flag metadata. So the strand gets fixed at load time once for all. I realize that this means that the original strand is lost and that you won't be able to restore it or change your mind like you can do right now by using the strandMode() setter on a GAlignmentPairs object. But if all we want is support the summarizeOverlaps( , fragments=TRUE) use case then maybe that's all we need?

Touching the internals of the GAlignmentsList class is a big deal and is quite invasive so I'd rather avoid it if we can.

@rcastelo
Copy link
Contributor Author

As you say, we won't have the strandMode setter, and we won't have coercions with GAlignmentPairs and, in general, the functionality of GAlignmentsList will not be entirely parallel to the functionality of GAlignmentPairs. I'm not sure we want that, and in fact, this PR in principle updates all the necessary internals of the GAlignmentsList class, with the exception of the coercion methods, which I forgot about it, but I could fix as well in this PR. Shouldn't we give it a chance with a month and a halve before the next release, or do you think is it too risky?

@hpages
Copy link
Contributor

hpages commented Feb 14, 2023

Touching the internals of an S4 class is a big deal because it breaks all serialized instances. I don't expect that but how do we know for sure that there are no serialized GAlignmentsList objects around? That's something that would need to be investigated. If we go that route, then we want to make sure that:

  • The coercions methods between GAlignmentPairs and GAlignmentsList are adjusted. A round-trip from GAlignmentPairs to GAlignmentsList and back to GAlignmentPairs should always be a no-op, whatever the strandMode on the original GAlignmentPairs is.
  • By default readGAlignmentsList() should return an object that is semantically equivalent to the object that was returned before this change. This means that the new object should produce the same results as the old object when coercing to GAlignmentPairs or GRangesList or GRanges, or when using it in the context of findOverlaps() and family. If we don't get this right then downstream packages could start failing or silently produce different results.
  • We should provide an updateObject() method for GAlignmentsList objects.

I probably forgot a few things.

But keep in mind that whatever approach you choose, we will only be able to know what we break once this is merged and we have a full build report. Then someone will have to deal with the breakage on the build report.

So I'm just trying to avoid embarking on a big and time-consuming adventure by proposing a simple way to support your summarizeOverlaps( , fragments=TRUE) use case. This would also lead to a PR that is much simpler for me to review (again I don't have much time to dedicate to this).

@rcastelo
Copy link
Contributor Author

Ok, I'll do another PR modifying only the readGAlignmentsList() function, separately from this one, to leave these edits available for the future, just in case we'd need to consider going this route at some point. Let me point out that by modifying only the readGAlignmentsList() function, the coercion method from GAlignmentsList to GAlignmentPairs will do the wrong thing, because it will copy the real strand and the GAlignmentPairs object will interpret it as the original strand. Anyway, the current code doesn't seem right to me either, because is not using the flag info to figure out which mate is first and which is last and, because the flag info is not there, has to assume that all aligned pairs are "proper".

@hpages
Copy link
Contributor

hpages commented Feb 15, 2023

Sounds good. Yes we can always add the strandMode slot to the GAlignmentsList class later. Thanks!

@rcastelo
Copy link
Contributor Author

Hi again, for completeness with respect to this PR, I have pushed changes implementing coercion methods between GAlignmentPairs and GAlignmentsList objects and an updateObject method, as you request in one of your messages above. Actually, when implementing the coercion methods, I realized that these methods do not work in the current release version of GenomicAlignments:

library(GenomicAlignments)
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)
galp <- readGAlignmentPairs(bamfile)
galp2gall <- as(galp, "GAlignmentsList")
Error in validObject(.Object) : invalid class “GAlignmentsList” object: 
    'mcols(x)' is not parallel to 'x'
gall <- readGAlignmentsList(bamfile)
gall <- gall[mcols(gall)$mate_status == "mated"]
gall2galp <- as(gall, "GAlignmentPairs")
gallback <- as(gall2galp, "GAlignmentsList")
Error in validObject(.Object) : invalid class “GAlignmentsList” object: 
    'mcols(x)' is not parallel to 'x'

This PR fixes these problems and provides a no-op round trip between the two classes:

library(GenomicAlignments)
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)
galp <- readGAlignmentPairs(bamfile)
galp2gall <- as(galp, "GAlignmentsList")
galpback <- as(galp2gall, "GAlignmentPairs")
identical(galp, galpback)
[1] TRUE
gall <- readGAlignmentsList(bamfile)
gall <- gall[mcols(gall)$mate_status == "mated"] ## GAlignmentPairs only include mated pairs
gall2galp <- as(gall, "GAlignmentPairs")
gallback <- as(gall2galp, "GAlignmentsList")
identical(gall, gallback)
[1] TRUE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants