-
Notifications
You must be signed in to change notification settings - Fork 15
Support for strandMode in GAlignmentsList objects #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
Thanks @rcastelo but do we really need to add the |
@hpages In Robert's original example, there are multiple list objects like this:
And without knowing which end of the read that alignment comes from, how do you attribute the correct strand? In other words, if it's the first read and it's strandMode = 2, the strand should get flipped. But if it's the second read and strandMode = 2, it's on the + strand. So you have to read in from the flag metadata and store that somewhere, no? |
I've tried to mimic where possible the implementation of strandMode for @jmacdon I was writing this comment as yours popped up, I think Hervé is proposing to minimize the memory footprint by setting the real strand during the call to |
Yes I'm proposing that Touching the internals of the GAlignmentsList class is a big deal and is quite invasive so I'd rather avoid it if we can. |
As you say, we won't have the |
Touching the internals of an S4 class is a big deal because it breaks all serialized instances. I don't expect that but how do we know for sure that there are no serialized GAlignmentsList objects around? That's something that would need to be investigated. If we go that route, then we want to make sure that:
I probably forgot a few things. But keep in mind that whatever approach you choose, we will only be able to know what we break once this is merged and we have a full build report. Then someone will have to deal with the breakage on the build report. So I'm just trying to avoid embarking on a big and time-consuming adventure by proposing a simple way to support your |
Ok, I'll do another PR modifying only the |
Sounds good. Yes we can always add the strandMode slot to the GAlignmentsList class later. Thanks! |
… and an updateObject method
Hi again, for completeness with respect to this PR, I have pushed changes implementing coercion methods between
This PR fixes these problems and provides a no-op round trip between the two classes:
|
Hi,
This PR provides support for using the
strandMode
parameter withGAlignmentsList
objects, following our discussion at the Bioconductor support site. In principle, I've made all necessary edits in the code and in the documentation, and added a unit test ininst/unitTests/test_readGAlignmentsList.R
. I've tested that the package builds and checks without errors and without warnings caused by this PR.I've tried to mimic where possible the implementation of
strandMode
forGAlignmentPairs
object, but there's the following important difference. InGAlignmentPairs
objects, there is aGAlignments
object for each mate of a paired alignment, stored in the slotsfirst
andlast
. This implicitly stores the information about which mate is first and which is not. In the case ofGAlignmentsList
objects, each pair of aligned reads, or set of ambiguously paired aligned reads, is stored as an element of aCompressedRangesList
object, and therefore, we need to additionally store which aligned read is the first mate, and which is not the first mate. This information is in theflag
metadata, so the current implementation reads that metadata column from the BAM file and uses it to figure out the "real" strand of each mate, according to thestrandMode
parameter, in a modifiedstrand()
method. The other key point of the implementation is in the newfindOverlaps()
methods forGAlignmentsList
objects, which prior to calculating the overlaps, they need to create a new version of the inputGAlignmentsList
object with the real strand according to thestrandMode
parameter. This is done by instructions similar to this one:and perform the
findOverlaps()
operation on the new versionquery2
of theGAlingmentsList
object. This has the consequence of increasing the memory footprint, first because we're storing theflag
information (an integer vector), and second because in thefindOverlaps()
operation we are duplicating theGAlignmentsList
object. Probably there's a more efficient way of implementing this, but I couldn't come up with anything better. In any case, the current solution at least calculates the overlaps correctly according to the real strand of the aligned reads, which I think is very important.