Skip to content

Conversation

ahl27
Copy link
Contributor

@ahl27 ahl27 commented Jan 7, 2025

This is part 2 of a PR, you can find part 1 here: Bioconductor/S4Vectors#127

Background is covered in that PR description; this one will just cover was wasn't mentioned there.

This PR implements the XVector methods required to allow comparisons between XVectors. After merging both these PRs, the following are now supported:

x <- DNAString("ATGC")
x[order(x)]
## 4-letter DNAString object
## seq: ACGT

x == "A"
## [1] FALSE FALSE FALSE  TRUE

pcompare(x, DNAString("GCCC")) == 0
## [1]  TRUE  TRUE FALSE FALSE

x <- as(1:10, "XInteger")
x == 1:10
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

x <= 10:1
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

identical(10:1 < x, 10:1 < as.integer(x))
## [1] TRUE

And similarly for XRaw and XDouble objects.

Known Issues:

  1. Different sorting methods aren't supported for SharedVector.order or XVector.order. That implementation will depend on the backend being implemented in S4Vectors (see this comment).
  2. No sameAsPreviousROW function exists for SharedVector objects, only XVector. Do we need one for SharedVector? Since the class isn't exported I'm not sure if it's necessary.

@hpages
Copy link
Contributor

hpages commented Jan 18, 2025

Hi Aidan,

Sorry for the slow response.

Comparing 2 DNAString objects (or other XString derivatives) with == or != works atomically i.e. it doesn't compare the 2 objects letter-wise but as a whole:

library(Biostrings)

DNAString("GACC") == DNAString("GACC")
# [1] TRUE

x <- DNAString("GACC")
y <- DNAString("GACCTAT")
x == y
# [1] FALSE

I made this decision a long time ago, based on my feeling at the time that this semantic would be more useful than a vectorized letter-wise comparison. The letter-wise comparison can be useful too, and you can obtain it by coercing the 2 objects to raw vectors:

as.raw(x) == as.raw(y)
# [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
# Warning message:
# In as.raw(x) == as.raw(y) :
#   longer object length is not a multiple of shorter object length

I don't know if this is documented somewhere but it probably should.

Had I made the decision to go the other way around (i.e. have == do a letter-wise comparison), then the user would still have the ability to perform atomic comparison with something like length(x) == length(y) && all(x == y). However, note that this wouldn't be as efficient as an optimized atomic comparison, especially when comparing long sequences of hundreds of millions of nucleotides (like Human chromosomes). In such case, x == y would expand to a big logical vector of 4 * length(x) bytes in memory (i.e. 1 Gb for Human chr1).

Note that XRaw objects (which XString objects derive from) also compare atomically:

xraw1 <- as(charToRaw('q5^#.*A'), "XRaw")
xraw2 <- as(charToRaw('q5^#.*a'), "XRaw")
xraw1 != xraw2
# [1] TRUE

Anyways, for better or worse, we're stuck with the atomic comparison 😉

This means that <=, >=, <, and > should follow.

Note that all these comparisons are already implemented for XStringSet derivatives so an easy way to make them work on XString objects is to coerce the latter to the former (this coercion is a no-copy operation so is very cheap). Something like this:

setMethod("<=", c("XString", "XString"),
    function(e1, e2) as(e1, "XStringSet") <= as(e2, "XStringSet")
)

Note that there's no need to define the >=, <, and > methods. These work out-of-the-box, as long as <= works, thanks to the default >=, <, and > methods defined in the S4Vectors package for all Vector derivatives e.g.:

> selectMethod('<', c("DNAString", "DNAString"))
Method Definition:

function (e1, e2) 
{
    !(e2 <= e1)
}
<bytecode: 0x60cf37611818>
<environment: namespace:S4Vectors>

Signatures:
        e1          e2         
target  "DNAString" "DNAString"
defined "Vector"    "Vector"   

Thanks for working on this. The S4Vectors/IRanges/XVector/Biostrings code base is a maze with 2 levels, the R level and the C level, each of them split across 4 packages. So 8 regions! It can be hard to navigate and understand the interactions between the 8 regions. It's great that you were able to find your way from the R level in Biostrings all the way down to the C level in S4Vectors. This was a good exercise that will help you in the future.

Thanks again and let me know if you have questions,

H.

@ahl27
Copy link
Contributor Author

ahl27 commented Jan 22, 2025

Sorry for the slow response -- thanks for the feedback!

Comparing 2 DNAString objects (or other XString derivatives) with == or != works atomically i.e. it doesn't compare the 2 objects letter-wise but as a whole.

Makes sense. I can change this to match that implementation. At the very least, there's an order functionality implemented now 😅 .

Note that all these comparisons are already implemented for XStringSet derivatives so an easy way to make them work on XString objects is to coerce the latter to the former (this coercion is a no-copy operation so is very cheap).

Also makes sense. I can open a PR with that change after I fix these PRs.

I'll work on updating these tomorrow.


The only remaining issue is what to do in a XVector, ANY or ANY, XVector comparison. There's a few ways this could possibly go. Of note are the following comparisons:

1:2 == 1:2
## [1]  TRUE  TRUE

XInteger(2,1:2) == XInteger(2,1:2)
## [1] TRUE

1:2 == XInteger(2,1:2)
## [1]  TRUE  TRUE     (?)
## [1]  TRUE           (?)

I'm thinking the best implementation is to just coerce the ANY argument to the corresponding XVector type, e.g.:

function(x_vec, y_any){
    x == as(y_any, class(x)) 
}

While this breaks consistency with the base R types, it does preserve the expected behavior of comparing XVector objects and gives some degree of type checking:

XInteger(2,1:2) == as("A", "XInteger")
## Error in as("A", "XInteger") : 
##   no method or default for coercing “character” to “XInteger”

which is probably a nice functionality. Maybe I'll also add in some documentation updates to man page to address the element-wise comparison you mentioned; I agree that it should probably be documented somewhere.

@ahl27
Copy link
Contributor Author

ahl27 commented Jan 23, 2025

New update in response to these comments -- the following comparisons now work:

x <- XInteger(5, 1:5)
y <- XInteger(5, 5:1)

x == y    ## FALSE
x <= y    ## TRUE
y <= x    ## FALSE
x <  y    ## TRUE
x >  y    ## FALSE
x >= y    ## FALSE

y <- 5:1
x == y    ## FALSE
x <= y    ## TRUE
y <= x    ## FALSE
x <  y    ## TRUE
x >  y    ## FALSE
x >= y    ## FALSE

I've also left in the order and pcompare functionality. I think the workflow is a little more straightforward than the previous workflow of coercing with as.raw:

pcompare(x, y)     ## [1] -1 -1  0  1  1
pcompare(y, x)     ## [1]  1  1  0 -1 -1
pcompare(x, 5:1)   ## [1] -1 -1  0  1  1
pcompare(5:1, x)   ## [1]  1  1  0 -1 -1

I've also updated the relevant documentation files to add a description of both of these functionalities for users.

Some brief notes on decisions made here:

  1. setMethod("<=", signature(e1="ANY", e2="XVector") is a required definition to ensure comparisons are atomic and consistent regardless of order. Similarly for equality comparison.
  2. order on multiple inputs doesn't work properly using do.call. I'm not sure why that is. Everything seems to be consistent if I use lapply, so I've done that instead.

Example of faulty order behavior:

setMethod("order", "XVector",
    function(..., na.last=TRUE, decreasing=FALSE, method=c("auto", "shell", "radix")){
        args <- list(...)
        if (length(args) == 1L) {
            x <- args[[1L]]
            SharedVector.order(x, decreasing)
        } else {
            args <- unname(args)
            do.call(order, c(args, list(na.last=na.last,
                                        decreasing=decreasing,
                                        method=method)))
        }
    }
)

x <- XInteger(10)
y <- XInteger(10)
order(x,y) ## errors using do.call, works using lapply(args, order, decreasing=decreasing)

@ahl27
Copy link
Contributor Author

ahl27 commented Jan 23, 2025

couple force pushes to make commit amends, no major changes from last comment

@hpages
Copy link
Contributor

hpages commented May 7, 2025

Hi Aidan,

There's a lot going on here and I think it's going to be easier/simpler if we leave aside the XInteger comparison for now. I think the PR should just focus on XString objects. More precisely, it should focus on:

  • (a) comparison between 2 XString derivatives
  • (b) comparison between an XString derivative and a character vector of length 1

Right now most of these comparisons are broken, as documented in Biostrings' TODO file here: https://github.com/Bioconductor/Biostrings/blob/44e8aa36353658a53a8ff7141082c024b6b39b9c/TODO#L62-L114

Comparisons between XInteger objects or between XDouble objects are also broken in a similar way, and they will need to be addressed at some point. However, fixing them will have additional challenges. Plus these comparisons are not related to or used by Biostrings, so I'd say it's ok to leave them alone for now.

So do you think you can modify this PR, or create a new PR (up to you), that focuses on (a) and (b) above? Given my previous comment from Jan 18 above, the new (or modified) PR should be much simpler. In particular, I don't anticipate the need to touch anything at the C level.

Hope this makes sense,

Thanks,
H.

@ahl27
Copy link
Contributor Author

ahl27 commented Aug 18, 2025

Sorry for the slow followup -- haven't had a lot of bandwidth outside of finishing up my dissertation.

I've adapted this PR into a Biostrings-specific one available here: Bioconductor/Biostrings#130

This addresses these issues specifically for XStrings (and XString-character comparisons). That should be sufficient for now; feel free to close this PR or leave it open for reference when the aforementioned issues with XInteger/XDouble need to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants