Skip to content

Possible Bug in findCorrelation_exact: Incorrect comparison logic for variable removal #1409

@alepr

Description

@alepr

https://github.com/topepo/caret/blob/c98cc1a3ba5f0b087d51f5c4362a3b751515e243/pkg/caret/R/findCorrelation.R#L66C9-L71C64

There appears to be a logical error in the findCorrelation_exact function when comparing variables for removal in findCorrelation.R file.

mn1 <- mean(x2[i,], na.rm = TRUE)
mn2 <- mean(x2[-j,], na.rm = TRUE)  # <- Issue here

The comparison is between:

mn1: Mean correlation of variable i with all other variables
mn2: Mean of the entire correlation matrix excluding row j (but still including row i)

I’m not sure whether this is correct, or if the algorithm should instead compare the mean correlations of the two highly correlated variables directly in order to decide which one to remove.

mn1 <- mean(x2[i,], na.rm = TRUE)
mn2 <- mean(x2[j,], na.rm = TRUE) 

Could someone confirm if this is the intended behavior or if it should be corrected?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions