https://github.com/topepo/caret/blob/c98cc1a3ba5f0b087d51f5c4362a3b751515e243/pkg/caret/R/findCorrelation.R#L66C9-L71C64
There appears to be a logical error in the findCorrelation_exact function when comparing variables for removal in findCorrelation.R file.
mn1 <- mean(x2[i,], na.rm = TRUE)
mn2 <- mean(x2[-j,], na.rm = TRUE) # <- Issue here
The comparison is between:
mn1: Mean correlation of variable i with all other variables
mn2: Mean of the entire correlation matrix excluding row j (but still including row i)
I’m not sure whether this is correct, or if the algorithm should instead compare the mean correlations of the two highly correlated variables directly in order to decide which one to remove.
mn1 <- mean(x2[i,], na.rm = TRUE)
mn2 <- mean(x2[j,], na.rm = TRUE)
Could someone confirm if this is the intended behavior or if it should be corrected?