Skip to content

Conversation

@dkarletsos
Copy link

The existing implementation returned a generic error message when multicollinearity caused issues with the correlation matrix. This PR enhances the add_integration function by adding functionality to detect and report multicollinearity in covariates by looking at the eigenvalues of the correlation matrix. When multicollinearity is present, the function now returns a clear and informative error message to help users identify the problematic covariates.

Key Changes:

  • Added a check for multicollinearity in covariates based on the correlation matrix eigenvalues.
  • Provided a specific error message when multicollinearity is detected, including the names of the collinear covariates, based on near-zero eigenvalues and their corresponding variable relationships.
  • Introduced a new test case in test-add_integration.R to verify the behavior of the function when multicollinearity exists.

choxos added a commit to choxos/multinma that referenced this pull request Jul 21, 2025
…lippo#53, dmphillippo#54, dmphillippo#55

- Add multicollinearity detection in add_integration() (dmphillippo#49)
  * Comprehensive detection system with multiple validation points
  * Specific variable identification and remediation guidance
  * 88 new lines of functionality + 263 lines of tests

- Implement TTE ranking warnings for auxiliary effects (dmphillippo#52)
  * Automatic detection of ranking ambiguity in survival models
  * Clear warnings about limitations and recommendations
  * Complete test coverage for all scenarios

- Return knot locations from spline models (dmphillippo#53)
  * Automatic inclusion for mspline and pexp models
  * Preservation of user-specified configurations
  * Enhanced sensitivity analysis capabilities
  * Updated class documentation

- Add comprehensive MCMC convergence diagnostics (dmphillippo#54)
  * Five new diagnostic functions with bayesplot integration
  * 306 lines of new functionality + 285 lines of tests
  * Automatic parameter selection and multinma theming
  * Complete suite: trace, ACF, R-hat, n_eff, combined plots

- Fix exchangeable classes with single treatments (dmphillippo#55)
  * Fixed array bounds checking across 8 Stan model files
  * Prevents runtime errors for single-treatment classes
  * 16 total instances fixed with proper validation
  * Comprehensive edge case testing

All fixes include:
- 1,000+ lines of new functionality
- 1,000+ lines of comprehensive test coverage
- Full backward compatibility maintained
- Production-ready implementation
- Complete documentation updates
choxos added a commit to choxos/multinma that referenced this pull request Jul 21, 2025
Issues resolved:
- dmphillippo#49: Enhanced multicollinearity detection in add_integration()
- dmphillippo#52: TTE parametric distribution ranking warnings
- dmphillippo#53: Return knot locations from spline-based likelihoods
- dmphillippo#54: MCMC convergence diagnostics functions
- dmphillippo#55: Fix exchangeable classes with single treatments

Changes include:
- 306 lines of new MCMC diagnostic functionality
- 88 lines of multicollinearity detection with comprehensive validation
- Automatic knot location storage for mspline/pexp models
- Warning system for TTE ranking ambiguity
- 21 Stan file fixes for array bounds checking
- 1000+ lines of comprehensive test coverage
- Full backward compatibility maintained
- Production-ready implementation
Copy link
Owner

@dmphillippo dmphillippo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dkarletsos! Sorry this has taken me a while to get to. I've had a look over and made some suggestions - see what you think.

Comment on lines +167 to +177
# Check for multicollinearity using eigenvalues
eigenvalues <- eigen(cor, symmetric = TRUE)$values
if (any(is.na(eigenvalues) | eigenvalues <= 0)) {
abort(
glue::glue(
"Multicollinearity detected in the provided correlation matrix `cor`.\n",
"This often occurs when covariates are perfectly or nearly perfectly correlated.\n",
"Check for redundant variables or ensure no one-hot encoded variables sum to a constant."
)
)
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts:

  • We still need to catch the case where some entries are NA, otherwise eigen() fails cryptically. This can happen when a variable is constant.
  • I'm also not sure we can say that collinearity in data is definitely the cause (since cor may have come from anywhere).
  • We should probably use a tolerance on the posdef check; the generation of integration points fails if cor is numerically singular. The check below should catch the same numerical failures as copula::cCopula().
Suggested change
# Check for multicollinearity using eigenvalues
eigenvalues <- eigen(cor, symmetric = TRUE)$values
if (any(is.na(eigenvalues) | eigenvalues <= 0)) {
abort(
glue::glue(
"Multicollinearity detected in the provided correlation matrix `cor`.\n",
"This often occurs when covariates are perfectly or nearly perfectly correlated.\n",
"Check for redundant variables or ensure no one-hot encoded variables sum to a constant."
)
)
}
if (any(is.na(cor)))
abort(c("Missing values in `cor`", i = "Correlations may have been calculated for one or more constant variables"))
ev <- eigen(cor, symmetric = TRUE)$values
if (any(ev < 0)) abort("`cor` should be a correlation matrix or NULL")
if (any(ev == 0)) abort(c("`cor` is singular", i = "Correlations may have been calculated for variables which are collinear or perfectly correlated"))
for (i in 1:(ncol(cor)-1)) {
if (Matrix::rcond(cor[1:i, 1:i, drop = FALSE]) <= .Machine$double.eps)
abort(c("`cor` is computationally singular", i = "Correlations may have been calculated for variables which are (nearly) collinear or perfectly correlated"))
}

Comment on lines +687 to +720
test_that("add_integration detects multicollinearity in correlation matrix", {
df <- tibble::tibble(
covariate1 = c(0.1, 0.2, 0.3),
covariate2 = c(0.4, 0.5, 0.6),
covariate3 = c(0.7, 0.8, 0.9)
)

# Create a correlation matrix with multicollinearity (perfectly correlated variables)
cor_matrix <- matrix(
c(1, 1, 1,
1, 1, 1,
1, 1, 1),
nrow = 3,
ncol = 3
)
colnames(cor_matrix) <- c("covariate1", "covariate2", "covariate3")
rownames(cor_matrix) <- c("covariate1", "covariate2", "covariate3")

distr_args <- list(
covariate1 = distr(qnorm, mean = 0, sd = 1),
covariate2 = distr(qnorm, mean = 0, sd = 1),
covariate3 = distr(qnorm, mean = 0, sd = 1)
)

expect_error(
add_integration.data.frame(
df,
!!! distr_args,
cor = cor_matrix,
n_int = 64L
),
regexp = "Multicollinearity detected in the provided correlation matrix `cor`"
)
})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify the test setup a little, and check some more of the error cases.

Suggested change
test_that("add_integration detects multicollinearity in correlation matrix", {
df <- tibble::tibble(
covariate1 = c(0.1, 0.2, 0.3),
covariate2 = c(0.4, 0.5, 0.6),
covariate3 = c(0.7, 0.8, 0.9)
)
# Create a correlation matrix with multicollinearity (perfectly correlated variables)
cor_matrix <- matrix(
c(1, 1, 1,
1, 1, 1,
1, 1, 1),
nrow = 3,
ncol = 3
)
colnames(cor_matrix) <- c("covariate1", "covariate2", "covariate3")
rownames(cor_matrix) <- c("covariate1", "covariate2", "covariate3")
distr_args <- list(
covariate1 = distr(qnorm, mean = 0, sd = 1),
covariate2 = distr(qnorm, mean = 0, sd = 1),
covariate3 = distr(qnorm, mean = 0, sd = 1)
)
expect_error(
add_integration.data.frame(
df,
!!! distr_args,
cor = cor_matrix,
n_int = 64L
),
regexp = "Multicollinearity detected in the provided correlation matrix `cor`"
)
})
test_that("add_integration detects multicollinearity in correlation matrix", {
# Collinear
expect_error(
add_integration(data.frame(),
x1 = distr(qnorm, mean = 0, sd = 1),
x2 = distr(qnorm, mean = 0, sd = 1),
x3 = distr(qnorm, mean = 0, sd = 1),
cor = cor(cbind(1:4, 1:4, 1:4))),
"may have been calculated for variables which are collinear or perfectly correlated")
expect_error(
add_integration(data.frame(),
x1 = distr(qnorm, mean = 0, sd = 1),
x2 = distr(qnorm, mean = 0, sd = 1),
x3 = distr(qnorm, mean = 0, sd = 1),
cor = cor(cbind(1:4, -(1:4), 1:4))),
"may have been calculated for variables which are collinear or perfectly correlated")
# Numerically collinear
cc <- matrix(1 - 1e-16, 3, 3)
diag(cc) <- 1
expect_error(
add_integration(data.frame(),
x1 = distr(qnorm, mean = 0, sd = 1),
x2 = distr(qnorm, mean = 0, sd = 1),
x3 = distr(qnorm, mean = 0, sd = 1),
cor = cc),
"may have been calculated for variables which are \\(nearly\\) collinear or perfectly correlated")
# Constant
expect_error(
add_integration(data.frame(),
x1 = distr(qnorm, mean = 0, sd = 1),
x2 = distr(qnorm, mean = 0, sd = 1),
x3 = distr(qnorm, mean = 0, sd = 1),
cor = suppressWarnings(cor(cbind(1, 1:4, 1:4)))),
"may have been calculated for one or more constant variables")
})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants