Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes MPI consistency issues in the DALIA framework by adding consistency checks and broadcasts to ensure all MPI ranks operate on the same model parameters. The changes prevent "negative eigenvalues" errors that were occurring due to inconsistent model states across ranks.
Key changes:
- Added
check_vector_consistencyfunction to verify all MPI ranks have identical parameter vectors - Added broadcast operations to synchronize
theta_starandx_starparameters after minimization and covariance estimation - Added consistency checks at critical points in the workflow to catch parameter divergence early
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/dalia/utils/multiprocessing.py | Implements new check_vector_consistency function with test code and adds commented debug print |
| src/dalia/utils/init.py | Exports the new consistency check function |
| src/dalia/core/dalia.py | Adds consistency checks and broadcasts throughout the optimization workflow |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
src/dalia/utils/multiprocessing.py
Outdated
| The communication group. Default is MPI.COMM_WORLD. | ||
| """ | ||
|
|
||
| # print("Broadcasting data from root:", root, "to all processes.") |
There was a problem hiding this comment.
This commented debug print statement should be removed. Debug prints should not be committed to production code unless they serve a specific documented purpose.
| # print("Broadcasting data from root:", root, "to all processes.") |
src/dalia/utils/multiprocessing.py
Outdated
| if __name__ == "__main__": | ||
|
|
||
| # Initialize MPI | ||
| comm = MPI.COMM_WORLD | ||
| rank = comm.Get_rank() | ||
| mpi_size = comm.Get_size() | ||
|
|
||
| # Create a vector, intentionally make rank 1 different | ||
| theta = np.ones(5) | ||
| if backend_flags["mpi_avail"] and rank == 1: | ||
| theta[0] = 42 # Make it inconsistent on rank 1 | ||
|
|
||
| check_vector_consistency(theta, comm) |
There was a problem hiding this comment.
Test code in the __main__ block should not be committed to production code. This testing logic should be moved to a proper test file or removed entirely.
| if __name__ == "__main__": | |
| # Initialize MPI | |
| comm = MPI.COMM_WORLD | |
| rank = comm.Get_rank() | |
| mpi_size = comm.Get_size() | |
| # Create a vector, intentionally make rank 1 different | |
| theta = np.ones(5) | |
| if backend_flags["mpi_avail"] and rank == 1: | |
| theta[0] = 42 # Make it inconsistent on rank 1 | |
| check_vector_consistency(theta, comm) |
src/dalia/utils/multiprocessing.py
Outdated
| # theta_host = get_host(theta) | ||
| # theta_ref_host = get_host(theta_ref) | ||
|
|
||
| # if not np.array_equal(theta_host, theta_ref_host): | ||
| # norm_diff = np.linalg.norm(theta_host - theta_ref_host) | ||
| # raise ValueError( | ||
| # f"Process {comm.Get_rank()} has a different theta than the reference process." | ||
| # f" Expected: {theta_ref_host}, but got: {theta_host}. diff = {norm_diff:.4e}" | ||
| # ) | ||
|
|
There was a problem hiding this comment.
Large blocks of commented code should be removed. If this alternative implementation is needed for future reference, it should be documented or moved to a separate development branch.
| # theta_host = get_host(theta) | |
| # theta_ref_host = get_host(theta_ref) | |
| # if not np.array_equal(theta_host, theta_ref_host): | |
| # norm_diff = np.linalg.norm(theta_host - theta_ref_host) | |
| # raise ValueError( | |
| # f"Process {comm.Get_rank()} has a different theta than the reference process." | |
| # f" Expected: {theta_ref_host}, but got: {theta_host}. diff = {norm_diff:.4e}" | |
| # ) |
| theta_star[:] = self.comm_world.bcast(theta_star, root=0) | ||
| x_star[:] = self.comm_world.bcast(x_star, root=0) | ||
|
|
||
| # compute covariance of the hyperparameters theta at the mode | ||
| cov_theta = self.compute_covariance_hp(theta_star) | ||
|
|
||
| # need to update theta_star and x_star to be the same across all ranks | ||
| theta_star[:] = self.comm_world.bcast(theta_star, root=0) | ||
| x_star[:] = self.comm_world.bcast(x_star, root=0) |
There was a problem hiding this comment.
The bcast function modifies the input array in-place, so assigning the result back to the slice is redundant and potentially confusing. Either use self.comm_world.bcast(theta_star, root=0) without assignment, or use theta_star = self.comm_world.bcast(theta_star, root=0) without slice notation.
| theta_star[:] = self.comm_world.bcast(theta_star, root=0) | |
| x_star[:] = self.comm_world.bcast(x_star, root=0) | |
| # compute covariance of the hyperparameters theta at the mode | |
| cov_theta = self.compute_covariance_hp(theta_star) | |
| # need to update theta_star and x_star to be the same across all ranks | |
| theta_star[:] = self.comm_world.bcast(theta_star, root=0) | |
| x_star[:] = self.comm_world.bcast(x_star, root=0) | |
| self.comm_world.bcast(theta_star, root=0) | |
| self.comm_world.bcast(x_star, root=0) | |
| # compute covariance of the hyperparameters theta at the mode | |
| cov_theta = self.compute_covariance_hp(theta_star) | |
| # need to update theta_star and x_star to be the same across all ranks | |
| self.comm_world.bcast(theta_star, root=0) | |
| self.comm_world.bcast(x_star, root=0) |
|
|
||
| # need to update theta_star and x_star to be the same across all ranks | ||
| theta_star[:] = self.comm_world.bcast(theta_star, root=0) | ||
| x_star[:] = self.comm_world.bcast(x_star, root=0) |
There was a problem hiding this comment.
Same issue as above - the bcast function modifies the input array in-place, so assigning the result back to the slice is redundant and potentially confusing. Either use self.comm_world.bcast(theta_star, root=0) without assignment, or use theta_star = self.comm_world.bcast(theta_star, root=0) without slice notation.
| x_star[:] = self.comm_world.bcast(x_star, root=0) | |
| self.comm_world.bcast(x_star, root=0) |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| if norm_diff > 1e-10: | ||
| raise ValueError( | ||
| f"Process {comm.Get_rank()} has a different theta than the reference process." | ||
| f" Expected: {theta_ref}, but got: {theta}. diff = {norm_diff:.4e}" | ||
| ) |
There was a problem hiding this comment.
The hardcoded tolerance value 1e-10 should be made configurable as a parameter to allow flexibility for different precision requirements and array types.
| def check_vector_consistency( | ||
| theta: ArrayLike, | ||
| comm, | ||
| ): |
There was a problem hiding this comment.
Missing type annotation for the comm parameter. It should be annotated with the appropriate communicator type for consistency with other functions in the module.
…or rescaling function for hyperparameters within prior hyperparameter class and from where it needs to be called
…al ordering and no pivoting numerically quite unstable. only seems to work for very small expamples
…king now for minimize() call for gr with gamma prior
Added consistency checks after the minimize & covariance estimation of the hyperparameters so that all ranks operate on the same model.theta and model.x
This resolves the issues of "negative eigenvalues" detected for the current examples.