Skip to content

Mpi issues#124

Open
lisa-gm wants to merge 31 commits intodevfrom
mpi_issues
Open

Mpi issues#124
lisa-gm wants to merge 31 commits intodevfrom
mpi_issues

Conversation

@lisa-gm
Copy link
Collaborator

@lisa-gm lisa-gm commented Sep 12, 2025

Added consistency checks after the minimize & covariance estimation of the hyperparameters so that all ranks operate on the same model.theta and model.x
This resolves the issues of "negative eigenvalues" detected for the current examples.

@lisa-gm lisa-gm requested review from Copilot and vincent-maillou and removed request for vincent-maillou September 12, 2025 14:49
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes MPI consistency issues in the DALIA framework by adding consistency checks and broadcasts to ensure all MPI ranks operate on the same model parameters. The changes prevent "negative eigenvalues" errors that were occurring due to inconsistent model states across ranks.

Key changes:

  • Added check_vector_consistency function to verify all MPI ranks have identical parameter vectors
  • Added broadcast operations to synchronize theta_star and x_star parameters after minimization and covariance estimation
  • Added consistency checks at critical points in the workflow to catch parameter divergence early

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/dalia/utils/multiprocessing.py Implements new check_vector_consistency function with test code and adds commented debug print
src/dalia/utils/init.py Exports the new consistency check function
src/dalia/core/dalia.py Adds consistency checks and broadcasts throughout the optimization workflow

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

The communication group. Default is MPI.COMM_WORLD.
"""

# print("Broadcasting data from root:", root, "to all processes.")
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commented debug print statement should be removed. Debug prints should not be committed to production code unless they serve a specific documented purpose.

Suggested change
# print("Broadcasting data from root:", root, "to all processes.")

Copilot uses AI. Check for mistakes.
Comment on lines +259 to +271
if __name__ == "__main__":

# Initialize MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
mpi_size = comm.Get_size()

# Create a vector, intentionally make rank 1 different
theta = np.ones(5)
if backend_flags["mpi_avail"] and rank == 1:
theta[0] = 42 # Make it inconsistent on rank 1

check_vector_consistency(theta, comm)
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test code in the __main__ block should not be committed to production code. This testing logic should be moved to a proper test file or removed entirely.

Suggested change
if __name__ == "__main__":
# Initialize MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
mpi_size = comm.Get_size()
# Create a vector, intentionally make rank 1 different
theta = np.ones(5)
if backend_flags["mpi_avail"] and rank == 1:
theta[0] = 42 # Make it inconsistent on rank 1
check_vector_consistency(theta, comm)

Copilot uses AI. Check for mistakes.
Comment on lines +236 to +245
# theta_host = get_host(theta)
# theta_ref_host = get_host(theta_ref)

# if not np.array_equal(theta_host, theta_ref_host):
# norm_diff = np.linalg.norm(theta_host - theta_ref_host)
# raise ValueError(
# f"Process {comm.Get_rank()} has a different theta than the reference process."
# f" Expected: {theta_ref_host}, but got: {theta_host}. diff = {norm_diff:.4e}"
# )

Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large blocks of commented code should be removed. If this alternative implementation is needed for future reference, it should be documented or moved to a separate development branch.

Suggested change
# theta_host = get_host(theta)
# theta_ref_host = get_host(theta_ref)
# if not np.array_equal(theta_host, theta_ref_host):
# norm_diff = np.linalg.norm(theta_host - theta_ref_host)
# raise ValueError(
# f"Process {comm.Get_rank()} has a different theta than the reference process."
# f" Expected: {theta_ref_host}, but got: {theta_host}. diff = {norm_diff:.4e}"
# )

Copilot uses AI. Check for mistakes.
Comment on lines +309 to +317
theta_star[:] = self.comm_world.bcast(theta_star, root=0)
x_star[:] = self.comm_world.bcast(x_star, root=0)

# compute covariance of the hyperparameters theta at the mode
cov_theta = self.compute_covariance_hp(theta_star)

# need to update theta_star and x_star to be the same across all ranks
theta_star[:] = self.comm_world.bcast(theta_star, root=0)
x_star[:] = self.comm_world.bcast(x_star, root=0)
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bcast function modifies the input array in-place, so assigning the result back to the slice is redundant and potentially confusing. Either use self.comm_world.bcast(theta_star, root=0) without assignment, or use theta_star = self.comm_world.bcast(theta_star, root=0) without slice notation.

Suggested change
theta_star[:] = self.comm_world.bcast(theta_star, root=0)
x_star[:] = self.comm_world.bcast(x_star, root=0)
# compute covariance of the hyperparameters theta at the mode
cov_theta = self.compute_covariance_hp(theta_star)
# need to update theta_star and x_star to be the same across all ranks
theta_star[:] = self.comm_world.bcast(theta_star, root=0)
x_star[:] = self.comm_world.bcast(x_star, root=0)
self.comm_world.bcast(theta_star, root=0)
self.comm_world.bcast(x_star, root=0)
# compute covariance of the hyperparameters theta at the mode
cov_theta = self.compute_covariance_hp(theta_star)
# need to update theta_star and x_star to be the same across all ranks
self.comm_world.bcast(theta_star, root=0)
self.comm_world.bcast(x_star, root=0)

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not true.


# need to update theta_star and x_star to be the same across all ranks
theta_star[:] = self.comm_world.bcast(theta_star, root=0)
x_star[:] = self.comm_world.bcast(x_star, root=0)
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above - the bcast function modifies the input array in-place, so assigning the result back to the slice is redundant and potentially confusing. Either use self.comm_world.bcast(theta_star, root=0) without assignment, or use theta_star = self.comm_world.bcast(theta_star, root=0) without slice notation.

Suggested change
x_star[:] = self.comm_world.bcast(x_star, root=0)
self.comm_world.bcast(x_star, root=0)

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not true.

@lisa-gm lisa-gm requested a review from Copilot September 12, 2025 14:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +240 to +244
if norm_diff > 1e-10:
raise ValueError(
f"Process {comm.Get_rank()} has a different theta than the reference process."
f" Expected: {theta_ref}, but got: {theta}. diff = {norm_diff:.4e}"
)
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded tolerance value 1e-10 should be made configurable as a parameter to allow flexibility for different precision requirements and array types.

Copilot uses AI. Check for mistakes.
Comment on lines +214 to +217
def check_vector_consistency(
theta: ArrayLike,
comm,
):
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing type annotation for the comm parameter. It should be annotated with the appropriate communicator type for consistency with other functions in the module.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants