Mpi issues by lisa-gm · Pull Request #124 · dalia-project/DALIA

lisa-gm · 2025-09-12T14:49:16Z

Added consistency checks after the minimize & covariance estimation of the hyperparameters so that all ranks operate on the same model.theta and model.x
This resolves the issues of "negative eigenvalues" detected for the current examples.

…the way im calling it now

… or not

Copilot

Pull Request Overview

This PR fixes MPI consistency issues in the DALIA framework by adding consistency checks and broadcasts to ensure all MPI ranks operate on the same model parameters. The changes prevent "negative eigenvalues" errors that were occurring due to inconsistent model states across ranks.

Key changes:

Added check_vector_consistency function to verify all MPI ranks have identical parameter vectors
Added broadcast operations to synchronize theta_star and x_star parameters after minimization and covariance estimation
Added consistency checks at critical points in the workflow to catch parameter divergence early

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
src/dalia/utils/multiprocessing.py	Implements new `check_vector_consistency` function with test code and adds commented debug print
src/dalia/utils/init.py	Exports the new consistency check function
src/dalia/core/dalia.py	Adds consistency checks and broadcasts throughout the optimization workflow

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-12T14:50:36Z

src/dalia/utils/multiprocessing.py

        The communication group. Default is MPI.COMM_WORLD.
    """
+
+    # print("Broadcasting data from root:", root, "to all processes.")


This commented debug print statement should be removed. Debug prints should not be committed to production code unless they serve a specific documented purpose.

Suggested change

# print("Broadcasting data from root:", root, "to all processes.")

Copilot · 2025-09-12T14:50:37Z

src/dalia/utils/multiprocessing.py

+if __name__ == "__main__":
+
+    # Initialize MPI
+    comm = MPI.COMM_WORLD
+    rank = comm.Get_rank()  
+    mpi_size = comm.Get_size()
+
+    # Create a vector, intentionally make rank 1 different
+    theta = np.ones(5)
+    if backend_flags["mpi_avail"] and rank == 1:
+        theta[0] = 42  # Make it inconsistent on rank 1
+
+    check_vector_consistency(theta, comm)


Test code in the __main__ block should not be committed to production code. This testing logic should be moved to a proper test file or removed entirely.

Suggested change

if __name__ == "__main__":

# Initialize MPI

comm = MPI.COMM_WORLD

rank = comm.Get_rank()

mpi_size = comm.Get_size()

# Create a vector, intentionally make rank 1 different

theta = np.ones(5)

if backend_flags["mpi_avail"] and rank == 1:

theta[0] = 42 # Make it inconsistent on rank 1

check_vector_consistency(theta, comm)

Copilot · 2025-09-12T14:50:37Z

src/dalia/utils/multiprocessing.py

+    # theta_host = get_host(theta)
+    # theta_ref_host = get_host(theta_ref)
+
+    # if not np.array_equal(theta_host, theta_ref_host):
+    #     norm_diff = np.linalg.norm(theta_host - theta_ref_host)
+    #     raise ValueError(
+    #         f"Process {comm.Get_rank()} has a different theta than the reference process."
+    #         f" Expected: {theta_ref_host}, but got:  {theta_host}. diff = {norm_diff:.4e}"
+    #     )
+


Large blocks of commented code should be removed. If this alternative implementation is needed for future reference, it should be documented or moved to a separate development branch.

Suggested change

# theta_host = get_host(theta)

# theta_ref_host = get_host(theta_ref)

# if not np.array_equal(theta_host, theta_ref_host):

# norm_diff = np.linalg.norm(theta_host - theta_ref_host)

# raise ValueError(

# f"Process {comm.Get_rank()} has a different theta than the reference process."

# f" Expected: {theta_ref_host}, but got: {theta_host}. diff = {norm_diff:.4e}"

# )

Copilot · 2025-09-12T14:50:37Z

src/dalia/core/dalia.py

+        theta_star[:] = self.comm_world.bcast(theta_star, root=0)
+        x_star[:] = self.comm_world.bcast(x_star, root=0)
+
        # compute covariance of the hyperparameters theta at the mode
        cov_theta = self.compute_covariance_hp(theta_star)

+        # need to update theta_star and x_star to be the same across all ranks
+        theta_star[:] = self.comm_world.bcast(theta_star, root=0)
+        x_star[:] = self.comm_world.bcast(x_star, root=0)


The bcast function modifies the input array in-place, so assigning the result back to the slice is redundant and potentially confusing. Either use self.comm_world.bcast(theta_star, root=0) without assignment, or use theta_star = self.comm_world.bcast(theta_star, root=0) without slice notation.

Suggested change

theta_star[:] = self.comm_world.bcast(theta_star, root=0)

x_star[:] = self.comm_world.bcast(x_star, root=0)

# compute covariance of the hyperparameters theta at the mode

cov_theta = self.compute_covariance_hp(theta_star)

# need to update theta_star and x_star to be the same across all ranks

theta_star[:] = self.comm_world.bcast(theta_star, root=0)

x_star[:] = self.comm_world.bcast(x_star, root=0)

self.comm_world.bcast(theta_star, root=0)

self.comm_world.bcast(x_star, root=0)

# compute covariance of the hyperparameters theta at the mode

cov_theta = self.compute_covariance_hp(theta_star)

# need to update theta_star and x_star to be the same across all ranks

self.comm_world.bcast(theta_star, root=0)

self.comm_world.bcast(x_star, root=0)

Copilot · 2025-09-12T14:50:38Z

src/dalia/core/dalia.py


+        # need to update theta_star and x_star to be the same across all ranks
+        theta_star[:] = self.comm_world.bcast(theta_star, root=0)
+        x_star[:] = self.comm_world.bcast(x_star, root=0)


Same issue as above - the bcast function modifies the input array in-place, so assigning the result back to the slice is redundant and potentially confusing. Either use self.comm_world.bcast(theta_star, root=0) without assignment, or use theta_star = self.comm_world.bcast(theta_star, root=0) without slice notation.

Suggested change

x_star[:] = self.comm_world.bcast(x_star, root=0)

self.comm_world.bcast(x_star, root=0)

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-12T14:55:23Z

src/dalia/utils/multiprocessing.py

+    if norm_diff > 1e-10:
+        raise ValueError(
+            f"Process {comm.Get_rank()} has a different theta than the reference process."
+            f" Expected: {theta_ref}, but got:  {theta}. diff = {norm_diff:.4e}"
+        )


The hardcoded tolerance value 1e-10 should be made configurable as a parameter to allow flexibility for different precision requirements and array types.

Copilot · 2025-09-12T14:55:23Z

src/dalia/utils/multiprocessing.py

+def check_vector_consistency(
+    theta: ArrayLike,
+    comm,
+):


Missing type annotation for the comm parameter. It should be annotated with the appropriate communicator type for consistency with other functions in the module.

…or rescaling function for hyperparameters within prior hyperparameter class and from where it needs to be called

…al ordering and no pivoting numerically quite unstable. only seems to work for very small expamples

…king now for minimize() call for gr with gamma prior

lisa-gm added 4 commits July 24, 2025 17:31

added check consistency function. bcast somehow not working for cupy …

1f2a204

…the way im calling it now

first steps in the direction of ar1

5a41257

ar1 model runs through. not sure if it computes the correct the thing…

f568917

… or not

no print in broadcasting function

9d75a6e

lisa-gm requested a review from vincent-maillou as a code owner September 12, 2025 14:49

lisa-gm requested review from Copilot and vincent-maillou and removed request for vincent-maillou September 12, 2025 14:49

Copilot AI reviewed Sep 12, 2025

View reviewed changes

removed extra print statements

0cf7d77

lisa-gm requested a review from Copilot September 12, 2025 14:54

Copilot AI reviewed Sep 12, 2025

View reviewed changes

lisa-gm and others added 18 commits September 21, 2025 10:21

added print gs small example

6a1a3fa

gaussian ar1 seems to be working. still needs a bit of clean up

4ab26f4

added verbosity level to not always show timer, added initial draft f…

9af4a3c

…or rescaling function for hyperparameters within prior hyperparameter class and from where it needs to be called

updated selected inversion to be dense inverse. probably due to natur…

f5d4a0c

…al ordering and no pivoting numerically quite unstable. only seems to work for very small expamples

added doc string and normalizing constant for both gaussian priors

533fbe6

first real progress with the reparametrization issue. seems to be wor…

627fcff

…king now for minimize() call for gr with gamma prior

work in progress, lots of quadrature. lots of test. nothing in core yet

4aa9ab5

incorporating new functions into dalia

06fe721

Merge branch 'gamma_prior' into mpi_issues

6a698e1

Update theta handling in check_vector_consistency

b645424

Check backend flags for cupy availability

cc01726

just updated run script

16e0178

fixed merge conflict

26b3053

resolved more merge conflicts

d7dbe35

incorporated copilot suggestions

3d50427

fixed it again ...

6339156

no print in broadcasting function

37514e9

removed extra print statements

420cd71

lisa-gm and others added 8 commits October 8, 2025 14:36

added print gs small example

faf8639

Update theta handling in check_vector_consistency

de81cfc

Check backend flags for cupy availability

b07ad00

incorporated copilot suggestions

beea79b

fixed it again ...

fba4ebb

did it resolve now ...

0bc030e

first version seems to be working for some priors

3ddbb6b

still fixing stuff

ba74de2

vincent-maillou mentioned this pull request Oct 27, 2025

Solver Testing and Addressed Bugs Along the Way #134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mpi issues#124

Mpi issues#124
lisa-gm wants to merge 31 commits intodevfrom
mpi_issues

lisa-gm commented Sep 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 12, 2025

Uh oh!

Copilot AI Sep 12, 2025

Uh oh!

Copilot AI Sep 12, 2025

Uh oh!

Copilot AI Sep 12, 2025

Uh oh!

lisa-gm Sep 12, 2025

Uh oh!

Copilot AI Sep 12, 2025

Uh oh!

lisa-gm Sep 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 12, 2025

Uh oh!

Copilot AI Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	x_star[:] = self.comm_world.bcast(x_star, root=0)
	self.comm_world.bcast(x_star, root=0)

Conversation

lisa-gm commented Sep 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

lisa-gm Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

lisa-gm Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants