Fixing bug in multiwrite with multiple MPI writes #37

eganhila · 2018-11-21T21:16:32Z

This fixes a bug in my use case where particles were not being output correctly for RHybrid+corsair in restart files. Previously simulations would either quietly write out particles in nonsense locations leaving large empty gaps in the correct file offset or die with a segfault, depending on the MPI implementation (OpenMpi, intel mpi). This was triggered anytime a processor had more than the number of particles that would correspond to the systems MaxBytesPerWrite.

…correctly

sandroos

So basically you're saying that dataSize is different than datatypeByteSize in your case? If this is the case it might create some other issues too / there may be a bug elsewhere in hyb/corsair. Can you post an example of the data you're trying to write.

eganhila · 2018-11-29T17:12:10Z

So when using corsair+rhybrid, the dataSize ends up being 56 while the datatypeByteSize is 1. While the true particle size is 56, this is already wrapped up in amount. I thought this was the intended behavior because of the discussion in lines 236-244 of corsair/src/kernel/restart_writer.cpp

// NOTE: addMultiwriteUnit function call here binds to the template wrapper function // which creates an MPI datatype for char pointer, when it should be a continuous // array of chars with arrayByteSizes[i] elements. This is corrected by writing // arraySizes[i][c]*arrayByteSizes[i] elements instead of arraySizes[i][c].

So amount contains 56*actual particle number, and the datatypeByteSize is given as 1, but then because the datatype vlsv gets is "unknown", it interprets dataSize as 56. It all seems to output correctly if I switch the increment to unitOffset (checking using analysator/pyVlsv), but I think there is still a problem with reading in the subsequent particle data even when combined with PR#31.

I'm guessing based on your reaction that this is not exactly how this is supposed to work though?

To test this all I've been runnning isolated test simulations where I just inject 1000 particles per grid cell each time step with a small velocity so they remain approximately in the same place, and watch when the outputs start to break down. My test script is here, and the modifications I've made to rhybrid to the uniform injector are here. I've also been working with corsair@ commit 67654ed398, and rhybrid@ commit 2b6c32585. I can share the subsequent datafiles, but they're quite large (since the particle data has to approach maxBytesPerWrite to trigger this), so let me know how would be best for you?

Changing offset amount for multiwrites to reflect datatype byte size …

28ced2d

…correctly

sandroos reviewed Nov 27, 2018

View reviewed changes

rjarvinen mentioned this pull request Dec 2, 2025

Restart does not work or is not reliable fmihpc/rhybrid#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixing bug in multiwrite with multiple MPI writes #37

Fixing bug in multiwrite with multiple MPI writes #37

Uh oh!

eganhila commented Nov 21, 2018

Uh oh!

sandroos left a comment

Uh oh!

eganhila commented Nov 29, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fixing bug in multiwrite with multiple MPI writes #37

Are you sure you want to change the base?

Fixing bug in multiwrite with multiple MPI writes #37

Uh oh!

Conversation

eganhila commented Nov 21, 2018

Uh oh!

sandroos left a comment

Choose a reason for hiding this comment

Uh oh!

eganhila commented Nov 29, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants