Skip to content

Conversation

@eganhila
Copy link

This fixes a bug in my use case where particles were not being output correctly for RHybrid+corsair in restart files. Previously simulations would either quietly write out particles in nonsense locations leaving large empty gaps in the correct file offset or die with a segfault, depending on the MPI implementation (OpenMpi, intel mpi). This was triggered anytime a processor had more than the number of particles that would correspond to the systems MaxBytesPerWrite.

Copy link
Contributor

@sandroos sandroos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically you're saying that dataSize is different than datatypeByteSize in your case? If this is the case it might create some other issues too / there may be a bug elsewhere in hyb/corsair. Can you post an example of the data you're trying to write.

@eganhila
Copy link
Author

So when using corsair+rhybrid, the dataSize ends up being 56 while the datatypeByteSize is 1. While the true particle size is 56, this is already wrapped up in amount. I thought this was the intended behavior because of the discussion in lines 236-244 of corsair/src/kernel/restart_writer.cpp

// NOTE: addMultiwriteUnit function call here binds to the template wrapper function // which creates an MPI datatype for char pointer, when it should be a continuous // array of chars with arrayByteSizes[i] elements. This is corrected by writing // arraySizes[i][c]*arrayByteSizes[i] elements instead of arraySizes[i][c].

So amount contains 56*actual particle number, and the datatypeByteSize is given as 1, but then because the datatype vlsv gets is "unknown", it interprets dataSize as 56. It all seems to output correctly if I switch the increment to unitOffset (checking using analysator/pyVlsv), but I think there is still a problem with reading in the subsequent particle data even when combined with PR#31.

I'm guessing based on your reaction that this is not exactly how this is supposed to work though?

To test this all I've been runnning isolated test simulations where I just inject 1000 particles per grid cell each time step with a small velocity so they remain approximately in the same place, and watch when the outputs start to break down. My test script is here, and the modifications I've made to rhybrid to the uniform injector are here. I've also been working with corsair@ commit 67654ed398, and rhybrid@ commit 2b6c32585. I can share the subsequent datafiles, but they're quite large (since the particle data has to approach maxBytesPerWrite to trigger this), so let me know how would be best for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants