VeloC plugin#661
Conversation
…rrectly read and in case of non-failure checkpointing is done but entirely using the decl_hdf5 plugin..
…ence and check the definition of a recovery file by storing it as a class data member (and querying its value) rather than using the length of the node.
…nality: If the datasets' paths depend on simulation parameters, the user can define the last dataset paths. If not, the same dataset paths as defined by the "datasets" key are assumed
…per around the VeloC library. Implemented functionalities of writing checkpoints and restoring the latest checkpoint. Added the Cmake file to build PDI with new plugin. Next step: testing for correctness.
… than on data expose because otherwise multiple checkpoints files were being written for the same iteration. Added tests to: 1) check correct writing of checkpoints 2) check correct restoration after a failure. Tests pass with basic requirements. These need to be expanded.
…ons and fixed naming mismatch in veloc_wrapper.cxx
…ype divided by the number of elements in the case of an array datatype. Added additional check to ensure an error is thrown if a checkpoint event is called before all data to be included in checkpoints has been exposed to PDI.
…nd tests to assert that the expected number of checkpoint files has been written. Added cmake changes to compile tests with library but it still does not work.
…has been made to store the configuration of the veloc plugin. The plugin creates an object of the configuration class and calls its getters method to function correctly. Events of types RECOVER and SYNC_STATE have been added ; recovering at the moment is not automatic on expose. The pdi example ran successfully but only tetsing checkpointing so far.
…ts of logging to debug the recover_var bug which is still not working.
… variable names of the VeloC plugins nd changed minor logic: returning 1 rather than version for subsequent check of return value to be valid
…t . Removed the use of "assert" from existing tests.
…d warning bug in check_conforlity()
…var and recover_rest functionalities.
… expected yaml tree to make it easier for the user, and added a test for manual recovery of a previous checkpoint file
…over to be coherent with "custom_configuration"
…uired for temporary directories where tests are run
c257c27 to
ce93f25
Compare
…an older version. Added VeloC plugin to the plugin related parts of the CMakeLists.txt and Source_installation.md
…ation number in order for the formula to calculate the next reduce iteration to be accurate. Also moved some plugin logic to not overload the logger
| cmake --build tests_api_mockfind ${MAKEFLAGS} | ||
| ctest --output-on-failure --timeout 90 ${CTEST_FLAGS} ${CTEST_DIR:+--output-junit "${CTEST_DIR}/tests_api_mockfind.xml"} ${EXCLUDED_PDI_TESTS:+-E $EXCLUDED_PDI_TESTS} --test-dir tests_api_mockfind | ||
| fi | ||
| fi No newline at end of file |
There was a problem hiding this comment.
Synchronisation issue with main, can be removed
|
|
||
| #include <mpi.h> | ||
| #include <iostream> | ||
| #include <assert.h> |
There was a problem hiding this comment.
Unify the includes of all new tests, can remove this include to unused assert.h
|
|
||
| void write_checkpoint(PDI::Context& ctx, std::string label, int version); | ||
|
|
||
| int read_checkpoint(PDI::Context& ctx, std::string label, int cp_id); // is this needed? can't remember anymore |
There was a problem hiding this comment.
Referenced two times in veloc.cxx at L210 and L223:
https://github.com/iole-bolognesi/pdi/blob/d0548ce8502859d015bad6fc3755d538c7f5b78d/plugins/VeloC/veloc.cxx#L210
|
|
||
| Dependencies of **the VeloC plugin**: | ||
| * the PDI library, | ||
| * the [VeloC](https://veloc.readthedocs.io/en/latest/userguide.html) library version 1.8 or above (not provided) |
There was a problem hiding this comment.
| * the [VeloC](https://veloc.readthedocs.io/en/latest/userguide.html) library version 1.8 or above (not provided) | |
| * the [VeloC](https://veloc.readthedocs.io/en/latest/userguide.html) library version 1.8 or above (not provided), |
| static void error_handler(PDI_status_t status, const char* message, void* ctx) | ||
| { | ||
| if (status) { | ||
| std::cerr << "[PDI error] " << message << "\n"; |
There was a problem hiding this comment.
May remove std::cerr and cout of all tests, to use fprintf of stderr instead (cout is not needed with ctest). See HDF5 tests for reference. May use EXPECT_EQ if moving to gtest
| #include <pdi/pdi_fwd.h> | ||
| #include <pdi/context.h> | ||
| #include <pdi/context_proxy.h> | ||
| #include <pdi/data_descriptor.h> | ||
| #include <pdi/datatype.h> | ||
| #include <pdi/error.h> | ||
| #include <pdi/expression.h> | ||
| #include <pdi/plugin.h> | ||
| #include <pdi/ref_any.h> | ||
|
|
||
|
|
||
| #include <iostream> | ||
| #include <optional> | ||
|
|
||
| #include "veloc_wrapper.h" | ||
|
|
||
| using PDI::Context; | ||
| using PDI::Ref_r; | ||
|
|
||
| using std::string; |
There was a problem hiding this comment.
All unused, except for #include "veloc_wrapper.h" and using std::string;
(also, if including std::string, no need to use std for the calls of this file)
Most std::string could be const std::string in this file (only read)
| } | ||
| } | ||
|
|
||
| int read_checkpoint(PDI::Context& ctx, std::string label, int version) |
There was a problem hiding this comment.
Incompatibility of read_checkpoint() and init_restart() for iteration index 0 from VELOC_Restart_test()
| void protect_data(PDI::Context& ctx, int id, void* ptr, size_t n, size_t sub_bytes) | ||
| { | ||
| if (VELOC_Mem_protect(id, ptr, n, sub_bytes) != VELOC_SUCCESS) { | ||
| ctx.logger().error("Memory protect failed for id {} with ptr = {} and size = {}", id, ptr, (n * sub_bytes)); |
There was a problem hiding this comment.
Log error but doesn't stops here, could it corrupt memory further ? Should throw an error directly ? May use more exit(2); in this file
| void unprotect_data(PDI::Context& ctx, int id) | ||
| { | ||
| if (VELOC_Mem_unprotect(id) != VELOC_SUCCESS) { | ||
| ctx.logger().error("Memory unprotect failed for id {}", id); |
|
|
||
| void end_checkpoint(PDI::Context& ctx) | ||
| { | ||
| if (VELOC_Checkpoint_end(1) != VELOC_SUCCESS) { |
There was a problem hiding this comment.
Might need to propagate the error if the user made a mistake between START_CHECKPOINT and END_CHECKPOINT, to signify success to END_CHECKPOINT so that a wrong simulation checkpoint does not happen
| if (VELOC_Checkpoint_end(1) != VELOC_SUCCESS) { | |
| if (VELOC_Checkpoint_end(success ? 1 : 0) != VELOC_SUCCESS) { |
(needs a change in Event_type::END_CHECKPOINT)
List of things to check before making a PR
Before merging your code, please check the following:
.clang-format;Fix #issuekeyword to autoclose the issue when merged.