fix: restrict policy var save for distributed setup#491
fix: restrict policy var save for distributed setup#491jatinsharechat wants to merge 20 commits intotensorflow:masterfrom
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Hi @jatinsharechat , thanks for your contribution! The CLA needed to be signed; please follow the guidance: https://github.com/tensorflow/recommenders-addons/pull/491/checks?check_run_id=39908185786. cc @jq @MoFHeka |
I've signed off the CLA and the rescan is green. |
|
the code format is failing, you may run yapf |
|
|
||
| def _save_de_model(self, filepath): | ||
|
|
||
| def _maybe_save_restrict_policy_params(de_var, proc_size=1, proc_rank=0): |
There was a problem hiding this comment.
use one _maybe_save_restrict_policy_params?
There was a problem hiding this comment.
Since the code is pretty minimal and calling de_var.save_to_file_system under the hood I thought might be okay to replicate the same function.
Any suggestions where to move the util function to share between the two? Just import from tensorflow_recommenders_addons.dynamic_embedding.python.keras.models._maybe_save_restrict_policy_params in callbacks.py or and use or something else?
Description
rank == 0.rank != 0, therestrict_varis not restored, leading to an unsynchronized state between the embedding variable andrestrict_var._maybe_save_restrict_policy_paramsfunction:de_varhas a restrict_policy._traverse_emb_layers_and_saveto:_maybe_save_restrict_policy_paramsfor each distributed embedding variable (de_var).hvd.rank()).Type of change
Checklist:
How Has This Been Tested?
tensorflow_recommenders_addons/dynamic_embedding/python/kernel_tests/horovod_embedding_restrict_save_test.pythat trains a dummy model for some steps, then saves the model.horovodrunfor CPU based distributed setup