Open
Conversation
taylanbil
reviewed
Nov 3, 2020
initial changes to support training on tpus changed tpu configuration to use training.device replace parallelLoader with mpLoader to solved loader exhaust issue. removed debug message. updated the comment added comments for drop_last change. removed pdb lines removed redundant device config added comments for pending changes default init not applicable for xla device type moved wrapping of dataloader to build added line-debug function metsumm removed some .item calls from reporting xla equivalents in the distributed module, earlier eval was failing at the metrics all reduce step implemented broadcast in terms of all_to_all changes for checkpoint saving change to make execution even across cores corrected the is_master logic one more fix for is_master clean up of debug messages
taylanbil
reviewed
Nov 23, 2020
| v = v.mean() | ||
| v = v.item() | ||
| assert isinstance(v, (float, int)) | ||
| #v = v.item() |
There was a problem hiding this comment.
instead of commenting out here, let's have a util function like
def item(self, v):
if torch.is_tensor(v) and v.device.type == 'xla':
return v
return v.item()and use v = self.item(v) and then assert on assert isinstance(v, (float, int)) or v.device.type == 'xla'
| # since other device types such as xla can be passed | ||
| # falling back to cpu should only happen when device_type | ||
| # is set to cude but cuda is not available. | ||
| if not torch.cuda.is_available() and device == "cuda": |
There was a problem hiding this comment.
ordering as device == 'cude' and torch.cuda.is_available() will save you the cuda available check.
| # Since, this is iterator, we need to return total length == number of batches | ||
| batch_size = get_batch_size() | ||
| # Changed the length to accomadate drop_last == True | ||
| # drop_last is required if the batch is split intor multiple cores |
| # Changed the length to accomadate drop_last == True | ||
| # drop_last is required if the batch is split intor multiple cores | ||
| # some of the cores may not have enough examples. | ||
| if is_xla(): |
There was a problem hiding this comment.
can you use thee bool drop_last here instead of is_xla?
| self.device = xm.xla_device() | ||
| self.distributed = True | ||
| self.local_rank = xm.get_local_ordinal() | ||
| self.tpu = True |
There was a problem hiding this comment.
I think using self.xla to denote xla usage is better than using self.tpu
| bool -- Tells whether early stopping occurred or not | ||
| """ | ||
| if not is_master(): | ||
| if not is_master() and not is_xla(): |
There was a problem hiding this comment.
I don't understand this. why always False if not the master ordinal?
| if init_distributed: | ||
| distributed_init(config) | ||
|
|
||
|
|
| if config.distributed.init_method is None: | ||
| infer_init_method(config) | ||
|
|
||
|
|
| ) | ||
| if is_xla(): | ||
| import torch_xla.distributed.xla_multiprocessing as xmp | ||
| torch.multiprocessing.set_sharing_strategy("file_system") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
For Internal review.