Conversation
… fix filling w/ -inf in wav2vec2, minor cleanups
…item in log_scalar, minor cleanups
| loss = F.binary_cross_entropy_with_logits( | ||
| logits, target.float(), weights, | ||
| reduction="sum" if reduce else "none", | ||
| ignore_index=-1, |
There was a problem hiding this comment.
this wont actually work because binary cross entropy with logits does not have ignore index. you need to use reduction="none" here and then zero out the loss coming from unmasked states
| sample_size = target.numel() if self.infonce else target.long().sum().item() | ||
| if 'sample_size' in sample and self.infonce: | ||
| sample_size = sample['sample_size'] | ||
| elif 'mask_indices' in sample['net_input'] and self.infonce: |
There was a problem hiding this comment.
maybe remove "and self.infonce" part?
| across workers prior to calling `reduce_metrics`. Setting this | ||
| to True will improves distributed training speed. | ||
| """ | ||
| return False |
There was a problem hiding this comment.
if you keep it at false, do you still see larger accuracy etc (i know we tried 1 node, but still)
| return y.new(0) | ||
|
|
||
| (bsz, tsz), fsz = mask_indices.shape, self.args.final_dim | ||
| high = mask_indices.sum(-1).max().item() |
There was a problem hiding this comment.
this means that sometimes you will sample negatives from masked timesteps for examples that are shorter than the longest one. why not sample separately per each example in the batch and use correct high for each example?
|
|
||
| if self.n_negatives > 0: | ||
| for i in range(1, bsz): | ||
| neg_idxs[i] += i * high |
There was a problem hiding this comment.
this is problematic because previously it assumed "high" is the number of timesteps, but we've redefined high above to be something smaller. you need to do neg_idxs[i] += i * tsz here
| neg_idxs = torch.randint( | ||
| low=0, high=high - 1, size=(bsz, self.n_negatives * num) | ||
| ) | ||
| neg_idxs = torch.randint(low=0, high=high-1, size=(bsz, self.n_negatives * tsz)) |
There was a problem hiding this comment.
here, for xla, you should loop over bsz and sample negatives individually for each example setting high to be tsz-sum(padding[b]) as we discussed. otherwise you might be sampling negatives from states that are padded. i guess padding[b] would come from padding_counts which is currently not used
| pdb.set_trace() | ||
| negs = negs.view( | ||
| bsz, num, self.n_negatives + self.cross_sample_negatives, fsz | ||
| bsz, tsz, self.n_negatives + self.cross_sample_negatives, fsz |
|
|
||
| y = self.project_q(y) | ||
|
|
||
| num = y.size(1) if tszs_after_mask is None else max(tszs_after_mask) |
There was a problem hiding this comment.
you dont need to do this, just set it to y.size(1) always
| if self.negatives_from_everywhere: | ||
| negs, _ = self.sample_negatives(unmasked_features, y.size(1)) | ||
| negs, _ = self.sample_negatives( | ||
| unmasked_features, num, padding_counts=padding_counts, |
There was a problem hiding this comment.
no need to change to num here
| else: | ||
| negs, _ = self.sample_negatives(y, y.size(1)) | ||
| negs, _ = self.sample_negatives( | ||
| y, num, |
No description provided.