Skip to content

Enable w2v2 tpu#7

Open
taylanbil wants to merge 82 commits intoenable-w2v2-tpu-BASECOMMITfrom
enable-w2v2-tpu
Open

Enable w2v2 tpu#7
taylanbil wants to merge 82 commits intoenable-w2v2-tpu-BASECOMMITfrom
enable-w2v2-tpu

Conversation

@taylanbil
Copy link
Copy Markdown
Owner

commits are separated to logical bits and organized, w/ informational messages.

patrickvonplaten and others added 14 commits February 10, 2021 14:04
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Fixes facebookresearch#3227

All models that do **not** make use of group norm, such as
- Wav2Vec 2.0 Large (LV-60)*
- Wav2Vec 2.0 Large (LV-60) + Self Training *

do need this fix IMO to able to correctly run batches through the model. Before this PR, the
following code snippet failed:

```python
import fairseq
import torch

# get model
wav2vec_path = "data/wav2vec2_vox_960h_new.pt"
model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(
    [wav2vec_path], arg_overrides={"data": "./data"}
)
model = model[0]
model.eval()

# create single input
input_wav_0 = torch.randn((1, 2000))
input_wav_1 = torch.randn((1, 3000))

# create batched input
batch_input_wav = torch.zeros((2, 3000))
batch_input_wav[0, :input_wav_0.shape[-1]] = input_wav_0
batch_input_wav[1, :input_wav_1.shape[-1]] = input_wav_1

# create padding mask
padding_mask = torch.zeros((2, 3000), dtype=torch.bool)
padding_mask[0, input_wav_0.shape[-1]:] = True

# run batch & single
output = model(source=input_wav_0, padding_mask=None)["encoder_out"]
batch_output = model(source=batch_input_wav, padding_mask=padding_mask)["encoder_out"]

# is equal?
print("Is batched forward and simple forward equal?", torch.allclose(output[:,0], batch_output[:output.shape[0], 0], atol=1e-3))
```
Note: It is assumed that both https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt and https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec2_vox_960h_new.pt were downloaded and stored in the folder data.

Also, see [this](https://colab.research.google.com/drive/1ASZ4lVZbKkj-dvRHDl1lo0mCcsaOERlG?usp=sharing) notebook for reproducibility.

This PR should fix the behavior and make the above code snippet / notebook run succesfully.

## PR review

Gently pinging alexeib for Wav2Vec2

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: facebookresearch#3228

Reviewed By: aconneau

Differential Revision: D26373721

Pulled By: alexeib

fbshipit-source-id: 3d5aca2f8136d1a8c4b5b4bc9c03cd05a69a3b52
Reviewed By: myleott, chtran

Differential Revision: D26348808

fbshipit-source-id: 010ef00024e02c09ec35b624f0713ce5f1f387b4
Summary:
At the start of the half there were some expired handles and it was annoying to track down which datasets were responsible when sampling data among multiple datasets and which flows were running them. Lets improve the error message to address several pain points

1. Explicitly tell the user which dataset has expired handles
2. Link to a scuba query to enable the user to find all flows that have expired handles
3. Fail job if 10k handles have expired, rather than if 10k handles in a row have expired. This can detect failures from datasets that have for example 50% expired handles
4. add logging when handles fail

Reviewed By: cruvadom

Differential Revision: D26187820

fbshipit-source-id: 771a359ea01de80b38932921346e98cff812f2f7
Summary:
fairscale.nn.Pipe has been ported to PyTorch:
https://github.com/pytorch/pytorch/blob/master/torch/distributed/pipeline/sync/pipe.py#L138.
As a result, modifying the pipeline transformer to use PyTorch pipe if available. This change depends on pytorch/pytorch#50860.

Pull Request resolved: facebookresearch#3149

Test Plan:
```
python train.py ru_en_bin/ --arch transformer_iwslt_de_en_pipeline_parallel --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 4096 --eval-bleu --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --pipeline-model-parallel --pipeline-balance '[1,3,5,3,3,1]' --pipeline-devices '[0,1,0,2,3,0]' --pipeline-chunks 16 --distributed-world-size 1 --distributed-no-spawn --disable-validation --max-epoch 1
```

Output with torch pipe:
```
2021-01-20 16:13:35 | INFO | train | epoch 001 | loss 12.676 | nll_loss 12.331 | ppl 5151.97 | wps 5108 | ups 1.66 | wpb 3081.6 | bsz 131.6 | num_updates 380 | lr 4.75e-05 | gnorm 2.08 | train_wall 229 | wall 233
2021-01-20 16:13:36 | INFO | fairseq_cli.train | done training in 233.1 seconds
```

Output with fairscale pipe:
```
2021-01-20 14:13:59 | INFO | train | epoch 001 | loss 12.677 | nll_loss 12.331 | ppl 5152.07 | wps 5198.9 | ups 1.69 | wpb 3081.6 | bsz 131.6 | num_updates 380 | lr 4.75e-05 | gnorm 2.08 | train_wall 224 | wall 228
2021-01-20 14:13:59 | INFO | fairseq_cli.train | done training in 228.0 seconds
```

Reviewed By: myleott

Differential Revision: D26204633

Pulled By: shruti-bh

fbshipit-source-id: 535f816e8d149b47fc6ba8385981accf67257257
…ch#3231)

Summary:
More informative exception when numpy version changes to ask the user to recompile Cython files

# Before submitting

- [With myleott  ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [N/A ] Did you make sure to update the docs?
- [N/A ] Did you write any new necessary tests?

## What does this PR do?
Raises a more informative error to tell the user to recompile Cython files after an update to the numpy version.

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: facebookresearch#3231

Reviewed By: myleott

Differential Revision: D26375174

Pulled By: mwillwork

fbshipit-source-id: f0a93e162bc4cf84619581110d21bea907baf7fc
Summary:
this allows tasks to declare some properties they'd like to save in the checkpoint (such as a dictionary), which are loaded when checkpoint is restored.

Pull Request resolved: fairinternal/fairseq-py#1562

Test Plan: tested by training a new wav2vec model, then finetuning it, then decoding it and making sure the dict only loaded once, during fine tuning process (and was obtained from checkpoint for decoding)

Reviewed By: myleott, gwenzek

Differential Revision: D25937974

Pulled By: alexeib

fbshipit-source-id: b9908042f76ec8cda943f33885eb9b1f121662ae
Summary:
- I don't think there is a convention for the shapes of `encoder_out` and `encoder_padding_mask` in fairseq but `fst_external_decoder.py` expects `encoder_padding_mask` to be of shape T x B. `encoder_padding_mask` also seems unused in the fairseq [CTC criterion and w2l decoder integration](https://fburl.com/diffusion/ms1zi2px) so taking the easy way out and changing its shape.
- Also checking in some changes to the pyspeech audio_pretraining task required to make decoding work

Reviewed By: alexeib

Differential Revision: D26382442

fbshipit-source-id: 87c8f9433026c0e011847f4e2e094beb2cd2182c
Summary:
fixes fairseqlm integration with flashlight (formerly wav2letter) decoder

Pull Request resolved: fairinternal/fairseq-py#1617

Reviewed By: xuqiantong

Differential Revision: D26415650

Pulled By: alexeib

fbshipit-source-id: 813684ba55047e92378f508101ff1eec55754420
Summary:
raise an exception if trying to use wav2vec seq2seq finetuning without autoregressive flag

Pull Request resolved: fairinternal/fairseq-py#1618

Reviewed By: xuqiantong

Differential Revision: D26417249

Pulled By: alexeib

fbshipit-source-id: 777b6d170b0f8196746e03b399e4d7c21ac0b837
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Fixes # (issue).

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: facebookresearch#3240

Reviewed By: aconneau

Differential Revision: D26420073

Pulled By: alexeib

fbshipit-source-id: 5939535b945a64e61d655cd36dc955ae46410bfb
Summary:
somehow merging previous pull request deleted the readme

Pull Request resolved: fairinternal/fairseq-py#1621

Reviewed By: michaelauli

Differential Revision: D26429893

Pulled By: alexeib

fbshipit-source-id: 3e6ed1e4698e67e56e0b88d304f42907a4f6cf41
Summary:
OSS removed the 'partition' key in their state dict to accommodate for changing partition size. This requires an update on the fairseq side to not look into the parameter partition, just broadcast everything, and let the optimizer on each rank decides which parameters are relevant.

This diff also needs D26419095 to function completely, and blefaudeux has made fixes upstream in facebookresearch/fairscale#383

Reviewed By: myleott

Differential Revision: D26382917

fbshipit-source-id: 95af1022be59e88814748acaee36a1a350f7dc5b
…s. (facebookresearch#3237)

Summary:
…ith BLEU scores

# Before submitting

- [no] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [yes] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [no need] Did you make sure to update the docs?
- [no need] Did you write any new necessary tests?

## What does this PR do?
Fixes bugs of evaluation with BLEU score when training with multi-gpus. But no error will happend if there is no distributed training.

when --eval-bleu is set to be `True` (default it is `False` and the best checkpoint is selected according to loss) and training with multi-gpus (when the number of gpu which participate in distributed training is greater than 1), following error will happend.

```bash
Traceback (most recent call last):
Traceback (most recent call last):
  File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module>
  File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module>
Traceback (most recent call last):
  File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module>
        sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())

  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main
  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main
        distributed_utils.call_main(cfg, main)distributed_utils.call_main(cfg, main)

  File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main
  File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main
    distributed_utils.call_main(cfg, main)
  File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)  File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main

  File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main
    main(cfg, **kwargs)

      File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 143, in main
main(cfg, **kwargs)
                                                                                                                                                                               main(cfg, **kwargs)rder/fairseq/fairseq_cli/train.py", line 143, in main
  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 143, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner
                                                                                                                                                                               valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
      File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train
Traceback (most recent call last):
  File "/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train", line 33, in <module>
    return func(*args, **kwds)
return func(*args, **kwds)  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train

  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train
    cfg, trainer, task, epoch_itr, valid_subsets, end_of_epoch
  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save
    cfg, trainer, task, epoch_itr, valid_subsets, end_of_epoch
  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save
        cfg, trainer, task, epoch_itr, valid_subsets, end_of_epochsys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())

  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save
  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 450, in cli_main
    valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets)
  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate
    valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets)
  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate
    valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets)
  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate
    trainer.valid_step(sample)
  File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner
    distributed_utils.call_main(cfg, main)
  File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 349, in call_main
    trainer.valid_step(sample)
  File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step
    trainer.valid_step(sample)
  File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step
        return func(*args, **kwds)distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)

  File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step
  File "/data1/cordercorder/fairseq/fairseq/distributed/utils.py", line 326, in distributed_main
    main(cfg, **kwargs)
  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 143, in main
    logging_output = self._reduce_and_log_stats(logging_outputs, sample_size)
  File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats
    logging_output = self._reduce_and_log_stats(logging_outputs, sample_size)
  File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner
    logging_output = self._reduce_and_log_stats(logging_outputs, sample_size)
  File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats
    return func(*args, **kwds)
  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 259, in train
    cfg, trainer, task, epoch_itr, valid_subsets, end_of_epoch
  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 345, in validate_and_save
    self.task.reduce_metrics(logging_outputs, self.get_criterion())
  File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics
        self.task.reduce_metrics(logging_outputs, self.get_criterion())valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets)

  File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics
  File "/data1/cordercorder/fairseq/fairseq_cli/train.py", line 413, in validate
    self.task.reduce_metrics(logging_outputs, self.get_criterion())
  File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics
    metrics.log_scalar("_bleu_counts", np.array(counts))
  File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__
    trainer.valid_step(sample)
      File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/contextlib.py", line 74, in inner
metrics.log_scalar("_bleu_counts", np.array(counts))
  File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__
        return func(*args, **kwds)metrics.log_scalar("_bleu_counts", np.array(counts))

  File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 834, in valid_step
  File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__
    return self.numpy()
TypeError: can't convert cuda:2 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
    return self.numpy()
TypeError: can't convert cuda:3 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
    return self.numpy()
TypeError: can't convert cuda:1 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
    logging_output = self._reduce_and_log_stats(logging_outputs, sample_size)
  File "/data1/cordercorder/fairseq/fairseq/trainer.py", line 1157, in _reduce_and_log_stats
    self.task.reduce_metrics(logging_outputs, self.get_criterion())
  File "/data1/cordercorder/fairseq/fairseq/tasks/translation.py", line 410, in reduce_metrics
    metrics.log_scalar("_bleu_counts", np.array(counts))
  File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/tensor.py", line 480, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
Traceback (most recent call last):
  File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/data/cordercorder/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/data/cordercorder/anaconda3/envs/nmt/bin/python', '-u', '/data/cordercorder/anaconda3/envs/nmt/bin/fairseq-train', '--local_rank=3', 'tiny_data_bin', '--distributed-world-size', '4', '--arch', 'transformer', '--share-decoder-input-output-embed', '--optimizer', 'adam', '--adam-betas', '(0.9, 0.98)', '--clip-norm', '0.0', '--lr-scheduler', 'inverse_sqrt', '--warmup-init-lr', '1e-07', '--warmup-updates', '3000', '--lr', '0.0005', '--stop-min-lr', '1e-09', '--dropout', '0.25', '--weight-decay', '0.0001', '--criterion', 'label_smoothed_cross_entropy', '--label-smoothing', '0.1', '--max-tokens', '5000', '--batch-size', '64', '--update-freq', '4', '--max-epoch', '30', '--save-dir', 'checkpoint', '--skip-invalid-size-inputs-valid-test', '--eval-bleu', '--eval-bleu-args', '{"beam": 5}', '--eval-bleu-remove-bpe', 'sentencepiece', '--eval-bleu-print-samples', '--eval-tokenized-bleu', '--best-checkpoint-metric', 'bleu', '--maximize-best-checkpoint-metric', '--validate-interval-updates', '1']' returned non-zero exit status 1.

```

The error is cased by the fact that the numpy of version 1.20.1 does't support codes like following:
```python
import torch
import numpy as np
a = torch.tensor(0, device="cuda:0")
b = np.array([a])
```
The above codes will lead to error: "TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.", but the codes run well if the numpy version is 1.18.1 or 1.17.0 (when the numpy version is below 1.20.0, it is ok, I guess). However, it seems like that the latest version of fairseq need a numpy package of version 1.20.0 or higher (issue facebookresearch#3203 ).

### Reproduce the error
Download the source code of fairseq (commit ID: 7061a0f) and run following code:
```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3
data_bin_dir=tiny_data_bin

python -m torch.distributed.launch --nproc_per_node=4 \
    --master_addr="127.0.0.1" \
    --master_port=12345 \
    $(which fairseq-train) ${data_bin_dir} \
    --distributed-world-size 4 \
    --arch transformer \
    --share-decoder-input-output-embed \
    --optimizer adam \
    --adam-betas '(0.9, 0.98)' \
    --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt \
    --warmup-init-lr 1e-07 \
    --warmup-updates 3000 \
    --lr 0.0005 \
    --stop-min-lr 1e-09 \
    --dropout 0.25 \
    --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 \
    --max-tokens 5000 \
    --batch-size 64 \
    --update-freq 4 \
    --max-epoch 30 \
    --save-dir checkpoint \
    --skip-invalid-size-inputs-valid-test \
    --eval-bleu \
    --eval-bleu-args '{"beam": 5}' \
    --eval-bleu-remove-bpe sentencepiece \
    --eval-bleu-print-samples \
    --eval-tokenized-bleu \
    --best-checkpoint-metric bleu \
    --maximize-best-checkpoint-metric \
    --validate-interval-updates 1
```

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: facebookresearch#3237

Reviewed By: myleott

Differential Revision: D26429732

Pulled By: alexeib

fbshipit-source-id: bc887ce952d28541cb07dbbdc7e80e99428a6b34
Summary:
fixes previous change that changes state/dataset/etc to class variables instead of instance variables

Pull Request resolved: fairinternal/fairseq-py#1623

Reviewed By: michaelauli

Differential Revision: D26439560

Pulled By: alexeib

fbshipit-source-id: ab9e75a425a47ac7ace006419259e254770e560e
Myle Ott and others added 14 commits February 16, 2021 15:52
…coder (facebookresearch#1559)

Summary:
Pull Request resolved: fairinternal/fairseq-py#1559

This matches the behavior of RobertaEncoder.

Test Plan: Imported from OSS

Reviewed By: gwenzek

Differential Revision: D25936937

Pulled By: myleott

fbshipit-source-id: 795ec8d50298a41d9e9638101436faa01cdf1586
Summary:
This is long overdue, but finally deprecating the RobertaEncoder components and just using TransformerEncoder directly. This will make it easier for some upcoming online backtranslation changes, and will eventually make migrating it to dataclasses/Hydra easier too. It also fixes some longstanding inconsistencies in layernorm placement in the model parallel roberta code.

Pull Request resolved: fairinternal/fairseq-py#1560

Test Plan:
- confirmed that training gives identical losses as before:
https://gist.github.com/myleott/9a4d213fb88a02b00094ea074f5a2e2d
- confirmed that old roberta models can be loaded and produce identical results
- confirmed that old linformer models can be loaded and produce identical results (reran commands from D25938236 (facebookresearch@bf54551))
- confirmed that old model parallel models can be loaded and produce identical results:
```
python -m fairseq_cli.validate --path checkpoint.mp1/checkpoint_last.pt --task dummy_masked_lm --criterion masked_lm --max-sentences 8 --dataset-size 100 --model-parallel-size 2 --distributed-world-size 2

before:
2021-01-19 19:04:14 | INFO | valid |  | valid on 'valid' subset | loss 14.62 | ppl 25174.3 | wps 0 | wpb 53248 | bsz 104

after:
2021-01-19 19:06:59 | INFO | valid |  | valid on 'valid' subset | loss 14.62 | ppl 25174.3 | wps 0 | wpb 53248 | bsz 104
```

Reviewed By: gwenzek, ngoyal2707

Differential Revision: D25937145

Pulled By: myleott

fbshipit-source-id: 1ce0bc93e28e03fb926534ea4134684a49232599
Summary: Pull Request resolved: fairinternal/fairseq-py#1570

Test Plan: Imported from OSS

Reviewed By: gwenzek, ngoyal2707

Differential Revision: D25967675

Pulled By: myleott

fbshipit-source-id: 7c7f8d25b87ef9b4f0a85331548bb3a2886a1e92
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Fixes # (issue).

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: fairinternal/fairseq-py#1629

Reviewed By: myleott

Differential Revision: D26484942

Pulled By: sshleifer

fbshipit-source-id: 9dcbab5c404c14d8f35628d823102ad9ce59dffd
Summary:
Integrating LASER (Language-Agnostic SEntence Representations) training code

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ Y] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ N/A] Did you make sure to update the docs?
- [ Y] Did you write any new necessary tests?  => an additional test in `test_iterators.py`

## What does this PR do?

This diff introduces the training code for LASER.
It includes a specific `laser` task in `laser_task.py` which reads a
json configuration file describing the binarized datasets of language
pairs.

`multitask_data_utils.py` defines dataset wrappers and iterators used by
`laser` task.

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Yes. �

Pull Request resolved: fairinternal/fairseq-py#1207

Reviewed By: myleott

Differential Revision: D26454296

Pulled By: Celebio

fbshipit-source-id: c987672aa66abf31b039ee11867b06912d3486e5
…1626)

Summary:
Add back a couple speed optimizations in the original roberta code that got lost in the refactor

Pull Request resolved: fairinternal/fairseq-py#1626

Reviewed By: gwenzek

Differential Revision: D26478534

Pulled By: myleott

fbshipit-source-id: b945de5e9bffd51cd63630cc3aa1f0078a41cca8
…ookresearch#3253)

Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [x] Did you make sure to update the docs?
- [x] Did you write any new necessary tests?

## What does this PR do?
- updates audio_utils to handle multi-channel audio as well as mono, with no change needed for existing recipes
- adds speech-to-text example for Multilingual TEDx (http://openslr.org/100) data

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: facebookresearch#3253

Reviewed By: yuntang

Differential Revision: D26514419

Pulled By: kahne

fbshipit-source-id: 699e428affda5b1347f96a8310691ab152dd6769
Summary: after D26382917 (facebookresearch@02803a1) shipped somehow the self._device was removed in optimizer, (or maybe I didn't test it the right way in the previous diff?) fortunately OSS doesn't need it any way.

Reviewed By: myleott

Differential Revision: D26523538

fbshipit-source-id: 637c1e344670340ae40b32635ef51f5501966b0c
Summary:
This is the pull request for the code for the paper
[SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation](https://www.aclweb.org/anthology/2020.aacl-main.58/)

The model will also be used for [IWSLT 2021 shared task on simultaneous translation
](https://iwslt.org/2021/simultaneous)
This pull request includes

- Convtransformer offline model
- Convtransformer simultaneous translation model with fixed pre-decision module
- The agent files for inference for the convtransformer simultaneous translation model

jmp84
The README is still missing. Just curious where should I place it?

Pull Request resolved: fairinternal/fairseq-py#1607

Test Plan:
Imported from GitHub, without a `Test Plan:` line.

**********
One of the failing landing integration tests
```
buck test mode/dev //multimo/fb/models/test:multimo_fb_model_test
https://fburl.com/testinfra/oxq2cn5n
```

Reviewed By: jmp84

Differential Revision: D26439663

Pulled By: sravyapopuri388

fbshipit-source-id: b127cb4962756af221b65e3ccb6598a42fc75f7f
Summary:
This diff integrates simul ST training into pyspeech with very minor modifications to the open sourced code. Specific changes made are
- In fixed_pre_decision.py remove self as argument to p_choose function as it is already called with super in line 101
- In monotonic_multihead_attention.py remove pdb.set_trace()
- Move label_smoothed_cross_entropy_latency_augmented.py to fairseq/criterions folder and add missing arguments to parser
- In fairseq/data/data_utils.py type cast max_tokens to int to avoid type error.
- Update fairseq/convtransformer.py to pyspeech/convtransformer.py

# Next steps:
- Verify decoding using the model trained
- Support everstore handle based decoding in simuleval and integrate it into pyspeech.

Reviewed By: jmp84

Differential Revision: D26478861

fbshipit-source-id: 3b02b2aee757e5464b71dbdd7ebdba42659faee5
Summary:
Fix LibriSpeech data prep script
* Lowercasing transcript to be consistent with the pre-trained models

Reviewed By: jmp84

Differential Revision: D26538845

fbshipit-source-id: 0885f99e2c85f0e722a24f3cb83f2635ce9429bc
Summary:
# Before submitting

- [x] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [x] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Fixes KeyError mentioned in  # (3211).

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: facebookresearch#3212

Reviewed By: alexeib

Differential Revision: D26513255

Pulled By: myleott

fbshipit-source-id: 5a11cb369c9d4202fab6998d269e7da5f3d3e534
…kresearch#3249)

Summary:
# Before submitting

- [x] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [x] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Fixes facebookresearch#3178 (issue).

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding � (I did ;)

Pull Request resolved: facebookresearch#3249

Reviewed By: alexeib

Differential Revision: D26513275

Pulled By: myleott

fbshipit-source-id: 2785098a945404c07eb72c079177654b1739a7a2
Summary:
I tried resuming a run from a checkpoint in f250883864, but ran into:

AssertionError: Criterion does not match; please reset the optimizer (--reset-optimizer). DistributedTimeoutWrapper vs ContrastiveLabelsCriterion

Based on this, I believe since D25836853 (facebookresearch@d68a353) we are no longer saving the actual criterion's name, but DistributedTimeoutWrapper in the checkpoint.

This is kind of weird though, as I would expect more people to run into this issue. Not sure if I am doing something wrong, let me know if so, thanks!

Reviewed By: myleott

Differential Revision: D26478656

fbshipit-source-id: bc3c7c925f5505140d9df4438af3a73d65d4f531
EricZLou and others added 10 commits March 3, 2021 10:50
Summary:
Pull Request resolved: fairinternal/fairseq-py#1669

Unit tests for async writes integration done in D26467815 (facebookresearch@3100d0b).

Ongoing performance tests: https://fb.quip.com/kjM7Atb1kKbO

Reviewed By: myleott

Differential Revision: D26732660

fbshipit-source-id: faf8cac67b9167af4195358c1a2592804c13562c
Reviewed By: vimalmanohar

Differential Revision: D26220694

fbshipit-source-id: ed13f8527a1b203e1a9d004fa8a86e1ad6423d60
Summary:
The sampling process in multi_corpus_dataset is very inefficient. Turns out we can signficantly optimize it by sampling in batches rather than one by one. this allows:

1. fast local development and iteration with corpus sampling, as the turnaround time was long before
2. makes it take less time for our jobs can start training, enabling earlier signal if for example there is a configuration issue

Reviewed By: zhengwy888

Differential Revision: D26187821

fbshipit-source-id: b4f7f6b7c187b3785499308226e2af671a6c354f
Summary:
there are a few changes here:
- convert config persisted in checkpoints into a plain dict when saving and back to omegaconf config when loading: this helps avoid compatibility issues between different versions of python, omegaconf, etc
- update checkpoints that have old print_alignment saved
- add lr_float to composite optimizer to enable sweeping on lr with auto sweepers like ax
- fixing some edge cases for config loading

Pull Request resolved: fairinternal/fairseq-py#1671

Reviewed By: myleott

Differential Revision: D26791583

Pulled By: alexeib

fbshipit-source-id: 124dec74932052925c43b6a93130f4428803cb46
Summary:
Provide an ability to pass attn_mask to TransformerSentenceEncoder. The default is None and hence this is backwards compatible.

The attention mask can either be a 2D tensor (of shape [tgt_seq_len, src_seq_len]) or a 3D tensor of shape (bcz * num_heads, tgt_seq_len, src_seq_len).

In case of self attention, tgt_seq_len = src_seq_len.

Reviewed By: myleott

Differential Revision: D26790767

fbshipit-source-id: 937d6c6cf08790c7d43d33fda97a30425f31ea06
Summary:
Pull Request resolved: fairinternal/fairseq-py#1666

Context: the checkpoint saving call stack has become a bit convoluted:
```
train.py
+ checkpoint_utils.save_checkpoint
 + trainer.save_checkpoint
  + checkpoint_utils.save_state
   + checkpoint_utils.torch_persistent_save
```

This diff slightly simplifies the checkpoint saving logic by exposing a `state_dict` method inside the Trainer. This simplifies the call stack to:
```
train.py
+ checkpoint_utils.save_checkpoint
 + trainer.save_checkpoint
  + checkpoint_utils.torch_persistent_save
```

This new structure is important for the FullyShardedDataParallel diff (next diff in the stack), since it enables the Trainer to save multiple checkpoints for the different optimizer state shards.

Test Plan:
- unit tests
- trained WMT En-De models; confirmed checkpoints save/load properly, resuming from a checkpoint gives identical results
- `buck test fblearner/flow/projects/langtech/translation:tests` (2 failures are in trunk too): https://www.internalfb.com/intern/testinfra/testconsole/testrun/2533274840914654/

Reviewed By: zhengwy888

Differential Revision: D26771146

Pulled By: myleott

fbshipit-source-id: 10f91979cd42205c1d8abcaa9ab56f63eba31e93
facebookresearch#1667)

Summary:
Pull Request resolved: fairinternal/fairseq-py#1667

Add support for FullyShardedDataParallel (--ddp-backend=fully_sharded)

This enables fully parameter + optimizer state sharding by using
FullyShardedDataParallel (FSDP) from fairscale. The user just needs to provide
`--ddp-backend=fully_sharded` to enable. Other common options work
out-of-the-box (e.g., `--fp16`, `--memory-efficient-fp16`, `--update-freq`,
etc.). This should be a drop-in replacement for the "c10d" backend.

This yields pretty big speedups for small models and enables training ~13B
parameter models on 8 GPUs and 175B parameter models on 128 GPUs, without model
parallelism.

This also adds a new option `--cpu-offload` that offloads the optimizer state
and FP32 model copy to CPU, which is particularly useful when combined with
`--optimizer=cpu_adam`.

Note: after enabling this, each GPU will save a checkpoint file, since the
optimizer state is sharded. Each checkpoint will contain a single shard of the
optimizer state and the rank 0 checkpoint will contain the full model weights.

Note: a known limitation of the current implementation is that you cannot
resume training on a different world_size. This constraint will be relaxed in
future iterations.

Test Plan: Imported from OSS

Reviewed By: sshleifer

Differential Revision: D26771144

Pulled By: myleott

fbshipit-source-id: 74c2f46f57719e24e2dcfc9d9ee7c2fc0aeedb46
Summary:
1. In fblearner flow we are dumping cmvn stats into json file (e.g. f253830726) Previously there's only --config option taking .npz path from a yaml file, and it's the only usage for the config. This diff adds an option --global-stats to import from json.

2. Inherit FairseqSimulSTAgent from nn.Module instead of SpeechAgent whose root class is object to prepare for scripting methods. Copy over / simplify all the necessary methods from SpeechAgent/Agent.

Reviewed By: jmp84

Differential Revision: D26800957

fbshipit-source-id: 74be527f8473c13405a60bb16ce6da5a7dc0b888
Summary:
Fix bug on converting stereo audio in audio_utils.py
- Github issue: facebookresearch#3303

Reviewed By: jmp84

Differential Revision: D26825964

fbshipit-source-id: 26905e71540bc52e98d76996b199ac0fbe78357b
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Fix a typo in gcmv_path given for config yaml generation (actual: gcvmn_cvmn_path, correct: gcmvn_path)

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: facebookresearch#3307

Reviewed By: jmp84

Differential Revision: D26826231

Pulled By: kahne

fbshipit-source-id: 6b60f2a8a8b4ba1c0c088299a08ef04fdfe870a8
Copy link
Copy Markdown
Collaborator

@alexeib alexeib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delay in reviewing. this is looking very good. could you please add a usage example to examples/wav2vec/README.md and then i will merge

@alexeib
Copy link
Copy Markdown
Collaborator

alexeib commented Mar 9, 2021

also can you do this as a PR to fairseq repo at https://github.com/pytorch/fairseq/?

Myle Ott and others added 13 commits March 9, 2021 06:31
…#3327)

Summary: Pull Request resolved: facebookresearch#3327

Reviewed By: sshleifer

Differential Revision: D26899416

Pulled By: myleott

fbshipit-source-id: bbb493a5c4e0a51f3b26fe8f94e3962b6206d6f6
Summary: Pull Request resolved: facebookresearch#3331

Reviewed By: sshleifer

Differential Revision: D26912554

Pulled By: myleott

fbshipit-source-id: b45a161fbd52a12da13d7e011d562d35a5b5a1a7
Summary:
update audio_utils and fix mTEDx example
- Updated `audio_utils`
  - Added support for OGG Vorbis (the only supported lossy compressed format)
  - Added a separate `convert_to_mono()` helper function
  - Updated `get_waveform()`
    - added new arguments `frames` and `start` for reading part of audios
    - added new argument `mono` for auto conversion to mono-channel audio
    - unified returned waveform shape to channels x length (same as torchaudio default)
- Updated mTEDx and MUST-C data prep scripts
  - Replaced `torchaudio.info()` with `soundfile.info()` (the latter is faster and the former has incompatible interface between <0.8 and the latest 0.8)
  - Replaced `torchaudio.load()` with `get_waveform` for auto conversion to mono channel

Reviewed By: jmp84

Differential Revision: D26901114

fbshipit-source-id: fa9560c9714d51a91157d5141564574d4eee454d
Summary: Pull Request resolved: fairinternal/fairseq-py#1683

Reviewed By: jmp84

Differential Revision: D26914869

Pulled By: xutaima

fbshipit-source-id: a5d2efdcff1852e56304e77838840b3aad5124b0
Summary:
### Changes:
- `PlasmaArray` saves the underlying data to `self.array`, `PlasmaView` never does that, instead it fetches the data from `plasma_store` shared memory when it is needed.
- `PlasmaArray` starts a new, ephemeral plasma_store and puts a new array in it when it is pickled. If `--use-plasma-view`, there is one server started before `spawn` and arrays are only put into it once, in `PlasmaArray.__init__` to accommodate this.
- user can now pass `--plasma-path` to explicitly control where server is started.
- We now make plasma keys based on `(split_path, (block_size, document_sep_len, str(break_mode), len(dataset)))`, so two jobs sharing plasma server but with different datasets, or same dataset but different clargs, will read each the other's array.

### Results [pre March 1]
This saves some CPU memory (5-15%), according to both `psutil` and `psrecord`:
here we run base_cmd (below) with num_workers=0,2,8, 2 GPUS and collect the logs. `branch` refers to `--use-plasma-view`, `master` uses `PlasmaArray`

```
+-------------------------+----------------+---------+-------+
| setting                 |   cpu_mem_used |     wps |   ppl |
+=========================+================+=========+=======+
| branch_nw0_gpu2_ddm.log |          12    | 55143.2 | 429.1 |
+-------------------------+----------------+---------+-------+
| branch_nw2_gpu2_ddm.log |          13.67 | 43377.6 | 429.1 |
+-------------------------+----------------+---------+-------+
| branch_nw8_gpu2_ddm.log |          18.36 | 53019.9 | 429.1 |
+-------------------------+----------------+---------+-------+
| master_nw0_gpu2_ddm.log |          12.26 | 56733   | 429.1 |
+-------------------------+----------------+---------+-------+
| master_nw2_gpu2_ddm.log |          14.58 | 53337.9 | 429.1 |
+-------------------------+----------------+---------+-------+
| master_nw8_gpu2_ddm.log |          21.1  | 53217.2 | 429.1 |
+-------------------------+----------------+---------+-------+
```

### Replication

1) get this branch
```bash
git fetch && git checkout share-plasma-server
```

2) Train tiny model and save logs

```bash

base_cmd () {
  fairseq-train --fp16 /private/home/sshleifer/data-bin/stories_mmap \
            --task language_modeling \
            --arch transformer_lm_gpt2_tiny \
            --sample-break-mode complete --tokens-per-sample 512 \
            --optimizer adam --clip-norm 0.0 --lr 0.0005 \
            --batch-size 1 \
            --max-update 200 --max-epoch 1 \
            --log-format simple --log-interval 100 \
            --restore-file x.pt --no-save \
            --skip-invalid-size-inputs-valid-test --disable-validation $@
}

USE_LOCK=1 CUDA_VISIBLE_DEVICES=0,1 base_cmd --num-workers 0 --use-plasma-view | tee branch_nw0_gpu2_ddm.log
```

### TODO:

- [x] test larger dataset
- [x] make it optional, cleanup
- [x] 1 GPU
- [x] unit-tests
- [x] ask hashing Q on stackoverflow https://stackoverflow.com/questions/66354598/deterministic-method-to-hash-np-array-int
- [ ] measure whether `PlasmaArray` disable for small array's logic helps
- [ x] test with fb_sweep
- [ x] measure 4 GPU savings

Pull Request resolved: fairinternal/fairseq-py#1645

Test Plan: Read github PR description: fairinternal/fairseq-py#1645

Reviewed By: myleott

Differential Revision: D26630365

Pulled By: sshleifer

fbshipit-source-id: b0c4163fbc97a7aefb116de70265fba11f6d7b42
…1690)

Summary: Pull Request resolved: fairinternal/fairseq-py#1690

Reviewed By: jmp84

Differential Revision: D27025669

Pulled By: xutaima

fbshipit-source-id: 8125365adedfdc938813d08e911e1f6ebe4f584b
… early

Summary: I had some issues with loading checkpoints from 5B parameter models (60 GB checkpoint files) due to OOM.

Reviewed By: myleott

Differential Revision: D27027616

fbshipit-source-id: 2b816e8e46ec80f0ec721aa7a6702cee531b94eb
- xmp.spawn 8 or 1 processes instead of always 8.
- util function to get the xla metrics report.o
- util functions to move stuff to/from tpu.
- make utils.item a no-op for xla. This is not critical on xla, and causes big performance hit.
- util function to check if a tensor is on xla device.
- util function to do torch.index_put efficiently on xla.
dd
- add util function to mark step and send a given tensor/container to cpu.
  - instead of 1 transfer per tensor (N total) in `logging_outputs`, we can do 1 transfer total.
- remove redundant mark_step's
- remove redundant compilation check on each device. XLA metrics are global even if they come from one device.
- s/GPU/device/g
- XLA compiles every time it sees a new graph, this includes dynamic
input shapes.
  - This commit introduces bucketing to raw_audio_dataset.
  - Tweaks bucket_pad_length_dataset and data_utils.py to enable this.
- Computation of mask indices in wav2vec2's `forward` is costly on XLA.
  - Moving it to the data preparation phase, optionally for gpus, forced for tpus.
- Use the util functions from previous commits in order to route the XLA codepath better.
- In model
  - Compute mask_indices only if it's not pre-computed in data prep phase.
  - Remove the dynamicity in model's forward caused by mask_indices.
    - adjust loss computation in criterion accordingly.
  - Adjust sampling of negatives, by integrating the padding_count that
  comes from data prep phase.
    - future work; sampling of negatives could also be taken out of
    model and to the data prep phase. I experimented w/ this and
    observed speed gains.
  - Copy hydra params from model to task, in order for dataset's to have
  the necessary mask arguments to enable mask indices creation.
Per previous commit, audio_pretraining task tries to copy mask
prepatarion related arguments to pass on to fairseq_dataset.

For the downstream finetuning job, fairseq uses the same task, and even
though the task arguments are optional, when it tries to copy from model
and can't (for a GPU built model), it errors. Maybe there's a better way
to do this in hydra, by passing a kwarg to `II`?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.