-
Notifications
You must be signed in to change notification settings - Fork 115
Description
I ran the pipedream2bw branch with 6 pipeline stages on 48 GPUs and the loss went to nan in about 16k steps. I used the following arguments:
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
NCCL_SOCKET_IFNAME=eth0 python -m torch.distributed.launch $DISTRIBUTED_ARGS
pretrain_gpt2.py
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 6
--scatter-gather-tensors-in-pipeline
--num-layers 24
--hidden-size 1024
--num-attention-heads 16
--seq-length 1024
--max-position-embeddings 1024
--micro-batch-size 4
--global-batch-size 512
--lr 0.00015
--train-iters 500000
--lr-decay-iters 320000
--lr-decay-style cosine
--min-lr 0.00001
--lr-warmup-fraction 0.01
--data-path $DATA_PATH
--vocab-file gpt2-vocab.json
--merge-file gpt2-merges.txt
--split 949,50,1
--log-interval 1
--clip-grad 1.0
--fp16
--DDP-impl local
--loss-scale 16384
--apply-query-key-layer-scaling
--bias-gelu-fusion
--bias-dropout-fusion
--exit-interval 320000
--save $CHECKPOINT_PATH
--save-interval 300
--load $CHECKPOINT_PATH
--max-num-ckpts 16
--pipeline-no-flushes
--checkpoint-activations --checkpoint-num-layers 1
Am I invoking the 2bw training correctly? Also in forward_step in pretrain_gpt2.py, the loss is being averaged across data-parallel workers in every micro-batch. Can these be combined to happen only once per batch?