Skip to content

GPT2 355m model convergence with 2BW training #64

@nitikasaran68

Description

@nitikasaran68

I ran the pipedream2bw branch with 6 pipeline stages on 48 GPUs and the loss went to nan in about 16k steps. I used the following arguments:

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

NCCL_SOCKET_IFNAME=eth0 python -m torch.distributed.launch $DISTRIBUTED_ARGS
pretrain_gpt2.py
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 6
--scatter-gather-tensors-in-pipeline
--num-layers 24
--hidden-size 1024
--num-attention-heads 16
--seq-length 1024
--max-position-embeddings 1024
--micro-batch-size 4
--global-batch-size 512
--lr 0.00015
--train-iters 500000
--lr-decay-iters 320000
--lr-decay-style cosine
--min-lr 0.00001
--lr-warmup-fraction 0.01
--data-path $DATA_PATH
--vocab-file gpt2-vocab.json
--merge-file gpt2-merges.txt
--split 949,50,1
--log-interval 1
--clip-grad 1.0
--fp16
--DDP-impl local
--loss-scale 16384
--apply-query-key-layer-scaling
--bias-gelu-fusion
--bias-dropout-fusion
--exit-interval 320000
--save $CHECKPOINT_PATH
--save-interval 300
--load $CHECKPOINT_PATH
--max-num-ckpts 16
--pipeline-no-flushes
--checkpoint-activations --checkpoint-num-layers 1

Am I invoking the 2bw training correctly? Also in forward_step in pretrain_gpt2.py, the loss is being averaged across data-parallel workers in every micro-batch. Can these be combined to happen only once per batch?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions