GPT2 355m model convergence with 2BW training

I ran the pipedream2bw branch with 6 pipeline stages on 48 GPUs and the loss went to nan in about 16k steps. I used the following arguments: 

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

NCCL_SOCKET_IFNAME=eth0 python -m torch.distributed.launch $DISTRIBUTED_ARGS \
    pretrain_gpt2.py \
        --tensor-model-parallel-size 1 \
        --pipeline-model-parallel-size 6 \
        --scatter-gather-tensors-in-pipeline \
        --num-layers 24 \
        --hidden-size 1024 \
        --num-attention-heads 16 \
        --seq-length 1024 \
        --max-position-embeddings 1024 \
        --micro-batch-size 4 \
        --global-batch-size 512 \
        --lr 0.00015 \
        --train-iters 500000 \
        --lr-decay-iters 320000 \
        --lr-decay-style cosine \
        --min-lr 0.00001 \
        --lr-warmup-fraction 0.01 \
        --data-path $DATA_PATH \
        --vocab-file gpt2-vocab.json \
        --merge-file gpt2-merges.txt \
        --split 949,50,1 \
        --log-interval 1 \
        --clip-grad 1.0 \
        --fp16 \
        --DDP-impl local \
        --loss-scale 16384 \
        --apply-query-key-layer-scaling \
        --bias-gelu-fusion \
        --bias-dropout-fusion \
        --exit-interval 320000 \
        --save $CHECKPOINT_PATH \
        --save-interval 300 \
        --load $CHECKPOINT_PATH \
        --max-num-ckpts 16 \
        --pipeline-no-flushes \
        --checkpoint-activations --checkpoint-num-layers 1


Am I invoking the 2bw training correctly? Also in forward_step in pretrain_gpt2.py, the loss is being averaged across data-parallel workers in every micro-batch. Can these be combined to happen only once per batch?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT2 355m model convergence with 2BW training #64

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

GPT2 355m model convergence with 2BW training #64

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions