multi-node multi-GPU training

Hello, I’m training Kinetics-600 on V100 with 17 frames of 128×128, latent size 5×16×16, and a single-GPU batch size of 4. I’m puzzled because using multiple nodes does not speed up training, adding more GPUs does not help, reducing the sub-dataset size does not make it faster, and using gradient accumulation actually makes it slower. Have you encountered this issue?