Have you met the problem that the loss quickly converges to zero in two epochs even with very large swap noise (>0.5) or dropout? Meanwhile, the transformed features do not contain useful informations. I am not sure if this is the problem caused by the dataset or not...