I continued training the model on the vitra-1M dataset based on human_pretrain.json, and tried learning rates of 1e-4 and 1e-5 for the action branch respectively. However, the performance of the fine-tuned model was consistently worse than that of the original weights.