Size mismatch when loading 60M no tri checkpoint

Hi,

I tried to load the 60M no tri model and encountered the error messages below.
There seems to be a misalignment between the [config](https://github.com/NVIDIA-Digital-Bio/proteina/blob/main/configs/experiment_config/model/nn/ca_af3_60M_notri.yaml) and the [checkpoint](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/resources/proteina_v1.3_dfs_60m_notri/files).

```
RuntimeError: Error(s) in loading state_dict for Proteina:
        size mismatch for nn.init_repr_factory.linear_out.weight: copying a param with shape torch.Size([512, 200]) from checkpoint, the shape in current model is torch.Size([512, 132]).
        size mismatch for nn.cond_factory.feat_creators.1.embedding_C.weight: copying a param with shape torch.Size([6, 196]) from checkpoint, the shape in current model is torch.Size([6, 256]).
        size mismatch for nn.cond_factory.feat_creators.1.embedding_A.weight: copying a param with shape torch.Size([44, 196]) from checkpoint, the shape in current model is torch.Size([44, 256]).
        size mismatch for nn.cond_factory.feat_creators.1.embedding_T.weight: copying a param with shape torch.Size([1473, 196]) from checkpoint, the shape in current model is torch.Size([1473, 256]).
        size mismatch for nn.cond_factory.linear_out.weight: copying a param with shape torch.Size([128, 784]) from checkpoint, the shape in current model is torch.Size([512, 1024]).
        size mismatch for nn.transition_c_1.swish_linear.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
        size mismatch for nn.transition_c_1.linear_out.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([512, 1024]).
        size mismatch for nn.transition_c_2.swish_linear.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
        size mismatch for nn.transition_c_2.linear_out.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([512, 1024]).
        size mismatch for nn.pair_repr_builder.init_repr_factory.ln_out.weight: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.pair_repr_builder.init_repr_factory.ln_out.bias: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.pair_repr_builder.init_repr_factory.linear_out.weight: copying a param with shape torch.Size([196, 319]) from checkpoint, the shape in current model is torch.Size([256, 319]).
        size mismatch for nn.pair_repr_builder.cond_factory.ln_out.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.pair_repr_builder.cond_factory.ln_out.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.pair_repr_builder.cond_factory.linear_out.weight: copying a param with shape torch.Size([128, 196]) from checkpoint, the shape in current model is torch.Size([512, 256]).
        size mismatch for nn.pair_repr_builder.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.pair_repr_builder.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.pair_repr_builder.adaln.to_gamma.0.weight: copying a param with shape torch.Size([196, 128]) from checkpoint, the shape in current model is torch.Size([256, 512]).
        size mismatch for nn.pair_repr_builder.adaln.to_gamma.0.bias: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.pair_repr_builder.adaln.to_beta.weight: copying a param with shape torch.Size([196, 128]) from checkpoint, the shape in current model is torch.Size([256, 512]).
        size mismatch for nn.transformer_layers.0.mhba.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.0.mhba.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.0.mhba.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.0.mhba.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.0.mhba.mha.to_qkv.weight: copying a param with shape torch.Size([1512, 512]) from checkpoint, the shape in current model is torch.Size([1536, 512]).
        size mismatch for nn.transformer_layers.0.mhba.mha.to_qkv.bias: copying a param with shape torch.Size([1512]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for nn.transformer_layers.0.mhba.mha.to_g.weight: copying a param with shape torch.Size([504, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.0.mhba.mha.to_g.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.0.mhba.mha.to_out_node.weight: copying a param with shape torch.Size([512, 504]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.0.mhba.mha.q_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.0.mhba.mha.q_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.0.mhba.mha.k_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.0.mhba.mha.k_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.0.mhba.mha.to_bias.weight: copying a param with shape torch.Size([12, 196]) from checkpoint, the shape in current model is torch.Size([8, 256]).
        size mismatch for nn.transformer_layers.0.mhba.mha.pair_norm.weight: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.0.mhba.mha.pair_norm.bias: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.0.mhba.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.0.transition.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.0.transition.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.0.transition.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.0.transition.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.0.transition.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.1.mhba.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.1.mhba.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.1.mhba.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.1.mhba.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.1.mhba.mha.to_qkv.weight: copying a param with shape torch.Size([1512, 512]) from checkpoint, the shape in current model is torch.Size([1536, 512]).
        size mismatch for nn.transformer_layers.1.mhba.mha.to_qkv.bias: copying a param with shape torch.Size([1512]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for nn.transformer_layers.1.mhba.mha.to_g.weight: copying a param with shape torch.Size([504, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.1.mhba.mha.to_g.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.1.mhba.mha.to_out_node.weight: copying a param with shape torch.Size([512, 504]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.1.mhba.mha.q_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.1.mhba.mha.q_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.1.mhba.mha.k_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.1.mhba.mha.k_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.1.mhba.mha.to_bias.weight: copying a param with shape torch.Size([12, 196]) from checkpoint, the shape in current model is torch.Size([8, 256]).
        size mismatch for nn.transformer_layers.1.mhba.mha.pair_norm.weight: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.1.mhba.mha.pair_norm.bias: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.1.mhba.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.1.transition.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.1.transition.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.1.transition.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.1.transition.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.1.transition.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.2.mhba.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.2.mhba.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.2.mhba.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.2.mhba.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.2.mhba.mha.to_qkv.weight: copying a param with shape torch.Size([1512, 512]) from checkpoint, the shape in current model is torch.Size([1536, 512]).
        size mismatch for nn.transformer_layers.2.mhba.mha.to_qkv.bias: copying a param with shape torch.Size([1512]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for nn.transformer_layers.2.mhba.mha.to_g.weight: copying a param with shape torch.Size([504, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.2.mhba.mha.to_g.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.2.mhba.mha.to_out_node.weight: copying a param with shape torch.Size([512, 504]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.2.mhba.mha.q_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.2.mhba.mha.q_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.2.mhba.mha.k_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.2.mhba.mha.k_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.2.mhba.mha.to_bias.weight: copying a param with shape torch.Size([12, 196]) from checkpoint, the shape in current model is torch.Size([8, 256]).
        size mismatch for nn.transformer_layers.2.mhba.mha.pair_norm.weight: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.2.mhba.mha.pair_norm.bias: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.2.mhba.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.2.transition.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.2.transition.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.2.transition.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.2.transition.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.2.transition.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.3.mhba.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.3.mhba.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.3.mhba.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.3.mhba.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.3.mhba.mha.to_qkv.weight: copying a param with shape torch.Size([1512, 512]) from checkpoint, the shape in current model is torch.Size([1536, 512]).
        size mismatch for nn.transformer_layers.3.mhba.mha.to_qkv.bias: copying a param with shape torch.Size([1512]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for nn.transformer_layers.3.mhba.mha.to_g.weight: copying a param with shape torch.Size([504, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.3.mhba.mha.to_g.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.3.mhba.mha.to_out_node.weight: copying a param with shape torch.Size([512, 504]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.3.mhba.mha.q_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.3.mhba.mha.q_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.3.mhba.mha.k_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.3.mhba.mha.k_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.3.mhba.mha.to_bias.weight: copying a param with shape torch.Size([12, 196]) from checkpoint, the shape in current model is torch.Size([8, 256]).
        size mismatch for nn.transformer_layers.3.mhba.mha.pair_norm.weight: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.3.mhba.mha.pair_norm.bias: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.3.mhba.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.3.transition.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.3.transition.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.3.transition.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.3.transition.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.3.transition.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.4.mhba.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.4.mhba.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.4.mhba.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.4.mhba.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.4.mhba.mha.to_qkv.weight: copying a param with shape torch.Size([1512, 512]) from checkpoint, the shape in current model is torch.Size([1536, 512]).
        size mismatch for nn.transformer_layers.4.mhba.mha.to_qkv.bias: copying a param with shape torch.Size([1512]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for nn.transformer_layers.4.mhba.mha.to_g.weight: copying a param with shape torch.Size([504, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.4.mhba.mha.to_g.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.4.mhba.mha.to_out_node.weight: copying a param with shape torch.Size([512, 504]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.4.mhba.mha.q_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.4.mhba.mha.q_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.4.mhba.mha.k_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.4.mhba.mha.k_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.4.mhba.mha.to_bias.weight: copying a param with shape torch.Size([12, 196]) from checkpoint, the shape in current model is torch.Size([8, 256]).
        size mismatch for nn.transformer_layers.4.mhba.mha.pair_norm.weight: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.4.mhba.mha.pair_norm.bias: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.4.mhba.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.4.transition.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.4.transition.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.4.transition.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.4.transition.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.4.transition.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.5.mhba.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.5.mhba.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.5.mhba.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.5.mhba.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.5.mhba.mha.to_qkv.weight: copying a param with shape torch.Size([1512, 512]) from checkpoint, the shape in current model is torch.Size([1536, 512]).
        size mismatch for nn.transformer_layers.5.mhba.mha.to_qkv.bias: copying a param with shape torch.Size([1512]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for nn.transformer_layers.5.mhba.mha.to_g.weight: copying a param with shape torch.Size([504, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.5.mhba.mha.to_g.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.5.mhba.mha.to_out_node.weight: copying a param with shape torch.Size([512, 504]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.5.mhba.mha.q_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.5.mhba.mha.q_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.5.mhba.mha.k_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.5.mhba.mha.k_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.5.mhba.mha.to_bias.weight: copying a param with shape torch.Size([12, 196]) from checkpoint, the shape in current model is torch.Size([8, 256]).
        size mismatch for nn.transformer_layers.5.mhba.mha.pair_norm.weight: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.5.mhba.mha.pair_norm.bias: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.5.mhba.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.5.transition.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.5.transition.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.5.transition.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.5.transition.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.5.transition.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.6.mhba.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.6.mhba.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.6.mhba.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.6.mhba.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.6.mhba.mha.to_qkv.weight: copying a param with shape torch.Size([1512, 512]) from checkpoint, the shape in current model is torch.Size([1536, 512]).
        size mismatch for nn.transformer_layers.6.mhba.mha.to_qkv.bias: copying a param with shape torch.Size([1512]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for nn.transformer_layers.6.mhba.mha.to_g.weight: copying a param with shape torch.Size([504, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.6.mhba.mha.to_g.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.6.mhba.mha.to_out_node.weight: copying a param with shape torch.Size([512, 504]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.6.mhba.mha.q_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.6.mhba.mha.q_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.6.mhba.mha.k_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.6.mhba.mha.k_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.6.mhba.mha.to_bias.weight: copying a param with shape torch.Size([12, 196]) from checkpoint, the shape in current model is torch.Size([8, 256]).
        size mismatch for nn.transformer_layers.6.mhba.mha.pair_norm.weight: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.6.mhba.mha.pair_norm.bias: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.6.mhba.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.6.transition.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.6.transition.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.6.transition.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.6.transition.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.6.transition.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.7.mhba.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.7.mhba.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.7.mhba.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.7.mhba.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.7.mhba.mha.to_qkv.weight: copying a param with shape torch.Size([1512, 512]) from checkpoint, the shape in current model is torch.Size([1536, 512]).
        size mismatch for nn.transformer_layers.7.mhba.mha.to_qkv.bias: copying a param with shape torch.Size([1512]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for nn.transformer_layers.7.mhba.mha.to_g.weight: copying a param with shape torch.Size([504, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.7.mhba.mha.to_g.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.7.mhba.mha.to_out_node.weight: copying a param with shape torch.Size([512, 504]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.7.mhba.mha.q_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.7.mhba.mha.q_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.7.mhba.mha.k_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.7.mhba.mha.k_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.7.mhba.mha.to_bias.weight: copying a param with shape torch.Size([12, 196]) from checkpoint, the shape in current model is torch.Size([8, 256]).
        size mismatch for nn.transformer_layers.7.mhba.mha.pair_norm.weight: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.7.mhba.mha.pair_norm.bias: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.7.mhba.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.7.transition.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.7.transition.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.7.transition.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.7.transition.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.7.transition.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.8.mhba.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.8.mhba.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.8.mhba.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.8.mhba.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.8.mhba.mha.to_qkv.weight: copying a param with shape torch.Size([1512, 512]) from checkpoint, the shape in current model is torch.Size([1536, 512]).
        size mismatch for nn.transformer_layers.8.mhba.mha.to_qkv.bias: copying a param with shape torch.Size([1512]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for nn.transformer_layers.8.mhba.mha.to_g.weight: copying a param with shape torch.Size([504, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.8.mhba.mha.to_g.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.8.mhba.mha.to_out_node.weight: copying a param with shape torch.Size([512, 504]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.8.mhba.mha.q_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.8.mhba.mha.q_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.8.mhba.mha.k_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.8.mhba.mha.k_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.8.mhba.mha.to_bias.weight: copying a param with shape torch.Size([12, 196]) from checkpoint, the shape in current model is torch.Size([8, 256]).
        size mismatch for nn.transformer_layers.8.mhba.mha.pair_norm.weight: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.8.mhba.mha.pair_norm.bias: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.8.mhba.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.8.transition.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.8.transition.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.8.transition.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.8.transition.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.8.transition.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.9.mhba.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.9.mhba.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.9.mhba.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.9.mhba.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.9.mhba.mha.to_qkv.weight: copying a param with shape torch.Size([1512, 512]) from checkpoint, the shape in current model is torch.Size([1536, 512]).
        size mismatch for nn.transformer_layers.9.mhba.mha.to_qkv.bias: copying a param with shape torch.Size([1512]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for nn.transformer_layers.9.mhba.mha.to_g.weight: copying a param with shape torch.Size([504, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.9.mhba.mha.to_g.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.9.mhba.mha.to_out_node.weight: copying a param with shape torch.Size([512, 504]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.9.mhba.mha.q_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.9.mhba.mha.q_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.9.mhba.mha.k_layer_norm.weight: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.9.mhba.mha.k_layer_norm.bias: copying a param with shape torch.Size([504]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.9.mhba.mha.to_bias.weight: copying a param with shape torch.Size([12, 196]) from checkpoint, the shape in current model is torch.Size([8, 256]).
        size mismatch for nn.transformer_layers.9.mhba.mha.pair_norm.weight: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.9.mhba.mha.pair_norm.bias: copying a param with shape torch.Size([196]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for nn.transformer_layers.9.mhba.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.9.transition.adaln.norm_cond.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.9.transition.adaln.norm_cond.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for nn.transformer_layers.9.transition.adaln.to_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.9.transition.adaln.to_beta.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
        size mismatch for nn.transformer_layers.9.transition.scale_output.to_adaln_zero_gamma.0.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([512, 512]).
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Size mismatch when loading 60M no tri checkpoint #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Size mismatch when loading 60M no tri checkpoint #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions