-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Hello,
Thank you for your great work. I am attempting to finetune ProteinBERT for another task, and I am trying to understand some of the strategies you followed.
First, why did you implement another tokenization strategy? I see that somehow included the mecanism labels during the tokenization, but if these were the "target labels", why not use them just as the "second column" of the input file (I am following the provided tutorial by ProteinBERT developers, and that's why I refer to the labels as the second column).
According to the former question, I see that the input example file has many columns and differs from the native strategy to finetune ProteinBERT, where only the sequence and the label can be provided as input. Hence, I want to know how you handle the labels, and if it is possible to use "multiple features" at the same time for finetuning the model.
Following on this, I see that the finetuning was performed six times, this means that ProteinBERT was finetuned six times for each feature?
Now, for the prediction process, how the output from the model/s (I will understand better with the answer to the previoous question) is handled? I mean the output would be "six vectors of probabilities" for each feature that was included, how this information is processed for final prediction?
Finally, during the final epoch (third stage), you mentioned that you used sequences longer than 1024 aa, so my questions are: in the previous stages you only used shorter sequences? I understand that ProteinBERT would split randomly the sequences that sequences that are longer than 1024, how did you handle then these longer sequences?
I appreciate your help and any insight about these topics.
Best regards,
Jeferyd