Code for Feature Reuse and Scaling: Understanding Transfer Learning with Protein Language Models

Code for reproducing the analyses in our preprint "Feature Reuse and Scaling: Understanding Transfer Learning with Protein Language Models": PREPRINT LINK HERE

For datasets and model checkpoints weights, see our Zenodo repository: ZENODO LINK HERE

Requirements

To install a conda env named protran, run requirements.yml

Scripts to extract representations from pretrained models can be found under:

run_pregen_embs.py - extracts representations from every layer in a model given a dataset and pretrained model
execute_gen_emb_carp.sh - batch script to extract representations over all datasets and pretrained models for CARP
execute_gen_emb_esm.sh - batch script to extract representations over all datasets and pretrained models for ESM

As these representations are time-consuming to extract, we provide them in our Zenodo repository (ZENODO LINK HERE).

Scripts to train linear models for each of the downstream tasks:

For the classification tasks (secondary structure and subcellular localization), we implement models in PyTorch:

run_protran_pytorch.py - trains and evaluates classifiers for each layer in a model given a dataset and pretrained model
execute_run_pytorch_carp.sh - batches over all datasets/models for CARP
execute_run_pytorch_esm.sh - batches over all datasets/models for ESM

For the regression tasks (all other downstream tasks), we implement models in Scikit-Learn:

run_protran_sklearn.py - trains and evaluates regressors for each layer in a model given a dataset and pretrained model
execute_run_sklearn_carp.sh - batches over all datasets/models for CARP
execute_run_sklearn_esm.sh - batches over all datasets/models for ESM

Scripts to reproduce our analysis:

To produce the plots shown in our manuscript, run run_results_analysis.py

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
.github/workflows		.github/workflows
scr		scr
.amltignore		.amltignore
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
execute_gen_emb_carp.sh		execute_gen_emb_carp.sh
execute_gen_emb_esm.sh		execute_gen_emb_esm.sh
execute_run_pytorch_carp.sh		execute_run_pytorch_carp.sh
execute_run_pytorch_esm.sh		execute_run_pytorch_esm.sh
execute_run_sklearn_carp.sh		execute_run_sklearn_carp.sh
execute_run_sklearn_esm.sh		execute_run_sklearn_esm.sh
requirements.yml		requirements.yml
run_dataset_analysis.py		run_dataset_analysis.py
run_pregen_emb.py		run_pregen_emb.py
run_protran_pytorch.py		run_protran_pytorch.py
run_protran_sklearn.py		run_protran_sklearn.py
run_results_analysis.py		run_results_analysis.py
run_rf_calc.py		run_rf_calc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code for Feature Reuse and Scaling: Understanding Transfer Learning with Protein Language Models

Requirements

Scripts to extract representations from pretrained models can be found under:

Scripts to train linear models for each of the downstream tasks:

Scripts to reproduce our analysis:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

microsoft/protein-transfer

Folders and files

Latest commit

History

Repository files navigation

Code for Feature Reuse and Scaling: Understanding Transfer Learning with Protein Language Models

Requirements

Scripts to extract representations from pretrained models can be found under:

Scripts to train linear models for each of the downstream tasks:

Scripts to reproduce our analysis:

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages