The Hugging Face compatible version of Uni-RNA, which is designed to be more efficient and easier to use.
If you unzip the code from compressed file, please run git init to initialize the git repository. We need git info. to track the version of the code.
conda create -n unirna python=3.10
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
pip install -e .- Released checkpoints: L8 / L12 / L16. Default configs use 64-d attention heads, GELU feed-forward with width
3 * hidden_size, rotary position embeddings, vocab size 10.
| Model | Layers | Hidden dim | Heads | FFN dim | Notes |
|---|---|---|---|---|---|
| L8 | 8 | 512 | 8 | 1536 | Fastest, lightest |
| L12 | 12 | 768 | 12 | 2304 | Balanced |
| L16 | 16 | 1024 | 16 | 3072 | Highest capacity |
UniRNAForMaskedLM(a MaskedLM-style class): outputs the MLM head logits for each position (shape[batch, seq_len+2, vocab_size]), representing token prediction probabilities;lossis provided when labels are supplied.
We provide jupyter notebook to demonstrate how to use the pretrained model. You can find the notebook in the examples directory.
For model weights, please download from Google Drive and copy to the root directory of the project, then run:
tar -zxvf weights.tar.gz. You will find the model weights is stored in the weights directory.
!!! You must convert string to uppercase before inputting the sequence to the model !!!
Sequence "AUcg" is different from "AUCG", all the lowercase letters will be merged and converted to unk_token in the tokenizer.
import unirna_tf
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("./weights/unirna_L16")
model = AutoModel.from_pretrained("./weights/unirna_L16")
seq = "AUCGGUGACA"
inputs = tokenizer(seq, return_tensors="pt")
outputs = model(**inputs)
# if you want return attention weights, set output_attentions=True
outputs = model(**inputs, output_attentions=True)
# if you want return the sequence embeddings without gradient:
with torch.no_grad():
outputs = model(**inputs)
sequence_embeddings = outputs.last_hidden_statePrepare a fasta file, same format as the example/fasta/example_0.fasta file. The fasta file should contain the sequences you want to embed. By running the following command, we will automatically collect all fasta files in the example/fasta directory and extract the embedding for each sequence.
python unirna_tf/infer.py --fasta_path example/fasta --output_dir example/output --batch_size 1 --concurrency 1 --pretrained_path weights/unirna_L16The --concurrency is the number of threads you want to use, corresponds to the number of GPUs you want to use. The --batch_size is the batch size for each thread, depending on the GPU RAM size of your machine. The --pretrained_path is the path to the pretrained model.
Commercial inquiries, please contact wenh@aisi.ac.cn