Uni-RNA: The Large-Scale Pre-Trained Model for RNA Science

The Hugging Face compatible version of Uni-RNA, which is designed to be more efficient and easier to use.

Installation

If you unzip the code from compressed file, please run git init to initialize the git repository. We need git info. to track the version of the code.

conda create -n unirna python=3.10
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
pip install -e .

Model summary

Released checkpoints: L8 / L12 / L16. Default configs use 64-d attention heads, GELU feed-forward with width 3 * hidden_size, rotary position embeddings, vocab size 10.

Model	Layers	Hidden dim	Heads	FFN dim	Notes
L8	8	512	8	1536	Fastest, lightest
L12	12	768	12	2304	Balanced
L16	16	1024	16	3072	Highest capacity

UniRNAForMaskedLM (a MaskedLM-style class): outputs the MLM head logits for each position (shape [batch, seq_len+2, vocab_size]), representing token prediction probabilities; loss is provided when labels are supplied.

How to use

We provide jupyter notebook to demonstrate how to use the pretrained model. You can find the notebook in the examples directory.

For model weights, please download from Google Drive and copy to the root directory of the project, then run: tar -zxvf weights.tar.gz. You will find the model weights is stored in the weights directory.

Quick Start

!!! You must convert string to uppercase before inputting the sequence to the model !!!

Sequence "AUcg" is different from "AUCG", all the lowercase letters will be merged and converted to unk_token in the tokenizer.

import unirna_tf
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("./weights/unirna_L16")
model = AutoModel.from_pretrained("./weights/unirna_L16")

seq = "AUCGGUGACA"
inputs = tokenizer(seq, return_tensors="pt")
outputs = model(**inputs)

# if you want return attention weights, set output_attentions=True
outputs = model(**inputs, output_attentions=True)

# if you want return the sequence embeddings without gradient:
with torch.no_grad():
    outputs = model(**inputs)
    sequence_embeddings = outputs.last_hidden_state

Ultra fast embedding inference

Preare the data

Prepare a fasta file, same format as the example/fasta/example_0.fasta file. The fasta file should contain the sequences you want to embed. By running the following command, we will automatically collect all fasta files in the example/fasta directory and extract the embedding for each sequence.

Run your inference

python unirna_tf/infer.py --fasta_path example/fasta --output_dir example/output --batch_size 1 --concurrency 1 --pretrained_path weights/unirna_L16

The --concurrency is the number of threads you want to use, corresponds to the number of GPUs you want to use. The --batch_size is the batch size for each thread, depending on the GPU RAM size of your machine. The --pretrained_path is the path to the pretrained model.

Acknowledgments

Commercial inquiries, please contact wenh@aisi.ac.cn

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
example		example
tests		tests
unirna_tf		unirna_tf
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
copyright		copyright
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uni-RNA: The Large-Scale Pre-Trained Model for RNA Science

Installation

Model summary

How to use

Quick Start

Ultra fast embedding inference

Preare the data

Run your inference

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

ComDec/unirna_tf

Folders and files

Latest commit

History

Repository files navigation

Uni-RNA: The Large-Scale Pre-Trained Model for RNA Science

Installation

Model summary

How to use

Quick Start

Ultra fast embedding inference

Preare the data

Run your inference

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages