Kirigami: large convolutional kernels improve deep learning-based RNA secondary structure prediction
Kirigami is a state-of-the-art (SOTA) AI model for RNA secondary structure prediction. On a standardized test set from bpRNA, Kirigami exceeds the performance of other programs like SPOT-RNA, MXfold2, and UFold.
The easiest way to download and interact with Kirigami is via PyTorch Hub. Simply run
import torch
model = torch.hub.load('marc-harary/kirigami', 'kirigami', pretrained=True)For a given FASTA sequence, run
model('GGGGCGAGCUGCAGCCCCAGUGAAUCAAGUGCAGC')
# '.((((........))))..................'to invoke a convenience __call__ method that embeds the FASTA string and returns a prediction in dot-bracket notation (DBN).
All experiments were performed via PyTorch Lightning. Although the weights of the production model are located at weights/main.ckpt, Kirigami can be retrained with varying hyperparameters. Run
python run.py --helpfor an exhaustive list of configurations, displayed via Lightning's CLI. The appropriate configuration files are located in configs.
Data used for training, validation, and testing are taken from the bpRNA database in the form of the standard TR0, VL0, and TS0 datasets used by SPOT-RNA, MXfold2, and UFold. Respectively, these contain 10,814, 1,300, and 1,305 non-redundant structures. The .dbn files located in this repo were generated by scraping the data originally uploaded by the authors of SPOT-RNA. The RNAStrAlign, archiveII, bpRNAnew, and bpRNAnew_mutate datasets, scraped from UFold, are likewise in the data directory.
From Wikipedia:
Kirigami (切り紙) is a variation of origami, the Japanese art of folding paper. In kirigami, the paper is cut as well as being folded, resulting in a three-dimensional design that stands away from the page.
The Kirigami pipeline both folds RNA molecules via a fully convolutional neural network (FCN) and uses Nussinov-style dynamic programming to recursively cut them into subsequences for post-processing.