RNA-MSM

Multiple sequence-alignment-based RNA language model and its application to structural inference

This repository contains codes and pre-trained weight for MSA RNA language model (RNA-MSM) as well as RNA secondary structure and solvent accessibility tasks and corresponding RNA datasets.

RNA-MSM is the first unsupervised MSA RNA language model based on aligned homologous sequences that outputs both embedding and attention map to match different types of downstream tasks.

The resulting RNA-MSM model produced attention maps and embeddings that have direct correlations to RNA secondary structure and solvent accessibility without supervised training. Further supervised training led to predicted secondary structure and solvent accessibility that are significantly more accurate than current state-of-the-art techniques. Unlike many previous studies, we would like to emphasize that we were extremely careful in avoiding over training, a significant problem in applying deep learning to RNA by choosing validation and test sets structurally different from the training set.

Pre-requisites

Create Environment with Anaconda

Download this repository and create the RNA-MSM environment.

git clone [email protected]:yikunpku/RNA-MSM.git
cd ./RNA-MSM
conda env create -f environment.yml
conda activate RNA-MSM

Data Preparation

Pretrain Data

RNA-MSM model operate on RNA homologous sequences (multiple sequence alignment; MSA), which contains information about conserved properties, co-evolution and functional-species evolutionary relationships (phylogenetics) in the amino acid sequences of constituent RNAs.

The effectiveness of predictions made by the RNA-MSM model is largely dependent on the quantity and quality of MSAs. Therefore, we recommend utilizing our recently developed RNAcmap3 tool to search for homologous sequences of the target RNA sequences to serve as input for the RNA-MSM model.

You may also gain entry to our online web server, wherein you can provide the target sequence, and subsequently receive the MSA files and two downstream tasks prediction results located via email.

The input MSA file should be be situated within ./results folder, and its suffix ought to be .a2m_msa2.

Downstream Task Data

The training, validation, and testing datasets used for our downstream tasks are currently available to the public and can be downloaded via this link.

Access Pre-trained Model

Download pre-trained models from and place the .ckpt files into the ./pretrained folder.

Inference

Feature Extraction

To following command can be used to extract target RNA sequence’s embedding and attention map feature:

python RNA_MSM_Inference.py \
data.root_path=./ \
data.MSA_path=./results \
data.model_path=./pretrained \
data.MSA_list=rna_id.txt

Generated files are saved at data.root_path/data.MSA_path

RNA-MSM model inference results includes 2 files:

*_atp.npy: Attention heads weights of the target RNA sequence generated by our RNA-MSM model with dimension (seq_len, seq_len, 120), saved as .npy format. You can apply this embedding feature to your own tasks.
*_emb.npy: Embedding representation of the target RNA sequence generated by our RNA-MSM model with dimension (seq_len, 768), saved as .npy format. You can apply this embedding feature to your own tasks.

Downstream Prediction - RNA Secondary Structure (SS)

cd ./_downstream_tasks/SS
python predict.py \
--rnaid 2DRB_1 \
--device cpu \
--featdir ./results

In addition, the following arguments need to be specified:

--rnaid ：target RNA name, eg: 2DRB_1

--device：inference on GPU or CPU

--featdir： inference output dir

Generated files are saved at data.root_path/data.MSA_path

RNA secondary structure prediction results include 3 files:

*.ct: CT file. The connect format is column based. The first column specified the sequence index, starting at one. Columns 3, 4, and 6 redundantly give sequence indices (plus/minus one). The second column contains the base in one-letter notation. Column 4 specifies the pairing partner of this base if it involved in a base pair. If the base is unpaired, this column is zero.
*.bpseq: The structural information in the bpseq format is denoted in three columns. The first column contains the sequence position, starting at one. The second column contains the base in one-letter notation. The third column contains the pairing partner of the base if the base is paired. If the base is unpaired, the third column is zero.
*.prob：a 2-dimension matrix that contain the probability of all base-pairs.

Downstream Prediction - RNA Solvent Accessibility Prediction (RSA)

cd ./_downstream_tasks/RSA
python predict.py \
python predict.py \
--rnaid 2DRB_1 \
--device cpu \
--featdir ./results

Generated files are saved at data.root_path/data.MSA_path

Solvent accessibility prediction results include 6 files:

*_asa.png: Graph of ASA predicted by ensemble model.
*_rsa.png: Graph of RSA predicted by ensemble model.
Results predicted by single model ：model_0 is the best single model, other 2 files are remain models。
Results predicted by ensemble model ：ensemble is the results predicted by ensemble model.

Results

We show the final result directory as follow:

./results
|-- 2DRB_1.a2m_msa2
|-- 2DRB_1_atp.npy
|-- 2DRB_1_emb.npy
|-- RSA_result
|   |-- 2DRB_1_asa.png
|   |-- 2DRB_1_rsa.png
|   |-- ensemble
|   |   `-- 2DRB_1.txt
|   |-- model_0
|   |   `-- 2DRB_1.txt
|   |-- model_1
|   |   `-- 2DRB_1.txt
|   `-- model_2
|       `-- 2DRB_1.txt
`-- SS_result
    |-- 2DRB_1.bpseq
    |-- 2DRB_1.ct
    `-- 2DRB_1.prob

Online RNA-MSM Sever

We also built a freely accessible web server for using the RNA-MSM models, You may effortlessly submit tasks onto the server and subsequently receive the outcomes via email, without the need to configure the environment or consume any computational resources.

As a preview, take a swift glance at the website:

Reference

If you find our work useful in your research or if you use parts of this code please consider citing our paper:

@article{zhang2023multiple,
  title={Multiple sequence-alignment-based RNA language model and its application to structural inference},
  author={Zhang, Yikun and Lang, Mei and Jiang, Jiuhong and Gao, Zhiqiang and Xu, Fan and Litfin, Thomas and Chen, Ke and Singh, Jaswinder and Huang, Xiansong and Song, Guoli and others},
  journal={bioRxiv},
  pages={2023--03},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
_downstream_tasks		_downstream_tasks
msm		msm
results		results
utils		utils
LICENSE		LICENSE
README.md		README.md
RNA_MSM_Inference.py		RNA_MSM_Inference.py
dataset.py		dataset.py
environment.yaml		environment.yaml
lr_schedulers.py		lr_schedulers.py
model.py		model.py
modules.py		modules.py
product_key_memory.py		product_key_memory.py
rna_id.txt		rna_id.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RNA-MSM

Pre-requisites

Create Environment with Anaconda

Data Preparation

Pretrain Data

Downstream Task Data

Access Pre-trained Model

Inference

Feature Extraction

Downstream Prediction - RNA Secondary Structure (SS)

Downstream Prediction - RNA Solvent Accessibility Prediction (RSA)

Results

Online RNA-MSM Sever

Reference

About

Uh oh!

Releases

Packages

Languages

License

meilanglang/RNA-MSM

Folders and files

Latest commit

History

Repository files navigation

RNA-MSM

Pre-requisites

Create Environment with Anaconda

Data Preparation

Pretrain Data

Downstream Task Data

Access Pre-trained Model

Inference

Feature Extraction

Downstream Prediction - RNA Secondary Structure (SS)

Downstream Prediction - RNA Solvent Accessibility Prediction (RSA)

Results

Online RNA-MSM Sever

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages