LazBFDEF

Code for Substrate Prediction for RiPP Biosynthetic Enzymes via Masked Language Modeling and Transfer Learning.

All data needed to reproduce the results can be found here. Descriptions of the data files can be found in DATA.md

Trained model weights can be accessed here.

Reproducing the work

All results can be reproduced by running the .ipynb notebooks contained in the scripts folder. The code can either be run on Google Colab or on a local machine. If using Google Colab, one can simply upload the notebooks one at a time and run the pip install cells to install the required packages as needed. If running the code locally, one can create a conda enviornment using the lazbfenv.yaml which lists the versions of all software libraries used in this work.

The Jupyter notebooks are numbered based on the order in which they should be run (i.e., start with 1_VanillaESMEmbeddings.ipynb). Each notebook contains comments which guides the user through the code. Below is a brief description of each notebook and its purpose.

1_VanillaESMEmbeddings.ipynb: Code used to extract LazBF/DEF sequence representations from Vanilla-ESM.
2_LazBFESMEmbeddings.ipynb: Code used to train LazBF-ESM and extract LazBF/DEF sequence representations from LazBF-ESM.
3_LazDEFESMEmbeddings.ipynb: Code used to train LazDEF-ESM and extract LazBF/DEF sequence representations from LazDEF-ESM.
4_PeptideESMEmbeddings.ipynb: Code used to extract LazBF/DEF sequence representations from Peptide-ESM.
5_LazBCDEF.ipynb: Code used to train LazBCDEF-ESM and extract LazBF/DEF sequence representations from LazBCDEF-ESM.
6_DownstreamModelTraining.ipynb: Code used to train LazBF/DEF substrate classification models on embeddings from each of the 5 language models for the high, medium, and low-N conditions.
7_tsne.ipynb: Code for t-SNE visualization of language model embeddings.
8_FineTuning.ipynb: Code for fine-tuning 35M and 650M parameter versions of ESM-2 for LazBF/DEF/BCDEF substrate prediction.
9_Interpretation_650M.ipynb: Code for Zero-shot prediction with 650M parameter models and code to reproduce figure 7.
10_interpretation_35M.ipynb: Code for Zero-shot prediction with 35M parameter models and code to reproduce figures 8, S3, S4.
Figures.ipynb: Code to reproduce figures 4, 6. Data were collected from 6_DownstreamModelTraining.ipynb.
PeptideESMTraining.ipynb: Pretraining code for Peptide-ESM model described in the paper.
OptionalDataPreprocessing.ipynb: Code for data preprocessing. Optional since preprocessed sequences are provided. See DATA.md.

The code was originally run on a single A100 GPU via Google Colab.

Citation:

@article{Clark2024,
  title = {Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning},
  ISSN = {2635-098X},
  url = {http://dx.doi.org/10.1039/D4DD00170B},
  DOI = {10.1039/d4dd00170b},
  journal = {Digital Discovery},
  publisher = {Royal Society of Chemistry (RSC)},
  author = {Clark,  Joseph D. and Mi,  Xuenan and Mitchell,  Douglas A. and Shukla,  Diwakar},
  year = {2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LazBFDEF

Reproducing the work

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
Data		Data
Embeddings		Embeddings
scripts		scripts
DATA.md		DATA.md
README.md		README.md
lazbfenv.yaml		lazbfenv.yaml

ShuklaGroup/LazBFDEF

Folders and files

Latest commit

History

Repository files navigation

LazBFDEF

Reproducing the work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages