A toolkit for extracting embeddings from various protein language models (PLMs). This repository provides standardized interfaces for generating embeddings from protein sequences using different PLM architectures.
- ANKH: Large and Base models
- ESM:
- ESM-2 (15B, 3B, 650M parameters)
- ESM-1b (650M parameters)
- ESM-1v (650M parameters)
- ProtT5: XL-U50
- ProteinBERT: Base model
- UniRep: Original implementation
- One-hot encoding: Basic sequence encoding
├── data/ # Data directory
│ └── isoform/ # Isoform-specific data
├── sandbox/ # Main package directory
│ ├── plm/ # PLM implementations
│ │ ├── ankh/ # ANKH model
│ │ ├── esm/ # ESM models
│ │ ├── one-hot/ # One-hot encoding
│ │ ├── prot_t5/ # ProtT5 model
│ │ ├── proteinbert/ # ProteinBERT
│ │ └── unirep/ # UniRep model
│ └── src/ # Core utilities
├── scripts/ # Runtime scripts
│ ├── evaluate/ # Evaluation scripts
│ ├── plm/ # SLURM job scripts
│ └── process/ # Data processing scripts
- Clone the repository:
git clone https://github.com/cheeseman-lab/plm_sandbox.git
cd plm_sandbox
- Set up the embedding analysis environment:
conda env create -f environment.yml
conda activate emmentalembed
pip install -e .
- Set up the PLM environment and download models:
conda env create -f plm_environment.yml
conda activate plm
./setup_plm.sh
Convert your protein sequences into the required format. For isoforms (example), we use the following approach:
from sandbox.src.process import process_isoform_data
process_isoform_data(
input_file='data/isoform/isoform_localization.csv',
output_label_file='output/isoform/process/isoform_labels.csv',
output_fasta_file='output/isoform/process/isoform_sequences.fasta'
)
You can add further functions to the process file for other types of proteins you'd like to process.
Each PLM has a standardized interface. Basic usage:
python sandbox/plm/<model>/extract.py -i input.fasta -o output.csv [additional_options]
Model-specific examples:
# ANKH
python sandbox/plm/ankh/extract.py -i input.fasta -o output.csv --model large
# ESM-2
python sandbox/plm/esm/extract.py esm2_t48_15B_UR50D input.fasta output_dir \
--include mean --concatenate_dir results/
# ProtT5
python sandbox/plm/prot_t5/extract.py -i input.fasta -o output.csv --per_protein 1
# One-hot encoding
python sandbox/plm/one-hot/extract.py input.fasta --method one_hot --results_path results/
For HPC environments, use the provided SLURM scripts in scripts/plm/
:
sbatch scripts/plm/<model>.sh
As an example, we evaluate the similarities between pairs of embeddings in evaluate/evaluate_isoforms.ipynb
.
The project uses two conda environments:
emmentalembed
: For analysis and processing of embeddingsplm
: For running protein language models and generating embeddings
Main components:
sandbox.plm
: PLM implementations and extraction scriptssandbox.src
: Core utilities for data processingscripts
: Runtime and submission scripts