EmmentalEmbed

A toolkit for extracting embeddings from various protein language models (PLMs). This repository provides standardized interfaces for generating embeddings from protein sequences using different PLM architectures.

Supported Models

ANKH: Large and Base models
ESM:
- ESM-2 (15B, 3B, 650M parameters)
- ESM-1b (650M parameters)
- ESM-1v (650M parameters)
ProtT5: XL-U50
ProteinBERT: Base model
UniRep: Original implementation
One-hot encoding: Basic sequence encoding

Project Structure

├── data/                     # Data directory
│   └── isoform/             # Isoform-specific data
├── sandbox/                  # Main package directory
│   ├── plm/                 # PLM implementations
│   │   ├── ankh/           # ANKH model
│   │   ├── esm/            # ESM models
│   │   ├── one-hot/        # One-hot encoding
│   │   ├── prot_t5/        # ProtT5 model
│   │   ├── proteinbert/    # ProteinBERT
│   │   └── unirep/         # UniRep model
│   └── src/                 # Core utilities
├── scripts/                  # Runtime scripts
│   ├── evaluate/            # Evaluation scripts
│   ├── plm/                 # SLURM job scripts
│   └── process/             # Data processing scripts

Installation

Clone the repository:

git clone https://github.com/cheeseman-lab/plm_sandbox.git
cd plm_sandbox

Set up the embedding analysis environment:

conda env create -f environment.yml
conda activate emmentalembed
pip install -e .

Set up the PLM environment and download models:

conda env create -f plm_environment.yml
conda activate plm
./setup_plm.sh

Usage

Processing Data

Convert your protein sequences into the required format. For isoforms (example), we use the following approach:

from sandbox.src.process import process_isoform_data

process_isoform_data(
    input_file='data/isoform/isoform_localization.csv',
    output_label_file='output/isoform/process/isoform_labels.csv',
    output_fasta_file='output/isoform/process/isoform_sequences.fasta'
)

You can add further functions to the process file for other types of proteins you'd like to process.

Generating Embeddings

Each PLM has a standardized interface. Basic usage:

python sandbox/plm/<model>/extract.py -i input.fasta -o output.csv [additional_options]

Model-specific examples:

# ANKH
python sandbox/plm/ankh/extract.py -i input.fasta -o output.csv --model large

# ESM-2
python sandbox/plm/esm/extract.py esm2_t48_15B_UR50D input.fasta output_dir \
    --include mean --concatenate_dir results/

# ProtT5
python sandbox/plm/prot_t5/extract.py -i input.fasta -o output.csv --per_protein 1

# One-hot encoding
python sandbox/plm/one-hot/extract.py input.fasta --method one_hot --results_path results/

SLURM Scripts

For HPC environments, use the provided SLURM scripts in scripts/plm/:

sbatch scripts/plm/<model>.sh

Evaluating Embeddings

As an example, we evaluate the similarities between pairs of embeddings in evaluate/evaluate_isoforms.ipynb.

Development

Environment Management

The project uses two conda environments:

emmentalembed: For analysis and processing of embeddings
plm: For running protein language models and generating embeddings

Package Structure

Main components:

sandbox.plm: PLM implementations and extraction scripts
sandbox.src: Core utilities for data processing
scripts: Runtime and submission scripts

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
data/isoform		data/isoform
sandbox		sandbox
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
environment.yml		environment.yml
plm_environment.yml		plm_environment.yml
setup.py		setup.py
setup_plm.sh		setup_plm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EmmentalEmbed

Supported Models

Project Structure

Installation

Usage

Processing Data

Generating Embeddings

SLURM Scripts

Evaluating Embeddings

Development

Environment Management

Package Structure

About

Uh oh!

Releases

Packages

Languages

License

cheeseman-lab/plm_sandbox

Folders and files

Latest commit

History

Repository files navigation

EmmentalEmbed

Supported Models

Project Structure

Installation

Usage

Processing Data

Generating Embeddings

SLURM Scripts

Evaluating Embeddings

Development

Environment Management

Package Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages