Progen

Implementation of Progen in Pytorch, from the paper "ProGen: Language Modeling for Protein Generation"

GPT for proteins sequences

Paper Link

Appreciation

Lucidrains
Agorians

Install

pip install progen-torch

Usage

import torch
from progen.model import ProGen

x = torch.randint(0, 100, (1, 1024))

# Initialize the model with specific parameters
model = ProGen(
    num_tokens=100,  # The size of the vocabulary
    dim=512,  # The dimension of the embeddings
    seq_len=1024,  # The length of the sequences
    depth=6,  # The number of layers in the model
    window_size=256,  # The size of the window for local attention
    global_mlp_depth=2,  # The depth of the MLP in the global attention mechanism
    heads=8,  # The number of attention heads
    dim_head=512,  # The dimension of each attention head
    ff_mult=4,  # The multiplier for the feed-forward network's hidden layer size
    ff_glu=True,  # Whether to use a GLU activation in the feed-forward network
    attn_dim=None,  # The dimension of the attention mechanism (None means it defaults to `dim`)
    clamp_gate=True,  # Whether to clamp the gate values in the GLU activation
    shift_tokens=True,  # Whether to shift the tokens for the causal attention mechanism
    dropout=0.1,  # The dropout rate
)

# Forward pass through the model
logits = model(x)

# The output is the logits for each token in the vocabulary, for each position in the input sequences
# Shape: (batch_size, sequence_length, num_tokens)
print(logits.shape)  # Should print: torch.Size([1, 1024, 100])

Dataset Strategy

Here is a table of the datasets used in the paper with metadata and source links:

Dataset	Description	Source
Uniparc	Contains protein sequences from various sources	https://www.uniprot.org/uniparc/
UniprotKB	Contains protein sequences and annotations	https://www.uniprot.org/uniprot/
SWISS-PROT	Curated protein sequence database	https://www.uniprot.org/swiss-prot/
TrEMBL	Computer-annotated protein sequences	https://www.uniprot.org/trembl/
Pfam	Database of protein families	https://pfam.xfam.org/
NCBI taxonomy	Taxonomic classification of organisms	https://www.ncbi.nlm.nih.gov/taxonomy

Here is a diagram showing the data preprocessing flow:

graph TD
    A[Uniparc] --> B[Filter and merge]
    C[UniprotKB] --> B
    D[SWISS-PROT] --> B 
    E[TrEMBL] --> B
    F[Pfam] --> B
    G[NCBI taxonomy] --> B
    B --> H[Train/test split]
    H --> I[Train set]
    H --> J[ID test set] 
    H --> K[OOD test set]

The Uniparc, UniprotKB, SWISS-PROT, TrEMBL, Pfam, and NCBI taxonomy datasets are filtered and merged in step B. The aggregated dataset is then split into training, in-distribution test, and out-of-distribution test sets in step H.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
progen		progen
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
agorabanner.png		agorabanner.png
example.py		example.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Progen

Appreciation

Install

Usage

Dataset Strategy

License

Citations

About

Releases

Sponsor this project

Packages

Languages

License

kyegomez/Progen

Folders and files

Latest commit

History

Repository files navigation

Progen

Appreciation

Install

Usage

Dataset Strategy

License

Citations

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages