Skip to content

ml-jku/Chem-xLSTM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

d39a301 · Apr 17, 2025

History

18 Commits
Nov 5, 2024
Apr 17, 2025
Dec 13, 2024
Apr 17, 2025
Apr 17, 2025
Feb 21, 2025
Apr 17, 2025
Apr 17, 2025
Apr 17, 2025

Repository files navigation

xlstm

Chem-xLSTM

This repository provides the code necessary to reproduce the experiments presented in the paper Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences. The code is organized across the following repositories:

Quickstart

Installation

git clone https://github.com/ml-jku/Chem-xLSTM.git
cd Chem-xLSTM
mamba env create -f chem_xlstm_env.yml # you can also use conda ;)
mamba activate chem_xlstm
pip install -e .

This package also supports the use of S4 using s4dd, as well as Mamba and the Llama Transformer Model. If you want to enable flash attention for the transformer model, install flash-attn separately.

Minor: (I've had to use a minor modification to xlstm/blocks/slstm/src/cuda_init.py where os.environ["CUDA_LIB"] = os.path.join(os.path.split(torch.utils.cpp_extension.include_paths(cuda=True)[-1])[0], "lib") the cuda argument has to be dropped due to the torch version, be aware this issue might exist)

Setting up the env was the hardest part ;)

Applications

For an easy start with Chem-xLSTM applications, we provide a sample notebooks:

  • examples/train.ipynb: This notebook demonstrates how to train a Chem-xLSTM model. (currently under construction)

Repository Structure

  • chemxlstm/: Implementation of Chem-xLSTM.
  • data/: Train, validation and test splits of the dataset.
  • scripts/: Scripts to reproduce experiments.
  • examples/: Example notebooks for Chem-xLSTM applications.

Pretraining

Data

The repository comes with both datasets: ChEMBL v31 SMILES Dataset as well as the In-Context Style Transfer (ICST) - Conditional Molecule Generation Dataset.

Model Training

To train an xLSTM model for example run (adjust the batch size and dim to your GPU memory):

xLSTM 1.6M model which takes ~8.5 GB of GPU memory and training one epoch takes ~25 minutes:

python chemxlstm/train.py --model_class xLSTM --model_dim 256 --state_dim 64 --n_layers 4 --n_heads 1 --n_max_epochs 100 --batch_size 512 --device cuda:0 --learning_rate 5e-3 --dropout 0.25 --vocab_size 37 --sequence_length 100 --logdir ./models/ --training_molecules_path ./data/chemblv31/train.zip --val_molecules_path ./data/chemblv31/valid.zip --patience 5 --delta 1e-5 --save_per_epoch 3 --no_denovodesign

xLSTM 14.8M model

python chemxlstm/train.py --model_class=xLSTM --device=cuda:1 --n_layers=9 --model_dim=512 --n_heads=8 --batch_size=1024 --warmup_steps=4000

to train a model on in-context molecular generation, you just set --training_molecules_path=./data/icst/train.zip and the --val_molecules_path=./data/icst/valid.zip

python ./chemxlstm/train.py --model_class=xLSTM --device=cuda:0 --n_layers=9 --model_dim=512 --n_heads=8 --logdir=./models/icst/ --vocab_size=100 --sequence_length=2048 --training_molecules_path=./data/icst/train.zip --val_molecules_path=./data/icst/valid.zip --no_denovodesign --batch_size=8 --warmup_steps=400 --permute_augmentation --learning_rate=1e-3 --patience=10 --accumulation_steps=4

If you want to finetune a pretrained model, set the model path: e.g.: --model_path=./models/xLSTM-14.8M-ed512_hid64_l9_he8_162/ (here you need to set the n_heads (=n_heads) explicitly). The max sequence length and vocab_size of the pretrained model is automatically overwritten if stated explicitly in the command line.

Evaluation

Evaluation of unconditional generation

If you remove the parameter --no_denovodesign from the training script, this will result in generating 102,400 molecules from the last model-epoch at different temperatures. The resulting molecules can be evaluated calling:

python chemxlstm/evaluate.py --model_path ./models/xLSTM-0.0M-ed64_hid64_l1_he1_270/epoch-006 --model_class xLSTM --n_heads=1

The evaluation is saved as metrics.csv in the coresponding folders (as well as the cross-entropy loss per molecule as .npy).

Conditional Generation Evaluation

To evaluate the model on the Conditional Generation ICST dataset, run the following and adjust to your parameters:

The parameter n_context_molecules is the number of molecules to be used as context for each generated molecule. The parameter n_designs is the number of designs per context molecule.

python ./chemxlstm/evaluate_cond_gen.py --model_path="./models/icst/xLSTM-14.8M-ed512_hid64_l8_he8_465/" --model_class xLSTM --device cuda:0 --n_heads 8 --batch_size 32 --context_path=./data/icst/test.zip --n_designs 1024 --mode gen --n_context_molecules 1

Acknowledgments

The underlying code was adapted from the S4DD repository. We further include original code from the xLSTM repository. Thanks for the great work.

Citation

If you found this work helpful in your project, please cite

@article{schmidinger2024bio-xlstm,
  title={{Bio-xLSTM}: Generative modeling, representation and in-context learning of biological and chemical  sequences},
  author={Niklas Schmidinger and Lisa Schneckenreiter and Philipp Seidl and Johannes Schimunek and Pieter-Jan Hoedt and Johannes Brandstetter and Andreas Mayr and Sohvi Luukkonen and Sepp Hochreiter and Günter Klambauer},
  journal={arXiv},
  doi = {},
  year={2024},
  url={}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages