Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ligand Expo molecules #81

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Ligand Expo molecules #81

wants to merge 1 commit into from

Conversation

peastman
Copy link
Member

This will eventually be a collection of molecules from Ligand Expo. I've written a draft of the script to identify molecules and variants we want to process. The current implementation of processLigand() is just a placeholder, showing a minimal implementation with RDKit. It takes a SMILES string and returns a new set of SMILES strings for variants. @jchodera will replace it with a better implementation.

I wrapped the calculation in a TheadPoolExecutor on the assumption you would want to parallelize it. You can change it to a ProcessPoolExecutor if needed.

@peastman
Copy link
Member Author

We need to figure out how this is going to work. Based on the discussion in #67 it seems like we're leaning toward something like this:

  1. Select every Ligand Expo molecule up to some fairly small maximum size.
  2. Identify all low energy tautomer/protonation states for each one (or perhaps just protonation states?)
  3. Somehow generate conformations for them.

@jchodera what do you suggest for generating the conformations? You seemed to have specific ideas about how to do it. As long as we only include very small molecules, we can reasonably include up to a few hundred thousand total conformations if necessary.

@peastman
Copy link
Member Author

@jchodera any ideas about this?

@jchodera
Copy link
Member

Apologies for the delay---the OpenMM/OpenFF renewal proposals ate up a bunch of time.

Ideally, the pipeline would use the following steps:

  1. Apply filtering rules (e.g. elements, min/max number of heavy atoms)
  2. Sort by popularity so that the most-used chemical components should appear first in the list
  3. Expand protonation/tautomeric states with Epik, keeping protonation states more than a min solution population e.g. exp(-6) to get states up to a ~6 kT penalty)

In terms of how we specifically generate conformations, we've talked about one or both of two approaches:
A. After expanding protonation/tautomeric states, generate MD at some temperature (300K? 400?) with a surrogate potential, such as GFN2-xTB, in vacuum
B. Before expanding protonation/tautomeric states, we enumerate <10 conformers with the OpenFF Molecule.generate_conformers (which can either use RDKit or OpenEye toolkits), vary protonation/tautomer states of each conformation, and then subject these directly to max ~3 steps of optimization in an OptimizationDataset

Questions

  • For (1), what filtering rules should we apply? Should we really use total number of atoms in [3,100] as the filter criteria, or filter on heavy atoms or molecular weight? Should we expand to more main group elements, which our level of theory seems to handle well, or would we need pseudopotentials to make these efficient?
  • For (3), I think we established that 6 kT was fine.
  • I think you prefer starting with (A), but it would be great to include a (B) dataset as well, even if just for the OpenFF level of theory
  • In terms of naming of each molecule, is there a preference? Can we use something like {PDB ID}_{conformer index}_{protonation/tautomer state index} (or some permutation)? Or should we be using IUPAC names? Or even SMILES?
  • Do we want to consider a separate dataset that includes transition metal elements?

I should be able to tackle this early this week.

Tagging @wiederm for additional comment.

@peastman
Copy link
Member Author

There are two distinct datasets we've talked about creating based on Ligand Expo. They're for different purposes.

One possible dataset would be to look at the effect of protonation. It would include pairs of molecules that differ only in the presence of a single hydrogen. They would be in identical conformations, so the only difference would be the extra hydrogen. This would be limited to very small molecules, maybe 10 atoms or so. Applying it to large molecules would be wasted computation. If a pair of 100 atom molecules are identical except for a single hydrogen and are in identical conformations, most atoms see nearly identical environments and have nearly identical forces.

The other possible dataset would be to increase our sampling of chemical space. It would include all molecules up to a fairly large size limit, maybe 100 atoms, possibly augmented with tautomers and protonation variants. That would be a lot of molecules, some of them quite large, so we would need to limit it to a very small number of conformations for each molecule, possibly only a single conformation.

An alternative we've discussed for sampling more chemical space is to use Enamine molecules. We would do one or the other of those two datasets, not both. They would both have the same goal, and they would both be very expensive.

Also note that the dataset from #72 already includes conformations for all Ligand Expo molecules with up to 36 atoms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants