Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance significantly affected by "flatness" of the compounds #757

Closed
chrisjonesBSU opened this issue Aug 23, 2023 · 6 comments · Fixed by #760
Closed

Performance significantly affected by "flatness" of the compounds #757

chrisjonesBSU opened this issue Aug 23, 2023 · 6 comments · Fixed by #760

Comments

@chrisjonesBSU
Copy link
Contributor

@marjanAlbouye and I were experiencing some very slow performance with gmso.parameterization.apply and have been doing some digging into what the problem could be.

It doesn't seem to be an issue of number of particles or molecules, and given the examples below, the only thing I can think of is that flat, planar molecules, or in this case polymers, are causing issues.

I'm curious to see if anyone else can re-create this, or has any ideas of what the issue could be. Possibly something related to graphs and sub graphs as a function of topology size and structure?

Examples:

1. Creating a 20mer poly(ethylene) from SMILES compared to a 20mer using the Polymer Builder.

  • Creating the chain from SMILES gives a very windy, twisty chain. Creating it from the polymer builder gives a flat, straight chain.
  • Applying a forcefield to the topology for the molecules from SMILES takes a few seconds. I'm not sure how long it takes for the system using the Polymer Builder because after letting it run for about an hour, it still isn't done.
import mbuild as mb
from mbuild.lib.recipes import Polymer
import gmso
from gmso.external import from_mbuild
from gmso.parameterization import apply
import forcefield_utilities as ffutils

import time

oplsaa = ffutils.FoyerFFs().load('oplsaa').to_gmso_ff()

# Polyethylene from SMILES

pe = mb.load("C"*40, smiles=True)
pe_box = mb.fill_box(compound=pe, box=[20, 20, 20], n_compounds=11)
pe_top = from_mbuild(pe_box)
print(pe_box.n_particles)

start = time.time()
apply(top=pe_top, forcefields=oplsaa, identify_connections=True)
print(time.time() - start)

1342 # Num particles
5.7355194091796875 # Time (seconds)

Screenshot from 2023-08-23 13-19-12

# Polyethylene from Polymer Builder

monomer = mb.load("CC", smiles=True)

chain = Polymer()
chain.add_monomer(monomer, indices=[2, -2], separation=0.145)
chain.build(n=20)

chain_box = mb.fill_box(compound=chain, n_compounds=11, box=[20, 20, 20])
chain_top = from_mbuild(chain_box)
print(chain_box.n_particles)

start = time.time()
apply(top=chain_top, forcefields=oplsaa, identify_connections=True)
print(time.time() - start)

Screenshot from 2023-08-23 13-19-45

I don't think it is an issue specific to the Polymer builder.

Example 2: Poly(benzene) from the polymer builder

Here, I choose different values for the bonding indices. One gives a twisty chain, one gives a perfectly flat and planar chain. The twisty one finishes in a reasonable amount of time. The flat one never finishes.

# Poly(benzene) from Polymer Builder

benzene = mb.load("c1ccccc1", smiles=True)
chain = Polymer()
chain.add_monomer(benzene, indices=[6, 8], separation=0.14)
chain.build(n=15)

box = mb.fill_box(compound=chain, n_compounds=10, box=[20, 20, 20])
top = from_mbuild(box)

print(box.n_particles)

start = time.time()
apply(top=top, forcefields=oplsaa, identify_connections=True)
finish = time.time()

print(finish-start)

1520 # Num particles
6.535366058349609 # Time (seconds)

Screenshot from 2023-08-23 13-16-46

# Flat Poly(benzene) from Polymer Builder

benzene = mb.load("c1ccccc1", smiles=True)
chain = Polymer()
chain.add_monomer(benzene, indices=[6, 9], separation=0.14)
chain.build(n=15)

box = mb.fill_box(compound=chain, n_compounds=10, box=[20, 20, 20])
top = from_mbuild(box)

print(box.n_particles)

start = time.time()
apply(top=top, forcefields=oplsaa, identify_connections=True)
finish = time.time()

print(finish-start)

Screenshot from 2023-08-23 13-17-57

I'm not super familiar with the graph/networkx operations GMSO is doing under the hood, but I wonder if there are some edge cases where you get stuck in an infinite while loop or for loop? For example, I have a bigger system than these examples (but not crazy big, only 13,000 particles) running on a cluster that is approaching 20 hours in the apply step (which is what prompted this issue). The polymer chains are totally flat like this benzene example.

@chrisjonesBSU
Copy link
Contributor Author

It seems that if I set identify_connected_components=False instead of True then the issue goes away.

@daico007
Copy link
Member

Yeah, I can see that issue, the identify_connected_components is a function called on the bond graph (to combine those that have similar bond graph), which supposed to help speed up the apply step. This option, however, only seems to work for solvent box specifically, but easily backfire (i.e., make things extremely slow, like in your case) in other cases.
Probably a good case/reason for it to be turned off by default.

@CalCraven
Copy link
Contributor

Hey @chrisjonesBSU, thanks for this issue. These examples are nicely fleshed out. So the GMSO apply step takes advantage of a lot of assumptions about repeated structure in a topology.

My guess is that for some reason, the polymer builder gets the residues correct in the flat molecule, and doesn't get it right in the twisted chain (looking just at example 2). You might gain a lot of information by printing out top.unique_site_labels("molecule"); top.unique_site_labels("residue") to see if that is different between the two examples.

You can also see if the identify_connections=False fixes anything. That step has been shown to be slow in the past, and is different from the atomtyping.

identify_connected_components is only good for lots of small molecules. Small numbers of large molecules is going to be inherently bad for it. For "most" cases it adds a lot of speedups. That's why it's the default. I can see a reason to remove that as the default though.

For this specific case (a polymer with many sub monomers), the ideal method for applying is actually with this argument turned on:
use_molecule_info: bool, optional, default=False A flag to determine whether or not to look at site.residue_name to look parameterize each molecule only once. Currently unused.
This would atomtype every residue the same way, so the monomer would only have to be applied once. But for now, we focused on making it work on many molecules, because that's a much simpler case, so that uses the identify_connected_components. I could also see a reason to rename these two arguments. One could be: many_molecules=True, and the other would be many_residues=True. That's more useful potentially to people that are trying to do the atomtyping with a particular system.

@chrisiacovella
Copy link
Contributor

A difference between smiles and polymer builder is the overall packaging (i.e., hierarchy). Could you try using flatten and/or condense on the chain before filling the box? There could be something in identify_connections that is getting stuck in an infinite loop because because of the underlying hierarchy.

@chrisjonesBSU
Copy link
Contributor Author

chrisjonesBSU commented Aug 23, 2023

Thanks for the detailed responses. It seems like a case could be made for either default value for identify_connected_components. It seems like the performance hit is potentially much worse (going from seconds to hours) if someone accidentally leaves it set to True when it should be False than if the opposite were to happen.

Whatever choice is made for the default value, maybe we add some warnings if possible, and definitely some more information to the doc strings?

@chrisiacovella
I called just the identify_connections() method on all of the topologies, and it is similarly fast for all of the examples.

@CalCraven
Copy link
Contributor

Also an added point is that use_molecule_info in the apply step claims it is unused in the docstring, but it actually is. In fact, you can turn off identify_connected_components if the molecules are all already labeled correctly in the topology, and you don't have to go down to compare the networkx graphs. Here is where it get's used, and here is the details about it in the parameterizer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants